R 将dfm转换为数据帧

R 将dfm转换为数据帧,r,quanteda,R,Quanteda,具有来自quanteda的dfm结果: library(quanteda); df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE) myDfm <- df$text %>% tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_r

具有来自quanteda的dfm结果:

library(quanteda); 
df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE)

myDfm <- df$text %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
    tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
    dfm()
我尝试的是:

convert(myDfm, to = "data.frame")

有点复杂,但下面的代码可以实现

library(dplyr)
library(tidyr)
library(quanteda)

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 1 x 2
     id features
  <dbl> <chr>   
1     1 loving

请举例说明您期望的输出结果是什么?这可以帮助您更快地找到答案。@ErrorJordan是的,请在更新中检查它。如果dfm的结果包含id为1的多个单词,那么您的预期结果是什么?你不需要计数?你所说的“一个数据帧,它将有行数和列数作为输入,但在文本列中,它将有dfm过程的干净文本”到底是什么意思?行数和列数是什么?您所说的“dfm过程的干净文本”——功能名称是什么意思?
错误:
!contains(“document”)`必须计算为列位置或名称,而不是逻辑向量运行
rlang::last_error()
,以查看错误发生的位置。`感谢我在运行时收到此错误it@Nathalie,根据您的示例,代码在我的机器上运行良好。您可能希望在mutate语句之后添加一个
过滤器(count!=0)
,以过滤掉0值。我在答案上加了一个例子。谢谢。我检查了更新,但还是一样的error@Nathalie,更新tidyr(和tidyselect)时会发生什么情况?这些软件包中的某些更改与上次更新不一致。我运行R4.0,以及quanteda、dplyr、tidyr、tidyselect等最新软件包。
library(dplyr)
library(tidyr)
library(quanteda)

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 1 x 2
     id features
  <dbl> <chr>   
1     1 loving
df <- data.frame(id = c(1,2), text = c("I am loving it", "I am hating it"), stringsAsFactors = FALSE)

myDfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
  dfm()

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  filter(count != 0) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 2 x 2
     id features
  <dbl> <chr>   
1     1 loving  
2     2 hating