R 将dfm转换为数据帧_R_Quanteda

R 将dfm转换为数据帧

R 将dfm转换为数据帧,r,quanteda,R,Quanteda,具有来自quanteda的dfm结果： library(quanteda); df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE) myDfm <- df$text %>% tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_r

具有来自quanteda的dfm结果：

library(quanteda); 
df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE)

myDfm <- df$text %>%
    tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
    tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
    dfm()

我尝试的是：

convert(myDfm, to = "data.frame")

有点复杂，但下面的代码可以实现

library(dplyr)
library(tidyr)
library(quanteda)

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 1 x 2
     id features
  <dbl> <chr>   
1     1 loving

请举例说明您期望的输出结果是什么？这可以帮助您更快地找到答案。@ErrorJordan是的，请在更新中检查它。如果dfm的结果包含id为1的多个单词，那么您的预期结果是什么？你不需要计数？你所说的“一个数据帧，它将有行数和列数作为输入，但在文本列中，它将有dfm过程的干净文本”到底是什么意思？行数和列数是什么？您所说的“dfm过程的干净文本”——功能名称是什么意思？

错误：

！contains（“document”）`必须计算为列位置或名称，而不是逻辑向量运行

rlang:：last_error（）

，以查看错误发生的位置。`感谢我在运行时收到此错误it@Nathalie，根据您的示例，代码在我的机器上运行良好。您可能希望在mutate语句之后添加一个

过滤器（count！=0）

，以过滤掉0值。我在答案上加了一个例子。谢谢。我检查了更新，但还是一样的error@Nathalie，更新tidyr（和tidyselect）时会发生什么情况？这些软件包中的某些更改与上次更新不一致。我运行R4.0，以及quanteda、dplyr、tidyr、tidyselect等最新软件包。

library(dplyr)
library(tidyr)
library(quanteda)

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 1 x 2
     id features
  <dbl> <chr>   
1     1 loving

df <- data.frame(id = c(1,2), text = c("I am loving it", "I am hating it"), stringsAsFactors = FALSE)

myDfm <- df$text %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
  tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
  dfm()

out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count")  %>% 
  mutate(id = as.integer(gsub("[a-z]", "", document))) %>% 
  filter(count != 0) %>% 
  inner_join(df) %>% # joins on id
  select(id, features) # select only the id and features column

Joining, by = "id"
# A tibble: 2 x 2
     id features
  <dbl> <chr>   
1     1 loving  
2     2 hating