R 将dfm转换为数据帧
具有来自quanteda的dfm结果:R 将dfm转换为数据帧,r,quanteda,R,Quanteda,具有来自quanteda的dfm结果: library(quanteda); df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE) myDfm <- df$text %>% tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% tokens_r
library(quanteda);
df <- data.frame(id = c(1), text = c("I am loving it"), stringsAsFactors = FALSE)
myDfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
dfm()
我尝试的是:
convert(myDfm, to = "data.frame")
有点复杂,但下面的代码可以实现
library(dplyr)
library(tidyr)
library(quanteda)
out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count") %>%
mutate(id = as.integer(gsub("[a-z]", "", document))) %>%
inner_join(df) %>% # joins on id
select(id, features) # select only the id and features column
Joining, by = "id"
# A tibble: 1 x 2
id features
<dbl> <chr>
1 1 loving
请举例说明您期望的输出结果是什么?这可以帮助您更快地找到答案。@ErrorJordan是的,请在更新中检查它。如果dfm的结果包含id为1的多个单词,那么您的预期结果是什么?你不需要计数?你所说的“一个数据帧,它将有行数和列数作为输入,但在文本列中,它将有dfm过程的干净文本”到底是什么意思?行数和列数是什么?您所说的“dfm过程的干净文本”——功能名称是什么意思?
错误:
!contains(“document”)`必须计算为列位置或名称,而不是逻辑向量运行rlang::last_error()
,以查看错误发生的位置。`感谢我在运行时收到此错误it@Nathalie,根据您的示例,代码在我的机器上运行良好。您可能希望在mutate语句之后添加一个过滤器(count!=0)
,以过滤掉0值。我在答案上加了一个例子。谢谢。我检查了更新,但还是一样的error@Nathalie,更新tidyr(和tidyselect)时会发生什么情况?这些软件包中的某些更改与上次更新不一致。我运行R4.0,以及quanteda、dplyr、tidyr、tidyselect等最新软件包。
library(dplyr)
library(tidyr)
library(quanteda)
out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count") %>%
mutate(id = as.integer(gsub("[a-z]", "", document))) %>%
inner_join(df) %>% # joins on id
select(id, features) # select only the id and features column
Joining, by = "id"
# A tibble: 1 x 2
id features
<dbl> <chr>
1 1 loving
df <- data.frame(id = c(1,2), text = c("I am loving it", "I am hating it"), stringsAsFactors = FALSE)
myDfm <- df$text %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_remove(pattern = c(stopwords(source = "smart"))) %>%
dfm()
out <- convert(myDfm, to = "data.frame")
pivot_longer(out, cols = !contains("document"), names_to = "features", values_to = "count") %>%
mutate(id = as.integer(gsub("[a-z]", "", document))) %>%
filter(count != 0) %>%
inner_join(df) %>% # joins on id
select(id, features) # select only the id and features column
Joining, by = "id"
# A tibble: 2 x 2
id features
<dbl> <chr>
1 1 loving
2 2 hating