如何在R中转换文档术语矩阵?

如何在R中转换文档术语矩阵?,r,dataframe,dataset,transformation,word-cloud,R,Dataframe,Dataset,Transformation,Word Cloud,您好,我有一个文档术语矩阵,我使用tidy()函数对其进行了转换,它工作得非常完美。我想根据单词的频率绘制一个单词云。因此,我的转换表如下所示: > head(Wcloud.Data) # A tibble: 6 x 3 document term count <chr> <chr> <dbl> 1 1 accept 1 2 1 access 1 3 1

您好,我有一个文档术语矩阵,我使用
tidy()
函数对其进行了转换,它工作得非常完美。我想根据单词的频率绘制一个单词云。因此,我的转换表如下所示:

> head(Wcloud.Data)
# A tibble: 6 x 3
  document term       count
  <chr>    <chr>      <dbl>
1 1        accept         1
2 1        access         1
3 1        accomplish     1
4 1        account        4
5 1        accur          2
6 1        achiev         1
更新:我添加了我的数据帧的图片


使用
dplyr
可以执行以下操作:

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Wcloud.Data<- data.frame(Document= c(rep(1,6)), 
                         term = c("accept", "access","accomplish", "account", "accur", "achiev"),
                         count = c(1,1,1,4,2,1))

Data<-Wcloud.Data %>% 
  group_by(term) %>% 
  summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

您能否向我们提供数据的子集
Wcloud.data
(可能使用
dput
),以便我们在您的数据集上重现问题?我想我有一个解决方案给你,但需要在当地确认。谢谢:)同一个单词出现是正常的,因为您有多个文档(76717),如果一个单词以高频率出现在多个文档中,它将被打印多次。如果您想要一个只包含单词的wordcloud,请删除文档并聚合每个单词的数字。@phiver谢谢您的回答。我如何自动解决这个问题?我不希望它是多个,我不知道为什么,但我有一个输出dput的问题。或者R正在计算,它需要一些时间。你的想法是什么?任何可以获得100-1000行
Wcloud.Data
的东西都会有帮助。谢谢carles,但我的问题不同。phiver发现了我的问题所在。好吧,现在我编辑了我的问题以得到你的答案。此外,如果你只是尝试按“术语”进行求和(计数)分组,应该会有结果。这是我的问题:出现同一个单词是正常的,因为你有多个文档(76717),如果一个单词出现在多个文档中的频率很高,它将被打印多次。是的,这就是术语频率矩阵。每个文档的字数。如果最典型的单词恰好出现在多个文档中,它会出现多次。那么我该怎么办?
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")

Wcloud.Data<- data.frame(Document= c(rep(1,6)), 
                         term = c("accept", "access","accomplish", "account", "accur", "achiev"),
                         count = c(1,1,1,4,2,1))

Data<-Wcloud.Data %>% 
  group_by(term) %>% 
  summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))
library(tibble)
library(quanteda)
Data <- data_frame(text = c("Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Chinese Beijing Chinese",
                              "Chinese Chinese Shanghai",
                              "this is china",
                              "china is here",
                              'hello china',
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              "Kyoto Japan",
                              "Tokyo Japan Chinese",
                              'japan'))
DocTerm <- quanteda::dfm(Data$text)
DocTerm
# Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
# 19 x 11 sparse Matrix of class "dfm"
# features
# docs     chinese beijing shanghai this is china here hello kyoto japan tokyo
# text1        2       1        0    0  0     0    0     0     0     0     0
# text2        2       0        1    0  0     0    0     0     0     0     0
# text3        0       0        0    1  1     1    0     0     0     0     0
# text4        0       0        0    0  1     1    1     0     0     0     0
# text5        0       0        0    0  0     1    0     1     0     0     0
# text6        2       1        0    0  0     0    0     0     0     0     0
# text7        2       0        1    0  0     0    0     0     0     0     0
# text8        0       0        0    1  1     1    0     0     0     0     0
# text9        0       0        0    0  1     1    1     0     0     0     0
# text10       0       0        0    0  0     1    0     1     0     0     0
# text11       0       0        0    0  0     0    0     0     1     1     0
# text12       1       0        0    0  0     0    0     0     0     1     1
# text13       0       0        0    0  0     0    0     0     1     1     0
# text14       1       0        0    0  0     0    0     0     0     1     1
# text15       0       0        0    0  0     0    0     0     1     1     0
# text16       1       0        0    0  0     0    0     0     0     1     1
# text17       0       0        0    0  0     0    0     0     1     1     0
# text18       1       0        0    0  0     0    0     0     0     1     1
# text19       0       0        0    0  0     0    0     0     0     1     0

Mat<-quanteda::convert(DocTerm,"data.frame")[,2:ncol(DocTerm)] # Converting to a Dataframe without taking into account the text variable
Result<- colSums(Mat) # This is what you are interested in
names(Result)<-colnames(Mat)
# > Result
# chinese  beijing shanghai     this       is    china     here    hello    kyoto    japan 
# 24        4        4        4        8       12        4        4        8       18