R tm软件包创建Nmost常用术语矩阵_R_Text Mining_Tm_Term Document Matrix

R tm软件包创建Nmost常用术语矩阵

R tm软件包创建Nmost常用术语矩阵,r,text-mining,tm,term-document-matrix,R,Text Mining,Tm,Term Document Matrix,我使用R中的tm包创建了一个termDocumentMatrix 我试图创建一个矩阵/数据框架，其中包含50个最常出现的术语当我尝试转换为矩阵时，会出现以下错误： > ap.m <- as.matrix(mydata.dtm) Error: cannot allocate vector of size 2.0 Gb 我的理想输出如下： term frequency the 2123 and 2095 able 883 ..

我使用R中的

tm

包创建了一个

termDocumentMatrix

我试图创建一个矩阵/数据框架，其中包含50个最常出现的术语

当我尝试转换为矩阵时，会出现以下错误：

> ap.m <- as.matrix(mydata.dtm)
Error: cannot allocate vector of size 2.0 Gb

我的理想输出如下：

term      frequency
the         2123
and         2095
able         883
...          ...

有什么建议吗？

tm中的术语文档矩阵已经创建为稀疏矩阵。这里，

mydata.tdm$i

和

mydata.tdm$j

是矩阵的索引向量，

mydata.tdm$v

是频率的相关向量。这样您就可以创建一个稀疏矩阵：

sparseMatrix(i=mydata.tdm$i, j=mydata.tdm$j, x=mydata.tdm$v)

然后，您可以使用

rowSums

将感兴趣的行链接到它们所代表的术语，并使用

$terms

HTH:term.freq

> str(mydata.dtm)
List of 6
 $ i       : int [1:430206] 377 468 725 3067 3906 4150 4393 5188 5793 6665 ...
 $ j       : int [1:430206] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:430206] 1 1 1 1 1 1 1 1 2 3 ...
 $ nrow    : int 15643
 $ ncol    : int 17207
 $ dimnames:List of 2
  ..$ Terms: chr [1:15643] "000" "0mm" "100" "1000" ...
  ..$ Docs : chr [1:17207] "1" "2" "3" "4" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
> mydata.dtm
A term-document matrix (15643 terms, 17207 documents)

Non-/sparse entries: 430206/268738895
Sparsity           : 100%
Maximal term length: 54 
Weighting          : term frequency (tf)

term      frequency
the         2123
and         2095
able         883
...          ...

sparseMatrix(i=mydata.tdm$i, j=mydata.tdm$j, x=mydata.tdm$v)