按R中的频率排列文档术语矩阵中的单词_R_Tm

按R中的频率排列文档术语矩阵中的单词

按R中的频率排列文档术语矩阵中的单词,r,tm,R,Tm,很抱歉有新问题，但我是文本挖掘的新手，需要profy的建议。现在，经过与content\u transformer的长期折磨后，我有了干净的语料库下一个问题 1. How select from `dtm` the words with small frequencies , so that the amount of frequencies was not more than 1% 例如，我需要这种格式 x 0,5% of all words in the dataset y 0,2

很抱歉有新问题，但我是文本挖掘的新手，需要profy的建议。现在，经过与

content\u transformer

的长期折磨后，我有了干净的语料库下一个问题

1. How  select from `dtm`  the words with small frequencies , so that the amount of frequencies was not more than 1%

例如，我需要这种格式

x 0,5% of all words in the dataset
y 0,2%
z 0,3%

所以这里的总频率总和=1%

如何做到这一点？

您可以查看

tm

软件包的

术语DocumentMatrix

功能。这包含一种计算每个文档中出现的单词的方法。把这些数字加在整个语料库上，应该会让你达到你想要的目的

dtm <- DocumentTermMatrix(corpus)
# wordcounts for complete corpus
counts <- colSums(as.matrix(dtm))

# number of documents
nb <- length(counts)
# frequencies
freqs <- counts / nb

dtm请您展示一下代码，以及如何选择频率较小的单词，使频率不超过1%。谢谢，很好。但是如何找到单词total frequency sum=1%并将其写入新的数据集中，您能给我看一下代码吗？