TermDocumentMatrix as.matrix使用大量内存_R_Tm_Term Document Matrix

TermDocumentMatrix as.matrix使用大量内存

TermDocumentMatrix as.matrix使用大量内存,r,tm,term-document-matrix,R,Tm,Term Document Matrix,我目前正在使用tm软件包提取术语，以便在一个大小适中的25k项（30Mb）数据库中进行重复检测。这个数据库在我的桌面上运行，但当我尝试在我的服务器上运行它时，它似乎花费了大量的时间。仔细检查后，我发现我已经使用了4GB的交换，运行apply（posts.TmDoc，1，sum）行来计算术语的频率。此外，即使运行as.matrix，也会在我的桌面上生成3GB的文档，请参见仅为在25k项上生成18k项的频率计数就必须这样做吗？有没有其他方法可以在不强制TermDocumentMatrix为矩阵或向

我目前正在使用tm软件包提取术语，以便在一个大小适中的25k项（30Mb）数据库中进行重复检测。这个数据库在我的桌面上运行，但当我尝试在我的服务器上运行它时，它似乎花费了大量的时间。仔细检查后，我发现我已经使用了4GB的交换，运行apply（posts.TmDoc，1，sum）行来计算术语的频率。此外，即使运行as.matrix，也会在我的桌面上生成3GB的文档，请参见

仅为在25k项上生成18k项的频率计数就必须这样做吗？有没有其他方法可以在不强制TermDocumentMatrix为矩阵或向量的情况下生成频率计数

我不能删除基于稀疏性的术语，因为实际算法就是这样实现的。它查找至少2个但不超过50个的通用术语及其组，计算每个组的相似性值

下面是上下文中的代码供参考

min_word_length = 5
max_word_length = Inf
max_term_occurance = 50
min_term_occurance = 2


# Get All The Posts
Posts = db.getAllPosts()
posts.corpus = Corpus(VectorSource(Posts[,"provider_title"]))

# remove things we don't want
posts.corpus = tm_map(posts.corpus,content_transformer(tolower))
posts.corpus = tm_map(posts.corpus, removePunctuation)
posts.corpus = tm_map(posts.corpus, removeNumbers)
posts.corpus = tm_map(posts.corpus, removeWords, stopwords('english'))

# grab any words longer than 5 characters
posts.TmDoc = TermDocumentMatrix(posts.corpus, control=list(wordLengths=c(min_word_length, max_word_length)))

# get the words that occur more than once, but not more than 50 times
clustterms = names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))

最小字长=5
最大字长=Inf
最大项发生率=50
最小项发生率=2
#拿到所有的帖子
Posts=db.getAllPosts（）
posts.corpus=corpus（矢量源（posts[，“提供者标题]））
#移除我们不想要的东西
posts.corpus=tm_地图（posts.corpus，content_transformer（tolower））
posts.corpus=tm_地图（posts.corpus，删除标点符号）
posts.corpus=tm_地图（posts.corpus，removeNumbers）
posts.corpus=tm_映射（posts.corpus，removeWords，stopwords（'english'））
#抓取任何超过5个字符的单词
posts.TmDoc=TermDocumentMatrix（posts.corpus，control=list（字长=c（最小字长，最大字长）））
#获取出现多次但不超过50次的单词
clustterms=名称（其中（apply（posts.TmDoc，1，sum）>=min\u term\u发生和apply（posts.TmDoc，1，sum）

因为我从来不需要频率计数，所以我可以使用findFreqTerms命令

setdiff(findFreqTerms(posts.TmDoc, 2), findFreqTerms(posts.TmDoc, 50))

与

names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance  & apply(posts.TmDoc, 1, sum) < max_term_occurance))

name（其中（apply（posts.TmDoc，1，sum）>=min\u term\u发生和apply（posts.TmDoc，1，sum）


但是几乎是瞬间运行的
做个计算：18e3*25e3*8/1024^3
给出了3.3GB。是的，这是矩阵的内存消耗。使用稀疏矩阵代替。你的问题类似于@Andrie，在实际生成稀疏矩阵之前，稀疏方法似乎仍然需要将其转换为常规矩阵。一旦它被转换，它大约是800KB，但直到转换，它在内存中。我将尝试第二个链接中的逐行方法，使用inspect一次提取一行，然后将行和的结果存储到指定的list@Andrie，之前的两个答案实际上都不是最有效的/elegrant解决方案。我已经提交了一个答案，另外还有一个是为了找到更好的方法！