将R中Gephi的大字符转换为字频矩阵_R_Twitter_Frequency_Tm_Gephi

将R中Gephi的大字符转换为字频矩阵

r twitter

将R中Gephi的大字符转换为字频矩阵,r,twitter,frequency,tm,gephi,R,Twitter,Frequency,Tm,Gephi,我想计算我收集的大量tweet的成对词频。因此，我可以使用它们在Gephi网络图中进行可视化。当前数据看起来像这样，因为它是一个字符 head(Tweet_text) [1] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship" [2] "business entrepreneurship tech watch

我想计算我收集的大量tweet的成对词频。因此，我可以使用它们在Gephi网络图中进行可视化。当前数据看起来像这样，因为它是一个字符

head(Tweet_text)
[1] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship"                 
[2] "business entrepreneurship tech watch youtube star casey neistat ride a drone dressed as santa"        
[3] "how to beat procrastination work deadlines workculture productivity management hr hrd entrepreneurship"
[4] "reading on entrepreneurship and startups and enjoying my latte"                                        
[5] "not a very nice way to encourage entrepreneurship and in the same sentence dog another company"        
[6] "us robotics founder on earlyday internet entrepreneurship articles management"

结构如下：

str(Tweet_text)
 chr [1:8661] "habits that separates successful persons from mediocre persons habit success startup entrepreneurship" ...

在这个示例数据集中，我有8661条推文。现在我想计算所有这些推文的成对单词频率，我可以将它们导出到Gephi。我期待的最终结果如下：

+------------------------+--------------+------+
| term1                  | term 2       | Freq |
+------------------------+--------------+------+
| entrepreneurship       | startup      |  2   |
+------------------------+--------------+------+

因此，我开始在tm软件包中使用DocumentTermMatrix函数：

dtm <- DocumentTermMatrix(Corpus(VectorSource(Tweet_text)))

这在第一条推文中成功的频率如下：

inspect(dtm[1, c("success")])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 1/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)

    Terms
Docs success
   1       1

在此之后，我尝试将这些频率设置为所需的表格格式，包括：

m<-as.matrix(dtm)
m[m>=1] <- 1
m <- m %*% t(m)
Dat_Freq <- as.data.frame(as.table(m))

但是现在第一个问题开始了，矩阵太大了。其次，我不知道如何将成对词频限制为特定值。例如，一对的频率必须大于10，这样矩阵就不会太大

如果您能提供建议，我将非常感激。我将如何获得csv格式的成对频率

祝您一切顺利：

我认为您应该检查RWeka库，尤其是NGramTokenizer函数。它将允许你获得所有可能的成对单词。然后，您应该使用findFreqTerms函数仅选择>10的术语

您可以做的另一件事是使用该软件包

假设您的数据位于一个名为tweets的数据框中，text是相应的变量

library(tidytext)
library(dplyr)

tweets %>%
   unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
   count(bigram, sort = TRUE) %>%
   head(100)

将给你100个最常见的大字。当然，首先删除stopwords可能是个好主意，因此请查看

中的食谱。非常感谢，我将直接检查这是否有效：谢谢Yannis P。！我以前不知道这本书，所以我一定会去看看。如果我找到了解决方案，我将发表评论：。是的，你是对的，我需要先清理stopwords。我花了一些时间来检查，tidytext库非常适合成对的单词频率！谢谢你，亚尼斯！似乎是新的'去'的文本处理在黑麦它是相当不错的。与tm libraray中的其他方法（例如TermDocumentMatrix等）相比，我喜欢每文档一行一个标记的想法。