R中用户随时间变化的词频_R_Tm

R中用户随时间变化的词频

R中用户随时间变化的词频,r,tm,R,Tm,我的目标是随着时间的推移对单词频率进行评估。我有大约36000条用户评论和相关日期的个人条目。我这里有25个用户示例：我试图在给定的日期里找到最常用的单词（可能是前10个？）。我觉得我的方法很接近，但不完全正确： library("tm") frequencylist <- list(0) for(i in unique(sampledf[,2])){ subset <- subset(sampledf, sampledf[,2]==i) comments

我的目标是随着时间的推移对单词频率进行评估。我有大约36000条用户评论和相关日期的个人条目。我这里有25个用户示例：

我试图在给定的日期里找到最常用的单词（可能是前10个？）。我觉得我的方法很接近，但不完全正确：

    library("tm")

frequencylist <- list(0)

for(i in unique(sampledf[,2])){

  subset <- subset(sampledf, sampledf[,2]==i)

  comments <- as.vector(subset[,1])
  verbatims <- Corpus(VectorSource(comments))
  verbatims <- tm_map(verbatims, stripWhitespace)
  verbatims <- tm_map(verbatims, content_transformer(tolower))
  verbatims <- tm_map(verbatims, removeWords, stopwords("english"))
  verbatims <- tm_map(verbatims, removePunctuation)

  stopwords2 <- c("game")
  verbatims2 <- tm_map(verbatims, removeWords, stopwords2)
  dtm <- DocumentTermMatrix(verbatims2)
  dtm2 <- as.matrix(dtm)
  frequency <- colSums(dtm2)
  frequency <- sort(frequency, decreasing=TRUE)
  frequencydf <- data.frame(frequency)
  frequencydf$comments <- row.names(frequencydf)
  frequencydf$date <- i

  frequencylist[[i]] <- frequencydf 
}

library（“tm”）
frequencylist评论者是正确的，因为有更好的方法来建立一个可复制的例子。此外，您的答案可以更具体地说明您试图作为输出完成什么。（我无法让您的代码无误地执行。）
但是：您要求使用更简单、更好的方法。这是我认为两者都是的。它使用quanteda文本包，并在创建文档特征矩阵时利用组
特征。然后，它在“dfm”上执行一些排名，以获得您需要的每日学期排名
请注意，这是基于我使用read.delim（“sampledf.tsv”，stringsAsFactors=FALSE）
加载了链接数据
require（quanteda）
#使用日期文档变量创建语料库
myCorpus可以dput（）
数据吗？看起来你在里面有一些奇怪的角色。听起来，至少有一次观察的长度为0时，频率发生了变化。你应该检查一下。另外，当向列表中添加项时，它应该是frequencylist[[i]]哎呀，我注意到subset函数中有一个bug，刚刚修复了它。你指的是什么数据dput（）
用于子集数据或数据帧的结果列表？一个dput（）优于指向外部粘贴站点的链接。看见既然你没有错误，这真的不是一个具体的问题。如果您的代码可以工作，但您希望有人检查它的样式，那么这更适合，而不是堆栈溢出。风格问题没有一个明确的答案可以接受。
require(quanteda)
# create a corpus with a date document variable
myCorpus <- corpus(sampledf$content_strip, 
                   docvars = data.frame(date = as.Date(sampledf$postedDate_fix, "%M/%d/%Y")))

# construct a dfm, group on date, and remove stopwords plus the term "game"
myDfm <- dfm(myCorpus, groups = "date", ignoredFeatures = c("game", stopwords("english")))
## Creating a dfm from a corpus ...
## ... grouping texts by variable: date
## ... lowercasing
## ... tokenizing
## ... indexing documents: 20 documents
## ... indexing features: 198 feature types
## ... removed 47 features, from 175 supplied (glob) feature types
## ... created a 20 x 151 sparse dfm
## ... complete. 
## Elapsed time: 0.009 seconds.

myDfm <- sort(myDfm) # not required, just for presentation
# remove a really nasty long term
myDfm <- removeFeatures(myDfm, "^a{10}", valuetype = "regex")
## removed 1 feature, from 1 supplied (regex) feature types

# make a data.frame of the daily ranks of each feature
featureRanksByDate <- as.data.frame(t(apply(myDfm, 1, order, decreasing = TRUE)))
names(featureRanksByDate) <- features(myDfm)
featureRanksByDate[, 1:10]
##              â great nice play  go will can get ever first
## 2013-10-02   1    18   19   20  21   22  23  24   25    26
## 2013-10-04   3     1    2    4   5    6   7   8    9    10
## 2013-10-05   3     9   28   29   1    2   4   5    6     7
## 2013-10-06   7     4    8   10  11   30  31  32   33    34
## 2013-10-07   5     1    2    3   4    6   7   8    9    10
## 2013-10-09  12    42   43    1   2    3   4   5    6     7
## 2013-10-13   1    14    6    9  10   13  44  45   46    47
## 2013-10-16   2     3   84   85   1    4   5   6    7     8
## 2013-10-18  15     1    2    3   4    5   6   7    8     9
## 2013-10-19   3    86    1    2   4    5   6   7    8     9
## 2013-10-22   2    87   88   89  90   91  92  93   94    95
## 2013-10-23  13    98   99  100 101  102 103 104  105   106
## 2013-10-25   4     6    5   12  16  109 110 111  112   113
## 2013-10-27   8     4    6   15  17  124 125 126  127   128
## 2013-10-30  11     1    2    3   4    5   6   7    8     9
## 2014-10-01   7    16  139    1   2    3   4   5    6     8
## 2014-10-02 140     1    2    3   4    5   6   7    8     9
## 2014-10-03 141   142  143    1   2    3   4   5    6     7
## 2014-10-05 144   145  146  147 148    1   2   3    4     5
## 2014-10-06  17   149  150    1   2    3   4   5    6     7

# top n features by day
n <- 10 
as.data.frame(apply(featureRanksByDate, 1, function(x) {
    todaysTopFeatures <- names(featureRanksByDate)
    names(todaysTopFeatures) <- x
    todaysTopFeatures[as.character(1:n)]
}), row.names = 1:n)
##    2013-10-02 2013-10-04 2013-10-05 2013-10-06 2013-10-07 2013-10-09 2013-10-13 2013-10-16 2013-10-18 2013-10-19 2013-10-22 2013-10-23
## 1           â      great         go     triple      great       play          â         go      great       nice       year       year
## 2         win       nice       will      niple       nice         go    created          â       nice       play          â       give
## 3        year          â          â   backflip       play       will      wasnt      great       play          â       give       good
## 4        give       play        can      great         go        can      money       will         go         go       good       hard
## 5        good         go        get      scope          â        get     prizes        can       will       will       hard       time
## 6        hard       will       ever       ball       will       ever       nice        get        can        can       time     triple
## 7        time        can      first          â        can      first      piece       ever        get        get     triple      niple
## 8      triple        get        fun       nice        get        fun       dead      first       ever       ever      niple   backflip
## 9       niple       ever      great   testical       ever        win       play        fun      first      first   backflip      scope
## 10   backflip      first        win       play      first       year         go        win        fun        fun      scope       ball
##    2013-10-25 2013-10-27 2013-10-30 2014-10-01 2014-10-02 2014-10-03 2014-10-05 2014-10-06
## 1       scope      scope      great       play      great       play       will       play
## 3    testical   testical       play       will       play       will        get       will
## 2        ball       ball       nice         go       nice         go        can         go
## 4           â      great         go        can         go        can       ever        can
## 5        nice       shot       will        get       will        get      first        get
## 6       great       nice        can       ever        can       ever        fun       ever
## 7        shot       head        get          â        get      first        win      first
## 8        head          â       ever      first       ever        fun       year        fun
## 9     dancing    dancing      first        fun      first        win       give        win
## 10        cow        cow        fun        win        fun       year       good       year