R tm：：findAssocs的数学这个函数是如何工作的？_R_Text Mining

R tm：：findAssocs的数学这个函数是如何工作的？

R tm：：findAssocs的数学这个函数是如何工作的？,r,text-mining,R,Text Mining,我一直在将findAssoc（）与文本挖掘（tmpackage）结合使用，但我意识到我的数据集似乎有问题我的数据集是1500个开放式答案，保存在csv文件的一列中。所以我像这样调用数据集，并使用典型的tm_-map将其添加到语料库中 library(tm) Q29 <- read.csv("favoritegame2.csv") corpus <- Corpus(VectorSource(Q29$Q29)) corpus <- tm_map(corpus, tolower)

我一直在将

findAssoc（）

与文本挖掘（

tm

package）结合使用，但我意识到我的数据集似乎有问题

我的数据集是1500个开放式答案，保存在csv文件的一列中。所以我像这样调用数据集，并使用典型的

tm_-map

将其添加到语料库中

library(tm)
Q29 <- read.csv("favoritegame2.csv")
corpus <- Corpus(VectorSource(Q29$Q29))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, removeWords, stopwords("english"))
dtm<- DocumentTermMatrix(corpus)

findAssocs(dtm, "like", .2)
> cousin  fill  ....
  0.28    0.20

此数据帧由1500个OB组成。1689个变量..（或者是因为数据保存在csv文件的一行中？）

问题2。即使当目标术语

like

出现一次时，

cosine

和

fill

出现一次，分数也会像这样不同。它们不应该是一样的吗

我试图找到

findAssoc（）

的数学公式，但还没有成功。非常感谢您的建议

findAssocs
 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
{
    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))

#函数（x，项，极限）
#使用方法（“findAssocs”，x）
#
方法（findAssocs）
#[1] findAssocs.DocumentTermMatrix*findAssocs.matrix*findAssocs.TermDocumentMatrix*
getAnywhere（findAssocs.DocumentTermMatrix）
#-------------
找到与“findAssocs.DocumentTermMatrix”匹配的单个对象
它被发现在以下地方
从namespace tm为findAssocs注册了S3方法
名称空间：tm
有价值
函数（x，项，极限）
{
ind您的dtm有1689个变量，因为这是您观察到的唯一单词数（不包括停止单词和数字）。可能在1500次观察中不止一次出现“喜欢”这个词，并且不总是伴随着“表亲”和“填充”。您计算过多少次“喜欢”显示？
顺便说一句，如果您的术语文档矩阵非常大，您可能需要尝试此版本的findAssocs
：
# u is a term document matrix (transpose of a DTM)
# term is your term
# corlimit is a value -1 to 1

findAssocsBig <- function(u, term, corlimit){
  suppressWarnings(x.cor <-  gamlr::corr(t(u[ !u$dimnames$Terms == term, ]),        
                                         as.matrix(t(u[  u$dimnames$Terms == term, ]))  ))  
  x <- sort(round(x.cor[(x.cor[, term] > corlimit), ], 2), decreasing = TRUE)
  return(x)
}

#u是术语文档矩阵（DTM的转置）
#任期就是你的任期
#corlimit是一个值-1比1
findAssocsBig我想没有人回答你的最后一个问题
我试图找到findAssoc（）的数学公式，但还没有成功。有吗
非常感谢您的建议
 findAssocs
#function (x, term, corlimit) 
#UseMethod("findAssocs", x)
#<environment: namespace:tm>

methods(findAssocs )
#[1] findAssocs.DocumentTermMatrix* findAssocs.matrix*   findAssocs.TermDocumentMatrix*

 getAnywhere(findAssocs.DocumentTermMatrix)
#-------------
A single object matching ‘findAssocs.DocumentTermMatrix’ was found
It was found in the following places
  registered S3 method for findAssocs from namespace tm
  namespace:tm
with value

function (x, term, corlimit) 
{
    ind <- term == Terms(x)
    suppressWarnings(x.cor <- cor(as.matrix(x[, ind]), as.matrix(x[, 
        !ind])))

findAssoc（）的数学基于R的stats包中的标准函数cor（）。给定两个数值向量，cor（）计算它们的协方差除以两个标准偏差
因此，给定包含术语“word1”和“word2”的DocumentTermMatrix dtm，findAssocs（dtm，“word1”，0）返回值为x的“word2”，则“word1”和“word2”的术语向量的相关性为x
举个冗长的例子
> data <-  c("", "word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5") 
> dtm <- DocumentTermMatrix(VCorpus(VectorSource(data)))
> as.matrix(dtm)
    Terms
Docs word1 word2 word3 word4 word5
   1     0     0     0     0     0
   2     1     0     0     0     0
   3     1     1     0     0     0
   4     1     1     1     0     0
   5     1     1     1     1     0
   6     1     1     1     1     1
> findAssocs(dtm, "word1", 0) 
$word1
word2 word3 word4 word5 
 0.63  0.45  0.32  0.20 

> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word2"])
[1] 0.6324555
> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word3"])
[1] 0.4472136

>数据dtm作为.matrix（dtm）
条款
文档word1 word2 word3 word4 word5
1     0     0     0     0     0
2     1     0     0     0     0
3     1     1     0     0     0
4     1     1     1     0     0
5     1     1     1     1     0
6     1     1     1     1     1
>findAssocs（dtm，“字1”，0）
$word1
单词2单词3单词4单词5
0.63  0.45  0.32  0.20 
>cor（as.matrix（dtm）[，“word1”]、as.matrix（dtm）[，“word2”]）
[1] 0.6324555
>cor（as.matrix（dtm）[，“word1”]、as.matrix（dtm）[，“word3”]）
[1] 0.4472136

对于单词4和5，依此类推
另请参见
CRAN上没有“textmining”包。请包括库（）或require（）调用您使用过。@Dwin-似乎在最近的邮件包“tm”中-谢谢您的编辑！我发现的一个警告是，findAssocs
要求相关限制为=0
。底层的cor
可能返回负值来表示关系的方向，但通过fin这似乎是不可能的dAssocs。
> data <-  c("", "word1", "word1 word2","word1 word2 word3","word1 word2 word3 word4","word1 word2 word3 word4 word5") 
> dtm <- DocumentTermMatrix(VCorpus(VectorSource(data)))
> as.matrix(dtm)
    Terms
Docs word1 word2 word3 word4 word5
   1     0     0     0     0     0
   2     1     0     0     0     0
   3     1     1     0     0     0
   4     1     1     1     0     0
   5     1     1     1     1     0
   6     1     1     1     1     1
> findAssocs(dtm, "word1", 0) 
$word1
word2 word3 word4 word5 
 0.63  0.45  0.32  0.20 

> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word2"])
[1] 0.6324555
> cor(as.matrix(dtm)[,"word1"], as.matrix(dtm)[,"word3"])
[1] 0.4472136