R 如何计算常用词并将结果存储在矩阵中?

R 如何计算常用词并将结果存储在矩阵中?,r,matrix,R,Matrix,我有很多文本句子要比较,但这里有一个小红帽的例子 text1 <- "Once upon a time" text2 <- "there was a dear little girl" text3 <- "who was loved by everyone who looked at her" 是否有一种有效的方法来完成矩阵?我有100多个句子要比较。这是一篇类似但不相等的帖子:试试这个: CommonWordsMatrixOld <- function(vList)

我有很多文本句子要比较,但这里有一个小红帽的例子

text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"
是否有一种有效的方法来完成矩阵?我有100多个句子要比较。这是一篇类似但不相等的帖子:

试试这个:

CommonWordsMatrixOld <- function(vList) {
    v <- lapply(vList, tolower)
    do.call(rbind, lapply(v, function(x) {
             xSplit <- strsplit(x, " ")[[1]]
             do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
        }))
}

myText <- list(text1, text2, text3)
对于OP请求的数据大小,它的速度相当快。获得的数据如下:

更新

这是一个更快的算法,它可以减少许多不必要的操作,并利用
lower.tri
,同时保持非常通用性

CommonWordsMatrixNew <- function(vList) {
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
    s <- length(v)
    m <- do.call(rbind, lapply(1L:s, function(x) {
        c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
    }))
    m[lower.tri(m)] <- t(m)[lower.tri(m)]
    m
}

新算法将调用
strsplit
的次数减少了
n^2-n
次(例如,在上面的示例中,
strplit
在原始算法中被调用
10000次,在更新版本中仅被调用
100次)。此外,由于生成的矩阵是对称的,因此不需要多次计算每个句子之间的交互作用,因此
lappy
函数中的
x=1:s
y=x:s
。这些循环的计算数量从
n^2
减少到
=(n*(n+1)/2)
(例如,在上面的示例中,从
10000
减少到
5050
)。在那之后,我们依赖于
R
中的索引功能,它通常比手工制造快得多。

我发现预先拆分会提高速度,因此

CommonWordsMatrix <- function(vList) {
  v <- lapply(vList, tolower)
  do.call(rbind, lapply(v, function(x) {
    do.call(c, lapply(v, function(y) length(intersect(x, y))))
  }))
}

CommonWordsMatrix非常感谢,下面我根据您的回复发布了一些内容,似乎是faster@pachamaltese,第一种算法是进行不必要的计算。我修改了我原来的算法,以减少许多操作。此外,上述算法仍然通用(即,它们不依赖于预分割向量)。顺便说一句,问得好。说得好!我在函数本身之前使用strsplit进行了预拆分
CommonWordsMatrixOld(myText)
     [,1] [,2] [,3]
[1,]    4    1    0
[2,]    1    6    1
[3,]    0    1    8
testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)

set.seed(1111)
myTestText <- lapply(1:100, function(x) {
         paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
    })

myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
 reform credit publish judge Senate publish fresh restaurant publish
 version Senate critical release recall relation version"

system.time(test1 <- CommonWordsMatrixOld(myTestText))
 user  system elapsed 
0.625   0.009   0.646
test1[1:10,1:10]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    9    3    5    1    3    4    4    2    2     1
 [2,]    3    5    3    1    1    3    3    0    0     1
 [3,]    5    3   12    0    3    8    4    3    2     1
 [4,]    1    1    0    1    0    0    1    0    0     0
 [5,]    3    1    3    0    4    2    1    1    1     0
 [6,]    4    3    8    0    2   13    7    4    1     1
 [7,]    4    3    4    1    1    7   10    4    1     1
 [8,]    2    0    3    0    1    4    4    7    3     0
 [9,]    2    0    2    0    1    1    1    3    4     0
[10,]    1    1    1    0    0    1    1    0    0     2
CommonWordsMatrixNew <- function(vList) {
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
    s <- length(v)
    m <- do.call(rbind, lapply(1L:s, function(x) {
        c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
    }))
    m[lower.tri(m)] <- t(m)[lower.tri(m)]
    m
}
microbenchmark(New=CommonWordsMatrixNew(myTestText), 
               Old=CommonWordsMatrixOld(myTestText),
               Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
 New  78.64434  79.07127  86.10754  79.72828  81.39679 137.0695    10
 Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306    10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535    10

identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE
CommonWordsMatrix <- function(vList) {
  v <- lapply(vList, tolower)
  do.call(rbind, lapply(v, function(x) {
    do.call(c, lapply(v, function(y) length(intersect(x, y))))
  }))
}