R 如何计算常用词并将结果存储在矩阵中？_R_Matrix

R 如何计算常用词并将结果存储在矩阵中？

r matrix

R 如何计算常用词并将结果存储在矩阵中？,r,matrix,R,Matrix,我有很多文本句子要比较，但这里有一个小红帽的例子 text1 <- "Once upon a time" text2 <- "there was a dear little girl" text3 <- "who was loved by everyone who looked at her" 是否有一种有效的方法来完成矩阵？我有100多个句子要比较。这是一篇类似但不相等的帖子：试试这个： CommonWordsMatrixOld <- function(vList)

我有很多文本句子要比较，但这里有一个小红帽的例子

text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"

是否有一种有效的方法来完成矩阵？我有100多个句子要比较。这是一篇类似但不相等的帖子：

试试这个：

CommonWordsMatrixOld <- function(vList) {
    v <- lapply(vList, tolower)
    do.call(rbind, lapply(v, function(x) {
             xSplit <- strsplit(x, " ")[[1]]
             do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
        }))
}

myText <- list(text1, text2, text3)

对于OP请求的数据大小，它的速度相当快。获得的数据如下：

更新

这是一个更快的算法，它可以减少许多不必要的操作，并利用

lower.tri

，同时保持非常通用性

CommonWordsMatrixNew <- function(vList) {
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
    s <- length(v)
    m <- do.call(rbind, lapply(1L:s, function(x) {
        c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
    }))
    m[lower.tri(m)] <- t(m)[lower.tri(m)]
    m
}

新算法将调用

strsplit

的次数减少了

n^2-n

次（例如，在上面的示例中，

strplit

在原始算法中被调用

10000次，在更新版本中仅被调用100次）。此外，由于生成的矩阵是对称的，因此不需要多次计算每个句子之间的交互作用，因此lappy
函数中的x=1:s
和y=x:s
。这些循环的计算数量从n^2
减少到=（n*（n+1）/2）
（例如，在上面的示例中，从10000
减少到5050
）。在那之后，我们依赖于R
中的索引功能，它通常比手工制造快得多。
我发现预先拆分会提高速度，因此
CommonWordsMatrix <- function(vList) {
  v <- lapply(vList, tolower)
  do.call(rbind, lapply(v, function(x) {
    do.call(c, lapply(v, function(y) length(intersect(x, y))))
  }))
}

CommonWordsMatrix非常感谢，下面我根据您的回复发布了一些内容，似乎是faster@pachamaltese，第一种算法是进行不必要的计算。我修改了我原来的算法，以减少许多操作。此外，上述算法仍然通用（即，它们不依赖于预分割向量）。顺便说一句，问得好。说得好！我在函数本身之前使用strsplit进行了预拆分
CommonWordsMatrixOld(myText)
     [,1] [,2] [,3]
[1,]    4    1    0
[2,]    1    6    1
[3,]    0    1    8

testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)

set.seed(1111)
myTestText <- lapply(1:100, function(x) {
         paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
    })

myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
 reform credit publish judge Senate publish fresh restaurant publish
 version Senate critical release recall relation version"

system.time(test1 <- CommonWordsMatrixOld(myTestText))
 user  system elapsed 
0.625   0.009   0.646

test1[1:10,1:10]
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    9    3    5    1    3    4    4    2    2     1
 [2,]    3    5    3    1    1    3    3    0    0     1
 [3,]    5    3   12    0    3    8    4    3    2     1
 [4,]    1    1    0    1    0    0    1    0    0     0
 [5,]    3    1    3    0    4    2    1    1    1     0
 [6,]    4    3    8    0    2   13    7    4    1     1
 [7,]    4    3    4    1    1    7   10    4    1     1
 [8,]    2    0    3    0    1    4    4    7    3     0
 [9,]    2    0    2    0    1    1    1    3    4     0
[10,]    1    1    1    0    0    1    1    0    0     2

CommonWordsMatrixNew <- function(vList) {
    v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
    s <- length(v)
    m <- do.call(rbind, lapply(1L:s, function(x) {
        c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
    }))
    m[lower.tri(m)] <- t(m)[lower.tri(m)]
    m
}

microbenchmark(New=CommonWordsMatrixNew(myTestText), 
               Old=CommonWordsMatrixOld(myTestText),
               Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
 New  78.64434  79.07127  86.10754  79.72828  81.39679 137.0695    10
 Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306    10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535    10

identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE

CommonWordsMatrix <- function(vList) {
  v <- lapply(vList, tolower)
  do.call(rbind, lapply(v, function(x) {
    do.call(c, lapply(v, function(y) length(intersect(x, y))))
  }))
}