R 如何计算常用词并将结果存储在矩阵中?
我有很多文本句子要比较,但这里有一个小红帽的例子R 如何计算常用词并将结果存储在矩阵中?,r,matrix,R,Matrix,我有很多文本句子要比较,但这里有一个小红帽的例子 text1 <- "Once upon a time" text2 <- "there was a dear little girl" text3 <- "who was loved by everyone who looked at her" 是否有一种有效的方法来完成矩阵?我有100多个句子要比较。这是一篇类似但不相等的帖子:试试这个: CommonWordsMatrixOld <- function(vList)
text1 <- "Once upon a time"
text2 <- "there was a dear little girl"
text3 <- "who was loved by everyone who looked at her"
是否有一种有效的方法来完成矩阵?我有100多个句子要比较。这是一篇类似但不相等的帖子:试试这个:
CommonWordsMatrixOld <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
xSplit <- strsplit(x, " ")[[1]]
do.call(c, lapply(v, function(y) length(intersect(xSplit, strsplit(y, " ")[[1]]))))
}))
}
myText <- list(text1, text2, text3)
对于OP请求的数据大小,它的速度相当快。获得的数据如下:
更新
这是一个更快的算法,它可以减少许多不必要的操作,并利用lower.tri
,同时保持非常通用性
CommonWordsMatrixNew <- function(vList) {
v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
s <- length(v)
m <- do.call(rbind, lapply(1L:s, function(x) {
c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
}))
m[lower.tri(m)] <- t(m)[lower.tri(m)]
m
}
新算法将调用
strsplit
的次数减少了n^2-n
次(例如,在上面的示例中,strplit
在原始算法中被调用10000次,在更新版本中仅被调用100次)。此外,由于生成的矩阵是对称的,因此不需要多次计算每个句子之间的交互作用,因此lappy
函数中的x=1:s
和y=x:s
。这些循环的计算数量从n^2
减少到=(n*(n+1)/2)
(例如,在上面的示例中,从10000
减少到5050
)。在那之后,我们依赖于R
中的索引功能,它通常比手工制造快得多。我发现预先拆分会提高速度,因此
CommonWordsMatrix <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
do.call(c, lapply(v, function(y) length(intersect(x, y))))
}))
}
CommonWordsMatrix非常感谢,下面我根据您的回复发布了一些内容,似乎是faster@pachamaltese,第一种算法是进行不必要的计算。我修改了我原来的算法,以减少许多操作。此外,上述算法仍然通用(即,它们不依赖于预分割向量)。顺便说一句,问得好。说得好!我在函数本身之前使用strsplit进行了预拆分
CommonWordsMatrixOld(myText)
[,1] [,2] [,3]
[1,] 4 1 0
[2,] 1 6 1
[3,] 0 1 8
testWords <- read.csv("4000-most-common-english-words-csv.csv", stringsAsFactors = FALSE)
set.seed(1111)
myTestText <- lapply(1:100, function(x) {
paste(testWords[sample(1000:1020, sample(30, 1), replace = TRUE),],collapse = " ")
})
myTestText[[15]]
[1] "access restaurant video opinion video eventually fresh eventually
reform credit publish judge Senate publish fresh restaurant publish
version Senate critical release recall relation version"
system.time(test1 <- CommonWordsMatrixOld(myTestText))
user system elapsed
0.625 0.009 0.646
test1[1:10,1:10]
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 9 3 5 1 3 4 4 2 2 1
[2,] 3 5 3 1 1 3 3 0 0 1
[3,] 5 3 12 0 3 8 4 3 2 1
[4,] 1 1 0 1 0 0 1 0 0 0
[5,] 3 1 3 0 4 2 1 1 1 0
[6,] 4 3 8 0 2 13 7 4 1 1
[7,] 4 3 4 1 1 7 10 4 1 1
[8,] 2 0 3 0 1 4 4 7 3 0
[9,] 2 0 2 0 1 1 1 3 4 0
[10,] 1 1 1 0 0 1 1 0 0 2
CommonWordsMatrixNew <- function(vList) {
v <- lapply(vList, function(x) tolower(strsplit(x, " ")[[1]]))
s <- length(v)
m <- do.call(rbind, lapply(1L:s, function(x) {
c(rep(0L,(x-1L)), do.call(c, lapply(x:s, function(y) length(intersect(v[[x]], v[[y]])))))
}))
m[lower.tri(m)] <- t(m)[lower.tri(m)]
m
}
microbenchmark(New=CommonWordsMatrixNew(myTestText),
Old=CommonWordsMatrixOld(myTestText),
Pach=CommonWordsMatrixPach(PreSplit1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
New 78.64434 79.07127 86.10754 79.72828 81.39679 137.0695 10
Old 321.49031 323.89835 326.61801 325.75221 328.50877 335.3306 10
Pach 138.34742 143.00504 145.35147 145.17376 148.34699 151.5535 10
identical(CommonWordsMatrixNew(myTestText), CommonWordsMatrixOld(myTestText), CommonWordsMatrixPach(PreSplit1))
[1] TRUE
CommonWordsMatrix <- function(vList) {
v <- lapply(vList, tolower)
do.call(rbind, lapply(v, function(x) {
do.call(c, lapply(v, function(y) length(intersect(x, y))))
}))
}