R 提高在大向量字符串上计算单词分数总和的性能？_R_String_Performance_Loops_Vectorization

R 提高在大向量字符串上计算单词分数总和的性能？

r string performance loops

R 提高在大向量字符串上计算单词分数总和的性能？,r,string,performance,loops,vectorization,R,String,Performance,Loops,Vectorization,我有一个字符串，如下所示： [1] "What can we learn from the Mahabharata " [2] "What are the most iconic songs associated with the Vietnam War " [3] "What are

我有一个字符串，如下所示：

 [1] "What can we learn from the Mahabharata "                                                                
 [2] "What are the most iconic songs associated with the Vietnam War "                                        
 [3] "What are some major social faux pas to avoid when visiting Malta "                                      
 [4] "Will Ready Boost technology contribute to CFD software usage "                                          
 [5] "Who is Jon Snow " ...

以及为每个单词分配分数的数据框：

   word score
   the    11
    to     9
  What     9
     I     7
     a     6
   are     6

我想给我的每个字符串分配包含在其中的单词的分数之和，我的解决方案是下面的函数

 score_fun<- function(x)

 # obtaining the list of words 

 {z <- unlist(strsplit(x,' ')); 

 # returning the sum of the words' scores     

 return(sum(word_scores$score[word_scores$word %in% z]))} 

 # using sapply() in conjunction with the function  

 scores <- sapply(my_strings, score_fun, USE.NAMES = F)

 # the output will look like 
 scores
 [1] 20 26 24  9  0  0 38 32 30  0

score\u fun考虑创建一个拆分单词的数据框，然后合并到word\u分数，最后按短语id聚合分数。这种方法避免了迭代sapply
循环计算
list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))      

ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

aggdf 
#   id score
# 1  1    20
# 2  2    26
# 3  3    24
# 4  4     9
# 5  7    38
# 6  8    32
# 7  9    30

您可以使用tidytext:：unnest_标记将单词标记化，然后加入并聚合：
库（tidyverse）
图书馆（tidytext）
数据帧（字符串=我的字符串，id=沿（字符串）的顺序））%>%
unnest_标记（单词、字符串、“单词”to_lower=FALSE）%>%
不同的（）%>%
左加入（单词分数）%>%
分组依据（id）%>%
总结（分数=总和（分数，na.rm=真））
#>#A tible:10×2
#>id分数
#>     
#> 1      1    20
#> 2      2    26
#> 3      3    24
#> 4      4     9
#> 5      5     0
#> 6      6     0
#> 7      7    38
#> 8      8    32
#> 9      9    30
#> 10    10     0

如果愿意，保留原始字符串，或者在末尾通过ID重新连接它们
在小数据上，速度要慢得多，但在缩放时速度会更快，例如，当my_strings
重采样到长度10000时：
单位：毫秒
expr最小lq平均uq最大neval
减少5440.03300 5656.41350 5815.2094 5814.0406 5944.9969 6206.2502 100
sapply 460.75930 486.94336 511.2762 503.4932 532.2363 746.8376 100
tidytext 86.92182 94.65745 101.7064 100.1487 107.3289 134.7276 100
当你说“我有一百万个单词”时，在你的查找表中有多少分数不为零的唯一单词？另外，看起来您正在尝试手动实现一个单词向量器，这些已经存在（在R、Python、word2vec等中），请看一看。你也许能够以词向量的形式存储你的输入句子；词序不重要，我想你不在乎计数？（多次出现？）查找例如“单词的矢量表示”。在这方面有大量的现有工作和软件包。如果您想获得实际的性能比较数字，请添加一个特定的代码读取，以及您的word分数示例查找表（/字典）。感谢您的回答，它似乎无法满足我的需要（除非我遗漏了什么），总分数应该分配给每个字符串，答案的第一个命令的输出<代码>列出字符串我喜欢这个解决方案，通过步骤<代码>左键连接（单词分数）%>%出于某种原因不断破坏我的Rstudio，即使是单独完成：（R，或RStudio？有很多理由R可以爆炸出来，但也许可以尝试更新到和全新的R 3.4.0，以防它是一个已经修补过的bug。我想我发现了问题，连接是在角色和因素之间（通常不是什么大问题）.我解决了这个问题，现在它可以工作了，而且速度非常快，我无法获得准确的时间，因为Sys.time（）由于某种原因无法工作。thanx！
list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))      

ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

aggdf 
#   id score
# 1  1    20
# 2  2    26
# 3  3    24
# 4  4     9
# 5  7    38
# 6  8    32
# 7  9    30

library(micorbenchmark)

microbenchmark({
   list_strings <- lapply(my_strings, function(i) unique(unlist(strsplit(s, split=" "))))

   ids <- lapply(seq(length(list_strings)), function(i) rep(i, length(list_strings[[i]])))

   phrases_df <- data.frame(id=Reduce(append, ids), word=Reduce(append, list_strings))      
   aggdf <- aggregate(score~id, merge(phrases_df, word_scores, by="word"), FUN=sum)

})

# Unit: milliseconds
#      min       lq     mean   median       uq      max neval
# 5.623328 5.808831 6.177336 5.964018 6.252019 10.09706   100

microbenchmark({
  score_fun<- function(x) {
     z <- unlist(strsplit(x,' '))
     return(sum(word_scores$score[word_scores$word %in% z]))
  } 
  scores <- sapply(my_strings, score_fun, USE.NAMES = F)
})

# Unit: microseconds
#       min      lq     mean  median       uq     max neval
# 809.382 843.307 1005.366 865.442 1209.983 1873.32   100