对R中引用表中包含字符串的行求和
对于表中作为行存在的字符串列表,我想确定R中另一个数据表行中这些字符串的频率。同时,我想对包含这些字符串的行的值求和 例如,包含字符串列表的my reference表如下所示:对R中引用表中包含字符串的行求和,r,string,join,nlp,sum,R,String,Join,Nlp,Sum,对于表中作为行存在的字符串列表,我想确定R中另一个数据表行中这些字符串的频率。同时,我想对包含这些字符串的行的值求和 例如,包含字符串列表的my reference表如下所示: +-----------------------------+ |String | +-----------------------------+ |Dixon | +-----------------------------+ |
+-----------------------------+
|String |
+-----------------------------+
|Dixon |
+-----------------------------+
|Nina Kraviz |
+-----------------------------+
|DJ Tennis |
+-----------------------------+
+--------------------------------+
|String |Score |
+--------------------------------+
|Nina Kraviz @ Hyde |100 |
+--------------------------------+
|DJ Tennis? |200 |
+--------------------------------+
|From Dixon |100 |
+--------------------------------+
|From Kevin Saunderson |100 |
+--------------------------------+
|Dixon |300 |
+--------------------------------+
|Nina Kraviz |200 |
+--------------------------------+
+---------------------------------+
|String |Score |
+---------------------------------+
|Dixon |400 |
+---------------------------------+
|Nina Kraviz |300 |
+---------------------------------+
|DJ Tennis |200 |
+---------------------------------+
我要分析的表格如下所示:
+-----------------------------+
|String |
+-----------------------------+
|Dixon |
+-----------------------------+
|Nina Kraviz |
+-----------------------------+
|DJ Tennis |
+-----------------------------+
+--------------------------------+
|String |Score |
+--------------------------------+
|Nina Kraviz @ Hyde |100 |
+--------------------------------+
|DJ Tennis? |200 |
+--------------------------------+
|From Dixon |100 |
+--------------------------------+
|From Kevin Saunderson |100 |
+--------------------------------+
|Dixon |300 |
+--------------------------------+
|Nina Kraviz |200 |
+--------------------------------+
+---------------------------------+
|String |Score |
+---------------------------------+
|Dixon |400 |
+---------------------------------+
|Nina Kraviz |300 |
+---------------------------------+
|DJ Tennis |200 |
+---------------------------------+
我希望生成的表如下所示:
+-----------------------------+
|String |
+-----------------------------+
|Dixon |
+-----------------------------+
|Nina Kraviz |
+-----------------------------+
|DJ Tennis |
+-----------------------------+
+--------------------------------+
|String |Score |
+--------------------------------+
|Nina Kraviz @ Hyde |100 |
+--------------------------------+
|DJ Tennis? |200 |
+--------------------------------+
|From Dixon |100 |
+--------------------------------+
|From Kevin Saunderson |100 |
+--------------------------------+
|Dixon |300 |
+--------------------------------+
|Nina Kraviz |200 |
+--------------------------------+
+---------------------------------+
|String |Score |
+---------------------------------+
|Dixon |400 |
+---------------------------------+
|Nina Kraviz |300 |
+---------------------------------+
|DJ Tennis |200 |
+---------------------------------+
我尝试过使用n-grams和标记化,但它的工作方式并不简单,因为艺术家的名字通常可以包含1、2或3个单词。任何帮助都将不胜感激。我们可以基于部分匹配过滤第二个data.frame的行
library(dplyr)
library(stringr)
pat <- str_c("\\b(", str_c(df1$String, collapse="|"), ")\\b")
df2 %>%
group_by(String = str_extract(String, pat)) %>%
filter(!is.na(String)) %>%
summarise(Score = sum(Score, na.rm = TRUE))
# A tibble: 3 x 2
# String Score
# <chr> <dbl>
#1 Dixon 400
#2 DJ Tennis 200
#3 Nina Kraviz 300
库(dplyr)
图书馆(stringr)
帕特%
分组依据(String=str\u extract(String,pat))%>%
筛选器(!is.na(字符串))%>%
总结(分数=总和(分数,na.rm=真))
#一个tibble:3x2
#弦乐
#
#1迪克森400
#2 DJ网球200
#3尼娜·克拉维兹300
数据
df1请分享一个示例,包括您使用的数据的一个小示例。