Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/string/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
对R中引用表中包含字符串的行求和_R_String_Join_Nlp_Sum - Fatal编程技术网

对R中引用表中包含字符串的行求和

对R中引用表中包含字符串的行求和,r,string,join,nlp,sum,R,String,Join,Nlp,Sum,对于表中作为行存在的字符串列表,我想确定R中另一个数据表行中这些字符串的频率。同时,我想对包含这些字符串的行的值求和 例如,包含字符串列表的my reference表如下所示: +-----------------------------+ |String | +-----------------------------+ |Dixon | +-----------------------------+ |

对于表中作为行存在的字符串列表,我想确定R中另一个数据表行中这些字符串的频率。同时,我想对包含这些字符串的行的值求和

例如,包含字符串列表的my reference表如下所示:

+-----------------------------+
|String                       |
+-----------------------------+
|Dixon                        |
+-----------------------------+
|Nina Kraviz                  |
+-----------------------------+
|DJ Tennis                    |
+-----------------------------+
+--------------------------------+
|String                |Score    |
+--------------------------------+
|Nina Kraviz @ Hyde    |100      |
+--------------------------------+
|DJ Tennis?            |200      |
+--------------------------------+
|From Dixon            |100      |
+--------------------------------+
|From Kevin Saunderson |100      |
+--------------------------------+
|Dixon                 |300      |
+--------------------------------+
|Nina Kraviz           |200      |
+--------------------------------+
+---------------------------------+
|String             |Score        |
+---------------------------------+
|Dixon              |400          |
+---------------------------------+
|Nina Kraviz        |300          |
+---------------------------------+
|DJ Tennis          |200          |
+---------------------------------+
我要分析的表格如下所示:

+-----------------------------+
|String                       |
+-----------------------------+
|Dixon                        |
+-----------------------------+
|Nina Kraviz                  |
+-----------------------------+
|DJ Tennis                    |
+-----------------------------+
+--------------------------------+
|String                |Score    |
+--------------------------------+
|Nina Kraviz @ Hyde    |100      |
+--------------------------------+
|DJ Tennis?            |200      |
+--------------------------------+
|From Dixon            |100      |
+--------------------------------+
|From Kevin Saunderson |100      |
+--------------------------------+
|Dixon                 |300      |
+--------------------------------+
|Nina Kraviz           |200      |
+--------------------------------+
+---------------------------------+
|String             |Score        |
+---------------------------------+
|Dixon              |400          |
+---------------------------------+
|Nina Kraviz        |300          |
+---------------------------------+
|DJ Tennis          |200          |
+---------------------------------+
我希望生成的表如下所示:

+-----------------------------+
|String                       |
+-----------------------------+
|Dixon                        |
+-----------------------------+
|Nina Kraviz                  |
+-----------------------------+
|DJ Tennis                    |
+-----------------------------+
+--------------------------------+
|String                |Score    |
+--------------------------------+
|Nina Kraviz @ Hyde    |100      |
+--------------------------------+
|DJ Tennis?            |200      |
+--------------------------------+
|From Dixon            |100      |
+--------------------------------+
|From Kevin Saunderson |100      |
+--------------------------------+
|Dixon                 |300      |
+--------------------------------+
|Nina Kraviz           |200      |
+--------------------------------+
+---------------------------------+
|String             |Score        |
+---------------------------------+
|Dixon              |400          |
+---------------------------------+
|Nina Kraviz        |300          |
+---------------------------------+
|DJ Tennis          |200          |
+---------------------------------+

我尝试过使用n-grams和标记化,但它的工作方式并不简单,因为艺术家的名字通常可以包含1、2或3个单词。任何帮助都将不胜感激。

我们可以基于部分匹配过滤第二个data.frame的行

library(dplyr)
library(stringr)
pat <- str_c("\\b(", str_c(df1$String, collapse="|"), ")\\b")
df2 %>%
     group_by(String = str_extract(String, pat)) %>%
     filter(!is.na(String)) %>%
     summarise(Score = sum(Score, na.rm = TRUE))
# A tibble: 3 x 2
#  String      Score
#  <chr>       <dbl>
#1 Dixon         400
#2 DJ Tennis     200
#3 Nina Kraviz   300
库(dplyr)
图书馆(stringr)
帕特%
分组依据(String=str\u extract(String,pat))%>%
筛选器(!is.na(字符串))%>%
总结(分数=总和(分数,na.rm=真))
#一个tibble:3x2
#弦乐
#         
#1迪克森400
#2 DJ网球200
#3尼娜·克拉维兹300
数据
df1请分享一个示例,包括您使用的数据的一个小示例。