R 从另一个df中的字符串检测一个df中的多个字符串，如果检测到，则返回检测到的字符串_R_String_Tags_Mapply

R 从另一个df中的字符串检测一个df中的多个字符串，如果检测到，则返回检测到的字符串

r string tags

R 从另一个df中的字符串检测一个df中的多个字符串，如果检测到，则返回检测到的字符串,r,string,tags,mapply,R,String,Tags,Mapply,我正在学习使用R，所以请容忍我我有一个google play store应用程序的数据集（master_tib）。每行都是一个play store应用程序。有一个标题为“描述”的专栏，其中包含有关应用程序功能的文本 master_tib App Description App1 Reduce your depression and anxiety App2 Help your depression App3 This app helps with Anxiety

我正在学习使用R，所以请容忍我

我有一个google play store应用程序的数据集（master_tib）。每行都是一个play store应用程序。有一个标题为“描述”的专栏，其中包含有关应用程序功能的文本

master_tib

App     Description
App1    Reduce your depression and anxiety
App2    Help your depression 
App3    This app helps with Anxiety 
App4    Dog walker app 3000

我还有一个标签的df（master_标签），其中包含我预定义的重要单词。有一个标题为tag的列，每行包含一个tag

master_tag

Tag
Depression
Anxiety
Stress
Mood

我的目标是根据描述中标记的存在，使用master_tags df中的标记标记来自master_tib df的应用程序。然后，它将在新列中打印标签。最终结果将是一个master_tib df，如下所示：

App     Description                            Tag
App1    Reduce your depression and anxiety     depression, anxiety
App2    Help your depression                   depression
App3    This app helps with anxiety            anxiety
App4    Dog walker app 3000                    FALSE

以下是我迄今为止使用str_detect和mapply的组合所做的工作：

# define function to use in mapply

detect_tag <- function(description, tag){ 
  if(str_detect(description, tag, FALSE)) {
    return (tag)
  } else { 
    return (FALSE)
  }
}

index <-  mapply(FUN = detect_tag, description = master_tib$description, master_tags$tag)

master_tib[index,]

而不是期望的：

App     Description                            Tag
App1    Reduce your depression and anxiety     depression, anxiety

我还没有把结果打印到一个新的专栏里。希望听到任何人的见解或想法，并为我糟糕的R技能提前道歉

您可以使用

str\u c

组合

master\u标记中的单词，并使用str\u extract\u all
获得与模式匹配的所有单词
library(stringr)
master_tib$Tag <- sapply(str_extract_all(tolower(master_tib$Description), 
              str_c('\\b', tolower(master_tag$Tag), '\\b', collapse = "|")), 
              function(x) toString(unique(x)))
master_tib$Tag
#[1] "depression, anxiety" "depression"          "anxiety"             "" 

库（stringr）
master_tib$Tag与@Ronaksah的答案相似，但以R为基数：
应用(
sappy（master_tag$tag，grepl，master_tib$Description，ignore.case=TRUE），
1，功能（a）粘贴（主标签$tag[a]，折叠=“，”）
#[1]“抑郁，焦虑”“抑郁”“焦虑”
# [4] ""                  

（并且没有小写或“逗号空格”的细节，如果需要的话可以很容易地添加）。
使用来自tidyverse
（dplyr
，stringr
，tidyr）的几个包和@Ronak Shah的答案中显示的数据。
首先将标记转换为模式：
模式%
tolower（）%>%
str_c（collapse=“|”））

然后查找所有匹配项并创建所需的输出：
master_tib %>%
  mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
  unnest(Tag, keep_empty = TRUE) %>%
  group_by(App, Description) %>% 
  summarise(Tag = str_c(Tag, collapse=", "))

这就产生了
# A tibble: 4 x 3
# Groups:   App [4]
  App   Description                        Tag                
  <chr> <chr>                              <chr>              
1 App1  Reduce your depression and anxiety depression, anxiety
2 App2  Help your depression               depression         
3 App3  This app helps with Anxiety        anxiety            
4 App4  Dog walker app 3000                NA 

#一个tible:4 x 3
#分组：应用程序[4]
应用程序描述标签
1 App1减少你的抑郁和焦虑抑郁、焦虑
2附录2帮助你抑郁
3 App3此应用程序有助于缓解焦虑
4应用程序4遛狗器应用程序3000 NA
谢谢你的回答。当我将其应用于数据集时，结果保留了匹配单词的每个实例。结果，多次提到抑郁症的描述都伴随着抑郁症而来。是否有方法删除每行中的重复项，以便每行出现一次匹配的凹陷？@jsole是的，我们只能获取唯一的值。请参阅更新的答案。
master_tib %>%
  mutate(Tag = str_extract_all(tolower(Description), pattern)) %>%
  unnest(Tag, keep_empty = TRUE) %>%
  group_by(App, Description) %>% 
  summarise(Tag = str_c(Tag, collapse=", "))

# A tibble: 4 x 3
# Groups:   App [4]
  App   Description                        Tag                
  <chr> <chr>                              <chr>              
1 App1  Reduce your depression and anxiety depression, anxiety
2 App2  Help your depression               depression         
3 App3  This app helps with Anxiety        anxiety            
4 App4  Dog walker app 3000                NA