R 从数据帧A中查找子字符串出现在B中的字符串

R 从数据帧A中查找子字符串出现在B中的字符串,r,string,substring,string-comparison,R,String,Substring,String Comparison,我有两个数据帧A和B。 A有完整的句子,B有我正在寻找的反复出现的短语。我想查找数据帧B中存在字符串/部分字符串的A中的所有行。 比如说, 数据帧A具有: "Sally is great" "John is great" "Sally likes peas" "John likes onions" "Jane is in Paris" "Archie is in Paris" 数据帧B具有: "in Paris" "is great" 输出将是: "Sal

我有两个数据帧A和B。 A有完整的句子,B有我正在寻找的反复出现的短语。我想查找数据帧B中存在字符串/部分字符串的A中的所有行。 比如说,

数据帧A具有:

    "Sally is great"
     "John is great"
  "Sally likes peas"
 "John likes onions"
  "Jane is in Paris"
"Archie is in Paris"
数据帧B具有:

"in Paris"
"is great"
输出将是:

    "Sally is great"
     "John is great"
  "Jane is in Paris"
"Archie is in Paris"
因为这些行在数据帧B中有一个字符串/子字符串

相当于SQL中类似于“%substring%”的x,但适用于一组子字符串


我在A中有近200万行,在B中有约300000行。我曾考虑过使用str_match with loop,但考虑到数据大小,这可能不是一个可行的解决方案

一种方法是遍历较小集合的元素,并使用
grep
检查较大集合中是否存在该元素

big = c("Sally is great",
        "John is great",
        "Sally likes peas",
        "John likes onions",
        "Jane is in Paris",
        "Archie is in Paris")
small = c("in Paris",
          "is great")

big[unlist(lapply(small, function(a) grep(a, big)))]
#[1] "Jane is in Paris"   "Archie is in Paris" "Sally is great"     "John is great"     

我们可以从
stringi

library(stringi)
big[stri_detect(big, regex = paste(small, collapse="|"))]
#[1] "Sally is great"     "John is great"      "Jane is in Paris"  
#[4] "Archie is in Paris"
数据
big好吧,在更多的搜索中,我意识到str_match_all可能会完成这项工作,所以测试一下。我还可以查看其他功能/软件包吗?
big <- c("Sally is great",
    "John is great",
    "Sally likes peas",
    "John likes onions",
    "Jane is in Paris",
    "Archie is in Paris")
small <- c("in Paris",
      "is great")