R 从数据帧A中查找子字符串出现在B中的字符串
我有两个数据帧A和B。 A有完整的句子,B有我正在寻找的反复出现的短语。我想查找数据帧B中存在字符串/部分字符串的A中的所有行。 比如说, 数据帧A具有:R 从数据帧A中查找子字符串出现在B中的字符串,r,string,substring,string-comparison,R,String,Substring,String Comparison,我有两个数据帧A和B。 A有完整的句子,B有我正在寻找的反复出现的短语。我想查找数据帧B中存在字符串/部分字符串的A中的所有行。 比如说, 数据帧A具有: "Sally is great" "John is great" "Sally likes peas" "John likes onions" "Jane is in Paris" "Archie is in Paris" 数据帧B具有: "in Paris" "is great" 输出将是: "Sal
"Sally is great"
"John is great"
"Sally likes peas"
"John likes onions"
"Jane is in Paris"
"Archie is in Paris"
数据帧B具有:
"in Paris"
"is great"
输出将是:
"Sally is great"
"John is great"
"Jane is in Paris"
"Archie is in Paris"
因为这些行在数据帧B中有一个字符串/子字符串
相当于SQL中类似于“%substring%”的x,但适用于一组子字符串
我在A中有近200万行,在B中有约300000行。我曾考虑过使用str_match with loop,但考虑到数据大小,这可能不是一个可行的解决方案一种方法是遍历较小集合的元素,并使用
grep
检查较大集合中是否存在该元素
big = c("Sally is great",
"John is great",
"Sally likes peas",
"John likes onions",
"Jane is in Paris",
"Archie is in Paris")
small = c("in Paris",
"is great")
big[unlist(lapply(small, function(a) grep(a, big)))]
#[1] "Jane is in Paris" "Archie is in Paris" "Sally is great" "John is great"
我们可以从
stringi
library(stringi)
big[stri_detect(big, regex = paste(small, collapse="|"))]
#[1] "Sally is great" "John is great" "Jane is in Paris"
#[4] "Archie is in Paris"
数据
big好吧,在更多的搜索中,我意识到str_match_all可能会完成这项工作,所以测试一下。我还可以查看其他功能/软件包吗?
big <- c("Sally is great",
"John is great",
"Sally likes peas",
"John likes onions",
"Jane is in Paris",
"Archie is in Paris")
small <- c("in Paris",
"is great")