Python 匹配相近的字符/单词
我有以下数据框,其中包含X列和Y列Python 匹配相近的字符/单词,python,r,Python,R,我有以下数据框,其中包含X列和Y列 X Y 1 SAN DIEGO FOND DU LAC 2 THE RIO GRANDE RIO GRANDE 3 RIO GRANDE RIO GRANDE 4 WEST TENNESSEE
X Y
1 SAN DIEGO FOND DU LAC
2 THE RIO GRANDE RIO GRANDE
3 RIO GRANDE RIO GRANDE
4 WEST TENNESSEE TENNESSEE
5 EP De SAN JOAQUIN De SAN JOAQUIN
6 SOUTHERN VIRGINIA VIRGINIA
7 SOUTHERN VIRGINIA SOUTHWESTERN VIRGINIA
8 EN COLOMBIA COLOMBIA
9 THE EP De NORTHERN CALIFORNIA De NORTHERN CALIFORNIA
10 FLORIDA NEW JERSY
我想得到不匹配的行,1和10。第2-9行是匹配项或接近匹配项,可以。我期望的数据帧是
X Y
1 SAN DIEGO FOND DU LAC
10 FLORIDA NEW JERSY
在
R
中,我们在每列中按空格分割字符串,检查单词之间是否存在任何相交
,找到列表的长度
,并将长度为0的数据集子集
df1[!lengths(Map(intersect, strsplit(df1$X, "\\s+"), strsplit(df1$Y, "\\s+"))),]
# X Y
#1 SAN DIEGO FOND DU LAC
#10 FLORIDA NEW JERSY
我们也可以循环遍历列,执行split
df1[!lengths(do.call(Map, c(intersect, unname(lapply(df1, strsplit, split="\\s+"))))),]
# X Y
#1 SAN DIEGO FOND DU LAC
#10 FLORIDA NEW JERSY
或者另一个选项是stringdist
library(stringdist)
i1 <- with(df1, stringdist(X, Y, method = "qgram"))
df1[i1 %in% tail(sort(i1), 2),]
# X Y
#1 SAN DIEGO FOND DU LAC
#10 FLORIDA NEW JERSY
库(stringdist)
i1