Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/73.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 匹配相近的字符/单词_Python_R - Fatal编程技术网

Python 匹配相近的字符/单词

Python 匹配相近的字符/单词,python,r,Python,R,我有以下数据框,其中包含X列和Y列 X Y 1 SAN DIEGO FOND DU LAC 2 THE RIO GRANDE RIO GRANDE 3 RIO GRANDE RIO GRANDE 4 WEST TENNESSEE

我有以下数据框,其中包含X列和Y列

    X                                   Y
1   SAN DIEGO                           FOND DU LAC
2   THE RIO GRANDE                      RIO GRANDE
3   RIO GRANDE                          RIO GRANDE
4   WEST TENNESSEE                      TENNESSEE
5   EP De SAN JOAQUIN                   De SAN JOAQUIN
6   SOUTHERN VIRGINIA                   VIRGINIA
7   SOUTHERN VIRGINIA                   SOUTHWESTERN VIRGINIA
8   EN COLOMBIA                         COLOMBIA
9   THE EP De NORTHERN CALIFORNIA       De NORTHERN CALIFORNIA
10  FLORIDA                             NEW JERSY
我想得到不匹配的行,1和10。第2-9行是匹配项或接近匹配项,可以。我期望的数据帧是

    X                                   Y
1   SAN DIEGO                           FOND DU LAC
10  FLORIDA                             NEW JERSY

R
中,我们在每列中按空格分割字符串,检查单词之间是否存在任何
相交
,找到
列表的
长度
,并将长度为0的数据集子集

df1[!lengths(Map(intersect, strsplit(df1$X, "\\s+"), strsplit(df1$Y, "\\s+"))),]
#          X           Y
#1  SAN DIEGO FOND DU LAC
#10   FLORIDA   NEW JERSY

我们也可以循环遍历列,执行
split

df1[!lengths(do.call(Map, c(intersect, unname(lapply(df1, strsplit, split="\\s+"))))),]
#      X           Y
#1  SAN DIEGO FOND DU LAC
#10   FLORIDA   NEW JERSY

或者另一个选项是
stringdist

library(stringdist)
i1 <- with(df1, stringdist(X, Y, method = "qgram"))
df1[i1 %in% tail(sort(i1), 2),]
#          X           Y
#1  SAN DIEGO FOND DU LAC
#10   FLORIDA   NEW JERSY
库(stringdist)
i1