使用stringdist\u连接通过多列连接

使用stringdist\u连接通过多列连接,r,join,left-join,R,Join,Left Join,我有两个数据框,其中列x可能有打字错误,列y总是正确的。 我不明白为什么用stringdist连接多个列会产生以下对: library(dplyr) library(fuzzyjoin) a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2&

我有两个数据框,其中列
x
可能有打字错误,列
y
总是正确的。 我不明白为什么用
stringdist
连接多个列会产生以下对:

library(dplyr)
library(fuzzyjoin)
a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6"))

b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), y = c("1","2", "3", "2","6"))

c <- a %>%
  stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0))

      x.x y.x     x.y  y.y
1  season   1  season    1
2  season   1   seson    2
3  season   1   seson    3
4  season   2   seson    2
5  season   3  season    1
6  season   3   seson    2
7  season   3   seson    3
8 package   1 package    2
9 package   6    <NA> <NA>
库(dplyr)
库(模糊连接)

cbind可以重现您想要的输出

cbind(a,b)
       x y      x y
1 season 1 season 1
2 season 2  seson 2
3 season 3  seson 3
4 season 4  seson 4
5 season 6  seson 6
编辑

如果
a
b
的行数不同,您可以从
dplyr
尝试
full\u join

full_join(a,b, by = "y")
     x.x y    x.y
1 season 1 season
2 season 2  seson
3 season 3  seson
4 season 4  seson
5 season 6  seson

我们可以根据两个数据集中“x”列中列值的相似性创建一个新列,然后执行
left\u join

library(stringdist)
library(dplyr)
a %>%
    mutate(grp = phonetic(x)) %>%
   left_join(b %>% mutate(grp = phonetic(x), y2 = y), by = c('grp', 'y')) %>% 
   select(-grp)
-输出

#      x.x y     x.y   y2
#1  season 1  season    1
#2  season 2   seson    2
#3  season 3   seson    3
#4 package 1    <NA> <NA>
#5 package 6 pakkage    6
根据
?“stringdist指标”

对于soundex距离(method='soundex'),字符串被转换为soundex代码(有关规范,请参见拼音)。当字符串具有相同的soundex代码时,字符串之间的距离为0,否则为1。请注意,soundex重新编码仅对a-z和a-z范围内的字符有意义。当遇到不可打印或非ascii字符时,会发出警告


我正在处理的数据帧顺序不同/长度相等,这就是为什么我不能使用cbind请检查我编辑的答案。它是否如你所期望的那样工作?我认为我的可复制示例不是一个好的示例。对不起,我编辑了我的问题!您还碰巧知道为什么在
stringdist\u left\u join
内部设置
max\u dist=c(1,0)
不起作用?@Maya您可以在更新的帖子中更改
方法
#      x.x y     x.y   y2
#1  season 1  season    1
#2  season 2   seson    2
#3  season 3   seson    3
#4 package 1    <NA> <NA>
#5 package 6 pakkage    6
library(fuzzyjoin)
a %>%
   stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0), 
            method = "soundex")
#      x.x y.x     x.y  y.y
#1  season   1  season    1
#2  season   2   seson    2
#3  season   3   seson    3
#4 package   1    <NA> <NA>
#5 package   6 pakkage    6