使用stringdist\u连接通过多列连接
我有两个数据框,其中列使用stringdist\u连接通过多列连接,r,join,left-join,R,Join,Left Join,我有两个数据框,其中列x可能有打字错误,列y总是正确的。 我不明白为什么用stringdist连接多个列会产生以下对: library(dplyr) library(fuzzyjoin) a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2&
x
可能有打字错误,列y
总是正确的。
我不明白为什么用stringdist
连接多个列会产生以下对:
library(dplyr)
library(fuzzyjoin)
a <- data.frame(x = c("season", "season", "season", "package", "package"), y = c("1","2", "3", "1","6"))
b <- data.frame(x = c("season", "seson", "seson", "package", "pakkage"), y = c("1","2", "3", "2","6"))
c <- a %>%
stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0))
x.x y.x x.y y.y
1 season 1 season 1
2 season 1 seson 2
3 season 1 seson 3
4 season 2 seson 2
5 season 3 season 1
6 season 3 seson 2
7 season 3 seson 3
8 package 1 package 2
9 package 6 <NA> <NA>
库(dplyr)
库(模糊连接)
cbind可以重现您想要的输出
cbind(a,b)
x y x y
1 season 1 season 1
2 season 2 seson 2
3 season 3 seson 3
4 season 4 seson 4
5 season 6 seson 6
编辑
如果a
与b
的行数不同,您可以从dplyr
尝试full\u join
full_join(a,b, by = "y")
x.x y x.y
1 season 1 season
2 season 2 seson
3 season 3 seson
4 season 4 seson
5 season 6 seson
我们可以根据两个数据集中“x”列中列值的相似性创建一个新列,然后执行left\u join
library(stringdist)
library(dplyr)
a %>%
mutate(grp = phonetic(x)) %>%
left_join(b %>% mutate(grp = phonetic(x), y2 = y), by = c('grp', 'y')) %>%
select(-grp)
-输出
# x.x y x.y y2
#1 season 1 season 1
#2 season 2 seson 2
#3 season 3 seson 3
#4 package 1 <NA> <NA>
#5 package 6 pakkage 6
根据?“stringdist指标”
对于soundex距离(method='soundex'),字符串被转换为soundex代码(有关规范,请参见拼音)。当字符串具有相同的soundex代码时,字符串之间的距离为0,否则为1。请注意,soundex重新编码仅对a-z和a-z范围内的字符有意义。当遇到不可打印或非ascii字符时,会发出警告
我正在处理的数据帧顺序不同/长度相等,这就是为什么我不能使用cbind请检查我编辑的答案。它是否如你所期望的那样工作?我认为我的可复制示例不是一个好的示例。对不起,我编辑了我的问题!您还碰巧知道为什么在stringdist\u left\u join
内部设置max\u dist=c(1,0)
不起作用?@Maya您可以在更新的帖子中更改方法
# x.x y x.y y2
#1 season 1 season 1
#2 season 2 seson 2
#3 season 3 seson 3
#4 package 1 <NA> <NA>
#5 package 6 pakkage 6
library(fuzzyjoin)
a %>%
stringdist_left_join(b, by = c("x", "y"), max_dist = c(1,0),
method = "soundex")
# x.x y.x x.y y.y
#1 season 1 season 1
#2 season 2 seson 2
#3 season 3 seson 3
#4 package 1 <NA> <NA>
#5 package 6 pakkage 6