通过stringmatch与dplyr和stringdist合并两个数据帧
我试图在两个数据帧上基于非常相似的语言(这不确切)进行dplyr左连接 DF1: DF2: 我执行此操作以使用stringdist包/库获取字符串距离作为向量:通过stringmatch与dplyr和stringdist合并两个数据帧,r,dplyr,stringdist,R,Dplyr,Stringdist,我试图在两个数据帧上基于非常相似的语言(这不确切)进行dplyr左连接 DF1: DF2: 我执行此操作以使用stringdist包/库获取字符串距离作为向量: titlematch <- amatch(df1$title,df2$showname) 通常,如果我有完全匹配的,我会: blended <- left_join(df1, df2, by = c("title" = "showname")) 排除第三个不匹配项,因为向量(NA)中没有可能的匹配项。这是一个快照 li
titlematch <- amatch(df1$title,df2$showname)
通常,如果我有完全匹配的,我会:
blended <- left_join(df1, df2, by = c("title" = "showname"))
排除第三个不匹配项,因为向量(NA)中没有可能的匹配项。这是一个快照
library(stringdist)
library(tidyverse)
df1 %>%
as_tibble() %>%
mutate(temp = amatch(title, df2$showname, maxDist = 10)) %>%
bind_cols(df2[.$temp, ]) %>%
select(-temp)
# A tibble: 3 x 4
title records showname counts
<chr> <int> <chr> <int>
1 Bob's show, part 1 42 Bob's show part 1 772
2 Time for dinner 77 Dinner time 89
3 Horsecrap 121 Dinner time 89
:
你看过吗
我以前从未听说过fuzzyjoin,但我尝试过并喜欢它
stringdist\u left\u join
正是我所需要的。您可以使用tidyr::crossing()
制作笛卡尔积,然后进行过滤-如果数据集较大,那么成本会很高。您看过吗?
titlematch
1
2
NA
blended <- left_join(df1, df2, by = c("title" = "showname"))
title | records | showname | counts
Bob's show, part 1 | 42 | Bob's show part 1 | 772
Time for dinner | 77 | Dinner time | 89
library(stringdist)
library(tidyverse)
df1 %>%
as_tibble() %>%
mutate(temp = amatch(title, df2$showname, maxDist = 10)) %>%
bind_cols(df2[.$temp, ]) %>%
select(-temp)
# A tibble: 3 x 4
title records showname counts
<chr> <int> <chr> <int>
1 Bob's show, part 1 42 Bob's show part 1 772
2 Time for dinner 77 Dinner time 89
3 Horsecrap 121 Dinner time 89
df1 <- structure(list(title = c("Bob's show, part 1", "Time for dinner",
"Horsecrap"), records = c(42L, 77L, 121L)), .Names = c("title",
"records"), row.names = c(NA, -3L), class = "data.frame")
df2 <- structure(list(showname = c("Bob's show part 1", "Dinner time",
"No way Jose"), counts = c(772L, 89L, 123L)), .Names = c("showname",
"counts"), row.names = c(NA, -3L), class = "data.frame")