R 数据表与字符串的模糊和非模糊匹配

R 数据表与字符串的模糊和非模糊匹配,r,optimization,merge,data.table,fuzzyjoin,R,Optimization,Merge,Data.table,Fuzzyjoin,所以我的问题实际上和我有两个非常大的数据集一样,需要通过某些列中的精确匹配和其他列中的模糊匹配来连接它们。我希望匹配项在出生日期列DOB和性别列gender中是准确的,但希望它们在姓名列中是“相似的” 通过“相似”,我希望能够使用一组特定的标准,如: OSA距离 Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult, : roll='nearest' can't be applied to a

所以我的问题实际上和我有两个非常大的数据集一样,需要通过某些列中的精确匹配和其他列中的模糊匹配来连接它们。我希望匹配项在出生日期列
DOB
和性别列
gender
中是准确的,但希望它们在
姓名列中是“相似的”

通过“相似”,我希望能够使用一组特定的标准,如:

  • OSA距离
    Error in bmerge(i, x, leftcols, rightcols, roll, rollends, nomatch, mult,  : 
      roll='nearest' can't be applied to a character column, yet.
    
    # copy left data
    df <- base
    
    # rename columns
    names(df)[c(1, 3)] <- c("ID", "loc")
    
    # copy right data
    df_alt <- name_unique
    
    # rename columns
    names(df_alt)[c(1, 3)] <- c("ID", "loc")
    
    
    # implement Lyngbakr's answer with stringdist() instead of abs()
    df_alt[df
           , on = .(ID, loc)
           , roll = "nearest"
           , .(ID, loc.x = i.loc, loc.y = x.loc, value, delta = stringdist(i.loc, x.loc))]
    
    library(data.table)
    library(tidyverse)
    
    base <- data.table(DOB = c("1956-01-01", "1994-05-13", "2001-07-03",
                               "1998-04-02", "1991-05-28", "2001-09-15",
                               "1999-04-05", "2001-04-10", "1996-01-14",
                               "2000-01-19") %>% as.Date,
                       gender = c("F", "F", "M", "F", "M", "F", "M", "F",
                                  "F", "F"),
                       names = c("Regina_Douglas", "Tamar_Hurley", "John_Moreno",
                                 "Josephine_Bone_O' Brian", "Borys_Holland",
                                 "Tonisha_Moran", "Jarrad_Kaur", "Abbi_Kane",
                                 "Leslie_Davis", "Blossom_Povey"),
                       row = 1:10)
    
    
    name_unique <-
            data.table(s_DOB = c("1941-01-09", "1976-09-22", "1996-08-07",
                                 "1993-09-24", "1991-05-28", "2001-09-15",
                                 "1969-03-21", "1939-06-25", "1996-01-14",
                                 "1978-07-27") %>% as.Date,
                       s_gen = c("M", "M", "F", "M", "M", "F", "M", "F", "F",
                                 "F"),
                       s_name = c("Brandon_Hampton", "John_Moreno", "Sally_Kemper",
                                  "Nickolas_Bolden", "Boris_Holland", "Tonisha_Morann",
                                  "Bryant_Lopez", "Kathryn_Krebs", "Lesli_David",
                                  "Kelley__Owens"),
                       s_identif = c(178, 184, 136, 188, 198, 133, 197,
                                     143, 200, 132))
    
    DOB         gender  names                   row s_identif
    1956-01-01  F       Regina_Douglas          1   NA
    1994-05-13  F       Tamar_Hurley            2   NA
    2001-07-03  M       John_Moreno             3   NA
    1998-04-02  F       Josephine_Bone_O' Brian 4   NA
    1991-05-28  M       Borys_Holland           5   198
    2001-09-15  F       Tonisha_Moran           6   133
    1999-04-05  M       Jarrad_Kaur             7   NA
    2001-04-10  F       Abbi_Kane               8   NA
    1996-01-14  F       Leslie_Davis            9   200
    2000-01-19  F       Blossom_Povey           10  NA