Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/80.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R中不同列的部分匹配上的data.table合并_R_Merge_Data.table_Matching_Partial - Fatal编程技术网

R中不同列的部分匹配上的data.table合并

R中不同列的部分匹配上的data.table合并,r,merge,data.table,matching,partial,R,Merge,Data.table,Matching,Partial,这个问题以前可能被问过,但我正在寻找一个data.table解决方案,如果可能的话,不使用其他包。我有一个数据表DT1作为参考: > require(data.table) > DT1 <- data.table(col1 = c("AA", "BA", "ABC", "ABC BC", "AB") , col2 = c(1,4,5,3,2)) > DT1 col1 col2 1: AA 1 2: BA

这个问题以前可能被问过,但我正在寻找一个data.table解决方案,如果可能的话,不使用其他包。我有一个数据表DT1作为参考:

> require(data.table)
> DT1 <- data.table(col1 = c("AA", "BA", "ABC", "ABC BC", "AB")
                  , col2 = c(1,4,5,3,2))
> DT1
     col1 col2
1:     AA    1
2:     BA    4
3:    ABC    5
4: ABC BC    3
5:     AB    2

考虑到问题的规模(DT1[(1:50000),(1:25)]-DT2[(1:50000000),(1:55)],在执行双向
grepl
之前,可能不可能执行IDs的CJ

分解不同类型的匹配/近似匹配,我们可以1)首先查找精确匹配,2)然后查找近似匹配,其中DT1中的子字符串可以在DT2中找到,然后,3)反之亦然

最后,我们对所有结果进行行绑定,并在原始DT2和行绑定结果之间进行左连接,以获得所需的输出

exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
    ID2 := ID1]

substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]), 
    by=1:DT2[,.N]][!is.na(VAL1), -1L]

substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]), 
    by=1:DT1[,.N]][!is.na(VAL2), -1L]

binded <- rbindlist(list(exactMatches, substr1in2, substr2in1), 
    use.names=TRUE, fill=TRUE)

binded[DT2, on=.(ID2, VAL2)]

exactMatches A data.table是一个data.frame。那么,为什么不使用data.frame呢?因为要制作一个data.table,你需要一个额外的包:因为它是一个非常大的数据集,数据帧合并操作非常缓慢。这是相关的:?嗯。这是相关的,但它不是完全相同的问题。在加入DT2之前在DT1的col1中按空格分割会对你的数据集起作用吗?谢谢@chinsoon12,这已经很好了。但当前解决方案的问题是,当grepl(ID,PATTERN)返回VAL1时,grepl需要有两种方式,或者至少是另一种方式。不幸的是,我的初始示例没有显示这种情况,但我编辑了问题以进行澄清(从匹配案例中可以看到),我在DT2中添加了一行,用于反向部分匹配的情况DT1和DT2的维度是什么?DT1[(1:50000),(1:25)]-DT2[(1:50000000),(1:55)]如果需要,可以将DT2分块到更小的集合
 > desired_output <- data.table(col1 = c(0,5,5,2,7,1,1,1,0)
                                 , col2 = c("BA", "ABC", "ABC", "DC", "AA",  "AB", "AB", "AB", "R AB")
                                 , col3 = c(4,5,3,NA,1,5,3,2,2))
> desired_output
   col1 col2 col3
1:    0   BA    4
2:    5  ABC    5
3:    5  ABC    3
4:    2   DC   NA
5:    7   AA    1
6:    1   AB    5
7:    1   AB    3
8:    1   AB    2
9:    0  R AB   2
col1/DT1    col2/DT2
  "AB"       "There is ABhere"    # it's a match
  "ABC"      "someABC"            # it's a match
  "ABC BC"   "ABC"                # it's a reverse match
  "DR"       "ADD"                # no match
  "BA"       "HABAHA"             # two matches
exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
    ID2 := ID1]

substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]), 
    by=1:DT2[,.N]][!is.na(VAL1), -1L]

substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]), 
    by=1:DT1[,.N]][!is.na(VAL2), -1L]

binded <- rbindlist(list(exactMatches, substr1in2, substr2in1), 
    use.names=TRUE, fill=TRUE)

binded[DT2, on=.(ID2, VAL2)]
       ID1 VAL1 VAL2  ID2
 1:     BA    4    0   BA
 2:    ABC    5    5  ABC
 3: ABC BC    3    5  ABC
 4:     AB    2    5  ABC
 5:   <NA>   NA    2   DC
 6:     AA    1    7   AA
 7:     AB    2    1   AB
 8:    ABC    5    1   AB
 9: ABC BC    3    1   AB
10:     AB    2    0 R AB
DT1 <- data.table(ID1 = c("AA", "BA", "ABC", "ABC BC", "AB"), 
    VAL1 = c(1,4,5,3,2))

DT2 <- data.table(VAL2 = c(0,5,2,7,1,0),
    ID2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))