R data.table为每行添加带有查询的新列_R_Data.table

R data.table为每行添加带有查询的新列

R data.table为每行添加带有查询的新列,r,data.table,R,Data.table,我有两个R数据。R中的表如下：第一张桌子第二张桌子如您所见，第一个表中的姓氏被截断。此外，名字和姓氏的组合在第一个表中是唯一的，但在第二个表中不是唯一的。我想加入到名字和姓氏的组合中，在难以置信的天真假设下首先，最后一个定义了一个人姓氏的截断不会引起歧义。结果应该如下所示： id | first | trunc | last | val1 ======================================= 1 | Bob | Smith |

我有两个R数据。R中的表如下：

第一张桌子

第二张桌子

如您所见，第一个表中的姓氏被截断。此外，名字和姓氏的组合在第一个表中是唯一的，但在第二个表中不是唯一的。我想加入到名字和姓氏的组合中，在难以置信的天真假设下

首先，最后一个定义了一个人姓氏的截断不会引起歧义。结果应该如下所示：

id | first | trunc |       last | val1 
=======================================
 1 |   Bob | Smith |      Smith |   10
 2 |   Sue | Goldm |    Goldman |   20
 3 |   Sue | Wollw |  Wollworth |   30
 4 |   Bob | Bellb | Bellbottom |   40

基本上，对于表_1中的每一行，我都需要找到一行，该行回补姓氏

对于第一张表格中的每一行：在第二个_表中查找第一行：匹配first_name&trunc是last的子字符串然后加入那一排

有没有一种简单的矢量化方法可以通过data.table实现这一点？

一种方法是先加入，然后根据子字符串匹配进行过滤

first_table[
    unique(second_table[, .(first, last)])
    , on = "first"
    , nomatch = 0
][
    substr(last, 1, nchar(trunc)) == trunc
]

#    id first trunc val1       last
# 1:  1   Bob Smith   10      Smith
# 2:  2   Sue Goldm   20    Goldman
# 3:  3   Sue Wollw   30  Wollworth
# 4:  4   Bob Bellb   40 Bellbottom

或者，对第二个_表进行截断以匹配第一个，然后对两列进行联接

first_table[
    unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
    , on = c("first", "trunc")
    , nomatch = 0
]
## yields the same answer

first_table[
    unique(second_table[, .(first, last)])
    , on = "first"
    , nomatch = 0
][
    substr(last, 1, nchar(trunc)) == trunc
]

#    id first trunc val1       last
# 1:  1   Bob Smith   10      Smith
# 2:  2   Sue Goldm   20    Goldman
# 3:  3   Sue Wollw   30  Wollworth
# 4:  4   Bob Bellb   40 Bellbottom

first_table[
    unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
    , on = c("first", "trunc")
    , nomatch = 0
]
## yields the same answer