R 数据表联接中的重复列

R 数据表联接中的重复列,r,data.table,R,Data.table,我只需按如下方式连接两个数据库: set.seed(1) DT1 <- data.table( Idx = rep(1:100), x1 = round(rnorm(100,0.75,0.3),2), x2 = round(rnorm(100,0.75,0.3),2), x3 = round(rnorm(100,0.75,0.3),2)) DT2 <- data.table( Idx2 = rep(1:100), x1 = round(rep(pi,100),2), targe

我只需按如下方式连接两个数据库:

set.seed(1)
DT1 <- data.table(
Idx = rep(1:100),  
x1 = round(rnorm(100,0.75,0.3),2),
x2 = round(rnorm(100,0.75,0.3),2),
x3 = round(rnorm(100,0.75,0.3),2))

DT2 <- data.table(
Idx2 = rep(1:100),
x1 = round(rep(pi,100),2),
targetcol = rep(999,100))

DT2[DT1,on = c(Idx2 = "Idx")]
但是它会导致不同的列顺序和命名(
x1.x
x1.y
),而且,我阅读它的速度比另一种方式慢


解决这个问题的最佳方法是什么(如果有更多的列和重复项;这只是为了说明问题)?

答案从HubertL代码的注释中移动,稍作修改

DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]

答案从注释中移走,并从HubertL代码中稍作修改

DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]
不是data.table解决方案,但可能仍然相关

对于我的软件包,你有几个选择

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
(1)
eat
明确指定所需的列:

eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(2)
eat
遵循所需模式的列:

eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(3)
eat
all(或使用
safe\u left\u join
),但如果发生冲突,请保留第一列:

eat(DT1, DT2, .by = c(Idx = "Idx2"), .conflict = ~.x)
safe_left_join(DT1, DT2, by = c(Idx = "Idx2"), conflict = ~.x) # same thing here
它们都提供以下输出:

#   Idx   x1   x2   x3 targetcol
# 1   1 0.56 0.50 1.20       999
# 2   2 0.81 0.90 0.87       999
# 3   3 0.50 0.97 0.56       999
# 4   4 1.23 0.92 0.09       999
# 5   5 0.85 0.66 1.09       999
不是data.table解决方案,但可能仍然相关

对于我的软件包,你有几个选择

# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
(1)
eat
明确指定所需的列:

eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(2)
eat
遵循所需模式的列:

eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(3)
eat
all(或使用
safe\u left\u join
),但如果发生冲突,请保留第一列:

eat(DT1, DT2, .by = c(Idx = "Idx2"), .conflict = ~.x)
safe_left_join(DT1, DT2, by = c(Idx = "Idx2"), conflict = ~.x) # same thing here
它们都提供以下输出:

#   Idx   x1   x2   x3 targetcol
# 1   1 0.56 0.50 1.20       999
# 2   2 0.81 0.90 0.87       999
# 3   3 0.50 0.97 0.56       999
# 4   4 1.23 0.92 0.09       999
# 5   5 0.85 0.66 1.09       999

关于
DT2[DT1[,-“x1”],on=c(Idx2=“Idx”)]
ok谢谢,现在这与j表达式中常用的列表语法无关。我仍然想知道,如果我想使用列表语法,我可以通过这样做选择除一列之外的所有列吗?Sthg像
(Idx2,x1,x2,x3,targetcol)
,但更简洁?此外,列的顺序不同于
merge
那么
DT2[DT1[,-“x1”],on=c(Idx2=“Idx”)]
ok,谢谢,这与j表达式中的常用列表语法无关。我仍然想知道,如果我想使用列表语法,我可以通过这样做选择除一列之外的所有列吗?Sthg像
(Idx2,x1,x2,x3,targetcol)
,但更简洁?此外,列的顺序不同于
merge