R 数据表联接中的重复列
我只需按如下方式连接两个数据库:R 数据表联接中的重复列,r,data.table,R,Data.table,我只需按如下方式连接两个数据库: set.seed(1) DT1 <- data.table( Idx = rep(1:100), x1 = round(rnorm(100,0.75,0.3),2), x2 = round(rnorm(100,0.75,0.3),2), x3 = round(rnorm(100,0.75,0.3),2)) DT2 <- data.table( Idx2 = rep(1:100), x1 = round(rep(pi,100),2), targe
set.seed(1)
DT1 <- data.table(
Idx = rep(1:100),
x1 = round(rnorm(100,0.75,0.3),2),
x2 = round(rnorm(100,0.75,0.3),2),
x3 = round(rnorm(100,0.75,0.3),2))
DT2 <- data.table(
Idx2 = rep(1:100),
x1 = round(rep(pi,100),2),
targetcol = rep(999,100))
DT2[DT1,on = c(Idx2 = "Idx")]
但是它会导致不同的列顺序和命名(x1.x
和x1.y
),而且,我阅读它的速度比另一种方式慢
解决这个问题的最佳方法是什么(如果有更多的列和重复项;这只是为了说明问题)?答案从HubertL代码的注释中移动,稍作修改
DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]
答案从注释中移走,并从HubertL代码中稍作修改
DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]
不是data.table解决方案,但可能仍然相关
对于我的软件包,你有几个选择
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
(1) eat
明确指定所需的列:
eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(2) eat
遵循所需模式的列:
eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(3) eat
all(或使用safe\u left\u join
),但如果发生冲突,请保留第一列:
eat(DT1, DT2, .by = c(Idx = "Idx2"), .conflict = ~.x)
safe_left_join(DT1, DT2, by = c(Idx = "Idx2"), conflict = ~.x) # same thing here
它们都提供以下输出:
# Idx x1 x2 x3 targetcol
# 1 1 0.56 0.50 1.20 999
# 2 2 0.81 0.90 0.87 999
# 3 3 0.50 0.97 0.56 999
# 4 4 1.23 0.92 0.09 999
# 5 5 0.85 0.66 1.09 999
不是data.table解决方案,但可能仍然相关
对于我的软件包,你有几个选择
# devtools::install_github("moodymudskipper/safejoin")
library(safejoin)
(1) eat
明确指定所需的列:
eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(2) eat
遵循所需模式的列:
eat(DT1, DT2, targetcol, .by = c(Idx = "Idx2"))
eat(DT1, DT2, starts_with("target"), .by = c(Idx = "Idx2"))
(3) eat
all(或使用safe\u left\u join
),但如果发生冲突,请保留第一列:
eat(DT1, DT2, .by = c(Idx = "Idx2"), .conflict = ~.x)
safe_left_join(DT1, DT2, by = c(Idx = "Idx2"), conflict = ~.x) # same thing here
它们都提供以下输出:
# Idx x1 x2 x3 targetcol
# 1 1 0.56 0.50 1.20 999
# 2 2 0.81 0.90 0.87 999
# 3 3 0.50 0.97 0.56 999
# 4 4 1.23 0.92 0.09 999
# 5 5 0.85 0.66 1.09 999
关于
DT2[DT1[,-“x1”],on=c(Idx2=“Idx”)]
ok谢谢,现在这与j表达式中常用的列表语法无关。我仍然想知道,如果我想使用列表语法,我可以通过这样做选择除一列之外的所有列吗?Sthg像(Idx2,x1,x2,x3,targetcol)
,但更简洁?此外,列的顺序不同于merge
那么DT2[DT1[,-“x1”],on=c(Idx2=“Idx”)]
ok,谢谢,这与j表达式中的常用列表语法无关。我仍然想知道,如果我想使用列表语法,我可以通过这样做选择除一列之外的所有列吗?Sthg像(Idx2,x1,x2,x3,targetcol)
,但更简洁?此外,列的顺序不同于merge