Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/75.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
通过data.table roll=&x27;合并两组数据;最近的';功能_R_Merge_Data.table - Fatal编程技术网

通过data.table roll=&x27;合并两组数据;最近的';功能

通过data.table roll=&x27;合并两组数据;最近的';功能,r,merge,data.table,R,Merge,Data.table,我有两组数据 集合A的示例(总行数:45467): 集合B的示例(总行数:4798): 我感兴趣的结果是,对于set_B,来自set_a的行与time_a和time_B的最近值匹配(输出行总数:4798)。在set_A中,time_A的值可以重复几次(例如ID_A[8,]和[ID_A[9,])-哪一行将与set_B中的一行合并并不重要(在这种情况下ID_B[1,])。预期结果示例: ID_b b1 b2 source time_b ID_a a1 a2 a3 t

我有两组数据

集合A的示例(总行数:45467):

集合B的示例(总行数:4798):

我感兴趣的结果是,对于
set_B
,来自
set_a
的行与
time_a
time_B
的最近值匹配(输出行总数:4798)。在
set_A
中,
time_A
的值可以重复几次(例如
ID_A[8,]
[ID_A[9,]
)-哪一行将与
set_B
中的一行合并并不重要(在这种情况下
ID_B[1,]
)。预期结果示例:

ID_b    b1  b2  source  time_b      ID_a    a1  a2  a3  time_a
2   34.20   15.114  set1.csv.1  20.35750    8   85640   5274.1  301.6041    20.01000
7   67.20   16.114  set1.csv.2  21.35778    7   85697   5345.2  301.6043    21.00972
12  12.20   33.114  set1.csv.3  22.35806    4   65694   9375.2  301.6049    22.00972
17  73.20   67.114  set2.csv.1  23.35833    3   85694   9278.9  301.6051    23.00972
23  88.20   42.114  set2.csv.2  19.35861    5   85653   4375.5  301.6047    19.00972
28  90.20   52.114  set3.csv.1  00.35889    2   35694   5245.2  301.6053    00.00944
我在stackoverflow上遇到了许多类似的问题,我非常喜欢
数据.table
库代码,因为它们看起来非常优雅。但是,我做了几次失败的尝试,我收到了基于两个集合构建的表(总行数45467)或者只将一列
time\u a
合并到
set\u B
…不过,我不会挑剔,如果有人有其他想法,我将非常感谢您的帮助

我正在处理的代码示例:

setDT(set_B)
setDT(set_A)
setkey(set_B, time_b) [, time_a:=time_b]
test_ab <- set_B[set_A, roll='nearest']
setDT(set_B)
setDT(set_A)
setkey(set_B,time_B)[,time_a:=time_B]

test_ab以下是基于您提供的样本数据的逐步示例:

# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)    

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)

# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]

unique(set_merged[order(ID_b)], by = "time")
#    ID_b   b1     b2     source   time_b     time ID_a    a1     a2       a3
# 1:    2 34.2 15.114 set1.csv.1 20.35750 20.01000    8 85640 5274.1 301.6041
# 2:    7 67.2 16.114 set1.csv.2 21.35778 21.00972    7 85697 5345.2 301.6043
# 3:   12 12.2 33.114 set1.csv.3 22.35806 22.00972    4 65694 9375.2 301.6049
# 4:   17 73.2 67.114 set2.csv.1 23.35833 23.00972    3 85694 9278.9 301.6051
# 5:   23 88.2 42.114 set2.csv.2 19.35861 19.00972    5 85653 4375.5 301.6047
# 6:   28 90.2 52.114 set3.csv.1  0.35889  0.00944    2 35694 5245.2 301.6053
#      time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6:  0.00944

样本数据
set\u A非常感谢代码和简单明了的解释,它帮助了我很多!我只是在行中做了一个更改:
unique(set\u merged[order(ID\u b)],by=“time”)
相反,我写了
unique(set\u merged[order(ID\u b)],by=“ID\u b”)
,因为
set\u b
中的一些时间值也被复制了(
ID_b
是唯一的)。是否要使用
i
中的“set_b”查找
x
中的“set_A”行?
set_A[set_b,on=“time”,roll=“nearest”]
。这样您就不需要
唯一的
步骤。此外,如果在
上使用
,则不需要两个
设置键
步骤。另外,
设置日期
通过引用更新,因此分配(
小加法:如果你用
fread
代替
read.table
,那么读取的结果已经是
data.table
。使用
TRUE
代替if
T
也是(或似乎是)更好的做法。我认为这个问题中的小例子(是的,在实际问题中)很高兴能更好地了解
数据。表
连接类型:。就我个人而言,我喜欢在
参数上使用
,而不是设置键。在
x[I,on=,…]
中,它明确了您要连接的内容。一些关于
on
的好例子。干杯
setDT(set_B)
setDT(set_A)
setkey(set_B, time_b) [, time_a:=time_b]
test_ab <- set_B[set_A, roll='nearest']
# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)    

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)

# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]

unique(set_merged[order(ID_b)], by = "time")
#    ID_b   b1     b2     source   time_b     time ID_a    a1     a2       a3
# 1:    2 34.2 15.114 set1.csv.1 20.35750 20.01000    8 85640 5274.1 301.6041
# 2:    7 67.2 16.114 set1.csv.2 21.35778 21.00972    7 85697 5345.2 301.6043
# 3:   12 12.2 33.114 set1.csv.3 22.35806 22.00972    4 65694 9375.2 301.6049
# 4:   17 73.2 67.114 set2.csv.1 23.35833 23.00972    3 85694 9278.9 301.6051
# 5:   23 88.2 42.114 set2.csv.2 19.35861 19.00972    5 85653 4375.5 301.6047
# 6:   28 90.2 52.114 set3.csv.1  0.35889  0.00944    2 35694 5245.2 301.6053
#      time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6:  0.00944
library(data.table)
setDT(set_A)
setDT(set_B)    

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]

set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
#   ID_a    a1     a2       a3   time_a     time ID_b   b1     b2     source
#1:    2 35694 5245.2 301.6053  0.00944  0.35889   28 90.2 52.114 set3.csv.1
#2:    3 85694 9278.9 301.6051 23.00972 23.35833   17 73.2 67.114 set2.csv.1
#3:    5 85653 4375.5 301.6047 19.00972 19.35861   23 88.2 42.114 set2.csv.2
#4:    6 12694 5236.3 301.6045 22.00972 22.35806   12 12.2 33.114 set1.csv.3
#5:    7 85697 5345.2 301.6043 21.00972 21.35778    7 67.2 16.114 set1.csv.2
#6:    9 30694 5279.0 301.6039 20.01000 20.35750    2 34.2 15.114 set1.csv.1
#  time_b
#1:  0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750
set_A <- read.table(text =
    "ID_a    a1  a2  a3  time_a
2   35694   5245.2  301.6053    00.00944
3   85694   9278.9  301.6051    23.00972
4   65694   9375.2  301.6049    22.00972
5   85653   4375.5  301.6047    19.00972
6   12694   5236.3  301.6045    22.00972
7   85697   5345.2  301.6043    21.00972
8   85640   5274.1  301.6041    20.01000
9   30694   5279.0  301.6039    20.01000", header = T)

set_B <- read.table(text =
    "ID_b    b1  b2  source  time_b
2   34.20   15.114  set1.csv.1  20.35750
7   67.20   16.114  set1.csv.2  21.35778
12  12.20   33.114  set1.csv.3  22.35806
17  73.20   67.114  set2.csv.1  23.35833
23  88.20   42.114  set2.csv.2  19.35861
28  90.20   52.114  set3.csv.1  00.35889", header = T)