R– ;如何按最近的时间日期连接两个数据帧?
我有两个数据集,每个数据集包含POSIXlt格式的日期时间值,以及一些其他数字和字符变量 我想根据日期时间列合并这两个数据集。 但是两个数据集的日期戳不匹配,所以我需要按最近的日期(之前或之后)组合它们。 在我的示例中,2016-03-01 23:52:00的数据值“e”需要与2016-03-02 00:00:00的“binH”组合,而不是“binG” 是否有一个函数允许我按最接近的日期时间值组合数据集,即使它在之后 我已经找到了使用cut()函数或data.tables中的roll=Inf函数将日期组合到上一个日期的方法。但我无法将时间戳转换为roll'nearest'可以接受的任何格式R– ;如何按最近的时间日期连接两个数据帧?,r,datetime,R,Datetime,我有两个数据集,每个数据集包含POSIXlt格式的日期时间值,以及一些其他数字和字符变量 我想根据日期时间列合并这两个数据集。 但是两个数据集的日期戳不匹配,所以我需要按最近的日期(之前或之后)组合它们。 在我的示例中,2016-03-01 23:52:00的数据值“e”需要与2016-03-02 00:00:00的“binH”组合,而不是“binG” 是否有一个函数允许我按最接近的日期时间值组合数据集,即使它在之后 我已经找到了使用cut()函数或data.tables中的roll=Inf函数
>df1
date1 value
1 2016-03-01 17:52:00 a
2 2016-03-01 18:01:30 b
3 2016-03-01 18:05:00 c
4 2016-03-01 20:42:30 d
5 2016-03-01 23:52:00 e
>df2
date2 bin_name
1 2016-03-01 17:00:00 binA
2 2016-03-01 18:00:00 binB
3 2016-03-01 19:00:00 binC
4 2016-03-01 20:00:00 binD
5 2016-03-01 21:00:00 binE
6 2016-03-01 22:00:00 binF
7 2016-03-01 23:00:00 binG
8 2016-03-02 00:00:00 binH
9 2016-03-02 01:00:00 binI
data.table
应该可以做到这一点(你能解释一下你遇到的错误吗?),尽管它确实倾向于自己将POSIXlt转换为POSIXct(也许手动在datetime列上进行转换以保持data.table
的快乐)。在使用roll
之前,还要确保设置了键列
(我在这里创建了我自己的示例表,让我的生活更轻松。如果您想在自己的示例表中使用dput,我很乐意使用您的数据更新此示例):
new我遇到了类似的问题,但没有使用数据。table
或tidyverse
我为“近似合并”创建了自己的函数amerge
。它需要4个参数:
- 两个数据帧
- “固定”(非近似)合并的列名向量-这两个数据帧中都必须存在
- 以及用于近似合并的单个列(在两个数据帧中)的名称。它适用于任何数值,包括日期
其思想是将最佳匹配的行1对1合并,而不是从任何数据帧中丢失任何行。下面是我的注释代码和一个工作示例
amerge <- function(d1, d2, firm=NULL, approx=NULL) {
rt = Sys.time()
# Take care of conflicting column names
n2 = data.frame(oldname = names(d2), newname = names(d2))
n2$newname = as.character(n2$newname)
n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")
# Add unique row IDs
if (length(c(firm, approx))>1) {
d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
} else {
d1$ID1 = factor(d1[,c(approx,firm)])
d2$ID2 = factor(d2[,c(approx,firm)])
}
# Perform initial merge on the 'firm' parameters, if any
# Otherwise match all to all
if (length(firm)>0) {
t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
} else {
names(d2)= c(n2$newname,"ID2")
t1 = data.frame()
for (i1 in 1:nrow(d1)) {
trow = d1[i1,]
t1 = rbind(t1, cbind(trow, d2))
}
}
# Match by the most approximate record
if (length(approx)==1) {
# Calculate the differential for approximate merging
t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
# Sort data by ascending DIFF, so that best matching records are used first
t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
t2 = data.frame()
d2$used = 0
# For each record of d1, find match from d2
for (i1 in na.omit(unique(t1$ID1))) {
tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
# If there are non-missing records, get the one with minimum DIFF (top one)
if (nrow(tx)>0) {
tx = tx[1,]
# If matching record found, remove it from the pool, so it's not used again
t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
# And mark it as used
d2$used[d2$ID2==tx$ID2] = 1
} else {
# If there are no non-missing records, just get the first one from the top
tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
}
t2 = rbind(t2,tx)
}
} else {
t2 = t1
}
# Make the records the same order as d1
t2 = t2[match(d1$ID1, t2$ID1),]
# Add unmatched records from d2 to the end of output
if (any(d2$used==0)) {
tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
tx = tx[!duplicated(tx$ID2),]
tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
t2 = rbind(t2,tx)
t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
}
t2$DIFF = t2$ID1 = t2$ID2 = NULL
cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
return(t2)
}
amerge 1){
d1$ID1=系数(适用于(d1[,c(近似,固定)],1,粘贴,折叠=”)
d2$ID2=系数(应用(d2[,c(近似,固定)],1,粘贴,折叠=”)
}否则{
d1$ID1=系数(d1[,c(近似,固定)])
d2$ID2=系数(d2[,c(约,公司)])
}
#对“公司”参数(如有)执行初始合并
#否则,将所有匹配到所有
如果(长度(固定)>0){
t1=合并(d1,d2,by=公司,all=T,suff=c(“,”.2”))
}否则{
名称(d2)=c(n2$newname,“ID2”)
t1=data.frame()
适用于(i1/1:nrow(d1)){
trow=d1[i1,]
t1=rbind(t1,cbind(trow,d2))
}
}
#按最接近的记录匹配
如果(长度(近似值)=1){
#计算近似合并的微分
t1$DIFF=abs(t1[,约]-t1[,n2$newname[n2$oldname==约]]
#按升序差对数据进行排序,以便首先使用最匹配的记录
t1=t1[订单(t1$DIFF,t1$ID1,t1$ID2),]
t2=data.frame()
d2$used=0
#对于d1的每条记录,从d2中查找匹配项
对于(na中的i1.omit(唯一的(t1$ID1))){
tx=t1[!is.na(t1$DIFF)&t1$ID1==i1,]
#如果有未丢失的记录,则获取差异最小的记录(顶部记录)
如果(nrow(tx)>0){
tx=tx[1,]
#如果找到匹配的记录,请将其从池中删除,以便不再使用
t1[!is.na(t1$ID2)&t1$ID2==tx$ID2,c(n2$newname[!(n2$newname%以%为单位)],“DIFF”)]=na
#并将其标记为已使用
d2$used[d2$ID2==tx$ID2]=1
}否则{
#如果没有未丢失的记录,只需从顶部获取第一条
tx=t1[!is.na(t1$ID1)&t1$ID1==i1,][1,]
}
t2=rbind(t2,tx)
}
}否则{
t2=t1
}
#使记录的顺序与d1相同
t2=t2[匹配(d1$ID1,t2$ID1),]
#将d2中不匹配的记录添加到输出的末尾
如有(d2$used==0)){
tx=t1[t1$ID2%in%d2$ID2[d2$used==0],]
tx=tx[!重复(tx$ID2),]
tx[,名称(d1)[!(名称(d1)%in%c(公司))]=NA
t2=rbind(t2,tx)
t2[is.na(t2[,约]),约]=t2[is.na(t2[,约]),n2$newname[n2$oldname==约]]
}
t2$DIFF=t2$ID1=t2$ID2=NULL
cat(“*运行时间:”,round(difftime(Sys.time(),rt,“secs”),1),“seconds.\n”,sep=“”)
返回(t2)
}
举个例子:
new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))
amerge(old, new, firm="ID", approx="date")
new我希望你真的是指POSIXct格式。在数据帧中存储POSIXlt向量(实际上是列表)会导致大量混乱。@user2223405这个解决方案对您有用吗?无论哪种方式,都要确保你不会放弃你的问题。回来告诉我们什么有效,什么无效。我在我的一个类似案例中尝试了这种方法,但在我的案例中,两个数据表的合并是错误的。似乎有两个小时的延迟。这意味着两个data.tables之间最接近的datetime值始终为2小时,但实际上有最接近的datetime值。@这是同一时区中的两个时间戳列?是的,我想,我的意思是这些值在同一时区中,但不确定在执行代码之前是否必须定义时区时区时区将作为列的属性存储。如果要比较连接的两列之间的属性不同,则可能会产生您看到的效果。您可以使用attr(dt$col,“tzone”)
amerge <- function(d1, d2, firm=NULL, approx=NULL) {
rt = Sys.time()
# Take care of conflicting column names
n2 = data.frame(oldname = names(d2), newname = names(d2))
n2$newname = as.character(n2$newname)
n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)] =
paste(n2$newname[(n2$oldname %in% names(d1)) & !(n2$oldname %in% firm)], "2", sep=".")
# Add unique row IDs
if (length(c(firm, approx))>1) {
d1$ID1 = factor(apply(d1[,c(approx,firm)], 1, paste, collapse=" "))
d2$ID2 = factor(apply(d2[,c(approx,firm)], 1, paste, collapse=" "))
} else {
d1$ID1 = factor(d1[,c(approx,firm)])
d2$ID2 = factor(d2[,c(approx,firm)])
}
# Perform initial merge on the 'firm' parameters, if any
# Otherwise match all to all
if (length(firm)>0) {
t1 = merge(d1, d2, by=firm, all=T, suff=c("",".2"))
} else {
names(d2)= c(n2$newname,"ID2")
t1 = data.frame()
for (i1 in 1:nrow(d1)) {
trow = d1[i1,]
t1 = rbind(t1, cbind(trow, d2))
}
}
# Match by the most approximate record
if (length(approx)==1) {
# Calculate the differential for approximate merging
t1$DIFF = abs(t1[,approx] - t1[,n2$newname[n2$oldname==approx]])
# Sort data by ascending DIFF, so that best matching records are used first
t1 = t1[order(t1$DIFF, t1$ID1, t1$ID2),]
t2 = data.frame()
d2$used = 0
# For each record of d1, find match from d2
for (i1 in na.omit(unique(t1$ID1))) {
tx = t1[!is.na(t1$DIFF) & t1$ID1==i1,]
# If there are non-missing records, get the one with minimum DIFF (top one)
if (nrow(tx)>0) {
tx = tx[1,]
# If matching record found, remove it from the pool, so it's not used again
t1[!is.na(t1$ID2) & t1$ID2==tx$ID2, c(n2$newname[!(n2$newname %in% firm)], "DIFF")] = NA
# And mark it as used
d2$used[d2$ID2==tx$ID2] = 1
} else {
# If there are no non-missing records, just get the first one from the top
tx = t1[!is.na(t1$ID1) & t1$ID1==i1,][1,]
}
t2 = rbind(t2,tx)
}
} else {
t2 = t1
}
# Make the records the same order as d1
t2 = t2[match(d1$ID1, t2$ID1),]
# Add unmatched records from d2 to the end of output
if (any(d2$used==0)) {
tx = t1[t1$ID2 %in% d2$ID2[d2$used==0], ]
tx = tx[!duplicated(tx$ID2),]
tx[, names(d1)[!(names(d1) %in% c(firm))]] = NA
t2 = rbind(t2,tx)
t2[is.na(t2[,approx]), approx] = t2[is.na(t2[,approx]), n2$newname[n2$oldname==approx]]
}
t2$DIFF = t2$ID1 = t2$ID2 = NULL
cat("* Run time: ", round(difftime(Sys.time(),rt, "secs"),1), " seconds.\n", sep="")
return(t2)
}
new <- data.frame(ID=c(1,1,1,2), date = as.POSIXct( c("2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-04-12 11:03:00")), new = c("t","u","v","x"))
old <- data.frame(ID=c(1,1,1,1,1), date = as.POSIXct( c("2016-03-07 12:20:00", "2016-04-02 12:20:00", "2016-03-01 10:09:00", "2015-04-12 10:09:00","2016-03-03 12:20:00")), old = c("a","b","c","d","e"))
amerge(old, new, firm="ID", approx="date")
ID date old date.2 new
2 1 2016-03-07 12:20:00 a 2016-03-07 12:20:00 u
6 1 2016-04-02 12:20:00 b 2016-04-02 12:20:00 v
7 1 2016-03-01 10:09:00 c <NA> <NA>
10 1 2015-04-12 10:09:00 d <NA> <NA>
13 1 2016-03-03 12:20:00 e 2016-03-02 12:20:00 t
16 2 2016-04-12 11:03:00 <NA> 2016-04-12 11:03:00 x