R xts-将不等时间步长xts重采样为等距时间序列
我在R和xts时间序列一起工作 我所拥有的: 具有不等间隔时间步长的时间序列数据集 我想要的: 具有等距时间步长的时间序列,其值对应于与时间步长重叠的原始值的比例(参见下面的示例) 示例:具有如下原始系列:R xts-将不等时间步长xts重采样为等距时间序列,r,xts,R,Xts,我在R和xts时间序列一起工作 我所拥有的: 具有不等间隔时间步长的时间序列数据集 我想要的: 具有等距时间步长的时间序列,其值对应于与时间步长重叠的原始值的比例(参见下面的示例) 示例:具有如下原始系列: sample_xts <- as.xts(read.zoo(text=' 2016-07-01 00:00:20, 0.0 2016-07-01 00:01:20, 60.0 2016-07-01 00:01:50, 30.0 2016-07-01 00:02:30, 40.
sample_xts <- as.xts(read.zoo(text='
2016-07-01 00:00:20, 0.0
2016-07-01 00:01:20, 60.0
2016-07-01 00:01:50, 30.0
2016-07-01 00:02:30, 40.0
2016-07-01 00:04:20, 110.0
2016-07-01 00:05:30, 140.0
2016-07-01 00:06:00, 97.0
2016-07-01 00:07:12, 144.0
2016-07-01 00:08:09, 0.0
', sep=',', index=1, tz='', format="%Y-%m-%d %H:%M:%S"))
names(sample_xts) <- c('x')
x
2016-07-01 00:00:00, 0.0
2016-07-01 00:01:00, 40.0
2016-07-01 00:02:00, 60.0
2016-07-01 00:03:00, 60.0
2016-07-01 00:04:00, 60.0
2016-07-01 00:05:00, 100.0
2016-07-01 00:06:00, 157.0
2016-07-01 00:07:00, 120.0
2016-07-01 00:08:00, 24.0
2016-07-01 00:09:00, 0.0
注:
- 某些原始时间步长小于新时间步长,而 其他的更大李>
- x的列和保持不变(即621)
任何帮助都将不胜感激!谢谢。以下是我使用
zoo
编写的一些代码-我没有太多使用xts
,因此我不知道是否可以应用相同的函数。希望有帮助
功能
以下函数为原始数据的每个间隔计算与给定间隔重叠的分数(注意:在以下所有代码中,变量名ta1
和ta2
指给定时间间隔的开始和结束(例如,需要作为输出的每个相等间隔),而tb1
和tb2
指原始数据(不等)间隔的开始和结束:
(注意:这对您提供的示例数据很有效,但在较大的数据集上,我发现速度太慢了。由于我编写此代码是为了使用常规时间步长对时间序列重新采样,因此我通常使用固定的时间间隔来完成此步骤,这会大大加快速度。修改代码可能很容易(请参阅下一个函数的代码)以根据原始数据的间隔加快此步骤。)
下一个函数使用前两个函数来计算间隔ta1
-ta2
的重采样值:
check.overlap <- function(ta1,ta2,tb1,tb2){
ov <- vector("logical",4)
ov[1] <- (tb1 <= ta1 & tb2 >= ta2) # Interval 2 starts earlier and ends later than interval 1
ov[2] <- (tb1 >= ta1 & tb2 <= ta2) # Interval 2 is fully contained within interval 1
ov[3] <- (tb1 <= ta1 & tb2 >= ta1) # Interval 2 partly overlaps with interval 1 (starts earlier, ends earlier)
ov[4] <- (tb1 <= ta2 & tb2 >= ta2) # Interval 2 partly overlaps with interval 1 (starts later, ends later)
return(as.logical(sum(ov))) # Gives TRUE if at least one element of ov is TRUE, otherwise FALSE
}
fracres <- function(tstart,interval,input){
# tstart: POSIX object
# interval: length of interval in seconds
# input: zoo object
ta1 <- tstart
ta2 <- tstart + interval
# First, determine which records of the original data (input) overlap with the current
# interval, to avoid going through the whole object at every iteration
ind <- index(input)
ind1 <- index(lag(input,-1))
recs <- which(sapply(1:length(ind),function(x) check.overlap(ta1,ta2,ind[x],ind1[x])))
#recs <- which(abs(as.numeric(difftime(ind,ta1,units="secs"))) < 601)
# For each record overlapping with the current interval, return the fraction of the input data interval contained in the current interval
if(length(recs) > 0){
fracs <- sapply(1:length(recs), function(x) frac.overlap(ta1,ta2,ind[recs[x]],ind1[recs[x]]))
return(sum(coredata(input)[recs]*fracs))
} else {
return(0)
}
}
看起来您的数据集包含瞬时值(“在01:20
,x的值为60”)。由于我编写此代码是为了求和值,因此时间戳的含义不同(“从01:20开始的记录的值为60”)。要更正此问题,需要移动记录:
sample_zoo <- lag(sample_zoo,1)
然后,我们可以应用函数fracres
,如上所述:
data.out <- sapply(1:length(time.out), function(x) fracres(tstart=time.out[x],interval=60,input=sample_zoo))
最后,时间序列再次以与之前相反的方向移动一步:
zoo.out <- lag(zoo.out,-1)
2016-07-01 00:01:00 2016-07-01 00:02:00 2016-07-01 00:03:00 2016-07-01 00:04:00 2016-07-01 00:05:00 2016-07-01 00:06:00 2016-07-01 00:07:00 2016-07-01 00:08:00 2016-07-01 00:09:00
40 60 60 60 100 157 120 24 0
zoo.out我终于决定走“while循环路”有了这个,我创建了下面的解决方案。它工作得很好-速度不是很快,但执行时间似乎与时间序列的长度成正比。我用我在问题中发布的小示例和具有330000个观察值的源时间序列以及约110000个时间步的目标时间序列对它进行了测试
源序列和目标序列都可以有不规则的时间步长。结果序列的总和与源序列的总和相同。
性能:速度还行,但我确信它可以更快。我想它显然适合RCpp版本,对于长系列来说应该更快。现在这对我来说已经足够了,如果/当我开始创建RCpp版本时,我会在这里发布
如果您对绩效改进有任何建议,请发布!
谢谢
sameEndTime <- function(i,j,src_index,dest_index){
if(src_index[i] == dest_index[j]){
TRUE
} else {
FALSE
}
}
wholeSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(dest_index[j-1] <= src_index[i-1] & src_index[i] <= dest_index[j]){
TRUE
} else {
FALSE
}
}
wholeDestStepIsWithinSourceStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] <= dest_index[j-1] & dest_index[j] <= src_index[i]){
TRUE
} else {
FALSE
}
}
onlyEndOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] < dest_index[j-1] & src_index[i] < dest_index[j] & src_index[i] > dest_index[j-1]){
TRUE
} else {
FALSE
}
}
onlyStartOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] < dest_index[j] & src_index[i-1] > dest_index[j-1] & src_index[i] > dest_index[j]){
TRUE
} else {
FALSE
}
}
resampleToDestTimeSteps <- function(src, dest){
# src and dest are both xts with only one time series each
# src is the original series and
# dest holds the time steps of the final series
#
# NB: there is an issue with the very first time step
# (gets ignored in this version)
#
original_names <- names(src)
names(src) <- c("value")
names(dest) <- c("value")
dest$value <- dest$value*0.0
dest$value[is.na(dest$value)] <- 0.0
dest[1]$value = 0.0
for(k in 2:length(src)){
src[k]$value <- src[k]$value/as.numeric(difftime(index(src[k]),index(src[k-1]),units="secs"))
}
# First value is NA due to lag at this point (we don't want that)
src$value[1] = 0.0
i = 2 # source timestep counter
j = 2 # destination timestep counter
src_index = index(src)
dest_index = index(dest)
src_length = length(src)
dest_length = length(dest)
# Make sure we start with an overlap
if(src_index[2] < dest_index[1]){
while(src_index[i] < dest_index[1]){
i = i + 1
}
} else if(dest_index[2] < src_index[1]){
while(dest_index[j] < src_index[1]){
j = j + 1
}
}
while(i <= src_length & j <= dest_length){
if( wholeSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],src_index[i-1],units="secs"))
if(sameEndTime(i,j,src_index,dest_index)){
j = j+1
}
i = i+1
} else if( wholeDestStepIsWithinSourceStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(dest_index[j],dest_index[j-1],units="secs"))
if(sameEndTime(i,j,src_index,dest_index)){
i = i+1
}
j = j+1
} else if( onlyEndOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],dest_index[j-1],units="secs"))
i = i+1
} else if( onlyStartOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
diff_time = difftime(dest_index[j],src_index[i-1],units="secs")
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(diff_time)
j = j+1
} else {
print("======================================================")
print(paste0("i=",i,", j=",j))
print(paste0("src_index[i] =",src_index[i]))
print(paste0("dest_index[j] =",dest_index[j]))
print(" ")
print(paste0("src_index[i-1] =",src_index[i-1]))
print(paste0("dest_index[j-1]=",dest_index[j-1]))
print("======================================================")
stop("This should never happen.")
}
}
names(dest) <- original_names
return(dest)
}
sameEndTime谢谢@m.chips!终于有时间在我的实时系列上试用了。效果很好,但是,是的,正如你指出的,它变得“慢得让人望而却步”即使是很短的序列。执行时间似乎与序列的长度成反比-指数增长或2^N。我的序列在30万到1 mil之间。观察结果受你的算法启发,我决定尝试其他方法。在下面的问题答案中发布它。
data.out <- sapply(1:length(time.out), function(x) fracres(tstart=time.out[x],interval=60,input=sample_zoo))
zoo.out <- read.zoo(data.frame(time.out,data.out))
zoo.out <- lag(zoo.out,-1)
2016-07-01 00:01:00 2016-07-01 00:02:00 2016-07-01 00:03:00 2016-07-01 00:04:00 2016-07-01 00:05:00 2016-07-01 00:06:00 2016-07-01 00:07:00 2016-07-01 00:08:00 2016-07-01 00:09:00
40 60 60 60 100 157 120 24 0
sameEndTime <- function(i,j,src_index,dest_index){
if(src_index[i] == dest_index[j]){
TRUE
} else {
FALSE
}
}
wholeSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(dest_index[j-1] <= src_index[i-1] & src_index[i] <= dest_index[j]){
TRUE
} else {
FALSE
}
}
wholeDestStepIsWithinSourceStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] <= dest_index[j-1] & dest_index[j] <= src_index[i]){
TRUE
} else {
FALSE
}
}
onlyEndOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] < dest_index[j-1] & src_index[i] < dest_index[j] & src_index[i] > dest_index[j-1]){
TRUE
} else {
FALSE
}
}
onlyStartOfSourceStepIsWithinDestStep <- function(i,j,src_index,dest_index){
if(src_index[i-1] < dest_index[j] & src_index[i-1] > dest_index[j-1] & src_index[i] > dest_index[j]){
TRUE
} else {
FALSE
}
}
resampleToDestTimeSteps <- function(src, dest){
# src and dest are both xts with only one time series each
# src is the original series and
# dest holds the time steps of the final series
#
# NB: there is an issue with the very first time step
# (gets ignored in this version)
#
original_names <- names(src)
names(src) <- c("value")
names(dest) <- c("value")
dest$value <- dest$value*0.0
dest$value[is.na(dest$value)] <- 0.0
dest[1]$value = 0.0
for(k in 2:length(src)){
src[k]$value <- src[k]$value/as.numeric(difftime(index(src[k]),index(src[k-1]),units="secs"))
}
# First value is NA due to lag at this point (we don't want that)
src$value[1] = 0.0
i = 2 # source timestep counter
j = 2 # destination timestep counter
src_index = index(src)
dest_index = index(dest)
src_length = length(src)
dest_length = length(dest)
# Make sure we start with an overlap
if(src_index[2] < dest_index[1]){
while(src_index[i] < dest_index[1]){
i = i + 1
}
} else if(dest_index[2] < src_index[1]){
while(dest_index[j] < src_index[1]){
j = j + 1
}
}
while(i <= src_length & j <= dest_length){
if( wholeSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],src_index[i-1],units="secs"))
if(sameEndTime(i,j,src_index,dest_index)){
j = j+1
}
i = i+1
} else if( wholeDestStepIsWithinSourceStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(dest_index[j],dest_index[j-1],units="secs"))
if(sameEndTime(i,j,src_index,dest_index)){
i = i+1
}
j = j+1
} else if( onlyEndOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(difftime(src_index[i],dest_index[j-1],units="secs"))
i = i+1
} else if( onlyStartOfSourceStepIsWithinDestStep(i,j,src_index,dest_index) ){
diff_time = difftime(dest_index[j],src_index[i-1],units="secs")
dest[j]$value = dest[j]$value + as.numeric(src[i]$value)*as.numeric(diff_time)
j = j+1
} else {
print("======================================================")
print(paste0("i=",i,", j=",j))
print(paste0("src_index[i] =",src_index[i]))
print(paste0("dest_index[j] =",dest_index[j]))
print(" ")
print(paste0("src_index[i-1] =",src_index[i-1]))
print(paste0("dest_index[j-1]=",dest_index[j-1]))
print("======================================================")
stop("This should never happen.")
}
}
names(dest) <- original_names
return(dest)
}