如何基于R中的datetime列对数据帧进行子采样

如何基于R中的datetime列对数据帧进行子采样,r,datetime,subsampling,R,Datetime,Subsampling,我希望以每小时一次的间隔从datetime列中对数据帧进行子采样,从数据帧第一行的时间值开始。我的数据帧从第一行到最后一行每隔10分钟运行一次。示例数据如下: structure(list(datetime = structure(1:19, .Label = c("30/03/2011 05:09", "30/03/2011 05:19", "30/03/2011 05:29", "30/03/2011 05:39", "30/03/2011 05:49", "30/03/2011 05:

我希望以每小时一次的间隔从datetime列中对数据帧进行子采样,从数据帧第一行的时间值开始。我的数据帧从第一行到最后一行每隔10分钟运行一次。示例数据如下:

structure(list(datetime = structure(1:19, .Label = c("30/03/2011 05:09", 
"30/03/2011 05:19", "30/03/2011 05:29", "30/03/2011 05:39", "30/03/2011 05:49", 
"30/03/2011 05:59", "30/03/2011 06:09", "30/03/2011 06:19", "30/03/2011 06:29", 
"30/03/2011 06:39", "30/03/2011 06:49", "30/03/2011 06:59", "30/03/2011 07:09", 
"30/03/2011 07:19", "30/03/2011 07:29", "30/03/2011 07:39", "30/03/2011 07:49", 
"30/03/2011 07:59", "30/03/2011 08:09"), class = "factor"), a_count = c(66L, 
34L, 33L, 20L, 12L, 44L, 36L, 29L, 21L, 22L, 17L, 38L, 24L, 19L, 
60L, 54L, 27L, 36L, 45L), b_count = c(166.49, 167.54, 168.31, 
168.81, 169.24, 169.61, 169.96, 170.29, 170.63, 170.98, 171.31, 
171.62, 171.94, 172.29, 172.68, 173.15, 173.71, 174.34, 174.99
)), .Names = c("datetime", "a_count", "b_count"), class = "data.frame", row.names = c(NA, 
-19L))
df

我想以以下数据框结束:

        datetime   a_count b_count
30/09/2011 05:09       66  166.49
30/09/2011 06:09       36  169.96
30/09/2011 07:09       24  171.94
30/09/2011 08:09       45  174.99

如有任何建议,将不胜感激

很难猜出你有什么结构。是否保证您在第一时间值+x乘以60分钟时有一个值?如果找不到该值,会发生什么情况?如果此时有两个值,会发生什么。你需要近似匹配吗?比如说,09:10算09:09

让您开始的想法如下:

# I will call your dataframe `d`. 
# Transform datetime to a POSIXct object, R's datatype for timestamps
d$datetime <- as.POSIXct(as.character(d$datetime), format='%d/%m/%Y %H:%M')
# Extract the minutes
d$minute <- as.numeric(format(d$datetime, '%M'))
# And select by identical minute.
subset(d, minute == d$minute[1])
#我将调用您的数据帧'd'。
#将datetime转换为POSIXct对象,即R的时间戳数据类型
d$datetime
>df$datetime df$dif
>df[df$dif%%60==0,]
日期时间a_计数b_计数dif
2011-03-30 05:09:00      66  166.49   0
2011-03-30 06:09:00      36  169.96  60
2011-03-30 07:09:00      24  171.94 120
2011-03-30 08:09:00      45  174.99 180

我和Thilo有同样的问题,但这里有另一个解决方案。

您也可以使用lubridate软件包更改您的时间格式,这可能更直观,更容易记住

此外,您还可以根据小时添加变量,然后总结您对plyr的看法

在下面的例子中,我取了一个_计数的总和和平均值。可能需要根据您的目的而有所不同

library(plyr)
library(lubridate)

df2 <- mutate(df, dt = dmy_hm(as.character(datetime)), hour = hour(dt), minute = minute(dt))
summary <- ddply(df2, .(hour), summarize, a_mean = mean(a_count), a_sum = sum(a_count))
库(plyr)
图书馆(lubridate)

df2@Thilo:这正是我想要的。非常感谢!从数据帧的第一行到最后一行,我的数据以精确的10分钟间隔运行。我会更新这个问题。杰克:非常感谢你的回答。这也非常有效!
> df$datetime <- strptime(df$datetime, format = "%d/%m/%Y %H:%M")                                                                                                                                                                           
> df$dif <- c(0, cumsum(as.numeric(diff(df$datetime))))                                                                                                                                                                                     
>                                                                                                                                                                                                                                           
> df[df$dif %% 60 == 0,]                                                                                                                                                                                                              

               datetime a_count b_count dif
  2011-03-30 05:09:00      66  166.49   0
  2011-03-30 06:09:00      36  169.96  60
  2011-03-30 07:09:00      24  171.94 120
  2011-03-30 08:09:00      45  174.99 180
library(plyr)
library(lubridate)

df2 <- mutate(df, dt = dmy_hm(as.character(datetime)), hour = hour(dt), minute = minute(dt))
summary <- ddply(df2, .(hour), summarize, a_mean = mean(a_count), a_sum = sum(a_count))