R 当我有一个包含电影开始和结束时间的数据帧时,如何计算每小时观看的电影数量?
我有一个数据框,看起来像:R 当我有一个包含电影开始和结束时间的数据帧时,如何计算每小时观看的电影数量?,r,R,我有一个数据框,看起来像: start_timestamp end_timestamp 2012-11-18 05:53:36.0 2012-11-18 7:46:40.0 2012-11-18 06:34:23.0 2012-12-18 09:21:57.0 hour moves_being_played 2012-11-18 05:00:00.0 NA 2012-11-18 06:00:00.0 NA 2012-11-18 07:0
start_timestamp end_timestamp
2012-11-18 05:53:36.0 2012-11-18 7:46:40.0
2012-11-18 06:34:23.0 2012-12-18 09:21:57.0
hour moves_being_played
2012-11-18 05:00:00.0 NA
2012-11-18 06:00:00.0 NA
2012-11-18 07:00:00.0 NA
2012-11-18 08:00:00.0 NA
2012-11-18 09:00:00.0 NA
我希望输出像这样:
hour moves_being_played
2012-11-18 05:00:00.0 1
2012-11-18 06:00:00.0 2
2012-11-18 07:00:00.0 2
2012-11-18 08:00:00.0 1
2012-11-18 09:00:00.0 1
我能想到的唯一方法是创建一个如下表:
start_timestamp end_timestamp
2012-11-18 05:53:36.0 2012-11-18 7:46:40.0
2012-11-18 06:34:23.0 2012-12-18 09:21:57.0
hour moves_being_played
2012-11-18 05:00:00.0 NA
2012-11-18 06:00:00.0 NA
2012-11-18 07:00:00.0 NA
2012-11-18 08:00:00.0 NA
2012-11-18 09:00:00.0 NA
然后使用一个for循环,在给定的时间段内每小时迭代一次,看看有多少
start\u时间戳
更低,并与一个end\u时间戳
配对,后者更大,但效率似乎非常低 @alistaire的评论是一个简洁、高效的解决方案,很可能既是一个实际的答案,也是一个公认的答案
抛开这一条,展示类似但更复杂情况下的通用习惯用法(没有足够的do()
示例):
library(dplyr)
df <- data_frame(
start_timestamp=as.POSIXct(c("2012-11-18 05:53:36.0", "2012-11-18 06:34:23.0")),
end_timestamp=as.POSIXct(c("2012-11-18 07:46:40.0", "2012-11-18 09:21:57.0"))
)
hourly_count <- function(x) {
range(x$start_timestamp, x$end_timestamp) %>%
format("%Y-%m-%d %H:00:00") %>%
as.POSIXct()-> rng
hrs <- seq(from=rng[1], to=rng[2], by="1 hour")
data_frame(hour=hrs, is_playing=TRUE)
}
rowwise(df) %>%
do(hourly_count(.)) %>%
count(hour, is_playing) %>%
select(-is_playing, movies_being_played=n)
## Source: local data frame [5 x 2]
## Groups: hour [5]
##
## hour movies_being_played
## <dttm> <int>
## 1 2012-11-18 05:00:00 1
## 2 2012-11-18 06:00:00 2
## 3 2012-11-18 07:00:00 2
## 4 2012-11-18 08:00:00 1
## 5 2012-11-18 09:00:00 1
库(dplyr)
df%
as.POSIXct()->rng
小时%
do(每小时计数(%)%>%
计数(小时,正在播放)%>%
选择(-正在播放,正在播放的电影=n)
##来源:本地数据帧[5 x 2]
##分组:小时[5]
##
##每小时播放一部电影
##
## 1 2012-11-18 05:00:00 1
## 2 2012-11-18 06:00:00 2
## 3 2012-11-18 07:00:00 2
## 4 2012-11-18 08:00:00 1
## 5 2012-11-18 09:00:00 1
没那么糟糕:sapply(seq(trunc(min)(df$start\u timestamp),'hour')、max(df$end\u timestamp),by='hour')、函数(x){sum(x>=df$start\u timestamp&x比我想象的要优雅得多!