在R中，如何将带有ID的时间戳间隔数据拆分和聚合到常规插槽中？_R_For Loop

在R中，如何将带有ID的时间戳间隔数据拆分和聚合到常规插槽中？

r for-loop

在R中，如何将带有ID的时间戳间隔数据拆分和聚合到常规插槽中？,r,for-loop,R,For Loop,我正在进行以下数据聚合的下一步工作。在那里，Jon Spring为我指出了一种解决方案，用于指示给定时间间隔内活动事件的数量在下一步中，我希望能够聚合这些数据，并获得在固定时间间隔内任何时间点处于活动状态的具有相同ID的观察数从包含七个事件和五个ID的玩具数据集开始： library(tidyverse); library(lubridate) df1 <- tibble::tibble( id = c("a", "b", "c", "c", "c", "d", "e"),

我正在进行以下数据聚合的下一步工作。在那里，Jon Spring为我指出了一种解决方案，用于指示给定时间间隔内活动事件的数量

在下一步中，我希望能够聚合这些数据，并获得在固定时间间隔内任何时间点处于活动状态的具有相同ID的观察数

从包含七个事件和五个ID的玩具数据集开始：

library(tidyverse); library(lubridate)

df1 <- tibble::tibble(
  id = c("a", "b", "c", "c", "c", "d", "e"),
  start = c(ymd_hms("2018-12-10 13:01:00"),
                 ymd_hms("2018-12-10 13:07:00"),
                 ymd_hms("2018-12-10 14:45:00"),
                 ymd_hms("2018-12-10 14:48:00"),
                 ymd_hms("2018-12-10 14:52:00"),
                 ymd_hms("2018-12-10 14:45:00"),
                 ymd_hms("2018-12-10 14:45:00")),
  end = c(ymd_hms("2018-12-10 13:05:00"),
               ymd_hms("2018-12-10 13:17:00"),
               ymd_hms("2018-12-10 14:46:00"),
               ymd_hms("2018-12-10 14:50:00"),
               ymd_hms("2018-12-10 15:01:00"),
               ymd_hms("2018-12-10 14:51:00"),
               ymd_hms("2018-12-10 15:59:00")))

我可以在数据帧的每一行上强制循环，并将每条记录“扩展”到从开始到结束的指定时间间隔，这里使用15分钟：

for (i in 1:nrow(df1)) {

  right <- df1 %>% 
    slice(i) %>% 
    mutate(start_floor = floor_date(start, "15 mins"))

  left <- tibble::tibble(
    timestamp = seq.POSIXt(right$start_floor, 
                           right$end, 
                           by  = "15 mins"),
    id = right$id)

  if (i == 1){
    result <- left
  }
  else {
    result <- bind_rows(result, left) %>% 
      distinct()
  }
}

然后，通过简单的聚合来获得最终结果：

result_agg <- result %>% 
  group_by(timestamp) %>% 
  summarise(users_mac = n())

这给出了期望的结果，但可能无法很好地扩展到数据集。目前，我需要将其用于约700万条记录。。和成长

有没有更好的方法解决这个问题？

我不确定效率，但有一种方法是创建一个15分钟的时间间隔序列，从数据中的最短时间到最长时间，然后找到该时间内的用户

library(tidyverse)
library(lubridate)

timestamp = floor_date(seq(min(df1$start), max(df1$end), by = "15 mins"), "15 mins")

tibble(timestamp) %>%
     mutate(users_mac = map_dbl(timestamp,~with(df1, n_distinct(id[(
  start > . | end > .) & (start < . + minutes(15) | end < . + minutes(15))])))) %>%
     filter(users_mac != 0)

#    timestamp           users_mac
#    <dttm>                  <dbl>
#1 2018-12-10 13:00:00         2
#2 2018-12-10 13:15:00         1
#3 2018-12-10 14:45:00         3
#4 2018-12-10 15:00:00         2
#5 2018-12-10 15:15:00         1
#6 2018-12-10 15:30:00         1
#7 2018-12-10 15:45:00         1

使用lubridate的as.interval和int_overlaps函数，然后通过一些tidyverse数据争用来获得摘要数据：

library(dplyr)
library(tidyr)
library(lubridate)

# list of 15-minute time increments (buckets)
timestamp <- tibble(start = floor_date(seq(min(df1$start), max(df1$end), by = "15 mins"), "15 mins"),
                    end = lead(start, 1),
                    interval = as.interval(start, end)) %>%
  na.omit() %>%
  .$interval

# add in interval on df1 start -- end times
df1 <- mutate(df1, interval = as.interval(start, end))

# find if each record is in each bucket - may not scale if there are many buckets?
tmp <- sapply(df1$interval,
       function(x, timestamp) int_overlaps(x, timestamp),
       timestamp) %>%
  t()
colnames(tmp) <- int_start(timestamp) %>% as.character()

# count how many unique ids in each time bucket
bind_cols(df1, as_tibble(tmp)) %>%
  select(-start, -end, -interval) %>%
  gather(key = start, value = logged, -id) %>%
  filter(logged) %>%
  group_by(start) %>%
  summarise(n = n_distinct(id))

# A tibble: 7 x 2
  start                   n
  <chr>               <int>
1 2018-12-10 13:00:00     2
2 2018-12-10 13:15:00     1
3 2018-12-10 14:30:00     3
4 2018-12-10 14:45:00     3
5 2018-12-10 15:00:00     2
6 2018-12-10 15:15:00     1
7 2018-12-10 15:30:00     1

使用TSIBLE包可以获得整洁的解决方案。

图书馆管理员 >已注册的S3方法被“ggplot2”覆盖： >方法自 >[.quosures rlang >c.quosures rlang >print.quosures rlang >已注册的S3方法被“rvest”覆盖： >方法自 >read_xml.response xml2 联吡啶酯 > >附加包装：“lubridate” >以下对象已从“package:base”屏蔽： > >日期 LibrarySibble，warn.conflications=FALSE df1% 变异开始=落地时间开始，15分钟，结束=楼层\日期结束，15分钟 %>% gatherlabel、索引、开始：结束%>% 区分，指数%>% mutatedate=as_dateindex%>% as_tsiblekey=cid，date，index=index%>% 填补差距%>% 索引\按索引%>% 摘要用户\u mac=n >一个可控震源：7 x 2[15米] >索引用户 > > 1 2018-12-10 13:00:00 2 > 2 2018-12-10 13:15:00 1 > 3 2018-12-10 14:45:00 3 > 4 2018-12-10 15:00:00 2 > 5 2018-12-10 15:15:00 1 > 6 2018-12-10 15:30:00 1 > 7 2018-12-10 15:45:00 1

由v0.2.1于2019-05-17创建，时间为14:45至15:00插槽应计3个用户ID，而不是5个。这是具有重复事件的用户c。从13:00至13:15应有2个用户；从13:15至13:30 1，从15:00至15:15 3。@radek我已更新了答案，但我感觉可能过于复杂了。看一看。我用larg测试了它er数据，当存在多个日期时，结果是不正确的。知道可能是什么原因吗？o_OI已经更新了答案。这也取决于日期。