确定每天运行的流程数量以及开始这些项目的平均天数，单位为R_R_Tidyverse_Rolling Computation_Cumulative Sum

确定每天运行的流程数量以及开始这些项目的平均天数，单位为R

确定每天运行的流程数量以及开始这些项目的平均天数，单位为R,r,tidyverse,rolling-computation,cumulative-sum,R,Tidyverse,Rolling Computation,Cumulative Sum,我有一个很大的进程数据集（它们的ID）、开始日期和相应的结束日期我想要的东西分为两部分。首先，每天运行多少个进程。其次，运行过程的平均运行/开始天数示例数据集如下所示 > dput(df) structure(list(Process = c("P001", "P002", "P003", "P004", "P005" ), Start = c("01-01-2020"

我有一个很大的进程数据集（它们的ID）、开始日期和相应的结束日期

我想要的东西分为两部分。首先，每天运行多少个进程。其次，运行过程的平均运行/开始天数

示例数据集如下所示

> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020", 
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020", 
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA, 
-5L))

df

> df
  Process      Start        End
1    P001 01-01-2020 10-01-2020
2    P002 02-01-2020 09-01-2020
3    P003 03-01-2020 04-01-2020
4    P004 08-01-2020 17-01-2020
5    P005 13-01-2020 19-01-2020

第一部分我是这样进行的

library(tidyverse)

df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
  mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
  mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
  arrange(dates, event) %>%
  mutate(processes = ifelse(event == 'Start', 1, -1),
         processes = cumsum(processes)) %>%
  select(-Process, -event) %>%
  complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
  fill(processes)

# A tibble: 20 x 2
   dates      processes
   <date>         <dbl>
 1 2020-01-01         1
 2 2020-01-02         2
 3 2020-01-03         3
 4 2020-01-04         3
 5 2020-01-05         2
 6 2020-01-06         2
 7 2020-01-07         2
 8 2020-01-08         3
 9 2020-01-09         3
10 2020-01-10         2
11 2020-01-11         1
12 2020-01-12         1
13 2020-01-13         2
14 2020-01-14         2
15 2020-01-15         2
16 2020-01-16         2
17 2020-01-17         2
18 2020-01-18         1
19 2020-01-19         1
20 2020-01-20         0

库（tidyverse）
df%>%pivot\u更长（cols=c（开始，结束），名称\u to='event'，值\u to='dates'）%>%
变异（日期=as.Date（日期，格式=“%d-%m-%Y”））%>%
变异（日期=如果其他（事件='结束'，日期+1，日期））%>%
安排（日期、事件）%>%
变异（进程=ifelse（事件=='Start'，1，-1），
进程=cumsum（进程））%>%
选择（-Process，-event）%%>%
完成（日期=顺序日期（最短日期、最长日期、截止日期='1天'））%>%
填充（工艺）
#一个tibble:20x2
日期进程
1 2020-01-01         1
2 2020-01-02         2
3 2020-01-03         3
4 2020-01-04         3
5 2020-01-05         2
6 2020-01-06         2
7 2020-01-07         2
8 2020-01-08         3
9 2020-01-09         3
10 2020-01-10         2
11 2020-01-11         1
12 2020-01-12         1
13 2020-01-13         2
14 2020-01-14         2
15 2020-01-15         2
16 2020-01-16         2
17 2020-01-17         2
18 2020-01-18         1
19 2020-01-19         1
20 2020-01-20         0

对于第二部分，所需的输出类似于以下屏幕截图中的列

平均天数

，并附有说明-

请选择tidyverse方法。

这里有一种方法：

library(tidyverse)

df %>%
  #Convert to date
  mutate(across(c(Start, End), lubridate::dmy),
  #Create a sequence of dates from start to end
        Dates = map2(Start, End, seq, by = 'day')) %>%
  #Get data in long format
  unnest(Dates) %>%
  #Remove columns
  select(-Start, -End) %>%
  #For each process
  group_by(Process) %>%
  #Count number of days spent on it
  mutate(days_spent = row_number() - 1) %>%
  #For each date
  group_by(Dates) %>%
  #Count number of process running and average days
  summarise(process = n(), 
            mean_days = mean(days_spent))

这将返回：

#   Dates      process mean_days
#   <date>       <int>     <dbl>
# 1 2020-01-01       1      0   
# 2 2020-01-02       2      0.5 
# 3 2020-01-03       3      1   
# 4 2020-01-04       3      2   
# 5 2020-01-05       2      3.5 
# 6 2020-01-06       2      4.5 
# 7 2020-01-07       2      5.5 
# 8 2020-01-08       3      4.33
# 9 2020-01-09       3      5.33
#10 2020-01-10       2      5.5 
#11 2020-01-11       1      3   
#12 2020-01-12       1      4   
#13 2020-01-13       2      2.5 
#14 2020-01-14       2      3.5 
#15 2020-01-15       2      4.5 
#16 2020-01-16       2      5.5 
#17 2020-01-17       2      6.5 
#18 2020-01-18       1      5   
#19 2020-01-19       1      6

#日期过程平均天数
#               
# 1 2020-01-01       1      0   
# 2 2020-01-02       2      0.5 
# 3 2020-01-03       3      1   
# 4 2020-01-04       3      2   
# 5 2020-01-05       2      3.5 
# 6 2020-01-06       2      4.5 
# 7 2020-01-07       2      5.5 
# 8 2020-01-08       3      4.33
# 9 2020-01-09       3      5.33
#10 2020-01-10       2      5.5 
#11 2020-01-11       1      3   
#12 2020-01-12       1      4   
#13 2020-01-13       2      2.5 
#14 2020-01-14       2      3.5 
#15 2020-01-15       2      4.5 
#16 2020-01-16       2      5.5 
#17 2020-01-17       2      6.5 
#18 2020-01-18       1      5   
#19 2020-01-19       1      6

谢谢Ronak。我也只是做了一些类似的事情。接受并投票表决。