Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
确定每天运行的流程数量以及开始这些项目的平均天数,单位为R_R_Tidyverse_Rolling Computation_Cumulative Sum - Fatal编程技术网

确定每天运行的流程数量以及开始这些项目的平均天数,单位为R

确定每天运行的流程数量以及开始这些项目的平均天数,单位为R,r,tidyverse,rolling-computation,cumulative-sum,R,Tidyverse,Rolling Computation,Cumulative Sum,我有一个很大的进程数据集(它们的ID)、开始日期和相应的结束日期 我想要的东西分为两部分。首先,每天运行多少个进程。其次,运行过程的平均运行/开始天数 示例数据集如下所示 > dput(df) structure(list(Process = c("P001", "P002", "P003", "P004", "P005" ), Start = c("01-01-2020"

我有一个很大的进程数据集(它们的ID)、开始日期和相应的结束日期

我想要的东西分为两部分。首先,每天运行多少个进程。其次,运行过程的平均运行/开始天数

示例数据集如下所示

> dput(df)
structure(list(Process = c("P001", "P002", "P003", "P004", "P005"
), Start = c("01-01-2020", "02-01-2020", "03-01-2020", "08-01-2020", 
"13-01-2020"), End = c("10-01-2020", "09-01-2020", "04-01-2020", 
"17-01-2020", "19-01-2020")), class = "data.frame", row.names = c(NA, 
-5L))

df

> df
  Process      Start        End
1    P001 01-01-2020 10-01-2020
2    P002 02-01-2020 09-01-2020
3    P003 03-01-2020 04-01-2020
4    P004 08-01-2020 17-01-2020
5    P005 13-01-2020 19-01-2020
第一部分我是这样进行的

library(tidyverse)

df %>% pivot_longer(cols = c(Start, End), names_to = 'event', values_to = 'dates') %>%
  mutate(dates = as.Date(dates, format = "%d-%m-%Y")) %>%
  mutate(dates = if_else(event == 'End', dates+1, dates)) %>%
  arrange(dates, event) %>%
  mutate(processes = ifelse(event == 'Start', 1, -1),
         processes = cumsum(processes)) %>%
  select(-Process, -event) %>%
  complete(dates = seq.Date(min(dates), max(dates), by = '1 day')) %>%
  fill(processes)

# A tibble: 20 x 2
   dates      processes
   <date>         <dbl>
 1 2020-01-01         1
 2 2020-01-02         2
 3 2020-01-03         3
 4 2020-01-04         3
 5 2020-01-05         2
 6 2020-01-06         2
 7 2020-01-07         2
 8 2020-01-08         3
 9 2020-01-09         3
10 2020-01-10         2
11 2020-01-11         1
12 2020-01-12         1
13 2020-01-13         2
14 2020-01-14         2
15 2020-01-15         2
16 2020-01-16         2
17 2020-01-17         2
18 2020-01-18         1
19 2020-01-19         1
20 2020-01-20         0
库(tidyverse)
df%>%pivot\u更长(cols=c(开始,结束),名称\u to='event',值\u to='dates')%>%
变异(日期=as.Date(日期,格式=“%d-%m-%Y”))%>%
变异(日期=如果其他(事件='结束',日期+1,日期))%>%
安排(日期、事件)%>%
变异(进程=ifelse(事件=='Start',1,-1),
进程=cumsum(进程))%>%
选择(-Process,-event)%%>%
完成(日期=顺序日期(最短日期、最长日期、截止日期='1天'))%>%
填充(工艺)
#一个tibble:20x2
日期进程
1 2020-01-01         1
2 2020-01-02         2
3 2020-01-03         3
4 2020-01-04         3
5 2020-01-05         2
6 2020-01-06         2
7 2020-01-07         2
8 2020-01-08         3
9 2020-01-09         3
10 2020-01-10         2
11 2020-01-11         1
12 2020-01-12         1
13 2020-01-13         2
14 2020-01-14         2
15 2020-01-15         2
16 2020-01-16         2
17 2020-01-17         2
18 2020-01-18         1
19 2020-01-19         1
20 2020-01-20         0
对于第二部分,所需的输出类似于以下屏幕截图中的列
平均天数
,并附有说明-

请选择tidyverse方法。

这里有一种方法:

library(tidyverse)

df %>%
  #Convert to date
  mutate(across(c(Start, End), lubridate::dmy),
  #Create a sequence of dates from start to end
        Dates = map2(Start, End, seq, by = 'day')) %>%
  #Get data in long format
  unnest(Dates) %>%
  #Remove columns
  select(-Start, -End) %>%
  #For each process
  group_by(Process) %>%
  #Count number of days spent on it
  mutate(days_spent = row_number() - 1) %>%
  #For each date
  group_by(Dates) %>%
  #Count number of process running and average days
  summarise(process = n(), 
            mean_days = mean(days_spent))
这将返回:

#   Dates      process mean_days
#   <date>       <int>     <dbl>
# 1 2020-01-01       1      0   
# 2 2020-01-02       2      0.5 
# 3 2020-01-03       3      1   
# 4 2020-01-04       3      2   
# 5 2020-01-05       2      3.5 
# 6 2020-01-06       2      4.5 
# 7 2020-01-07       2      5.5 
# 8 2020-01-08       3      4.33
# 9 2020-01-09       3      5.33
#10 2020-01-10       2      5.5 
#11 2020-01-11       1      3   
#12 2020-01-12       1      4   
#13 2020-01-13       2      2.5 
#14 2020-01-14       2      3.5 
#15 2020-01-15       2      4.5 
#16 2020-01-16       2      5.5 
#17 2020-01-17       2      6.5 
#18 2020-01-18       1      5   
#19 2020-01-19       1      6   
#日期过程平均天数
#               
# 1 2020-01-01       1      0   
# 2 2020-01-02       2      0.5 
# 3 2020-01-03       3      1   
# 4 2020-01-04       3      2   
# 5 2020-01-05       2      3.5 
# 6 2020-01-06       2      4.5 
# 7 2020-01-07       2      5.5 
# 8 2020-01-08       3      4.33
# 9 2020-01-09       3      5.33
#10 2020-01-10       2      5.5 
#11 2020-01-11       1      3   
#12 2020-01-12       1      4   
#13 2020-01-13       2      2.5 
#14 2020-01-14       2      3.5 
#15 2020-01-15       2      4.5 
#16 2020-01-16       2      5.5 
#17 2020-01-17       2      6.5 
#18 2020-01-18       1      5   
#19 2020-01-19       1      6   

谢谢Ronak。我也只是做了一些类似的事情。接受并投票表决。