R 按具有多个条件的组扩展数据
我有关于Jenkins作业管道执行的数据,我试图根据数据中的开始和结束时间来确定从开发到生产的平均持续时间。数据有点像一个事务数据库,其中Dev管道的执行是一个唯一的记录,然后,到生产的同一管道的执行是另一个唯一的记录(只共享一个分组变量,即运行作业的团队) 以下是我开始使用的数据示例:R 按具有多个条件的组扩展数据,r,dplyr,purrr,R,Dplyr,Purrr,我有关于Jenkins作业管道执行的数据,我试图根据数据中的开始和结束时间来确定从开发到生产的平均持续时间。数据有点像一个事务数据库,其中Dev管道的执行是一个唯一的记录,然后,到生产的同一管道的执行是另一个唯一的记录(只共享一个分组变量,即运行作业的团队) 以下是我开始使用的数据示例: job_id startTime endTime env_type Team_ID 1 100 8/4/2017 17:14:00 8
job_id startTime endTime env_type Team_ID
1 100 8/4/2017 17:14:00 8/4/2017 17:16:00 DEV A
2 101 8/4/2017 17:20:00 8/4/2017 17:21:00 DEV A
3 102 8/4/2017 17:24:00 8/4/2017 17:27:00 DEV B
4 103 8/4/2017 17:38:00 8/4/2017 17:40:00 DEV B
5 104 8/4/2017 17:40:00 8/4/2017 17:42:00 DEV C
6 105 8/4/2017 17:51:00 8/4/2017 17:54:00 DEV C
在我第一次尝试扩大数据范围时,我使用mutate创建新列,并根据env_类型复制开始和结束时间:
df %>%
mutate(prod_job_id = ifelse(env_type == "PROD", job_id, ""),
prod_start_time = ifelse(env_type == "PROD", startTime, ""),
prod_end_time = ifelse(env_type == "PROD", endTime, ""),
dev_job_id = ifelse(env_type == "DEV", job_id, ""),
dev_start_time = ifelse(env_type == "DEV", startTime, ""),
dev_end_time = ifelse(env_type == "DEV", endTime, ""))
这让我得到了类似这样的结果(也使用as.POSIXct转换了时间):
棘手的部分是,管道可能在进入prod之前多次进入dev,甚至可能在之后再次进入prod,而不返回dev,正如您在上面的数据框架中所看到的那样
我试图找出如何创建一个循环(或一系列dplyr/purrr命令或一些*ply函数)来对齐数据,以便使用diffTime来获得部署持续时间。最终目标是获得从开发人员到产品的所有管道的diffTimes,然后平均这个数字。
为了实现我的目标,我试图通过将数据转换成这样的方式来解决这个问题(经过处理后,env_类型将不再有效-但这没关系,因为我最终只对diffTime感兴趣):
在英语中,我想我需要的是:
对于env_type==“PROD”的每一行,找到Dev最近的时间戳,并用该值覆盖Dev列——类似于max(Dev_end_time,其中Dev_end_time不大于PROD_start_time,Dev_end_time大于PROD_end_time的上一个值)。我知道数据需要按团队ID分组并按顺序排列。我还知道,我必须从查看产品管道开始,然后再反向工作
我从以下几点开始:
df %>%
group_by(Team_ID) %>%
arrange(Team_ID, startTime)
以便按时间顺序对数据进行分组和排列。但我应该从这里走到哪里?我首先想到变异可能会起作用:
mutate(dev_start_time=ifelse((dev_end_timeprod_start_time-1)),dev_start_time,“””)
但我不知道如何让R查看正确的行(prod_start_time-1应该是prod的前一行,而不是时间-1)
我知道必须有一些方法来做到这一点,但我只是不熟悉的功能(s)来完成它
编辑:
对于@LetEpsilonBeLessThanZero
我试图通过管道id获得分组的要点,然后过滤至少有1个dev和1个prod行的数据将删除有价值的数据。为了证明这一点,让我们看一下以下数据:
Team_ID pipeline_id env_type dev_start_time dev_end_time prod_start_time prod_end_time
1 A 1000 DEV 2018-08-01 12:00:00 2018-08-01 13:00:00 <NA> <NA>
2 A 1000 DEV 2018-08-02 12:00:00 2018-08-02 13:00:00 <NA> <NA>
3 A 1000 PROD <NA> <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
4 A 1000 PROD <NA> <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
5 B 2000 DEV 2018-08-01 12:00:00 2018-08-01 13:00:00 <NA> <NA>
6 B 2000 DEV 2018-08-02 12:00:00 2018-08-02 13:00:00 <NA> <NA>
7 B 2000 PROD <NA> <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
8 C 3000 DEV 2018-08-05 12:00:00 2018-08-05 13:00:00 <NA> <NA>
9 C 3000 DEV 2018-08-06 12:00:00 2018-08-06 13:00:00 <NA> <NA>
10 C 3000 TEST 2018-08-06 14:00:00 2018-08-06 15:00:00 <NA> <NA>
11 D 4000 DEV 2018-08-01 12:00:00 2018-08-01 13:00:00 <NA> <NA>
12 D 4000 DEV 2018-08-02 12:00:00 2018-08-02 13:00:00 <NA> <NA>
13 D 5000 PROD <NA> <NA> 2018-08-02 14:00:00 2018-08-02 15:00:00
14 D 5000 PROD <NA> <NA> 2018-08-02 16:00:00 2018-08-02 17:00:00
Team_ID pipeline_ID env_type dev_start_time dev_end_time prod_start_time prod_end_time
1A 1000 DEV 2018-08-01 12:00:00 2018-08-01 13:00:00
2 A 1000 DEV 2018-08-02 12:00:00 2018-08-02 13:00:00
3 A 1000产品2018-08-02 14:00:00 2018-08-02 15:00:00
4 A 1000产品2018-08-02 16:00:00 2018-08-02 17:00:00
5B2000开发2018-08-01 12:00:00 2018-08-01 13:00:00
6B2000开发2018-08-02 12:00:00 2018-08-02 13:00:00
7 B 2000产品2018-08-02 16:00:00 2018-08-02 17:00:00
8 C 3000开发人员2018-08-05 12:00:00 2018-08-05 13:00:00
9 C 3000 DEV 2018-08-06 12:00:00 2018-08-06 13:00:00
10C 3000测试2018-08-06 14:00:00 2018-08-06 15:00:00
11 D 4000 DEV 2018-08-01 12:00:00 2018-08-01 13:00:00
12 D 4000 DEV 2018-08-02 12:00:00 2018-08-02 13:00:00
13 D 5000产品2018-08-02 14:00:00 2018-08-02 15:00:00
14 D 5000产品2018-08-02 16:00:00 2018-08-02 17:00:00
注意团队D是如何创建一个独特的开发管道和一个独特的产品管道的。我仍然需要一种方法来链接它们并测量时间差,因为我知道部署用于同一个应用程序,但不能按照您建议的方式通过管道id分组来完成
另一方面,我知道我们需要一种新的方式将这些团队组合在一起,以便更容易地将这些工作关联起来,现在已经有了实现这一点的计划。但我仍然需要找到一种方法,尽我所能,利用我目前所拥有的一切来获取这些数据,因此我非常感谢所有的帮助 下面的代码如何?我修改了您的一个虚拟数据集,以便测试一些不同的场景
df
dataframe是未更改的虚拟数据集
df_w_implicated_proj_id
将向您展示我如何确定我创建的字段“proj_id”的勇气。proj_id表示“真实”管道
mean_dev_df
计算项目id之间的平均总差异时间
library(dplyr)
df = data.frame(startTime = as.POSIXct(c("2018-08-01 12:00:00",
"2018-08-02 10:00:00",
"2018-08-02 14:00:00",
"2018-08-02 16:00:00",
"2018-08-01 12:00:00",
"2018-08-02 12:00:00",
"2018-08-02 16:00:00",
"2018-08-05 12:00:00",
"2018-08-06 12:00:00",
"2018-08-06 14:00:00",
"2018-08-06 16:00:00",
"2018-08-06 18:00:00",
"2018-08-01 12:00:00",
"2018-08-02 12:00:00",
"2018-08-02 14:00:00",
"2018-08-02 16:00:00"), format="%Y-%m-%d %H:%M:%S"),
endTime = as.POSIXct(c("2018-08-01 13:00:00",
"2018-08-02 13:00:00",
"2018-08-02 15:00:00",
"2018-08-02 18:00:00",
"2018-08-01 13:00:00",
"2018-08-02 13:00:00",
"2018-08-02 18:00:00",
"2018-08-05 13:00:00",
"2018-08-06 13:00:00",
"2018-08-06 15:00:00",
"2018-08-06 17:00:00",
"2018-08-06 19:00:00",
"2018-08-01 13:00:00",
"2018-08-02 13:00:00",
"2018-08-02 15:00:00",
"2018-08-02 21:00:00"), format="%Y-%m-%d %H:%M:%S"),
env_type = c("DEV","DEV","PROD","PROD","DEV","DEV","PROD","DEV","DEV","PROD","DEV","PROD","DEV","DEV","PROD","PROD"),
Team_ID = c("A","A","A","A","B","B","B","C","C","C","C","C","D","D","D","D"))
df_w_implied_proj_id = df %>%
arrange(Team_ID, startTime) %>%
mutate(diffTimeSecs = difftime(endTime,startTime,units="secs"),
proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1) %>%
group_by(proj_id) %>%
mutate(total_proj_diffTimeSecs = sum(diffTimeSecs))
mean_dev_df = df_w_implied_proj_id %>%
group_by(proj_id) %>%
summarise(temp_totals = sum(diffTimeSecs)) %>%
ungroup() %>%
summarise(mean_total_proj_diffTimeSecs = mean(temp_totals))
该代码的主要工作代码如下:
proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1
为了理解它,让我们看看数据集中的env_type
值:
env_type
DEV
DEV
PROD
PROD
DEV
DEV
PROD
DEV
DEV
PROD
DEV
PROD
DEV
DEV
PROD
PROD
lag
函数只返回前一行的值。因此,作为一个随机示例,lag(c(“a”、“B”、“c”),default=“BALLOON”)
将返回c(“BALLOON”、“a”、“B”)
所以env_type!=滞后(环境类型,默认值=第一个(环境类型))
将返回以下内容:
env_type != lag(env_type, default = first(env_type))
0 (note: there's no row before the first row, so the lag statement defaults this to the first element of env_type vector, which is "DEV". And "DEV" != "DEV" evaluates to FALSE aka 0)
0 (note: "DEV" != "DEV" evaluates to FALSE aka 0)
1 (note: "PROD" != "DEV" evaluates to TRUE aka 1)
0 (note: "PROD != "PROD" evaluates to FALSE aka 0. By now you hopefully get the gist of what's going on.)
1
0
1
1
0
1
1
1
1
0
1
0
然后0和1向量的求和(…)就来了
proj_id = cumsum(env_type != lag(env_type, default = first(env_type))) %/% 2 + 1
env_type
DEV
DEV
PROD
PROD
DEV
DEV
PROD
DEV
DEV
PROD
DEV
PROD
DEV
DEV
PROD
PROD
env_type != lag(env_type, default = first(env_type))
0 (note: there's no row before the first row, so the lag statement defaults this to the first element of env_type vector, which is "DEV". And "DEV" != "DEV" evaluates to FALSE aka 0)
0 (note: "DEV" != "DEV" evaluates to FALSE aka 0)
1 (note: "PROD" != "DEV" evaluates to TRUE aka 1)
0 (note: "PROD != "PROD" evaluates to FALSE aka 0. By now you hopefully get the gist of what's going on.)
1
0
1
1
0
1
1
1
1
0
1
0
0 0 1 1 2 2 3 4 4 5 6 7 8 8 9 9
1 1 1 1 2 2 2 3 3 3 4 4 5 5 5 5
df %>%
group_by(Team_ID) %>%
arrange(Team_ID, startTime) %>%
mutate("Dev-Prod" = as.numeric(difftime(prod_end_time, lag(dev_start_time), units = "secs"))) %>%
filter(!is.na(`Dev-Prod`))