分组,应用多个条件并在R或Python中计算持续时间
目标 我有一个数据集df,我想根据特定条件计算持续时间,并显示分组,应用多个条件并在R或Python中计算持续时间,python,r,pandas,loops,dplyr,Python,R,Pandas,Loops,Dplyr,目标 我有一个数据集df,我想根据特定条件计算持续时间,并显示收件人,开始时间,结束时间,持续时间和长度的输出 问题 如果以下条件适用,我需要首先将邮件分组: 如果文件夹=='out'或草稿,消息=='',编辑==“T”,如果 收件人列和长度列连续相同 理想情况下,这会给我A组及其持续时间。例如,第一个“数据块”将标记为“A组”,开始时间为1/2/2020 1:00:01 AM,结束时间为1/2/2020 1:00:30 AM Subject Re Leng
收件人
,开始时间
,结束时间
,持续时间
和长度
的输出
问题
如果以下条件适用,我需要首先将邮件分组:
如果文件夹=='out'
或草稿
,消息==''
,编辑==“T”
,如果
收件人列和长度列连续相同
理想情况下,这会给我A组及其持续时间。例如,第一个“数据块”将标记为“A组”,开始时间为1/2/2020 1:00:01 AM
,结束时间为1/2/2020 1:00:30 AM
Subject Re Length Folder Message Date Edit
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:01 AM T
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:05 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:15 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:30 AM T
some
some
some
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:00 AM T
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:05 AM T
no
no
no
no
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:20 AM T
此外,如果主题的最后一行Re和Length列与第一行的其他组Subject、Re和Length匹配,我想将A组与另一个数据块“匹配”。因此,第二组A的开始时间为1/2/2020 1:02:00 AM
,结束时间为1/2/2020 1:02:05 AM
Subject Re Length Folder Message Date Edit
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:01 AM T
a@mail.com,b@mail.com 80 out 1/2/2020 1:00:05 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:15 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:00:30 AM T
some
some
some
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:00 AM T
hey a@mail.com,b@mail.com 80 draft 1/2/2020 1:02:05 AM T
no
no
no
no
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:10 AM T
hey a@mail.com,b@mail.com 80 out 1/2/2020 1:03:20 AM T
所需输出
Start End Duration Group Subject Length
1/2/2020 1:00:01 AM 1/2/2020 1:00:30 AM 29 A hey 80
1/2/2020 1:02:00 AM 1/2/2020 1:02:05 AM 5 A hey 80
1/2/2020 1:03:10 AM 1/2/2020 1:03:20 AM 10 A hey 80
dput:
structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("",
"hey", "no", "some"), class = "factor"), Recipient = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L,
2L), .Label = c("", " ", "a@mail.com,b@mail.com"), class = "factor"),
Length = c(80L, 80L, 80L, 80L, 80L, NA, NA, NA, 80L, 80L,
NA, NA, NA, NA, 80L, 80L, NA, NA), Folder = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L,
1L, 1L), .Label = c("", "draft", "out"), class = "factor"),
Message = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 1L, 1L), .Label = c("",
"1/2/2020 1:00", "1/2/2020 1:02", "1/2/2020 1:03"), class = "factor"),
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA, NA, TRUE,
TRUE, NA, NA, NA, NA, TRUE, TRUE, NA, NA)), class = "data.frame", row.names = c(NA,
-18L))
我正在使用它,但我希望保留主题为空的行,我不希望将其过滤掉。从本例的前几行可以看出,尽管主题字段为空,但它仍应包含在第一个“块”中。当我移除此部分时:
filter(Subject != '') %>%, I get some errors, should I remove another part in the code too? (Keep in mind, I still want to display the Subject output). Any advice is appreciated.
df1<-df %>%
mutate_if(is.factor, as.character) %>%
mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "", stringr::str_trim(.))) %>%
filter(Subject != '') %>%
mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>%
mutate(segment = cumsum(!cond)) %>%
filter(cond) %>%
group_by(Subject, Recipient, Length, segment) %>%
summarize(Start = min(Date),
End = max(Date),
Duration = End - Start) %>%
mutate(new_group = (Subject != lag(Subject, 1, "")) *
(Recipient != lag(Recipient, 1, "")) *
(Length != lag(Length, 1, ""))) %>%
ungroup() %>%
mutate(group = LETTERS[cumsum(new_group)])
filter(Subject!='')%%>%,我遇到了一些错误,我应该删除代码中的另一部分吗?(请记住,我仍然希望显示主题输出)。任何建议都将不胜感激。
df1%
如果(is.factor,as.character)%>%,则进行变异
在(c(“受试者”、“接受者”),~if_-else(is.na(.),“”,stringr::str_-trim(.))%>%
过滤器(主题!='')%>%
mutate(日期=as.POSIXct(日期,格式='%m/%d/%Y%H:%m:%OS'))%>%
变异(cond=Edit&文件夹%in%c('out','draft')&消息=='')%>%
突变(段=cumsum(!cond))%>%
过滤器(cond)%%>%
分组依据(受试者、接受者、长度、分段)%>%
汇总(开始=分钟(日期),
结束=最大(日期),
持续时间=结束-开始)%>%
变异(新组=(受试者!=滞后(受试者,1,“))*
(收件人!=延迟(收件人,1,“”)*
(长度!=滞后(长度,1,“”))%>%
解组()%>%
变异(组=字母[cumsum(新组)])
最好重新提交dput数据。虽然我已经修复了开始,但它似乎也存在其他问题好的,谢谢@R.S。您的dput
有一些错误。Subject
列只有7个条目,而其余的只有13个条目。我已经完全更新了@Rohit My de歉意如果您想将带有空白Subject
的行作为第一个块的一部分,那么不要将其过滤掉。另外,不要按主题分组。