分组,应用多个条件并在R或Python中计算持续时间

分组,应用多个条件并在R或Python中计算持续时间,python,r,pandas,loops,dplyr,Python,R,Pandas,Loops,Dplyr,目标 我有一个数据集df,我想根据特定条件计算持续时间,并显示收件人,开始时间,结束时间,持续时间和长度的输出 问题 如果以下条件适用,我需要首先将邮件分组: 如果文件夹=='out'或草稿,消息=='',编辑==“T”,如果 收件人列和长度列连续相同 理想情况下,这会给我A组及其持续时间。例如,第一个“数据块”将标记为“A组”,开始时间为1/2/2020 1:00:01 AM,结束时间为1/2/2020 1:00:30 AM Subject Re Leng

目标

我有一个数据集df,我想根据特定条件计算持续时间,并显示
收件人
开始时间
结束时间
持续时间
长度
的输出

问题

如果以下条件适用,我需要首先将邮件分组: 如果
文件夹=='out'
草稿
消息==''
编辑==“T”
,如果 收件人列和长度列连续相同

理想情况下,这会给我A组及其持续时间。例如,第一个“数据块”将标记为“A组”,开始时间为
1/2/2020 1:00:01 AM
,结束时间为
1/2/2020 1:00:30 AM

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 
some
some
some
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        
no
no
no
no
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T                        
此外,如果主题的最后一行Re和Length列与第一行的其他组Subject、Re和Length匹配,我想将A组与另一个数据块“匹配”。因此,第二组A的开始时间为
1/2/2020 1:02:00 AM
,结束时间为
1/2/2020 1:02:05 AM

Subject Re                    Length         Folder      Message   Date                   Edit     
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:01 AM     T                               
        a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:05 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:15 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:00:30 AM     T 
some
some
some
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:00 AM     T                        
hey     a@mail.com,b@mail.com 80            draft                  1/2/2020 1:02:05 AM     T                        
no
no
no
no
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:10 AM     T                        
hey     a@mail.com,b@mail.com 80             out                   1/2/2020 1:03:20 AM     T                        
所需输出

 Start                  End                        Duration          Group  Subject  Length
 1/2/2020 1:00:01 AM    1/2/2020 1:00:30 AM        29                A      hey       80
 1/2/2020 1:02:00 AM    1/2/2020 1:02:05 AM        5                 A      hey       80
 1/2/2020 1:03:10 AM    1/2/2020 1:03:20 AM        10                A      hey       80
dput:

 structure(list(Subject = structure(c(1L, 1L, 2L, 2L, 2L, 4L, 
 4L, 4L, 2L, 2L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 1L), .Label = c("", 
 "hey", "no", "some"), class = "factor"), Recipient = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 
2L), .Label = c("", " ", "a@mail.com,b@mail.com"), class = "factor"), 
Length = c(80L, 80L, 80L, 80L, 80L, NA, NA, NA, 80L, 80L, 
NA, NA, NA, NA, 80L, 80L, NA, NA), Folder = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("", "draft", "out"), class = "factor"), 
Message = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), Date = structure(c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 1L, 1L), .Label = c("", 
"1/2/2020 1:00", "1/2/2020 1:02", "1/2/2020 1:03"), class = "factor"), 
Edit = c(TRUE, TRUE, TRUE, TRUE, TRUE, NA, NA, NA, TRUE, 
TRUE, NA, NA, NA, NA, TRUE, TRUE, NA, NA)), class = "data.frame", row.names =   c(NA, 
  -18L))
我正在使用它,但我希望保留主题为空的行,我不希望将其过滤掉。从本例的前几行可以看出,尽管主题字段为空,但它仍应包含在第一个“块”中。当我移除此部分时:

   filter(Subject != '') %>%, I get some errors, should I remove another      part in the code too? (Keep in mind, I still want to display the Subject output).  Any advice is appreciated.





 df1<-df %>% 

 mutate_if(is.factor, as.character) %>% 

 mutate_at(c("Subject", "Recipient"), ~if_else(is.na(.), "",      stringr::str_trim(.))) %>%
 filter(Subject != '') %>%
 mutate(Date = as.POSIXct(Date, format = '%m/%d/%Y %H:%M:%OS')) %>%
 mutate(cond = Edit & Folder %in% c('out', 'draft') & Message == '') %>% 
 mutate(segment = cumsum(!cond)) %>%
 filter(cond) %>%  


 group_by(Subject, Recipient, Length, segment) %>%
 summarize(Start = min(Date),
        End = max(Date),
        Duration = End - Start) %>%


  mutate(new_group = (Subject   != lag(Subject, 1, "")) *
       (Recipient != lag(Recipient, 1, "")) *
       (Length    != lag(Length, 1, ""))) %>%
  ungroup() %>%
  mutate(group = LETTERS[cumsum(new_group)])
filter(Subject!='')%%>%,我遇到了一些错误,我应该删除代码中的另一部分吗?(请记住,我仍然希望显示主题输出)。任何建议都将不胜感激。
df1%
如果(is.factor,as.character)%>%,则进行变异
在(c(“受试者”、“接受者”),~if_-else(is.na(.),“”,stringr::str_-trim(.))%>%
过滤器(主题!='')%>%
mutate(日期=as.POSIXct(日期,格式='%m/%d/%Y%H:%m:%OS'))%>%
变异(cond=Edit&文件夹%in%c('out','draft')&消息=='')%>%
突变(段=cumsum(!cond))%>%
过滤器(cond)%%>%
分组依据(受试者、接受者、长度、分段)%>%
汇总(开始=分钟(日期),
结束=最大(日期),
持续时间=结束-开始)%>%
变异(新组=(受试者!=滞后(受试者,1,“))*
(收件人!=延迟(收件人,1,“”)*
(长度!=滞后(长度,1,“”))%>%
解组()%>%
变异(组=字母[cumsum(新组)])

最好重新提交dput数据。虽然我已经修复了开始,但它似乎也存在其他问题好的,谢谢@R.S。您的
dput
有一些错误。
Subject
列只有7个条目,而其余的只有13个条目。我已经完全更新了@Rohit My de歉意如果您想将带有空白
Subject
的行作为第一个块的一部分,那么不要将其过滤掉。另外,不要按主题分组。