R 根据多种条件组合观测结果
目前我正在写我的硕士论文,但是,我在多个条件下合并行时遇到了一些问题。我已经在下面说明了我的问题和期望的结果。我希望你能帮助我:) 以下是我的数据集的外观示例:R 根据多种条件组合观测结果,r,dplyr,conditional-statements,diff,R,Dplyr,Conditional Statements,Diff,目前我正在写我的硕士论文,但是,我在多个条件下合并行时遇到了一些问题。我已经在下面说明了我的问题和期望的结果。我希望你能帮助我:) 以下是我的数据集的外观示例: df <- data.frame( userID = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3), sessionID = c(1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4), date = as.Date(c("2019-03-15", "20
df <- data.frame(
userID = c(1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 3),
sessionID = c(1, 2, 3, 4, 5, 1, 2, 1, 2, 3, 4),
date = as.Date(c("2019-03-15", "2019-03-18", "2019-03-19", "2019-03-21","2019-03-30", "2019-04-05",
"2019-06-06", "2019-11-22", "2019-12-22", "2019-12-24", "2020-01-15"),
format = "%Y-%m-%d"),
purchase=c(0,1,0,0,0,0,0,0,0,1,0))
df%
变异(差异=日期-滞后(日期))
但是,如果两行之间的差异小于10天,我希望将它们合并。我希望每次有活动(一个新的sessionID)时,10天窗口都会重置。此外,当purchase为1时,它将停止,当有新的sessionID时,10天窗口将再次开始
我在dplyr中尝试了很多函数filter和summary,但没有得到预期的结果。此外,我真的不知道如何包括购买条件
我期望的结果如下所示:
df2 <- data.frame(
userID = c(1, 1, 2, 2, 3, 3, 3),
sessionID = c("1 + 2", "3 + 4 + 5", "1", "2", "1", "2 + 3", "4"),
date.start = as.Date(c("2019-03-15","2019-03-19", "2019-04-05",
"2019-06-06", "2019-11-22", "2019-12-22", "2020-01-15"),
format = "%Y-%m-%d"),
date.end = as.Date(c("2019-03-18", "2019-03-30", "2019-04-05", "2019-06-06",
"2019-11-22", "2019-12-24", "2020-01-15"), format = "%Y-%m-%d"),
purchase=c(1,0,0,0,0,1,0))
df2按“userID”分组,通过对“purchase”的lag
进行累积和,根据“purchase”中出现的1创建一个新组,然后根据相邻“date”值的差异创建另一个分组,即检查差异是否大于或等于10天,进行累积和,和通过粘贴()对“sessionID”进行汇总,获取“date”的第一个
元素和“date”的最后一个
,以及“purchase”中的值1作为汇总列
library(dplyr)
library(stringr)
df %>%
group_by(userID) %>%
group_by( grp = cumsum(lag(purchase,
default = first(purchase))), .add = TRUE) %>%
group_by(cat = cumsum(difftime(date,
lag(date, default = first(date)), units = 'day') >= 10), .add = TRUE ) %>%
summarise(sessionID = str_c(sessionID, collapse= ' + '),
date.start = first(date), date.end = last(date),
purchase = +(any(purchase == 1)), .groups = 'drop' ) %>%
select(-grp, -cat)
-输出
# A tibble: 7 x 5
userID sessionID date.start date.end purchase
<dbl> <chr> <date> <date> <int>
1 1 1 + 2 2019-03-15 2019-03-18 1
2 1 3 + 4 + 5 2019-03-19 2019-03-30 0
3 2 1 2019-04-05 2019-04-05 0
4 2 2 2019-06-06 2019-06-06 0
5 3 1 2019-11-22 2019-11-22 0
6 3 2 + 3 2019-12-22 2019-12-24 1
7 3 4 2020-01-15 2020-01-15 0
#一个tible:7 x 5
用户ID sessionID date.start date.end purchase
1 1 1 + 2 2019-03-15 2019-03-18 1
2 1 3 + 4 + 5 2019-03-19 2019-03-30 0
3 2 1 2019-04-05 2019-04-05 0
4 2 2 2019-06-06 2019-06-06 0
5 3 1 2019-11-22 2019-11-22 0
6 3 2 + 3 2019-12-22 2019-12-24 1
7 3 4 2020-01-15 2020-01-15 0
献给我亲爱的朋友@akrun
这里只是实现最终输出的另一种方式,它不像亲爱的@akrun建议的那样优雅和简洁。事实上,我花了几个小时在这个问题上,这对我来说是非常重要的看到它的结束。然而,亲爱的@Akrun一如既往地鼓舞了我。我希望它对你有用:
library(dplyr)
library(purrr)
df %>%
mutate(cum = cumsum(purchase == 1),
cum = ifelse(cum - lag(cum, default = 0) == 1, lag(cum), cum),
Days = as.numeric(date - lag(date, default = first(date)))) %>%
group_by(cum) %>%
mutate(diff = ifelse(Days < 10, 0, 1)) %>%
ungroup() %>%
mutate(diff = cumsum(diff),
start = date,
end = date) %>%
mutate(across(sessionID, as.character)) %>%
group_split(userID, cum, diff) %>%
map_dfr(~ add_row(.x, userID = .x$userID[1],
sessionID = paste(.x$sessionID, collapse = "+"),
start = .x$date[1], end = .x$date[length(.x$date)])) %>%
filter(if_any(date:diff, ~ is.na(.x))) %>%
select(!date:diff)
# A tibble: 7 x 4
userID sessionID start end
<dbl> <chr> <date> <date>
1 1 1+2 2019-03-15 2019-03-18
2 1 3+4+5 2019-03-19 2019-03-30
3 2 1 2019-04-05 2019-04-05
4 2 2 2019-06-06 2019-06-06
5 3 1 2019-11-22 2019-11-22
6 3 2+3 2019-12-22 2019-12-24
7 3 4 2020-01-15 2020-01-15
库(dplyr)
图书馆(purrr)
df%>%
变异(cum=cumsum(购买=1),
cum=ifelse(cum-lag(cum,默认值=0)=1,lag(cum),cum),
天数=as.numeric(日期-滞后(日期,默认值=第一个(日期)))%>%
组别(合计)%>%
变异(差异=ifelse(天数<10,0,1))%>%
解组()%>%
变异(diff=cumsum(diff),
开始=日期,
结束=日期)%>%
变异(跨(sessionID,as.character))%>%
组分割(用户ID、cum、diff)%>%
map_dfr(~add_row(.x,userID=.x$userID[1],
sessionID=粘贴(.x$sessionID,collapse=“+”,
开始=.x$日期[1],结束=.x$日期[长度(.x$日期)])%>%
过滤器(如果有(日期:diff,~is.na(.x)))%>%
选择(!日期:差异)
#一个tibble:7x4
用户ID会话ID开始结束
1 1 1+2 2019-03-15 2019-03-18
2 1 3+4+5 2019-03-19 2019-03-30
3 2 1 2019-04-05 2019-04-05
4 2 2 2019-06-06 2019-06-06
5 3 1 2019-11-22 2019-11-22
6 3 2+3 2019-12-22 2019-12-24
7 3 4 2020-01-15 2020-01-15
另一种使用累加2的tidyverse策略
df%
分组人(userID,grp=cumsum(sessionID=1))%>%
变异(diff=as.numeric(日期-滞后(日期,默认值=first(日期)))%>%
分组依据(grp2=累计2(差异,购买[-n()],~if(…2>10 |…3==1)…1+1其他..1),.add=T)%>%
摘要(sessionID=paste(sessionID,collapse='+'),
开始日期=第一个(日期),
结束日期=最后一个(日期),.groups='drop')%>%
选择(!以('grp')开头)
#>#tibble:7 x 4
#>用户ID会话ID开始日期结束日期
#>
#> 1 1 1 + 2 2019-03-15 2019-03-18
#> 2 1 3 + 4 + 5 2019-03-19 2019-03-30
#> 3 2 1 2019-04-05 2019-04-05
#> 4 2 2 2019-06-06 2019-06-06
#> 5 3 1 2019-11-22 2019-11-22
#> 6 3 2 + 3 2019-12-22 2019-12-24
#> 7 3 4 2020-01-15 2020-01-15
由(v2.0.0)于2021-06-10创建非常感谢!这个例子非常好用,但是,我不能用我的原始数据集重现它。如果userID和sessionId被定义为字符,这重要吗?@Andre这不重要,因为我只是paste
ingsessionID@Andre您的原始数据中可能有NAs,这会在cumsum
中产生一些问题。复制示例的问题在于示例的最后四行(所以,userID 3)。在这个示例中,它是有效的。对于我的原始数据集,我得到以下信息:userID:3;sessionid:4+3;start.date:2020-01-15;end.date:2019-12-14;purchase:1。&另一行是userID:3;sessionid:2+1;start.date:2019-12-22;end.date-2019-11-22,purchase:0。我通过添加arrange(日期)来修复它功能。感谢您的帮助和快速响应!非常感谢:)
library(dplyr)
library(purrr)
df %>%
mutate(cum = cumsum(purchase == 1),
cum = ifelse(cum - lag(cum, default = 0) == 1, lag(cum), cum),
Days = as.numeric(date - lag(date, default = first(date)))) %>%
group_by(cum) %>%
mutate(diff = ifelse(Days < 10, 0, 1)) %>%
ungroup() %>%
mutate(diff = cumsum(diff),
start = date,
end = date) %>%
mutate(across(sessionID, as.character)) %>%
group_split(userID, cum, diff) %>%
map_dfr(~ add_row(.x, userID = .x$userID[1],
sessionID = paste(.x$sessionID, collapse = "+"),
start = .x$date[1], end = .x$date[length(.x$date)])) %>%
filter(if_any(date:diff, ~ is.na(.x))) %>%
select(!date:diff)
# A tibble: 7 x 4
userID sessionID start end
<dbl> <chr> <date> <date>
1 1 1+2 2019-03-15 2019-03-18
2 1 3+4+5 2019-03-19 2019-03-30
3 2 1 2019-04-05 2019-04-05
4 2 2 2019-06-06 2019-06-06
5 3 1 2019-11-22 2019-11-22
6 3 2+3 2019-12-22 2019-12-24
7 3 4 2020-01-15 2020-01-15