R 通过将两列或更多列分组,根据日期差异计算总和
假设我有一个类似于以下数据集的数据集:R 通过将两列或更多列分组,根据日期差异计算总和,r,filter,group-by,sum,R,Filter,Group By,Sum,假设我有一个类似于以下数据集的数据集: | id | Date | Buyer | diff | Amount | ConsecutiveSum | |------|:---------:|------:|------|--------|----------------| | 334 | 6/15/2018 | Simon | NA | 1948 | 0 | | 334 | 6/20/2018 | Simon | 5 | 4290 |
| id | Date | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334 | 6/15/2018 | Simon | NA | 1948 | 0 |
| 334 | 6/20/2018 | Simon | 5 | 4290 | 6238 |
| 334 | 8/17/2018 | Simon | 58 | 4260 | 8550 |
| 334 | 8/20/2018 | Simon | 3 | 79 | 4339 |
| 334 | 8/7/2018 | Wang | NA | 2145 | 0 |
| 334 | 8/9/2018 | Wang | 2 | 4192 | 6337 |
| 5006 | 3/4/2019 | Wang | NA | 1700 | 0 |
| 5006 | 3/7/2019 | Wang | 3 | 335 | 2035 |
| 5006 | 5/5/2019 | Wang | 59 | 4400 | 4735 |
| 5006 | 5/9/2019 | Wang | 4 | 2700 | 7100 |
| 5006 | 5/14/2019 | Wang | 5 | 4355 | 7055 |
| 5006 | 5/17/2019 | Wang | 3 | 3100 | 7455 |
| id | Date | Buyer | diff | Amount | ConsecutiveSum |
|------|:---------:|------:|------|--------|----------------|
| 334 | 6/15/2018 | Simon | NA | 1948 | 0 |
| 334 | 6/20/2018 | Simon | 5 | 4290 | 6238 |
| 334 | 8/7/2018 | Wang | NA | 2145 | 0 |
| 334 | 8/9/2018 | Wang | 2 | 4192 | 6337 |
| 5006 | 5/5/2019 | Wang | 59 | 4400 | 4735 |
| 5006 | 5/9/2019 | Wang | 4 | 2700 | 7100 |
| 5006 | 5/14/2019 | Wang | 5 | 4355 | 7055 |
| 5006 | 5/17/2019 | Wang | 3 | 3100 | 7455 |
我需要获得相同买家和相同id的连续行金额之和>=5000的交易,但差异在5天内的交易=5000,而2018年8月17日和2018年8月20日完成的交易也在5天内差异,但连续性不大于或等于5000,我不希望这些事务出现在输出中。
此外,王在2019年5月5日和2019年5月9日完成的交易相差不超过5天,但基于此帖子,我只能获得2019年5月9日的交易,而不能获得2019年5月5日的交易。
如何重新构造代码以包含此类事务
以下是以下代码:
df <- data.frame(id = c("334","334","334","334","334","334","5006","5006","5006","5006","5006","5006"),
Date = c("6/15/2018","6/20/2018","8/17/2018","8/20/2019","8/7/2018","8/9/2018","3/4/2019",
"3/7/2019","5/5/2019","5/9/2019","5/14/2019","5/17/2019"),
Buyer = c("Simon", "Simon", "Simon", "Simon", "Chang", "Chang", "Chang", "Chang", "Chang",
"Chang","Chang","Chang"),
diff = c("NA","5","58","3","NA","2","NA","3","59","4","5","3"),
Amount = c("1948","4290","4260","79","2145","4192","1700","335","4400","2700","4355","3100"),
ConsecutiveSum = c("0","6238","8550","4339","0","6337","0","2035","4735","7100","7055","7455"),stringsAsFactors = F)
df$Date <- as.Date(df$Date, '%m/%d/%Y')
df$Amount <- as.numeric(df$Amount)
df$diff <- as.numeric(df$diff)
df$ConsecutiveSum <- as.numeric(df$ConsecutiveSum)
df_sum = df %>% group_by(Buyer,id) %>%
mutate(rank=dense_rank(Date)) %>%
mutate(ConsecutiveSum = ifelse(is.na(lag(Amount)),0,Amount + lag(Amount , default = 0))) %>%
filter(diff<=5 & ConsecutiveSum>=5000 | ConsecutiveSum==0 & lead(ConsecutiveSum)>=5000)
这里有一种可能使用隐藏变量keep1和keep2。首先重复示例中的所有行,直到df$conceutivesum按照您的要求,我提出了一个非常简单的想法,它确实符合您的逻辑并提供了预期结果。请注意,预期结果的第5行并非来自提供的toy data.frame
library(data.table)
setDT(df)
# create a column with day differences between consecutive dates of Buyer AND id:
df[, lagdays := c(NA, diff(Date)), by = .(id, Buyer)]
# Filter the cases in which: lagdays are either less than 5 or NA (first row in a Buyer-id combination) AND consecutiveSum is either greater than 5000 OR 0 (first row in a buyer-id combination).
# lagdays := NULL removes the helper variable
df[(lagdays <= 5 | is.na(lagdays)) & (ConsecutiveSum == 0 | ConsecutiveSum >= 5000), ][, lagdays := NULL][]
id Date Buyer diff Amount ConsecutiveSum
1: 334 2018-06-15 Simon NA 1948 0
2: 334 2018-06-20 Simon 5 4290 6238
3: 334 2018-08-07 Chang NA 2145 0
4: 334 2018-08-09 Chang 2 4192 6337
5: 5006 2019-03-04 Chang NA 1700 0
6: 5006 2019-05-09 Chang 4 2700 7100
7: 5006 2019-05-14 Chang 5 4355 7055
8: 5006 2019-05-17 Chang 3 3100 7455
您是否需要df中的秩列_sum@akrun我这样认为是因为它会以不同的方式排列,每个买家在您的预期输出中订购了一个项目。第5行的连续值为8592。它是从哪里来的?
id Date Buyer diff Amount ConsecutiveSum
1 334 2018-06-15 Simon 0 1948 0
2 334 2018-06-20 Simon 5 4290 6238
3 334 2018-08-07 Chang 0 2145 0
4 334 2018-08-09 Chang 2 4192 6337
5 5006 2019-05-05 Chang 59 4400 4735
6 5006 2019-05-09 Chang 4 2700 7100
7 5006 2019-05-14 Chang 5 4355 7055
8 5006 2019-05-17 Chang 3 3100 7455
library(data.table)
setDT(df)
# create a column with day differences between consecutive dates of Buyer AND id:
df[, lagdays := c(NA, diff(Date)), by = .(id, Buyer)]
# Filter the cases in which: lagdays are either less than 5 or NA (first row in a Buyer-id combination) AND consecutiveSum is either greater than 5000 OR 0 (first row in a buyer-id combination).
# lagdays := NULL removes the helper variable
df[(lagdays <= 5 | is.na(lagdays)) & (ConsecutiveSum == 0 | ConsecutiveSum >= 5000), ][, lagdays := NULL][]
id Date Buyer diff Amount ConsecutiveSum
1: 334 2018-06-15 Simon NA 1948 0
2: 334 2018-06-20 Simon 5 4290 6238
3: 334 2018-08-07 Chang NA 2145 0
4: 334 2018-08-09 Chang 2 4192 6337
5: 5006 2019-03-04 Chang NA 1700 0
6: 5006 2019-05-09 Chang 4 2700 7100
7: 5006 2019-05-14 Chang 5 4355 7055
8: 5006 2019-05-17 Chang 3 3100 7455