R：计算指定时间范围内不同类别的数量_R_Data.table_Dplyr_Distinct Values

R：计算指定时间范围内不同类别的数量

R：计算指定时间范围内不同类别的数量,r,data.table,dplyr,distinct-values,R,Data.table,Dplyr,Distinct Values,以下是一些虚拟数据： user_id date category 27 2016-01-01 apple 27 2016-01-03 apple 27 2016-01-05 pear 27 2016-01-07 plum 27 2016-01-10 apple 27 2016-01-14 pear 27 2016-01-16 plum

以下是一些虚拟数据：

  user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple

我想为每个

用户id

计算指定时间段（例如过去7、14天）内不同

类别的数量，包括当前订单
解决方案如下所示：
 user_id       date category distinct_7 distinct_14
      27 2016-01-01    apple          1           1
      27 2016-01-03    apple          1           1
      27 2016-01-05     pear          2           2
      27 2016-01-07     plum          3           3
      27 2016-01-10    apple          3           3
      27 2016-01-14     pear          3           3
      27 2016-01-16     plum          3           3
      11 2016-01-01    apple          1           1
      11 2016-01-03     pear          2           2
      11 2016-01-05     pear          2           2
      11 2016-01-07     pear          2           2
      11 2016-01-10    apple          2           2
      11 2016-01-14    apple          2           2
      11 2016-01-16    apple          1           2

我发布了类似的问题，但没有一个提到计算指定时间段内的累积唯一值。非常感谢你的帮助
 这里有两个数据表
解决方案，一个包含两个嵌套的lappy
，另一个使用非等联接
第一个是相当笨拙的数据表
解决方案，但它再现了预期的答案。它可以在任意数量的时间范围内工作。（尽管@alistaire在评论中提出的简洁的tidyverse
解决方案也可以修改）
它使用两个嵌套的lappy
。第一个循环遍历时间范围，第二个循环遍历日期。将临时结果与原始数据合并，然后将其从长格式改为宽格式，这样我们将在每个时间帧中以单独的列结束
library(data.table)
tmp <- rbindlist(
  lapply(c(7L, 14L), 
         function(ldays) rbindlist(
           lapply(unique(dt$date), 
                  function(ldate) {
                    dt[between(date, ldate - ldays, ldate), 
                       .(distinct = sprintf("distinct_%02i", ldays), 
                         date = ldate, 
                         N = uniqueN(category)), 
                       by = .(user_id)]
                  })
         )
  )
)
dcast(tmp[dt, on=c("user_id", "date")], 
      ... ~ distinct, value.var = "N")[order(-user_id, date, category)] 
#          date user_id category distinct_07 distinct_14
# 1: 2016-01-01      27    apple           1           1
# 2: 2016-01-03      27    apple           1           1
# 3: 2016-01-05      27     pear           2           2
# 4: 2016-01-07      27     plum           3           3
# 5: 2016-01-10      27    apple           3           3
# 6: 2016-01-14      27     pear           3           3
# 7: 2016-01-16      27     plum           3           3
# 8: 2016-01-01      11    apple           1           1
# 9: 2016-01-03      11     pear           2           2
#10: 2016-01-05      11     pear           2           2
#11: 2016-01-07      11     pear           2           2
#12: 2016-01-10      11    apple           2           2
#13: 2016-01-14      11    apple           2           2
#14: 2016-01-16      11    apple           1           2

数据：
dt在tidyverse中，您可以使用map\u int
对一组值进行迭代，并简化为整数asapply
或vapply
。通过比较对象子集的n\u distinct
（如长度（唯一（…）
）或辅助对象之间的）来计算不同的出现次数，最小值由当天减去的适当数量设置，即为设置
library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2

库（tidyverse）
df%>%分组依据（用户id）%>%
mutate（distinct_7=map_int（date，~n_distinct（category）[between（date，.x-7，.x）]），
distinct_14=map_int（日期，~n_distinct（类别[介于（日期，.x-14，.x）]））
##来源：本地数据帧[14 x 5]
##组：用户标识[2]
## 
##用户id日期类别不同\u 7不同\u 14
##                           
##1 27 2016-01-01苹果公司1 1
##2 27 2016-01-03苹果公司1 1
##3 27 2016-01-05梨2 2
##427 2016-01-07李子3 3
##5 27 2016-01-10苹果公司3 3
##6 27 2016-01-14梨3 3
##7 27 2016-01-16李子3 3
##8 11 2016-01-01苹果1 1
##9 11 2016-01-03梨2 2
##10 11 2016-01-05梨2 2
##11 11 2016-01-07梨2 2
##12 11 2016-01-10苹果公司2
##13 11 2016-01-14苹果公司2
##14 11 2016-01-16苹果公司1 2
我建议使用软件包。在运行windows时，您可以使用任何R函数和runner
函数。下面的代码获得设计输出，即过去7天+当前和过去14天+当前（当前8天和15天）：
df%
分组依据（用户id）%>%
变异（不同的）7=跑步者（类别，k=7+1，idx=date，
f=函数（x）长度（唯一（x）），
独特的14=跑步者（类别，k=14+1，idx=日期，
f=函数（x）长度（唯一（x）））

更多信息请参阅和文档。
为什么它以0
开头？这是我的打字错误，现在已更正，谢谢！您确定distinct_7
中的值正确吗？如果我看一下2016-01-10，它是否应该作为一个新的群体开始。另外，如果您查看user\u id
11的distinct\u 7
值，它从0开始。在distinct\u 7
中，在2016-01-10
和2016-01-03
之间，user 27
共有3个类别，而user 11
共有2个类别。现在有意义吗？您可以迭代日期
，即库（tidyverse）；df%>%groupby（user\u id）%>%mutate（distinct\u 7=map\u int（date，~n\u distinct）（category[date>=.x-7&date=.x-14&date
dt <- fread("user_id       date category
       27 2016-01-01    apple
       27 2016-01-03    apple
       27 2016-01-05     pear
       27 2016-01-07     plum
       27 2016-01-10    apple
       27 2016-01-14     pear
       27 2016-01-16     plum
       11 2016-01-01    apple
       11 2016-01-03     pear
       11 2016-01-05     pear
       11 2016-01-07     pear
       11 2016-01-10    apple
       11 2016-01-14    apple
       11 2016-01-16    apple")
dt[, date := as.IDate(date)]

library(tidyverse)

df %>% group_by(user_id) %>% 
    mutate(distinct_7  = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])), 
           distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))

## Source: local data frame [14 x 5]
## Groups: user_id [2]
## 
##    user_id       date category distinct_7 distinct_14
##      <int>     <date>   <fctr>      <int>       <int>
## 1       27 2016-01-01    apple          1           1
## 2       27 2016-01-03    apple          1           1
## 3       27 2016-01-05     pear          2           2
## 4       27 2016-01-07     plum          3           3
## 5       27 2016-01-10    apple          3           3
## 6       27 2016-01-14     pear          3           3
## 7       27 2016-01-16     plum          3           3
## 8       11 2016-01-01    apple          1           1
## 9       11 2016-01-03     pear          2           2
## 10      11 2016-01-05     pear          2           2
## 11      11 2016-01-07     pear          2           2
## 12      11 2016-01-10    apple          2           2
## 13      11 2016-01-14    apple          2           2
## 14      11 2016-01-16    apple          1           2

df <- read.table(
  text = "  user_id       date category
       27 2016-01-01    apple
  27 2016-01-03    apple
  27 2016-01-05     pear
  27 2016-01-07     plum
  27 2016-01-10    apple
  27 2016-01-14     pear
  27 2016-01-16     plum
  11 2016-01-01    apple
  11 2016-01-03     pear
  11 2016-01-05     pear
  11 2016-01-07     pear
  11 2016-01-10    apple
  11 2016-01-14    apple
  11 2016-01-16    apple", header = TRUE, colClasses = c("integer", "Date", "character"))



library(dplyr)
library(runner)
df %>%
  group_by(user_id) %>%
  mutate(distinct_7  = runner(category, k = 7 + 1, idx = date, 
                              f = function(x) length(unique(x))),
         distinct_14 = runner(category, k = 14 + 1, idx = date, 
                              f = function(x) length(unique(x))))