R 算法效率-时差循环

R 算法效率-时差循环,r,algorithm,loops,datetime,R,Algorithm,Loops,Datetime,我有一个名为vistsPerDay的数据集,看起来像这样,但有405890行和10406个唯一的客户ID: > CUST_ID Date > 1 2013-09-19 > 1 2013-10-03 > 1 2013-10-08 > 1 2013-10-12 > 1 2013-10-20 > 1 2013-10-25 > 1 2013-

我有一个名为vistsPerDay的数据集,看起来像这样,但有405890行和10406个唯一的客户ID:

> CUST_ID   Date
> 1         2013-09-19
> 1         2013-10-03
> 1         2013-10-08
> 1         2013-10-12
> 1         2013-10-20
> 1         2013-10-25
> 1         2013-11-01
> 1         2013-11-02
> 1         2013-11-08
> 1         2013-11-15
> 1         2013-11-23
> 1         2013-12-02
> 1         2013-12-04
> 1         2013-12-09
> 2         2013-09-16
> 2         2013-09-17
> 2         2013-09-18
我想做的是创建一个新变量,即访问日期之间的滞后差。以下是我当前使用的代码:

visitsPerDay <- visitsPerDay[order(visitsPerDay$CUST_ID), ]
cust_id <- 0
for (i in 1:nrow(visitsPerDay)) {
  if (visitsPerDay$CUST_ID[i] != cust_id) {
    cust_id <- visitsPerDay$CUST_ID[i]
    visitsPerDay$MTBV <- NA
  } else {
    visitsPerDay$MBTV <- as.numeric(visitsPerDay$Date[i] - visitsPerDay$Date[i-1])
  }
}

visitsPerDay这里有一个使用
tapply
的方法:

# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date) 

visitsPerDay <- transform(visitsPerDay, 
                          MBTV = unlist(tapply(Date, 
                                               CUST_ID, 
                                               FUN = function(x) c(NA,diff(x)))))

编辑:更快的方法:

# transform 'Date' to values of class 'Date' (maybe already done)
visitsPerDay$Date <- as.Date(visitsPerDay$Date) 

visitsPerDay$MBTV <- c(NA_integer_, 
                       "is.na<-"(diff(visitsPerDay$Date), 
                                 !duplicated(visitsPerDay$CUST_ID)[-1]))
#将“Date”转换为类“Date”的值(可能已经完成)

visitsPerDay$Date由于您是按客户id排序的,因此您可以通过执行桶排序而不是普通排序来加速此过程。请注意,算法中的瓶颈(以大O表示法而言)是排序,即
O(nlogn)

以下伪代码假设数据按日期排序(与答案中建议的代码所需的假设相同):

//桶排序:

客户这是
数据表
解决方案。这可能会更快,更具可读性:

dt = data.table(visitsPerDay)

dt[, MBTV := c(NA, diff(as.Date(Date))), by = CUST_ID]
dt
#    CUST_ID       Date    MBTV
# 1:       1 2013-09-19 NA days
# 2:       1 2013-10-03 14 days
# 3:       1 2013-10-08  5 days
# 4:       1 2013-10-12  4 days
# 5:       1 2013-10-20  8 days
# 6:       1 2013-10-25  5 days
# 7:       1 2013-11-01  7 days
# 8:       1 2013-11-02  1 days
# 9:       1 2013-11-08  6 days
#10:       1 2013-11-15  7 days
#11:       1 2013-11-23  8 days
#12:       1 2013-12-02  9 days
#13:       1 2013-12-04  2 days
#14:       1 2013-12-09  5 days
#15:       2 2013-09-16 NA days
#16:       2 2013-09-17  1 days
#17:       2 2013-09-18  1 days

除非
order()
是一个稳定的排序,并且原始数据是按日期排序的-算法是错误的。我明白你的意思,阿米特。为了确保日期是按时间顺序排列的,我还应该按日期排序。然而,这不是我目前问题的症结所在。我目前正在运行上面列出的算法,我已经过了5分钟的运行时间。我建议您阅读一下R中的循环,这通常是避免的。看到了,绝对完美。还剩2.58秒。干得好@布列滕2.6秒相当慢。如果您使用package data.table实现它,您应该能够更快地完成此任务。@brittenb我添加了一个(希望)更快的方法。@Roland 2.6秒可能不是最快的方法,这篇文章的标题暗示了我正在寻找。所以我完全理解这个评论。然而,我只是需要一些比我使用的更快的东西。在速度/效率和可读性/适应性之间有一个明确的折衷,因为其他人需要阅读并理解代码。Sven提供的初始解决方案干净、优雅,执行时间最短。考虑到此代码的运行次数有限,任何进一步调整以减少1秒的计算时间似乎都会适得其反。
//bucket sort:
customers <- new array of size 10406
for each (cust_id,date):
   if customers[cust_id] == nil:
        customers[cust_id] = []
   customers[cust_id].append(date)
//find differences:
for each list in customers:
   i <- list.iter()
   prev = i.next()
   while (i.hasNext()):
        curr <- i.next()
        output diff(prev,curr)
        prev <- curr
dt = data.table(visitsPerDay)

dt[, MBTV := c(NA, diff(as.Date(Date))), by = CUST_ID]
dt
#    CUST_ID       Date    MBTV
# 1:       1 2013-09-19 NA days
# 2:       1 2013-10-03 14 days
# 3:       1 2013-10-08  5 days
# 4:       1 2013-10-12  4 days
# 5:       1 2013-10-20  8 days
# 6:       1 2013-10-25  5 days
# 7:       1 2013-11-01  7 days
# 8:       1 2013-11-02  1 days
# 9:       1 2013-11-08  6 days
#10:       1 2013-11-15  7 days
#11:       1 2013-11-23  8 days
#12:       1 2013-12-02  9 days
#13:       1 2013-12-04  2 days
#14:       1 2013-12-09  5 days
#15:       2 2013-09-16 NA days
#16:       2 2013-09-17  1 days
#17:       2 2013-09-18  1 days