Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/81.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/date/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R:计算特定事件之间的时间差_R_Date_Dplyr_Lubridate - Fatal编程技术网

R:计算特定事件之间的时间差

R:计算特定事件之间的时间差,r,date,dplyr,lubridate,R,Date,Dplyr,Lubridate,我有以下数据集: df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)), complete_order = c(rep(c(1,0,0,1), 2)), order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-0

我有以下数据集:

df = data.frame(cbind(user_id = c(rep(1, 4), rep(2,4)),
                  complete_order = c(rep(c(1,0,0,1), 2)),
                  order_date = c('2015-01-28', '2015-01-31', '2015-02-08', '2015-02-23', '2015-01-25', '2015-01-28', '2015-02-06', '2015-02-21')))  

library(lubridate)
df$order_date = as_date(df$order_date)

user_id complete_order order_date
      1              1 2015-01-28
      1              0 2015-01-31
      1              0 2015-02-08
      1              1 2015-02-23
      2              1 2015-01-25
      2              0 2015-01-28
      2              0 2015-02-06
      2              1 2015-02-21
我试图计算每个用户仅完成订单之间的天数差异。理想的结果如下所示:

user_id complete_order order_date complete_order_time_diff
<fctr>         <fctr>     <date>              <time>
   1              1    2015-01-28             NA days
   1              0    2015-01-31              3 days
   1              0    2015-02-08             11 days
   1              1    2015-02-23             26 days
   2              1    2015-01-25             NA days
   2              0    2015-01-28              3 days
   2              0    2015-02-06             12 days
   2              1    2015-02-21             27 days
它返回错误:

错误:大小不兼容(3),应为4(组大小)或1


这方面的任何帮助都会很好,谢谢

我认为您可以添加一个
过滤器
函数来代替带有
order\u date[complete\u order==1]
的子集,并通过将
stringsAsFactors=F
添加到
data.frame()
)来确保
order\u date(和其他变量)是正确的数据类型:

试试这个

library(dplyr)

df %>% group_by(user_id, complete_order) %>% 
   mutate(c1 = order_date - lag(order_date)) %>% 
   group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% ungroup %>% 
   mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>% 
   select(-c(c1, c2))
更新 对于多个取消的订单

 df %>% mutate(c3=cumsum( complete_order != "0")) %>% group_by(user_id, complete_order) %>% 
  mutate(c1 = order_date - lag(order_date)) %>% 
  group_by(user_id) %>% mutate(c2 = order_date - lag(order_date)) %>% 
  mutate(c2=as.numeric(c2)) %>% group_by(user_id, c3) %>% 
  mutate(c2=cumsum(ifelse(complete_order==1, 0, c2))) %>% ungroup %>% 
  mutate(complete_order_time_diff = ifelse(complete_order==0, c2, c1)) %>% 
  select(-c(c1, c2, c3))
逻辑
c3
是一个
id
,每次有一个订单(即
complete\u order not 0
)递增1

c1
计算用户id的日差(但对于未完成的订单,结果是错误的)

c2
修复了
c1
与非完整订单之间的这种不一致性

希望这能澄清问题


我建议您使用
groupby()
mutate(cumsum())
的组合,以便更好地理解拥有多个分组变量的结果。

您似乎在寻找每个订单与上次完成的订单之间的距离。有一个二进制向量,
x
c(NA,cummax(x*seq_沿着(x))[-length(x)])
给出每个元素前面最后一个“1”的索引。然后,从相应索引处的“订单日期”中减去“订单日期”的每个元素,得到所需的输出。例如

set.seed(1453); x = sample(0:1, 10, TRUE)
set.seed(1821); y = sample(5, 10, TRUE)
cbind(x, y, 
      last_x = c(NA, cummax(x * seq_along(x))[-length(x)]), 
      y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
#      x y last_x y_diff
# [1,] 1 3     NA     NA
# [2,] 0 3      1      0
# [3,] 1 5      1      2
# [4,] 0 1      3     -4
# [5,] 0 3      3     -2
# [6,] 1 5      3      0
# [7,] 1 1      6     -4
# [8,] 0 3      7      2
# [9,] 0 4      7      3
#[10,] 1 5      7      4
在您的数据上,为方便起见,请使用第一种格式
df

df$order_date = as.Date(df$order_date)
df$complete_order = df$complete_order == "1"  # lose the 'factor'
然后,在
分组后应用上述方法:

library(dplyr)
df %>% group_by(user_id) %>% 
   mutate(time_diff = order_date - 
order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])
,或者,在考虑“用户id”更改的索引后,尝试避免分组的操作(假设“用户id”已排序):


嗨,Joshua,谢谢,但我需要在数据集中保留取消的订单(完成的订单==0),并计算它们的时差。谢谢,dimitris,不幸的是,这对我不起作用:正如预期结果中所建议的,我还需要计算取消订单(完成的订单==0)的时差!在此发布之前尝试过此方法,此解决方案还将计算完成订单和取消订单之间的时间差。如果我们定义了
time\u diff=x-y
,那么
x
可以是任何类型的订单,但
y
必须始终是完整的订单。希望这有意义现在看起来很有希望,让我在接受答案之前在更复杂的数据集上测试一下,谢谢!另外,我认为您期望的结果的最后一行应该是
9天
,而不是
6天
,最后一行是
24天
,否则我仍然缺少一些东西。好的,我可以看到您的解决方案在大多数情况下都有效,但在一个客户的行中有多个取消订单时就不行了。我将编辑示例数据集,以便它考虑这样的场景。再次感谢你的帮助!似乎你可以尝试通过“用户id”应用函数
ff=function(complete,date)date-date[c(NA,cummax(complete*seq_-along(complete))[-length(complete)]]
,其中“complete_order”和“order_-date”分别作为“complete”和“date”传递。谢谢,@alexis_-laz,这绝对是正确的方向。然而,当我测试您的解决方案时——使用所有数据预处理步骤——我得到取消订单的NAs(已完成订单==0),您知道如何解决吗?@KasiaKulma:您是指示例中的“df”还是您的实际数据?如果是后者,您能否提供/
dput
一个示例,说明它返回的是
NA
?“dplyr”方法和最后一个方法都返回
NA
?您好,我的测试集中的分组结果是错误的,更正后,解决方案非常有效,谢谢!另外,感谢您对解决方案工作原理的清晰解释,非常有用@卡西亚库尔马:很高兴你发现它很有用。祝你好运我尝试将相同的原则应用于取消的订单,但结果不一致,即使上次取消订单的位置计算正确。你介意在这个()上聊一聊吗?
set.seed(1453); x = sample(0:1, 10, TRUE)
set.seed(1821); y = sample(5, 10, TRUE)
cbind(x, y, 
      last_x = c(NA, cummax(x * seq_along(x))[-length(x)]), 
      y_diff = y - y[c(NA, cummax(x * seq_along(x))[-length(x)])])
#      x y last_x y_diff
# [1,] 1 3     NA     NA
# [2,] 0 3      1      0
# [3,] 1 5      1      2
# [4,] 0 1      3     -4
# [5,] 0 3      3     -2
# [6,] 1 5      3      0
# [7,] 1 1      6     -4
# [8,] 0 3      7      2
# [9,] 0 4      7      3
#[10,] 1 5      7      4
df$order_date = as.Date(df$order_date)
df$complete_order = df$complete_order == "1"  # lose the 'factor'
library(dplyr)
df %>% group_by(user_id) %>% 
   mutate(time_diff = order_date - 
order_date[c(NA, cummax(complete_order * seq_along(complete_order))[-length(complete_order)])])
# save variables to vectors and keep a "logical" of when "id" changes
id = df$user_id
id_change = c(TRUE, id[-1] != id[-length(id)])

compl = df$complete_order
dord = df$order_date

# accounting for changes in "id", locate last completed order
i = c(NA, cummax((compl | id_change) * seq_along(compl))[-length(compl)])
is.na(i) = id_change

dord - dord[i]
#Time differences in days
#[1] NA  3 11 26 NA  3 12 27