R 使用data.table创建序列

R 使用data.table创建序列,r,data.table,aggregate-functions,R,Data.table,Aggregate Functions,我有一个数据表的格式 id | pet | name 2011-01-01 | "dog" | "a" 2011-01-02 | "dog" | "b" 2011-01-03 | "cat" | "c" 2011-01-04 | "dog" | "a" 2011-01-05 | "dog" | "some" 2011-01-06 | "cat" | "thing" 我想执行一个聚合,将猫出现之前出现的所有狗名连接起来,例如 id | pet | name

我有一个数据表的格式

id | pet   | name  
2011-01-01 | "dog" | "a"  
2011-01-02 | "dog" | "b"  
2011-01-03 | "cat" | "c"  
2011-01-04 | "dog" | "a"  
2011-01-05 | "dog" | "some"   
2011-01-06 | "cat" | "thing"
我想执行一个聚合,将猫出现之前出现的所有狗名连接起来,例如

id | pet   | name   | prior  
2011-01-01 | "dog" | "a"     |  
2011-01-02 | "dog" | "b"     |  
2011-01-03 | "cat" | "c"     |  "a b"  
2011-01-04 | "dog" | "a"     |  
2011-01-05 | "dog" | "some"  |  
2011-01-06 | "cat" | "thing" | "a some"  
试一试

数据
df1这里是另一个选项

indx <- setDT(DT)[, list(.I[.N], paste(name[-.N], collapse = ' ')), 
                    by = list(c(0L, cumsum(pet == "cat")[-nrow(DT)]))]
DT[indx$V1, prior := indx$V2]
DT
#            id pet  name  prior
# 1: 2011-01-01 dog     a     NA
# 2: 2011-01-02 dog     b     NA
# 3: 2011-01-03 cat     c    a b
# 4: 2011-01-04 dog     a     NA
# 5: 2011-01-05 dog  some     NA
# 6: 2011-01-06 cat thing a some

indx我在数据集中运行了每个解决方案,并将运行时间与rbenchmark进行了比较

我无法共享数据集,但这里有一些基本信息:

dim(event_source_causal_parts)
[1] 311127      4
用于比较的代码

require(rbenchmark)
benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(c(0,diff(Source == "Warranty")) < 0))][Source != 'Warranty', prior := '']
 })

benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(shift(Source, fill="Warranty") == "Warranty"))][Source != 'Warranty', prior := ''] 
  })


benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  indx <- setDT(event_source_causal_parts)[, list(.I[.N], paste(Causal_Part_Number[-.N], collapse = " ")),
                                       by = list(c(0L, cumsum(Source == "Warranty")[-nrow(event_source_causal_parts)]))]
})
我的环境,

R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rbenchmark_1.0.0 stringr_0.6.2    data.table_1.9.5 vimcom_1.2-6    

loaded via a namespace (and not attached):
[1] chron_2.3-45    grid_3.1.2      lattice_0.20-30 tools_3.1.2     zoo_1.7-11 
R使用了英特尔MKL数学库

基于这些结果,我认为@akrun的第二个解决方案是最快的

我再次运行了测试,但现在我用-O3重新编译了data.table,并将R更新为3.2.0。结果非常不同:

  replications elapsed relative user.self sys.self user.child sys.child
1          100   21.22        1     20.73     0.48         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   11.31        1     10.39     0.92         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   35.77        1     35.53     0.25         NA        NA

因此,在新的R和O3条件下,最佳溶液的速度更快,但次优溶液的速度要慢得多

你尝试了什么?请不要在这里“扩展问题”。如果你有一个新问题,你可能应该把它作为一个新问题发布,最好用一个小的容易重复的例子来解决。对不起,真管用!谢谢。我将对我的问题进行一些基准测试,以比较解决方案。@Anton,出于好奇,我的答案也不起作用?或者你就是懒得为你在这里得到的免费帮助提供一些反馈?@DavidArenburg我认为OP打算在两篇文章中比较解决方案(尽管不确定),如果你的第二个解决方案获胜,我将获得7.5分:)@DavidArenburg我也将测试你的解决方案。
require(rbenchmark)
benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(c(0,diff(Source == "Warranty")) < 0))][Source != 'Warranty', prior := '']
 })

benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  setDT(event_source_causal_parts)[, prior := paste(Causal_Part_Number[-.N], collapse = ' '), .(group=cumsum(shift(Source, fill="Warranty") == "Warranty"))][Source != 'Warranty', prior := ''] 
  })


benchmark({
  event_source_causal_parts <- augmented_data_no_software[, list(PROD_ID, Source, Event_Date, Causal_Part_Number)] 
  indx <- setDT(event_source_causal_parts)[, list(.I[.N], paste(Causal_Part_Number[-.N], collapse = " ")),
                                       by = list(c(0L, cumsum(Source == "Warranty")[-nrow(event_source_causal_parts)]))]
})
  replications elapsed relative user.self sys.self user.child sys.child
1          100   12.91        1     12.76     0.05         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100    12.7        1     12.66     0.05         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   61.97        1     61.65        0         NA        NA
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rbenchmark_1.0.0 stringr_0.6.2    data.table_1.9.5 vimcom_1.2-6    

loaded via a namespace (and not attached):
[1] chron_2.3-45    grid_3.1.2      lattice_0.20-30 tools_3.1.2     zoo_1.7-11 
  replications elapsed relative user.self sys.self user.child sys.child
1          100   21.22        1     20.73     0.48         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   11.31        1     10.39     0.92         NA        NA

  replications elapsed relative user.self sys.self user.child sys.child
1          100   35.77        1     35.53     0.25         NA        NA