R 用于检查和批量线性模型的数据表选项_R_Data.table_Dplyr

R 用于检查和批量线性模型的数据表选项

R 用于检查和批量线性模型的数据表选项,r,data.table,dplyr,R,Data.table,Dplyr,我想知道是否有一个data.table选项，用于批处理数据集中的线性模型，并先检查我需要对每个唯一标识符运行一组线性模型，但首先我需要做一个检查。对于每个唯一的id和年份，我需要检查是否至少有24个月以前的月度数据，但不超过60个月。因此，当我运行回归时，它应该包括每个个体每年24-60次上个月（年）数据的观察。如果该年的数据少于24个月，则该个人的年份将被删除，但如果超过60个月，则仅使用60个月感谢这篇（感谢@akrun）帖子，我能够为每个人建立线性模型，运行它们，然后将beta作为两个

我想知道是否有一个

data.table

选项，用于批处理数据集中的线性模型，并先检查

我需要对每个唯一标识符运行一组线性模型，但首先我需要做一个检查。对于每个唯一的id和年份，我需要检查是否至少有24个月以前的月度数据，但不超过60个月。因此，当我运行回归时，它应该包括每个个体每年24-60次上个月（年）数据的观察。如果该年的数据少于24个月，则该个人的年份将被删除，但如果超过60个月，则仅使用60个月

感谢这篇（感谢@akrun）帖子，我能够为每个人建立线性模型，运行它们，然后将beta作为两个beta的总和输出。问题是，这只会在当年（12个OB）上运行回归，而不会在之前的24-60年运行回归

前任职务：

我希望有一个

dplyr

选项，但它似乎不起作用，post和下面的

ddply

方法需要几个小时才能运行。但是，我需要在110万obs范围内的各种数据集上多次运行此功能

dput示例：

   tdata <- structure(list(cusip = c(101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L), date = c(19901130L, 19901031L, 19900928L, 
19900831L, 19900731L, 19900629L, 19900531L, 19900430L, 19900330L, 
19900228L, 19900131L, 19891229L, 19891130L, 19891031L, 19890929L, 
19890831L, 19890731L, 19890630L, 19890531L, 19890428L, 19890331L, 
19890228L, 19890131L, 19881230L, 19881130L, 19881031L, 19880930L, 
19880831L, 19880729L, 19880630L, 19880531L, 19880429L, 19880331L, 
19880229L, 19880129L, 19871231L, 19871130L, 19871030L, 19870930L, 
19870831L, 19870731L, 19870630L, 19870529L, 19870430L, 19870331L, 
19870227L, 19870130L, 19861231L, 19861128L, 19861031L, 19860930L, 
19860829L, 19860731L), fyear = c("1990", "1990", "1990", "1990", 
"1990", "1990", "1990", "1990", "1990", "1990", "1990", "1989", 
"1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", 
"1989", "1989", "1989", "1988", "1988", "1988", "1988", "1988", 
"1988", "1988", "1988", "1988", "1988", "1988", "1988", "1987", 
"1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", 
"1987", "1987", "1987", "1986", "1986", "1986", "1986", "1986", 
"1986"), month = c("11", "10", "09", "08", "07", "06", "05", 
"04", "03", "02", "01", "12", "11", "10", "09", "08", "07", "06", 
"05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07", 
"06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08", 
"07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09", 
"08", "07"), ret = c("0.117647", "0.030303", "-0.161017", "-0.186207", 
"-0.131737", "0.128378", "0.027778", "-0.162791", "0.131579", 
"0.178295", "-0.091549", "0.163934", "-0.089552", "0.007519", 
"0.117647", "0.155340", "0.211765", "0.024096", "0.338710", "0.377778", 
"0.071429", "-0.176471", "0.378378", "-0.026316", "-0.050000", 
"-0.047619", "-0.086957", "-0.061224", "0.088889", "-0.062500", 
"-0.040000", "-0.056604", "0.081633", "0.042553", "-0.096154", 
"0.238095", "-0.263158", "-0.393617", "-0.160714", "0.400000", 
"-0.090909", "-0.200000", "-0.098361", "-0.152778", "0.000000", 
"0.107692", "0.460674", "-0.101010", "-0.019802", "0.246914", 
"-0.052632", "0.179310", "-0.064516"), ewretd = c(0.035468, -0.057155, 
-0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117, 
0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934, 
0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089, 
0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485, 
0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016, 
0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876, 
0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283, 
0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071, 
0.023269, -0.073128), lagewretd = c(-0.004526, 0.035468, -0.057155, 
-0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117, 
0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934, 
0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089, 
0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485, 
0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016, 
0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876, 
0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283, 
0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071, 
0.023269)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-53L), .Names = c("cusip", "date", "fyear", "month", "ret", "ewretd", 
"lagewretd"))

tdata%as.integer）%
安排（财年、月）
##计算出每年的累计可用月数（针对每个cusip）
年资%
分组依据（cusip，fyear）%>%
汇总（n=n（））%>%
突变（n_cum=cumsum（n））
##迭代yearstuff行（对于每个cusip）
型号%coef
}
})

我将为所有计算编写一个单独的函数，以获得系数。然后您可以使用

plyr

、

dplyr

或

data.table

。您可能应该使用更大的数据集重新运行下面的基准测试

# function to get coefficients 
# (further optimization should probably focus on improving this function)
get_coefs <- function(.cusip, .fyear, .n_cum){
  if(.n_cum < 24) {
    data_frame(`(Intercept)` = NA_real_, ewretd = NA_real_, lagewretd = NA_real_)
  } else {
    my_dat <- tdata %>%
      filter(cusip == .cusip, fyear <= .fyear) %>%
      mutate(rn = row_number(desc(date)))
    lm(ret ~ ewretd + lagewretd, my_dat, subset = rn < 61) %>% 
      coef %>% 
      as.list %>% 
      as_data_frame
  }
}
require(microbenchmark)
microbenchmark(
  models_plyr <- plyr::ddply(yearstuff, ~ cusip + fyear, function(y)
    get_coefs(y$cusip, y$fyear, y$n_cum))
  ,
  models_dplyr <- yearstuff %>% 
    group_by(cusip, fyear) %>%
    do(get_coefs(.$cusip, .$fyear, .$n_cum))
  ,
  models_dt <- as.data.table(as.data.frame(yearstuff))[, get_coefs(cusip, fyear, n_cum), by = list(cusip, fyear)]
)
##      min       lq     mean   median       uq      max neval cld
## 12.69178 13.29136 13.62600 13.45849 13.67471 16.73910   100   c
## 12.45302 12.94036 13.33589 13.14721 13.59907 14.73485   100  b 
## 10.66120 11.09856 11.43126 11.21593 11.45625 13.69591   100 a  
all.equal(models_plyr %>% data.frame, 
          models_dplyr %>% data.frame)
## [1] TRUE
all.equal(models_plyr %>% data.frame, 
          models_dt %>% data.frame) 
## [1] TRUE

获取系数的函数 #（进一步优化可能应侧重于改进此功能）获得_coefs% 系数%>% as.list%>% as_数据_帧 } } 要求（微基准）微基准( 型号（单位：年） do（获得系数（.$cusip、.$fyear、.$n_cum）） , 型号_dt%data.frame，型号（dplyr%>%data.frame） ##[1]是的所有.equal（型号\u plyr%>%data.frame，型号（dt%>%数据帧） ##[1]是的

我认为，对于基准测试，应该适当地提前准备对象，使其不包括

as.data.table（as.data.frame（yearstuff））

在时间上。@JanGorecki:我想这取决于OP的剩余工作流程。如果转换为

数据.table

然后返回到

数据.frame

是必要的，它们可能应该包括在基准测试中。否则，他们可能不应该这样做。无论如何，我怀疑

data.table

和

data.frame

之间的转换是这些计算中的主要瓶颈。然后应该使用

setDT

或

setDF

，这不需要复制-它对较大的集合有重大影响。这不是这里的瓶颈，而是毫无价值的不平等流程的基准。只需将您的第三个调用包装到

setDF（yearstuff\u dt[…]）

中，您就可以在不增加开销的情况下获得所需的内容。@shadow谢谢！这正是我想要的。不幸的是，由于数据集太大（1.1m obs），因此使用

dplyr

仍然需要2小时。我想我得耐心点。

# function to get coefficients 
# (further optimization should probably focus on improving this function)
get_coefs <- function(.cusip, .fyear, .n_cum){
  if(.n_cum < 24) {
    data_frame(`(Intercept)` = NA_real_, ewretd = NA_real_, lagewretd = NA_real_)
  } else {
    my_dat <- tdata %>%
      filter(cusip == .cusip, fyear <= .fyear) %>%
      mutate(rn = row_number(desc(date)))
    lm(ret ~ ewretd + lagewretd, my_dat, subset = rn < 61) %>% 
      coef %>% 
      as.list %>% 
      as_data_frame
  }
}
require(microbenchmark)
microbenchmark(
  models_plyr <- plyr::ddply(yearstuff, ~ cusip + fyear, function(y)
    get_coefs(y$cusip, y$fyear, y$n_cum))
  ,
  models_dplyr <- yearstuff %>% 
    group_by(cusip, fyear) %>%
    do(get_coefs(.$cusip, .$fyear, .$n_cum))
  ,
  models_dt <- as.data.table(as.data.frame(yearstuff))[, get_coefs(cusip, fyear, n_cum), by = list(cusip, fyear)]
)
##      min       lq     mean   median       uq      max neval cld
## 12.69178 13.29136 13.62600 13.45849 13.67471 16.73910   100   c
## 12.45302 12.94036 13.33589 13.14721 13.59907 14.73485   100  b 
## 10.66120 11.09856 11.43126 11.21593 11.45625 13.69591   100 a  
all.equal(models_plyr %>% data.frame, 
          models_dplyr %>% data.frame)
## [1] TRUE
all.equal(models_plyr %>% data.frame, 
          models_dt %>% data.frame) 
## [1] TRUE