Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/postgresql/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 基于检查的循环更有效_R - Fatal编程技术网

R 基于检查的循环更有效

R 基于检查的循环更有效,r,R,我已经为编写了一个循环,它执行一些检查并根据结果返回0或1。然而,在一个大数据集上运行它需要很长时间(让它过夜,在早上仍然运行)。有没有关于如何使用dplyr或其他工具来提高效率的想法?谢谢 以下是一些测试数据: tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

我已经为编写了一个
循环,它执行一些检查并根据结果返回0或1。然而,在一个大数据集上运行它需要很长时间(让它过夜,在早上仍然运行)。有没有关于如何使用
dplyr
或其他工具来提高效率的想法?谢谢

以下是一些测试数据:

tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2), fyear = c("1971", "1971", "1971", "1971", 
"1971", "1971", "1971", "1971", "1971", "1971", "1971", "1971", 
"1972", "1972", "1972", "1972", "1972", "1972", "1972", "1972", 
"1972", "1972", "1972", "1972", "1972", "1973", "1973", "1973", 
"1973", "1973", "1973", "1973", "1973", "1973", "1973", "1973", 
"1973", "1974", "1974", "1974", "1974", "1974", "1974", "1974", 
"1974", "1974", "1974", "1974", "1974", "1975", "1975", "1975", 
"1975", "1975", "1975", "1975", "1975", "1975", "1975", "1975"
), datadate = c(19711231L, 19710129L, 19710226L, 19710331L, 19710430L, 
19710528L, 19710630L, 19710730L, 19710831L, 19710930L, 19711029L, 
19711130L, 19721231L, 19720131L, 19720229L, 19720330L, 19720428L, 
19720531L, 19720630L, 19720731L, 19720831L, 19720929L, 19721031L, 
19721130L, 19721229L, 19731231L, 19730131L, 19730228L, 19730330L, 
19730430L, 19730531L, 19730629L, 19730731L, 19730831L, 19730928L, 
19731031L, 19731130L, 19741231L, 19740131L, 19740228L, 19740329L, 
19740430L, 19740531L, 19740628L, 19740731L, 19740830L, 19740930L, 
19741031L, 19741129L, 19751231L, 19750131L, 19750228L, 19750331L, 
19750430L, 19750530L, 19750630L, 19750731L, 19750829L, 19750930L, 
19751031L), month = c("12", "01", "02", "03", "04", "05", "06", 
"07", "08", "09", "10", "11", "12", "01", "02", "03", "04", "05", 
"06", "07", "08", "09", "10", "11", "12", "12", "01", "02", "03", 
"04", "05", "06", "07", "08", "09", "10", "11", "12", "01", "02", 
"03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "01", 
"02", "03", "04", "05", "06", "07", "08", "09", "10")), .Names = c("cusip", 
"fyear", "datadate", "month"), row.names = c(NA, -60L), class = c("tbl_df", 
"tbl", "data.frame"))

tdata具有分组和滞后累积和的解决方案:

library(dplyr)

tdata %>%
  group_by(cusip, fyear) %>%
  summarise(number = n(), share = n() / 60)  %>% 
  mutate( cum_y = lag(cumsum(share)), 
          cum_y4 = lag(cum_y, 4),
          last4 = ifelse(is.na(cum_y4), cum_y, cum_y - cum_y4),
          check = as.numeric( last4 >= 0.4 )
          ) %>%
  select(cusip, fyear, last4, check)
解释:

  • fyear
    分组,计算观察值并获得一年的
    份额
  • cum_y
    是一个滞后的累计股份总数
  • cum_y4
    落后4年
    cum_y
  • last4
    cum_y
    cum_y4
  • check
    正在检查
    last4
  • 更新 与原始数据中的变量联接:

    ... %>%
      left_join(tdata, by = c("cusip", "fyear"))
    

    你能用文字解释一下
    for
    循环的作用吗?@DavidArenburg在我概述了ideaThanks时看到了edit,但这没有考虑到唯一id(cusip)。是否需要将代码更改为(cusip,fyear)
    ?非常感谢。但是,即使在我最后删除了
    select
    之后,它也不会返回较大数据集上的所有变量。你知道为什么吗?因为团队。如果变量在cusip fyear对中是常量,则可以将它们添加到group_by列表中。如果不是,则在原始数据框上使用
    left\u join
    ,观察值为
    by=c(“cusip”,“fyear”)
    ... %>%
      left_join(tdata, by = c("cusip", "fyear"))