R 需要具有从开始到停止索引的快速滚动应用功能

R 需要具有从开始到停止索引的快速滚动应用功能,r,performance,plyr,R,Performance,Plyr,下面是一段代码。它给出了滚动15分钟(历史)窗口的贸易价格水平百分比。如果长度为500或1000,它运行得很快,但正如您所看到的,有45K个观测值,而对于整个数据,它的速度非常慢。我可以应用任何plyr功能吗?欢迎提出任何其他建议 贸易数据是这样的: > str(trade) 'data.frame': 45571 obs. of 5 variables: $ time : chr "2013-10-20 22:00:00.489" "2013-10-20 22:00:00

下面是一段代码。它给出了滚动15分钟(历史)窗口的贸易价格水平百分比。如果长度为500或1000,它运行得很快,但正如您所看到的,有45K个观测值,而对于整个数据,它的速度非常慢。我可以应用任何plyr功能吗?欢迎提出任何其他建议

贸易数据是这样的:

> str(trade)
'data.frame':   45571 obs. of  5 variables:
 $ time    : chr  "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00.811" "2013-10-20 22:00:00.811" ...
 $ prc     : num  121 121 121 121 121 ...
 $ siz     : int  1 4 1 2 3 3 2 2 3 4 ...
 $ aggress : chr  "B" "B" "B" "B" ...
 $ time.pos: POSIXlt, format: "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00.811" "2013-10-20 22:00:00.811" ...
这就是新列trade$time.pos之后数据的样子

trade$time.pos <- strptime(trade$time, format="%Y-%m-%d %H:%M:%OS") 

> head(trade)
                     time      prc siz aggress                time.pos
1 2013-10-20 22:00:00.489 121.3672   1       B 2013-10-20 22:00:00.489
2 2013-10-20 22:00:00.807 121.3750   4       B 2013-10-20 22:00:00.807
3 2013-10-20 22:00:00.811 121.3750   1       B 2013-10-20 22:00:00.811
4 2013-10-20 22:00:00.811 121.3750   2       B 2013-10-20 22:00:00.811
5 2013-10-20 22:00:00.811 121.3750   3       B 2013-10-20 22:00:00.811
6 2013-10-20 22:00:00.811 121.3750   3       B 2013-10-20 22:00:00.811

#t_15_index function returns the indices of the trades that were executed in last 15 minutes from the current trade(t-15 to t).
t_15_index <- function(data_vector,index) {
  which(data_vector[index] - data_vector[1:index]<=15*60)
}

get_percentile <- function(data) {
  len_d <- dim(trade)[1]  

  price_percentile = vector(length=len_d)  

  for(i in 1: len_d) {   

    t_15 = t_15_index(trade$time.pos,i)
    #ecdf(rep(..)) gets the empirical distribution of the the trade size on a particular trade-price level
    price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
    #percentile of the current price level depending on current (t-15 to t) subset of data
    price_percentile[i] = price_dist(trade$prc[i])
  }
  trade$price_percentile = price_percentile
  trade
}


res_trade = get_percentile(trade)
trade$time.pos头(交易)
时间prc siz侵略时间.pos
1 2013-10-20 22:00:00.489 121.3672 1b 2013-10-20 22:00:00.489
2 2013-10-20 22:00:00.807121.37504 B 2013-10-20 22:00:00.807
2013-10-20 22:00:00.811121.3750 1b 2013-10-20 22:00:00.811
4 2013-10-20 22:00:00.811121.3750 2b 2013-10-20 22:00:00.811
5 2013-10-20 22:00:00.811121.3750 B 2013-10-20 22:00:00.811
6 2013-10-20 22:00:00.811121.3750 B 2013-10-20 22:00:00.811
#t_15_index函数返回当前交易(t-15到t)最后15分钟内执行的交易的指数。

t_15_index可能有一种加速滚动应用程序的方法,但是由于窗口大小的变化,我认为标准工具(例如
rollappy
)不起作用,尽管可能一些更熟悉它们的人会有想法。同时,您可以优化百分比计算。您可以直接计算一个合适的近似值,而不是使用
ecdf
,它创建一个具有所有相关开销的函数:

> vec <- rnorm(10000, 0, 3)
> val <- 5
> max(which(sort(vec) < val)) / length(vec)
[1] 0.9543
> ecdf(vec)(val)
[1] 0.9543
> microbenchmark(max(which(sort(vec) < val)) / length(vec))
Unit: milliseconds
expr      min       lq   median       uq      max neval
max(which(sort(vec) < val))/length(vec) 1.093434 1.105231 1.116364 1.141204 1.449141   100
> microbenchmark(ecdf(vec)(val))
Unit: milliseconds
expr      min       lq   median       uq      max neval
ecdf(vec)(val) 2.552946 2.808041 3.043579 3.439269 4.208202   100
>vec val max(其中(排序(vec)ecdf(vec)(val)
[1] 0.9543
>微基准(最大值(排序(vec)微基准(ecdf(vec)(val))
单位:毫秒
expr最小lq中值uq最大neval
ecdf(vec)(val)2.552946 2.808041 3.043579 3.439269 4.208202 100

大约提高了2.5倍。小样本的改进更大。

好吧,这个问题让我感兴趣。以下是我所做的事情:

  • 用自定义百分位数计算替换
    ecdf
  • 将time.pos更改为数字(因为它无论如何都是以秒为单位的),因为与
    [.POSIXct
    vs
    [
  • t_15_index
    更改为只回溯到上一个最早的时间戳,因为数据已排序,因此我们不需要回溯到索引1
  • 这就是结果:

    > system.time(res2 <- get_percentile2(trade))
      user  system elapsed 
    14.458   0.768  15.215 
    > system.time(res1 <- get_percentile(trade))
       user  system elapsed 
    110.851  17.974 128.736 
    
    大约8.5倍的改进。请注意,如果每15分钟间隔的项目较少,则此改进会更大。这将在24小时内填满45K点。因此,如果您的45K点实际超过几天,则您可以期待更多改进。以下是代码:

    t_15_index2 <- function(data_vector,index, min.index) {
      which(data_vector[index] - data_vector[min.index:index]<=minutes*60) + min.index - 1L
    }
    get_percentile2 <- function(trade) {
      len_d <- dim(trade)[1]  
      price_percentile = vector(length=len_d)
      min.index <- 1  
      for(i in 1: len_d) {
        t_15 = t_15_index2(trade$time.pos.2,i, min.index)
        vec <- rep(trade$prc[t_15],trade$siz[t_15])
        price_percentile[i] <- max(0, which(sort(vec) <= trade$prc[i])) / length(vec)
        min.index <- t_15[[1]]
      }
      trade$price_percentile = price_percentile
      trade
    }
    
    最后,如果你这样做的话,你也可以想出聪明的方法来重新计算百分位数,因为你知道你的15分钟包含了什么,增加了什么,删除了什么


    无法100%确定执行FIFO 15分钟窗口所需的簿记是否最终会克服执行FIFO 15分钟窗口所带来的好处。

    这里有一个快速尝试,可以更有效地查找15分钟前发生的时间索引:

    # Create some sample data (modified from BrodieG)
    set.seed(1)
    
    ticks <- 45000
    start <- as.numeric(as.POSIXct("2013-01-01"))
    end <- as.numeric(as.POSIXct("2013-01-02"))
    times <- as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01"))
    trade <- data.frame(
      time = sort(times),
      prc = 100 + rnorm(ticks, 0, 5),
      siz = sample(1:10, ticks, rep = T)
    )
    
    # For vector of times, find the index of the first time that was at least
    # fifteen minutes before the current time. Assumes times are sorted
    minutes_ago <- function(time, minutes = 15) {
      secs <- minutes * 60
      time <- as.numeric(time)
      out <- integer(length(time))
    
      before <- 1
    
      for(i in seq_along(out)) {
        while(time[before] < time[i] - secs) {
          before <- before + 1
        }
        out[i] <- before
    
      }
      out
    }
    system.time(minutes_ago(trade$time))
    # Takes about 0.2s on my machine
    
    library(Rcpp)
    cppFunction("IntegerVector minutes_ago2(NumericVector time, int minutes = 15) {
      int secs = minutes * 60;
      int n = time.size();
      IntegerVector out(n);
    
      int before = 0;
      for (int i = 0; i < n; ++i) {
        # Could do even better here with a binary search
        while(time[before] < time[i] - secs) {
          before++;
        }
        out[i] = before + 1;
      }
      return out;
    }")
    
    system.time(minutes_ago2(trade$time, 10))
    # Takes less than < 0.001
    
    all.equal(minutes_ago(trade$time, 15), minutes_ago2(trade$time, 15))
    
    #创建一些示例数据(从BrodieG修改)
    种子(1)
    
    ticks如果您想在dplyr中使用ecdf,请在mutate中使用seq_-along/sapply以获得更快的结果

    y <- x %>% mutate(percentile.score = sapply(seq_along(score), function(i){sum(score[1:i] <= score[i])/i}))
    

    y%变异(percentile.score=sappy)(seq_沿着(score),函数(i){sum(score[1:i]
    get_percentile
    接受一个输入,
    data
    ,并且在函数中不引用
    数据
    。如果您有时间学习一些东西,您可以查看
    数据表
    包,它实现了一个带有参数
    roll
    的滚动联接,我很确定这将比usi快数倍ng
    plyr
    这个答案()如果你想尝试Rcpp,你应该想办法。在C++中,你必须找出代码> ECDF < /Cord>。在你正确地整理数据之后,编译的函数可能会在毫秒内工作。如果有人在寻找一个快速的函数来查找一个区间内的点,请检查<代码> FiffeldIs/COD>。如果您在一次扫描中计算了所有
    t_15
    值,则会做得更好-假设它已排序,您应该只需要通过data@hadley,您的意思是存储所有可能的15分钟滚动间隔(我估计大约有20毫米)?然后才对它们进行操作?现在,原始方法和我的调整都会扫描数据一次。请看我的答案。您在
    t_15_索引中扫描了多次(虽然它不是整个向量,但效率仍然相当低)。我明白您的意思。t_15索引构建步骤大约需要1.3秒(只需运行
    t_15_index2
    ),因此在非Rcpp版本中可以节省约1秒。不幸的是,与17秒的总运行时间相比,这是很小的(这与我现在使用的机器有关,与我编写原始代码时使用的机器不同).+1用于教育。@BrodieG我认为对于真正快速的代码,您可以使用相同的策略来计算滚动ecdf,但这将非常复杂
    t_15_index <- function(data_vector,index) {
      which(data_vector[index] - data_vector[1:index]<=minutes*60)
    }
    get_percentile <- function(trade) {
      len_d <- dim(trade)[1]    
      price_percentile = vector(length=len_d)  
      for(i in 1: len_d) {       
        t_15 = t_15_index(trade$time.pos,i)
        price_dist = ecdf(rep(trade$prc[t_15],trade$siz[t_15]))
        price_percentile[i] = price_dist(trade$prc[i])
      }
      trade$price_percentile = price_percentile
      trade
    }
    
    # Version that pulls whole 2000 entries each time
    sub.vec <- numeric(2000)
    system.time(r1 <- for(i in seq_len(length(vec) - 2000)) sub.vec <- vec[i:(i+1999)])
    #  user  system elapsed 
    # 17.507   4.723  22.211 
    
    # Version that overwrites 1 value at a time keeping the most recent 2K
    sub.vec <- numeric(2001) # need to make this slightly larger because of 2000 %% 2000 == 0
    system.time(r2 <- for(i in seq_len(length(vec) - 2000)) sub.vec[[(i %% 2000) + 1]] <- vec[[i]])
    
    #  user  system elapsed 
    # 2.642   0.009   2.650 
    
    all.equal(r1, tail(r2, -1L))
    # [1] TRUE
    
    # Create some sample data (modified from BrodieG)
    set.seed(1)
    
    ticks <- 45000
    start <- as.numeric(as.POSIXct("2013-01-01"))
    end <- as.numeric(as.POSIXct("2013-01-02"))
    times <- as.POSIXct(runif(ticks, start, end), origin=as.POSIXct("1970-01-01"))
    trade <- data.frame(
      time = sort(times),
      prc = 100 + rnorm(ticks, 0, 5),
      siz = sample(1:10, ticks, rep = T)
    )
    
    # For vector of times, find the index of the first time that was at least
    # fifteen minutes before the current time. Assumes times are sorted
    minutes_ago <- function(time, minutes = 15) {
      secs <- minutes * 60
      time <- as.numeric(time)
      out <- integer(length(time))
    
      before <- 1
    
      for(i in seq_along(out)) {
        while(time[before] < time[i] - secs) {
          before <- before + 1
        }
        out[i] <- before
    
      }
      out
    }
    system.time(minutes_ago(trade$time))
    # Takes about 0.2s on my machine
    
    library(Rcpp)
    cppFunction("IntegerVector minutes_ago2(NumericVector time, int minutes = 15) {
      int secs = minutes * 60;
      int n = time.size();
      IntegerVector out(n);
    
      int before = 0;
      for (int i = 0; i < n; ++i) {
        # Could do even better here with a binary search
        while(time[before] < time[i] - secs) {
          before++;
        }
        out[i] = before + 1;
      }
      return out;
    }")
    
    system.time(minutes_ago2(trade$time, 10))
    # Takes less than < 0.001
    
    all.equal(minutes_ago(trade$time, 15), minutes_ago2(trade$time, 15))
    
    y <- x %>% mutate(percentile.score = sapply(seq_along(score), function(i){sum(score[1:i] <= score[i])/i}))