Performance 在R中加速循环操作_Performance_R_Loops_Rcpp_R Faq

Performance 在R中加速循环操作

performance r loops

Performance 在R中加速循环操作,performance,r,loops,rcpp,r-faq,Performance,R,Loops,Rcpp,R Faq,我在R中遇到了一个很大的性能问题。我编写了一个函数，该函数迭代data.frame对象。它只需在data.frame中添加一个新列并累加一些内容。（操作简单）。data.frame大约有850K行。我的电脑还在工作（现在大约10小时），我不知道运行时间 dayloop2 <- function(temp){ for (i in 1:nrow(temp)){ temp[i,10] <- i if (i > 1) {

我在R中遇到了一个很大的性能问题。我编写了一个函数，该函数迭代

data.frame

对象。它只需在

data.frame

中添加一个新列并累加一些内容。（操作简单）。

data.frame

大约有850K行。我的电脑还在工作（现在大约10小时），我不知道运行时间

dayloop2 <- function(temp){
    for (i in 1:nrow(temp)){    
        temp[i,10] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                temp[i,10] <- temp[i,9] + temp[i-1,10]                    
            } else {
                temp[i,10] <- temp[i,9]                                    
            }
        } else {
            temp[i,10] <- temp[i,9]
        }
    }
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

dayloop2通过使用索引或嵌套的ifelse（）
语句跳过循环，可以大大加快速度
idx <- 1:nrow(temp)
temp[,10] <- idx
idx1 <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
temp[idx1,10] <- temp[idx1,9] + temp[which(idx1)-1,10] 
temp[!idx1,10] <- temp[!idx1,9]    
temp[1,10] <- temp[1,9]
names(temp)[names(temp) == "V10"] <- "Kumm."

idx在R中，通常可以通过使用apply
系列函数来加速循环处理（在您的情况下，它可能是replicate
）。查看提供进度条的plyr
包
另一个选择是完全避免循环，并用矢量化算法代替它们。我不确定您到底在做什么，但您可能可以一次将函数应用于所有行：
temp[1:nrow(temp), 10] <- temp[1:nrow(temp), 9] + temp[0:(nrow(temp)-1), 10]

temp[1:nrow（temp），10]最大的问题和无效的根源是对data.frame进行索引，我指的是使用temp[，]的所有这些行
尽量避免这种情况。我接受了您的函数，更改了索引，这里是版本A
dayloop2_A <- function(temp){
    res <- numeric(nrow(temp))
    for (i in 1:nrow(temp)){    
        res[i] <- i
        if (i > 1) {             
            if ((temp[i,6] == temp[i-1,6]) & (temp[i,3] == temp[i-1,3])) { 
                res[i] <- temp[i,9] + res[i-1]                   
            } else {
                res[i] <- temp[i,9]                                    
            }
        } else {
            res[i] <- temp[i,9]
        }
    }
    temp$`Kumm.` <- res
    return(temp)
}

结果是

您可以看到您的版本以指数形式依赖于nrow（X）
。修改后的版本具有线性关系，简单的lm
模型预测850000行的计算需要6分10秒
矢量化的威力
正如Shane和Calimo在其答案中所述，矢量化是提高性能的关键。
您可以从代码中移出循环：

调节
结果的初始化（即temp[i，9]
）

这导致了这个代码
dayloop2_B <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in 1:nrow(temp)) {
        if (cond[i]) res[i] <- temp[i,9] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

您获得的性能在很大程度上取决于数据结构。精确-条件中TRUE值的百分比。
对于我的模拟数据，在1秒以下需要85万行的计算时间

我希望你能更进一步，我认为至少可以做两件事：

编写一个C
代码来执行条件求和
若你们知道在你们的数据中，max序列并不大，那个么你们可以把循环改为矢量化，比如
while (any(cond)) {
    indx <- c(FALSE, cond[-1] & !cond[-n])
    res[indx] <- res[indx] + res[which(indx)-1]
    cond[indx] <- FALSE
}

while（任何（条件））{
indx如果您使用进行循环，则很可能是将R编码为C、Java或其他类型。正确矢量化的R代码速度非常快
以这两个简单的代码位为例，按顺序生成10000个整数的列表：
第一个代码示例是如何使用传统的编码范式对循环进行编码。完成循环需要28秒
system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61 

加速R代码的一般策略
首先，找出真正慢的部分在哪里。不需要优化运行不慢的代码。对于少量代码，只需仔细思考就可以了。如果失败，RProf和类似的分析工具可能会有所帮助
找出瓶颈后，考虑更高效的算法来做你想做的事情。如果可能，计算应该只运行一次，因此：

存储结果并访问它们，而不是重复地重新计算
将非循环相关计算从循环中取出
避免不必要的计算（例如）

使用更有效的函数可以产生中等或较大的速度增益。例如，paste0
产生较小的效率增益，但。colSums（）
及其相关函数会产生更显著的增益。mean

这样你就可以避免一些特别常见的麻烦了：

cbind
会让你很快慢下来
初始化数据结构，然后填入
即使使用预分配，您也可以切换到按引用传递的方法，而不是按值传递的方法，但这可能不值得这么麻烦
请看一看，以了解更多要避免的陷阱

尝试更好的矢量化，这通常会有帮助，但并不总是有帮助。在这方面，像ifelse
、diff
等固有的矢量化命令将比apply
命令系列（在编写良好的循环中提供很少甚至没有速度提升）提供更多的改进
您还可以尝试向R函数提供更多信息。例如，使用和指定。速度增益将根据您消除的猜测量而变化
接下来，考虑<强>优化包：该包在使用时可能产生大量的速度增益，在数据操作和读取大量数据时（<代码> Frad ）。
下一步，尝试通过更有效的呼叫R的方式来提高速度：

编译你的R脚本。或者使用Ra
和jit
包进行即时编译（Dirk在中有一个例子）
确保您使用的是优化的BLAS。这些系统提供了全面的速度提升。老实说，R在安装时没有自动使用最高效的库，这是一个遗憾。希望Revolution R将为整个社区贡献他们在这里所做的工作
Radford Neal做了一系列优化，其中一些被采用到R Core中，还有许多被分叉到R Core中

<>和最后，如果上述所有的还没有达到你所需要的速度，你可能需要移动到慢速代码段的<强>快的语言。<>代码> Rcpp < /C> >和<>代码>内联 >这里只使用C++代码替换最慢的部分，特别是容易的。，然后它就吹了
dayloop2_D <- function(temp){
    cond <- c(FALSE, (temp[-nrow(temp),6] == temp[-1,6]) & (temp[-nrow(temp),3] == temp[-1,3]))
    res <- temp[,9]
    for (i in (1:nrow(temp))[cond]) {
        res[i] <- res[i] + res[i-1]
    }
    temp$`Kumm.` <- res
    return(temp)
}

while (any(cond)) {
    indx <- c(FALSE, cond[-1] & !cond[-n])
    res[indx] <- res[indx] + res[which(indx)-1]
    cond[indx] <- FALSE
}

system.time({
    a <- NULL
    for(i in 1:1e5)a[i] <- i
})
   user  system elapsed 
  28.36    0.07   28.61 

system.time({
    a <- rep(1, 1e5)
    for(i in 1:1e5)a[i] <- i
})

   user  system elapsed 
   0.30    0.00    0.29 

system.time(a <- 1:1e5)

   user  system elapsed 
      0       0       0 

body <- 'Rcpp::NumericMatrix nm(temp);
         int nrtemp = Rccp::as<int>(nrt);
         for (int i = 0; i < nrtemp; ++i) {
             temp(i, 9) = i
             if (i > 1) {
                 if ((temp(i, 5) == temp(i - 1, 5) && temp(i, 2) == temp(i - 1, 2) {
                     temp(i, 9) = temp(i, 8) + temp(i - 1, 9)
                 } else {
                     temp(i, 9) = temp(i, 8)
                 }
             } else {
                 temp(i, 9) = temp(i, 8)
             }
         return Rcpp::wrap(nm);
        '

settings <- getPlugin("Rcpp")
# settings$env$PKG_CXXFLAGS <- paste("-I", getwd(), sep="") if you want to inc files in wd
dayloop <- cxxfunction(signature(nrt="numeric", temp="numeric"), body-body,
    plugin="Rcpp", settings=settings, cppargs="-I/usr/include")

dayloop2 <- function(temp) {
    # extract a numeric matrix from temp, put it in tmp
    nc <- ncol(temp)
    nm <- dayloop(nc, temp)
    names(temp)[names(temp) == "V10"] <- "Kumm."
    return(temp)
}

inc <- '#include <header.h>

n <- 1000000
df <- as.data.frame(matrix(sample(1:10, n*9, TRUE), n, 9))
colnames(df) <- paste("col", 1:9, sep = "")

library(data.table)

dayloop2.dt <- function(df) {
  dt <- data.table(df)
  dt[, Kumm. := {
    res <- .I;
    ifelse (res > 1,             
      ifelse ((col6 == shift(col6, fill = 0)) & (col3 == shift(col3, fill = 0)) , 
        res <- col9 + shift(res)                   
      , # else
        res <- col9                                 
      )
     , # else
      res <- col9
    )
  }
  ,]
  res <- data.frame(dt)
  return (res)
}

res <- dayloop2.dt(df)

m <- microbenchmark(dayloop2.dt(df), times = 100)
#Unit: milliseconds
#       expr      min        lq     mean   median       uq      max neval
#dayloop2.dt(df) 436.4467 441.02076 578.7126 503.9874 575.9534 966.1042    10

nrow=function(x){ ##required as I use nrow at times.
  if(class(x)=='list') {
    length(x[[names(x)[1]]])
  }else{
    base::nrow(x)
  }
}

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
})

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  d=as.list(d) #become a list
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  d=as.data.frame(d) #revert back to data.frame
})

   user  system elapsed 
   0.53    0.00    0.53

   user  system elapsed 
   0.04    0.00    0.03 

system.time({
  d=data.frame(seq=1:10000,r=rnorm(10000))
  d$foo=d$r
  d$seq=1:5
  class(d)='list'
  mark=NA
  for(i in 1:nrow(d)){
    if(d$seq[i]==1) mark=d$r[i]
    d$foo[i]=mark
  }
  class(d)='data.frame'
})
head(d)

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    cat(round(i/nrow(temp)*100,2),"%    \r") # prints the percentage complete in realtime.
    # do stuff
  }
  return(blah)
}

dayloop2 <- function(temp){
  temp <- lapply(1:nrow(temp), function(i) {
    cat(round(i/nrow(temp)*100,2),"%    \r")
    #do stuff
  })
  return(temp)
}

dayloop2 <- function(temp){
  for (i in 1:nrow(temp)){
    if(i %% 100 == 0) cat(round(i/nrow(temp)*100,2),"%    \r") # prints every 100 times through the loop
    # do stuff
  }
  return(temp)
}

dayloop_accumulate <- function(temp) {
  temp %>%
    as_tibble() %>%
     mutate(cond = c(FALSE, (V6 == lag(V6) & V3 == lag(V3))[-1])) %>%
    mutate(V10 = V9 %>% 
             purrr::accumulate2(.y = cond[-1], .f = function(.i_1, .i, .y) {
               if(.y) {
                 .i_1 + .i
               } else {
                 .i
               }
             }) %>% unlist()) %>%
    select(-cond)
}