Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/cmake/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
在R中为BLAS操作重用现有内存_R_Rcpp_Blas_Armadillo - Fatal编程技术网

在R中为BLAS操作重用现有内存

在R中为BLAS操作重用现有内存,r,rcpp,blas,armadillo,R,Rcpp,Blas,Armadillo,我在R的紧环中有一个操作需要优化。它通过计算向量和矩阵的舒尔积来更新IRLS算法中的权重。也就是说,它将矩阵中的每个元素乘以向量中相应的行值,生成与矩阵相同维度的结果。在过于简化的示意图形式中,它如下所示: reweight = function(iter, w, Q) { for (i in 1:iter) { wT = w * Q } } 在普通R代码中,每次迭代都会创建一个dim()[rows,cols]的新矩阵: cols = 1000 rows = 1000000 w

我在R的紧环中有一个操作需要优化。它通过计算向量和矩阵的舒尔积来更新IRLS算法中的权重。也就是说,它将矩阵中的每个元素乘以向量中相应的行值,生成与矩阵相同维度的结果。在过于简化的示意图形式中,它如下所示:

reweight = function(iter, w, Q) {
  for (i in 1:iter) {
    wT = w * Q
  }
}
在普通R代码中,每次迭代都会创建一个dim()[rows,cols]的新矩阵:

cols = 1000
rows = 1000000
w = runif(rows)
Q = matrix(1.0, rows, cols)

Rprofmem()
reweight(5, w, Q)
Rprofmem(NULL)
nate@ubuntu:~/R$减去Rprofmem.out

8000000040 :"reweight"
8000000040 :"reweight"
8000000040 :"reweight"
8000000040 :"reweight"
8000000040 :"reweight"
8000040 :"matrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
2544 :"<Anonymous>" "weightMatrix" "reweightCPP"
如果矩阵较大(多GB),则内存分配的成本超过了数值运算所花费的时间:

nate@ubuntu:~/R$perf记录-p`pgrep R`sleep 5和性能报告

49.93%  R  [kernel.kallsyms]  [k] clear_page_c_e
47.67%  R  libR.so            [.] real_binary
 0.57%  R  [kernel.kallsyms]  [k] get_page_from_freelist
 0.35%  R  [kernel.kallsyms]  [k] clear_huge_page
 0.34%  R  libR.so            [.] RunGenCollect
 0.20%  R  [kernel.kallsyms]  [k] clear_page
它还消耗大量内存:

USER       PID VSZ    RSS    COMMAND
nate     17099 22.5GB 22.5GB /usr/local/lib/R/bin/exec/R --vanilla
如果矩阵较小(数MB),但迭代次数较大,则内存使用更合理,但代价是垃圾收集器使用的时间比数值计算的时间长:

cols = 100
rows = 10000
w = runif(rows)
Q = matrix(1.0, rows, cols)
reweight(1000, w, Q)
(请注意,这是一个从头开始的新流程)

如果我用Rcpp编写自己的函数,并在适当的位置完成工作,我可以得到我想要的内存分配:

library(Rcpp)
cppFunction('
void weightMatrix(NumericVector w,
                  NumericMatrix Q,
                  NumericMatrix wQ) {
    size_t numRows = Q.rows();
    for (size_t row = 0; row < numRows; row++) {
       wQ(row,_) = w(row) * Q(row,_);  
    }
    return;
}
')

reweightCPP = function(iter, w, Q) {
  # Initialize workspace to non-NA
  wQ = matrix(1.0, nrow(Q), ncol(Q))
  for (i in 1:iter) {
    weightMatrix(w, Q, wQ)
  }
}

cols = 100
rows = 10000
w = runif(rows)
Q = matrix(1.0, rows, cols)
wQ = matrix(NA, rows, cols)
Rprofmem()
reweightCPP(5, w, Q)
Rprofmem(NULL)
但我可以通过使用较低级别的C++来解决这一问题:

cppFunction('
void weightMatrix(NumericVector w_,
                  NumericMatrix Q_,
                  NumericMatrix wQ_) {
    size_t numCols = Q_.ncol();
    size_t numRows = Q_.nrow();
    double * __restrict__ w = &w_[0];
    double * __restrict__ Q = &Q_[0];
    double * __restrict__ wQ = &wQ_[0];
    for (size_t row = 0; row < numRows; row++) {
        size_t colOffset = 0;
        for (size_t col = 0; col < numCols; col++) {
            wQ[colOffset + row] = w[row] * Q[colOffset + row];
            colOffset += numRows;
        }
    }
    return;
}
')

99.18%  R  sourceCpp_59392.so  [.] sourceCpp_48203_weightMatrix
 0.06%  R  libR.so             [.] PutRNGstate
 0.06%  R  libR.so             [.] do_begin
 0.06%  R  libR.so             [.] Rf_eval
cppFunction('
空位权重矩阵(数值向量w_2;,
数值矩阵,
数值矩阵(wQ_){
size_t numCols=Q_.ncol();
size_t numRows=Q_.nrow();
双*uuu限制uuw=&w[0];
双*uuu限制uuq=&Q[0];
双*uuu限制uuu wQ=&wQ[0];
对于(行大小=0;行
也就是说,我还没有找到让编译器可靠地生成高效程序集而不使用SIMD intrinsic强制使用VMULPD的方法。即使使用了丑陋的“\uuuu restrict\uuuuuu”属性,在这里显示的形式中,它似乎不得不反转循环顺序并做许多不必要的工作。但我可能最终会发现神奇的交叉编译器语法,或者更可能的是,调用Fortran BLAS函数

这就引出了我的问题:

有没有什么方法可以让我不费吹灰之力就能达到我想要的效果?如果做不到这一点,我有没有办法至少把它隐藏在幕后,让R中的最终用户可以使用“wQ=w*Q”,让它神奇地重用wQ,而不是分配和扔掉另一个巨大的矩阵

对于答案可以写入其中一个操作数(Q=w*Q)的情况,R中的BLAS包装器似乎做得相当好,但当我需要“第三方”工作区时,我还没有找到任何方法来做到这一点。是否有合理的方法来定义%=%的方法,将“wQ=w*Q”转换为“op_mult(w,Q,wQ)”

先发制人的问题是:是的,我已经测量过了,它很重要。该用例是处理大量纵向数据的循环内交叉验证逻辑回归的集合()。每次分析将被称为数百万次(如果不是数十亿次的话)。此函数的良好优化将有助于将运行时间从“不可能”减少到“天”。一个伟大的优化(或者更确切地说是几个这样的优化的组合)可能会让它降到“小时”甚至“分钟”

编辑:在评论中,Henrik正确地指出,示例循环已经简化到只需多次重复相同的计算。我希望这能集中讨论这个问题,但可能会把它弄糊涂。在实际版本中,循环中会有更多的步骤,“w*Q”中的“w”在每次迭代中都是不同的。下面是实际函数的一个未经测试的草稿版本。这是一个“半优化”的逻辑回归直线R的基础上

logistic\u irls\u qrnewton=函数(A,y,maxIter=25,targetSSE=1e-16){
#在第一个权重小于阈值时警告下面的用户
tinyWeightsFound=FALSE
tiny=sqrt(.Machine$double.eps)
#将A分解为QR(仅一次,在循环中完成)
QR=QR(A)#A[行=样本,cols=协变量]
Q=qr.Q(qr)#Q[行,列](与A尺寸相同)
R=qr.R(qr)#R[cols,cols](右上三角形)
#现在复制可防止每次将y用作参数时进行复制
y=y+0;#y[行]
#由于初始值是恒定的,所以第一个过程在循环外
iter=1
t=(y-0.5)*4.0#t[行]=(y-m)*初始重量
C=chol(crossprod(Q,Q))#C[行,行]
t=交叉点(Q,t)
s=前向解算(t(C),t)#s[cols]
s=后向解算(C,s))
t=Q%*%s
sse=crossprod(s)#误差平方和
打印(作为矢量(sse))
收敛=ifelse(sse76.53%  R  sourceCpp_82335.so  [.] _Z12weightMatrixN4Rcpp6VectorILi14ENS_15PreserveStorageEEENS_6MatrixILi14ES1_EES4_
10.46%  R  libR.so             [.] Rf_getAttrib
 9.53%  R  libR.so             [.] getAttrib0
 2.06%  R  libR.so             [.] Rf_isMatrix
 0.42%  R  libR.so             [.] INTEGER
cppFunction('
void weightMatrix(NumericVector w_,
                  NumericMatrix Q_,
                  NumericMatrix wQ_) {
    size_t numCols = Q_.ncol();
    size_t numRows = Q_.nrow();
    double * __restrict__ w = &w_[0];
    double * __restrict__ Q = &Q_[0];
    double * __restrict__ wQ = &wQ_[0];
    for (size_t row = 0; row < numRows; row++) {
        size_t colOffset = 0;
        for (size_t col = 0; col < numCols; col++) {
            wQ[colOffset + row] = w[row] * Q[colOffset + row];
            colOffset += numRows;
        }
    }
    return;
}
')

99.18%  R  sourceCpp_59392.so  [.] sourceCpp_48203_weightMatrix
 0.06%  R  libR.so             [.] PutRNGstate
 0.06%  R  libR.so             [.] do_begin
 0.06%  R  libR.so             [.] Rf_eval
logistic_irls_qrnewton = function(A, y, maxIter=25, targetSSE=1e-16) {
    # warn user below on first weight less than threshold
    tinyWeightsFound = FALSE
    tiny = sqrt(.Machine$double.eps)

    # decompose A to QR (only once, Choleski done in loop)
    QR = qr(A)     # A[rows=samples, cols=covariates]
    Q  = qr.Q(QR)  # Q[rows, cols] (same dimensions as A)
    R  = qr.R(QR)  # R[cols, cols] (upper right triangular)

    # copying now prevents copying each time y is used as argument
    y = y + 0;     # y[rows]

    # first pass is outside loop since initial values are constant
    iter = 1
    t = (y - 0.5) * 4.0       # t[rows] = (y - m) * initial weight
    C = chol(crossprod(Q, Q)) # C[rows, rows]
    t = crossprod(Q,t)
    s = forwardsolve(t(C), t) # s[cols]
    s = backsolve(C, s))
    t = Q %*% s
    sse = crossprod(s)        # sum of squared errors
    print(as.vector(sse))
    converged = ifelse(sse < targetSSE, 1, 0)

    while (converged == 0 && iter < maxIter) {
        iter = iter + 1

        # only t is required as an input
        dim(t) = NULL     # matrix to vector to counteract crossprod
        e = exp(t)
        m = e / (e + 1)       # mu = exp(eta) / (1 + exp(eta))
        d = m / (e + 1)       # mu.eta = exp(eta) / (1 + exp(eta))^2
        w = d * d / (m - m^2) # W =  (1 / variance) = 1 / (mu * (1 - mu))
        if(tinyWeightsFound == FALSE && min(w) < tiny) {
            print("Tiny weights found")
            tinyWeightsFound = TRUE
        }
        t = crossprod(Q, w * (((y - m) / d) + t))
        C = chol(crossprod(Q, w * Q))
        n = forwardsolve(t(C), t)
        n = backsolve(C, n)
        t = Q %*% n
        sse = crossprod(n - s) # divergence from previous
        s = n # save divergence for difference from next
        print(as.vector(sse))
        if (sse < targetSSE) converged = iter
    }

    if (converged == 0) {
        print(paste("Failed to converge after", iter, "iterations"))
        print(paste("Final SSE was", sse))
    } else {
        print(paste("Convergence after iteration", iter))
    }

    coefficients = backsolve(R, crossprod(Q,t))
    dim(coefficients) = NULL # return as a vector
    coefficients
}