C++ C/Rcpp中骰子系数的加速计算_C++_Performance_R_Algorithm_Rcpp

C++ C/Rcpp中骰子系数的加速计算

c++ performance r algorithm

C++ C/Rcpp中骰子系数的加速计算,c++,performance,r,algorithm,rcpp,C++,Performance,R,Algorithm,Rcpp,我需要计算一个相似性度量，称为R中二进制向量的大矩阵（600000 x 500）上的骰子系数。为了提高速度，我使用C/Rcpp。该功能运行得很好，但由于我不是计算机科学家，我想知道它是否可以运行得更快。这段代码适合并行化，但我没有并行化C代码的经验骰子系数是相似性/相异性的一个简单度量（取决于你如何看待它）。它旨在比较非对称二进制向量，这意味着其中一个组合（通常为0-0）并不重要，一致性（1-1对）比不一致性（1-0或0-1对）更重要。想象一下下面的列联表： 1 0 1 a

我需要计算一个相似性度量，称为R中二进制向量的大矩阵（600000 x 500）上的骰子系数。为了提高速度，我使用C/Rcpp。该功能运行得很好，但由于我不是计算机科学家，我想知道它是否可以运行得更快。这段代码适合并行化，但我没有并行化C代码的经验

骰子系数是相似性/相异性的一个简单度量（取决于你如何看待它）。它旨在比较非对称二进制向量，这意味着其中一个组合（通常为0-0）并不重要，一致性（1-1对）比不一致性（1-0或0-1对）更重要。想象一下下面的列联表：

   1    0
1  a    b
0  c    d

骰子系数是：（2*a）/（2*a+b+c）

以下是我的Rcpp实现：

library(Rcpp)
cppFunction('
    NumericMatrix dice(NumericMatrix binaryMat){
        int nrows = binaryMat.nrow(), ncols = binaryMat.ncol();
        NumericMatrix results(ncols, ncols);
        for(int i=0; i < ncols-1; i++){ // columns fixed
            for(int j=i+1; j < ncols; j++){ // columns moving
                double a = 0;
                double d = 0;
                for (int l = 0; l < nrows; l++) {
                    if(binaryMat(l, i)>0){
                        if(binaryMat(l, j)>0){
                            a++;
                        }
                    }else{
                        if(binaryMat(l, j)<1){
                            d++;
                        }
                    }
                }
                // compute Dice coefficient         
                double abc = nrows - d;
                double bc = abc - a;
                results(j,i) = (2*a) / (2*a + bc);          
            }
        }
        return wrap(results);
    }
')

库（Rcpp）
CPP函数（'
NumericMatrix骰子（NumericMatrix binaryMat）{
int nrows=binaryMat.nrow（），ncols=binaryMat.ncol（）；
数值矩阵结果（ncols，ncols）；
对于（int i=0；i0）{
如果（二进制矩阵（l，j）>0）{
a++；
}
}否则{
如果（binaryMat（l，j）我无法在工作时运行您的函数，但结果与此相同吗
library(arules)
plot(dissimilarity(X,method="dice"))

system.time(dissimilarity(X,method="dice"))
#user  system elapsed 
#0.04    0.00    0.04 

罗兰提出的解决方案并不完全满足我的用例。因此，基于arules
包中的源代码，我实现了一个更快的版本。arules
中的代码依赖于Leisch（2005）的算法，该算法使用R中的tcrossproduct（）
函数
首先，我编写了一个Rcpp/RcppEigen版本的crossprod
，速度快了2-3倍
library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;

const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));

const int m(A.rows()), n(A.cols());

MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));

return wrap(AtA);
'

fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")

结果是不一样的。你需要这样做：m这给出了几乎相同的计时。我的意思是：m看起来你也可以使用相异性（如（X，“itemMatrix”），method=“dice”，which=“items”）

，但它仍然比你的函数快不了多少。很好。如果你有时间，也许可以清理一下，然后把它作为一篇文章发表？谢谢！可以。我围绕它构建了一个包，我将另外在github上发表。很高兴看到你找到了一个好的解决方案。别忘了接受你的答案。

library(Rcpp)
library(RcppEigen)
library(inline)
crossprodCpp <- '
using Eigen::Map;
using Eigen::MatrixXi;
using Eigen::Lower;

const Map<MatrixXi> A(as<Map<MatrixXi> >(AA));

const int m(A.rows()), n(A.cols());

MatrixXi AtA(MatrixXi(n, n).setZero().selfadjointView<Lower>().rankUpdate(A.adjoint()));

return wrap(AtA);
'

fcprd <- cxxfunction(signature(AA = "matrix"), crossprodCpp, "RcppEigen")

diceR <- function(X){
    a <- fcprd(X)

nx <- ncol(X)
rsx <- colSums(X)

c <- matrix(rsx, nrow = nx, ncol = nx) - a
# b <- matrix(rsx, nrow = nx, ncol = nx, byrow = TRUE) - a
b <- t(c)

m <- (2 * a) / (2*a + b + c)
return(m)
}

m <- microbenchmark(dice(X), diceR(X), dissimilarity(t(X), method="dice"), times=100)
m
# Unit: milliseconds
#                                  expr       min       lq    median       uq      max neval
#                               dice(X) 791.34558 809.8396 812.19480 814.6735 910.1635   100
#                              diceR(X)  62.98642  76.5510  92.02528 159.2557 507.1662   100
#  dissimilarity(t(X), method = "dice") 264.07997 342.0484 352.59870 357.4632 520.0492   100