用于中值计算的Rcpp NumericMatrix的排序列_R_Rcpp

用于中值计算的Rcpp NumericMatrix的排序列

用于中值计算的Rcpp NumericMatrix的排序列,r,rcpp,R,Rcpp,我一直在测试Rcpp和RcppArmadillo，以计算大型矩阵的汇总统计数据。在约400万行、45列上，这比基本的colMeans或犰狳要快得多（快5或10倍） colMeansRcpp <- cxxfunction(signature(X_="integer"), plugin='Rcpp', body=' Rcpp:

我一直在测试Rcpp和RcppArmadillo，以计算大型矩阵的汇总统计数据。在约400万行、45列上，这比基本的colMeans或犰狳要快得多（快5或10倍）

colMeansRcpp <- cxxfunction(signature(X_="integer"), 
                            plugin='Rcpp',
                            body='
                            Rcpp::IntegerMatrix X = X_;
                            int ncol = X.ncol(); int nrow = X.nrow();                      
                            Rcpp::NumericVector out(ncol);
                            for(int col = 0; col < ncol; col++){
                              out[col]=Rcpp::sum(X(_, col));
                            }                             
                            return wrap(out/nrow);
                          ')

我不知道如何用“RCPP糖”和“STD C++”的混合物来表达“就位列”。很抱歉，我可以看出我所做的是错误的，但是一个关于正确语法的提示会很好

ps我是对的，我需要做这个克隆（），这样我就不会改变R对象了

编辑我添加了RcppArmadillo代码和基准比较，以解决下面的答案/注释。为了快速回复，基准测试仅在5万行上，但我记得它与更多行类似。我知道你是Rcpp的作者。。非常感谢您抽出时间

我的想法是，也许我在用RcppArmadillo代码做一些愚蠢的事情，使它比基本的colMeans或Rcpp版本运行得慢很多

colMeansRcppArmadillo <- cxxfunction(signature(X_="integer"), 
                                     plugin="RcppArmadillo",
                                      body='
                                      arma::mat X = Rcpp::as<arma::mat > (X_);
                                      arma::rowvec MD= arma::mean(X, 0);
                                      return wrap(MD);
                                    ')

您实际上并没有显示RcppArmadillo代码——我对RcppArmadillo代码的性能非常满意，因为我需要行/列子集

您可以通过Rcpp实例化犰狳矩阵，其效率与Rcpp差不多（无拷贝，重复使用R对象内存），所以我会尝试一下

而你：你想要

clone（）

作为一个独特的拷贝，我认为如果你使用默认的RcppArmadillo构造函数（而不是更高效的两步操作），你会免费得到它

几小时后编辑

你留下了一个悬而未决的问题，为什么你的犰狳行动缓慢。同时，Vincent为您解决了这个问题，但这里有一个重新访问的、更干净的解决方案，使用您的代码和Vincent的代码

现在，它是如何实例化犰狳矩阵而无需复制的——因此速度更快。它还避免了整数矩阵和数值矩阵的混合。代码首先：

#include <RcppArmadillo.h> 

using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
NumericVector colMedianRcpp(NumericMatrix x) {
    int nrow = x.nrow();
    int ncol = x.ncol();
    int position = nrow / 2; // Euclidian division
    NumericVector out(ncol);
    for (int j = 0; j < ncol; j++) { 
        NumericVector y = x(_,j); // Copy column -- original will not be mod
        std::nth_element(y.begin(), y.begin() + position, y.end()); 
        out[j] = y[position];  
    }
    return out;
}

// [[Rcpp::export]]
arma::rowvec colMeansRcppArmadillo(NumericMatrix x){
    arma::mat X = arma::mat(x.begin(), x.nrow(), x.ncol(), false); 
    return arma::mean(X, 0); 
}

// [[Rcpp::export]]
NumericVector colMeansRcpp(NumericMatrix X) {
    int ncol = X.ncol();
    int nrow = X.nrow(); 
    Rcpp::NumericVector out(ncol);
    for (int col = 0; col < ncol; col++){
        out[col]=Rcpp::sum(X(_, col)); 
    } 
    return wrap(out/nrow);
} 

/*** R
set.seed(42)
X <- matrix(rnorm(100*10), 100, 10)
library(microbenchmark)

mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
                     colMedianRcpp(X), times=50)  
print(mb)
*/

#包括
使用名称空间Rcpp；
//[[Rcpp:：depends（RcppArmadillo）]]
//[[Rcpp:：导出]]
数值向量colMedianRcpp（数值矩阵x）{
int nrow=x.nrow（）；
int ncol=x.ncol（）；
int position=nrow/2；//欧几里得除法
数字矢量输出（ncol）；
对于（int j=0；j X图书馆（微基准）
R> mb打印（mb）
单位：微秒
expr最小lq中值uq最大neval
colMeans（X）9.469 10.422 11.5810 12.421 30.597 50
colMeansRcpp（X）3.922 4.281 4.5245 5.306 18.020 50
colMeansRcppArmadillo（X）4.196 4.549 4.9295 5.927 11.159 50
colMedianRcpp（X）15.615 16.291 16.7290 17.971 27.026 50
R>

您可以使用

NumericVector y = x(_,j);

完整示例：

library(Rcpp)
cppFunction('
  NumericVector colMedianRcpp(NumericMatrix x) {
    int nrow = x.nrow();
    int ncol = x.ncol();
    int position = nrow / 2; // Euclidian division
    NumericVector out(ncol);
    for (int j = 0; j < ncol; j++) {
      NumericVector y = x(_,j); // Copy the column -- the original will not be modified
      std::nth_element(y.begin(), y.begin() + position, y.end());
      out[j] = y[position];
    }
    return out;
  }
')
x <- matrix( sample(1:12), 3, 4 )
x
colMedianRcpp(x)
x   # Unchanged

库（Rcpp）
CPP函数（'
数值向量colMedianRcpp（数值矩阵x）{
int nrow=x.nrow（）；
int ncol=x.ncol（）；
int position=nrow/2；//欧几里得除法
数字矢量输出（ncol）；
对于（int j=0；j嗨。不，原谅我，我不清楚那不是犰狳密码。我对问题进行了编辑，以展示这一点和基准。我是不是在做一些愚蠢的事情，让犰狳密码变得特别慢？我关于克隆的观点是，在这个例子中，我正确地复制了一个副本——如果我要对它进行适当的排序，这样就不会改变R对象了，这也是非常有用的。因为犰狳的方式肯定更简洁。而且~同样快。。。当你做对的时候。我已经阅读了你们优秀的Rcpp图库，但显然我还没有领会其中的一些细节。很多thx。如果您只需要中值或其他分位数，则不必对整个数组进行排序：工作方式与std:：sort完全相同，但当所需元素位于其最终位置时，将停止对数组进行排序。评论不错（与往常一样）Vincent写的——事实上，我们有一篇Rcpp图库文章是关于这一点的：谢谢，我不知道std:：nth_元素，Rcpp图库文章在如何使用STL对向量进行排序、部分排序等方面做得很好……但我仍然不知道对矩阵列进行排序的语法（如果存在）。谢谢。根据你关于第n个元素的提示，这就是我所做的。我只是让我自己陷入了一种狂热的想法，因为有糖来表达列和行，也许有糖可以在不复制的情况下对它们应用stl算法。显然，这会使一切变得太容易；）@史蒂芬汉德森你必须为这类事情做拷贝。因为std:：nth\u元素的std:：sort可以正常工作，所以如果不使用副本，它将更改原始数据。你不会想要的。
#include <RcppArmadillo.h> 

using namespace Rcpp;

// [[Rcpp::depends(RcppArmadillo)]]

// [[Rcpp::export]]
NumericVector colMedianRcpp(NumericMatrix x) {
    int nrow = x.nrow();
    int ncol = x.ncol();
    int position = nrow / 2; // Euclidian division
    NumericVector out(ncol);
    for (int j = 0; j < ncol; j++) { 
        NumericVector y = x(_,j); // Copy column -- original will not be mod
        std::nth_element(y.begin(), y.begin() + position, y.end()); 
        out[j] = y[position];  
    }
    return out;
}

// [[Rcpp::export]]
arma::rowvec colMeansRcppArmadillo(NumericMatrix x){
    arma::mat X = arma::mat(x.begin(), x.nrow(), x.ncol(), false); 
    return arma::mean(X, 0); 
}

// [[Rcpp::export]]
NumericVector colMeansRcpp(NumericMatrix X) {
    int ncol = X.ncol();
    int nrow = X.nrow(); 
    Rcpp::NumericVector out(ncol);
    for (int col = 0; col < ncol; col++){
        out[col]=Rcpp::sum(X(_, col)); 
    } 
    return wrap(out/nrow);
} 

/*** R
set.seed(42)
X <- matrix(rnorm(100*10), 100, 10)
library(microbenchmark)

mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
                     colMedianRcpp(X), times=50)  
print(mb)
*/

R> sourceCpp("/tmp/stephen.cpp") 
R> set.seed(42)
R> X <- matrix(rnorm(100*10), 100, 10)
R> library(microbenchmark)
R> mb <- microbenchmark(colMeans(X), colMeansRcpp(X), colMeansRcppArmadillo(X),
+                      colMedianRcpp(X), times=50) 
R> print(mb)
Unit: microseconds
                     expr    min     lq  median     uq    max neval
              colMeans(X)  9.469 10.422 11.5810 12.421 30.597    50 
          colMeansRcpp(X)  3.922  4.281  4.5245  5.306 18.020    50 
 colMeansRcppArmadillo(X)  4.196  4.549  4.9295  5.927 11.159    50 
         colMedianRcpp(X) 15.615 16.291 16.7290 17.971 27.026    50 
R>

NumericVector y = x(_,j);

library(Rcpp)
cppFunction('
  NumericVector colMedianRcpp(NumericMatrix x) {
    int nrow = x.nrow();
    int ncol = x.ncol();
    int position = nrow / 2; // Euclidian division
    NumericVector out(ncol);
    for (int j = 0; j < ncol; j++) {
      NumericVector y = x(_,j); // Copy the column -- the original will not be modified
      std::nth_element(y.begin(), y.begin() + position, y.end());
      out[j] = y[position];
    }
    return out;
  }
')
x <- matrix( sample(1:12), 3, 4 )
x
colMedianRcpp(x)
x   # Unchanged