C++ 为什么我的Rcpp实现用于查找唯一项的数量比基R慢？_C++_R_Rcpp

C++ 为什么我的Rcpp实现用于查找唯一项的数量比基R慢？

c++ r

C++ 为什么我的Rcpp实现用于查找唯一项的数量比基R慢？,c++,r,rcpp,C++,R,Rcpp,我试图编写一个函数来计算字符串向量中的唯一项的数目（我的问题稍微复杂一些，但这是可重复的）。我根据C++中找到的答案来做这个。下面是我的代码： C++ 查看unique函数的R源代码（这有点难理解），它似乎使用数组上的循环将唯一元素添加到散列中，并检查该散列是否已经存在元素因此，我认为它应该等同于无序集方法。我不明白为什么无序集方法慢5倍 TLDR:为什么我的C++代码慢？< P>首先，请举例说明可重复性。以上缺少RCPP属性、C++ 11插件和必要的报头导入。第二，这里显示的问题是执行从R

我试图编写一个函数来计算字符串向量中的唯一项的数目（我的问题稍微复杂一些，但这是可重复的）。我根据C++中找到的答案来做这个。下面是我的代码：

C++

查看

unique

函数的R源代码（这有点难理解），它似乎使用数组上的循环将唯一元素添加到散列中，并检查该散列是否已经存在元素

因此，我认为它应该等同于无序集方法。我不明白为什么无序集方法慢5倍

TLDR:为什么我的C++代码慢？

< P>首先，请举例说明可重复性。以上缺少RCPP属性、C++ 11插件和必要的报头导入。

第二，这里显示的问题是执行从R到C++结构的数据的“强”深拷贝< /强>的开销。在基准中的大部分时间都在复制过程中使用。这个过程是由使用<代码> STD:：向量< /代码>而不是<代码> RCPP：：特征向量< /代码>，W触发的。HICH持有EXP> <代码>，S表达式，或指向数据指针。通过否定RCPP对象提供的代理模型，只执行<强>浅拷贝>，将直接导入到C++中的成本要比在这里描述的微秒大得多。话虽如此，让我们谈谈如何修改上面的示例以使用Rcpp对象。首先，请注意，Rcpp对象有一个名为

.sort（）

的成员函数，该函数可以准确地对

Rcpp:：CharacterVector

中缺少的值进行排序（有关详细信息，请参阅此函数假定没有大写或特殊区域设置）其次， EXP> <代码>类型可以用作构造<代码> STD:：unOrdEdEDATESE//>甚至数据导入为RCP::ValueValue/COD>。这些修改可以在声明中使用“原生”的C++函数中找到。

#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>

// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
  sort(x.begin(), x.end());
  return unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
  std::unordered_set<std::string> tab(x.begin(), x.end());
  return tab.size();
}

// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
  x.sort();
  return std::unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
  std::unordered_set<SEXP> tab(x.begin(), x.end());
  return tab.size();
}

因此，当使用Rcpp对象避免深度复制时，

unique\u set\u native

函数将

length（unique（））

调用时间缩短约30毫秒。

什么是“高度优化的编译代码”如何实现这一点呢？Rcpp已经编译好了。另外，其他函数也可以编写成与已编译的R代码性能相同的函数（例如，制表函数）。我觉得我遗漏了一个算法上慢了5倍的东西。德克解释得更好。我认为这是一个类似的情况。在那篇文章中，由于开销造成的差异是以纳秒为单位测量的，在这里没有特别相关的差异是秒。我可以向您展示其他R函数，其中差异非常小我知道这并不能回答你的问题，但这里有一个

Rcpp

函数，它比基R快5倍：

int unique_size（CharacterVector x）{return Rcpp:：unique（x）.size（）；}

。我一直在查找源代码，但找不到任何东西。这确实很有趣。谢谢，我很感激。希望有人能回答“为什么”的问题：）你能添加一个直接调用S3方法而不是S3泛型的基本R方法到基准测试中吗？如果您避免方法分派，那么Rcpp的速度会更快，这将是很有趣的（我希望不是这样）。@Roland，在这种情况下，删除分派似乎没有多大作用。我增加了评估尝试的数量以进一步调查。。。潜在地，额外的时间差可以归因于

length（）

调用R？length也是一个通用的。但这似乎并不能完全解释这种差异。嗯，回答得很好，再一次！谢谢你，我真的很感谢你的解释和代码示例！

x <- paste0("x", sample(1:1e5, 1e7, replace=T))
microbenchmark(length(unique(x)),unique_sort(x), unique_set(x), times=3)

Unit: milliseconds
              expr        min         lq       mean     median         uq
 length(unique(x))   365.0213   373.4018   406.0209   381.7823   426.5206
    unique_sort(x) 10732.1918 10847.0532 10907.6882 10961.9146 10995.4363
     unique_set(x)  1948.6517  2230.3383  2334.4040  2512.0249  2527.2802

#include <Rcpp.h>
#include <unordered_set>
#include <algorithm>

// [[Rcpp::plugins(cpp11)]]

// [[Rcpp::export]]
int unique_sort(std::vector<std::string> x) {
  sort(x.begin(), x.end());
  return unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set(std::vector<std::string> x) {
  std::unordered_set<std::string> tab(x.begin(), x.end());
  return tab.size();
}

// [[Rcpp::export]]
int unique_sort_native(Rcpp::CharacterVector x) {
  x.sort();
  return std::unique(x.begin(), x.end()) - x.begin();
}

// [[Rcpp::export]]
int unique_set_native(Rcpp::CharacterVector x) {
  std::unordered_set<SEXP> tab(x.begin(), x.end());
  return tab.size();
}

# install.packages(c("microbenchmark"))

# Note, it is more efficient to supply an integer rather than a vector
# in sample()'s first parameter.
x <- paste0("x", sample(1e5, 1e7, replace=T))

# Run a microbenchmark
microbenchmark::microbenchmark(
  length(unique(x)),
  length(unique.default(x)),
  unique_sort(x),
  unique_set(x),
  unique_sort_native(x),
  unique_set_native(x),
  times = 10
)

Unit: milliseconds
                      expr     min      lq    mean  median      uq     max neval
         length(unique(x))   208.0   235.3   235.7   237.2   240.2   247.4    10
 length(unique.default(x))   230.9   232.8   238.8   233.7   241.8   266.6    10
            unique_sort(x) 12759.4 12877.1 12993.8 12920.1 13043.2 13416.7    10
             unique_set(x)  2528.1  2545.3  2590.1  2590.3  2631.3  2670.1    10
     unique_sort_native(x)  7452.6  7482.4  7568.5  7509.0  7563.6  7917.8    10
      unique_set_native(x)   175.8   176.9   179.2   178.3   182.3   183.4    10