Performance 为什么std::pair比std::tuple快
下面是测试代码 元组测试:Performance 为什么std::pair比std::tuple快,performance,c++11,std-pair,stdtuple,Performance,C++11,Std Pair,Stdtuple,下面是测试代码 元组测试: using namespace std; int main(){ vector<tuple<int,int>> v; for (int var = 0; var < 100000000; ++var) { v.push_back(make_tuple(var, var)); } } 我想知道,为什么在O0中这两种数据结构之间有如此大的差异,因为它们应该非常相似。在02中只有一点不同 为什么
using namespace std;
int main(){
vector<tuple<int,int>> v;
for (int var = 0; var < 100000000; ++var) {
v.push_back(make_tuple(var, var));
}
}
我想知道,为什么在O0中这两种数据结构之间有如此大的差异,因为它们应该非常相似。在02中只有一点不同
为什么O0的差异如此之大,为什么会有任何差异
编辑:
带有v.resize()的代码
配对:
编辑:
我的系统
g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
GLIBCXX_3.4.19
您缺少一些关键信息:您使用什么编译器?你用什么来衡量微基准的性能?您使用什么标准库实现 我的系统:
g++ (GCC) 4.9.1 20140903 (prerelease)
GLIBCXX_3.4.20
无论如何,我运行了您的示例,但首先保留了向量的适当大小,以消除内存分配开销。有了这些,我有趣地观察到了相反的有趣的事情——与你看到的相反:
g++ -std=c++11 -O2 pair.cpp -o pair
perf stat -r 10 -d ./pair
Performance counter stats for './pair' (10 runs):
1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% )
346 context-switches:HG # 0.210 K/sec ( +- 40.13% )
7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% )
182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% )
3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%]
2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%]
1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%]
2,837,484,461 instructions:HG # 0.84 insns per cycle
# 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%]
526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%]
829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%]
594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%]
20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%]
5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%]
<not supported> LLC-load-misses:HG
1.658671368 seconds time elapsed ( +- 1.82% )
对于元组:
Performance counter stats for './tuple' (10 runs):
1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% )
8 context-switches:HG # 0.008 K/sec ( +- 29.74% )
3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% )
195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% )
2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%]
1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%]
697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%]
3,305,631,646 instructions:HG # 1.57 insns per cycle
# 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%]
675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%]
656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%]
475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%]
19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%]
5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%]
<not supported> LLC-load-misses:HG
1.020225388 seconds time elapsed ( +- 0.48% )
“./tuple”(10次运行)的性能计数器统计信息:
1018.980969任务时钟:HG(毫秒)#0.999 CPU利用率(+-0.47%)
8个上下文切换:HG#0.008 K/sec(+-29.74%)
3次cpu迁移:HG#0.003 K/sec(+-42.64%)
195396页错误:HG#0.192米/秒(+-0.00%)
2103574740个周期:HG#2.064 GHz(+-0.30%)[44.28%]
1088827212失速循环前端:HG#51.76%前端循环闲置(+-0.47%)[44.56%]
697438071后端暂停周期:HG#33.15%后端周期空闲(+-0.41%)[44.76%]
3305631646说明:HG#1.57 insns/周期
#每insn 0.33次失速循环(+-0.21%)[55.94%]
675175757支:HG#662.599米/秒(+-0.16%)[56.02%]
656205个分支未命中:HG#0.10%的分支(+-0.98%)[55.93%]
475532976 L1数据缓存负载:HG#466.675米/秒(+-0.13%)[55.69%]
19430992 L1数据缓存加载未命中:HG#4.09%的L1数据缓存命中(+-0.20%)[55.49%]
5161624 LLC负载:汞柱5.065米/秒(+-0.47%)[44.14%]
LLC负载未命中:HG
1.020225388秒经过时间(+-0.48%)
所以请记住,
-flto
是您的朋友,失败的内联可能会在大量模板化的代码上产生极端的结果。使用perf stat
找出发生了什么事。milianw没有说明-O0
与-O2
之间的区别,因此我想补充说明
完全可以预料,std::tuple
在未优化时将比std::pair
慢,因为它是更复杂的对象。一个pair正好有两个成员,因此它的方法很容易定义。但元组有任意数量的成员,在模板参数列表上迭代的唯一方法是递归。因此,元组的大多数函数处理一个成员,然后递归处理其余成员,因此对于2元组,函数调用的数量是原来的两倍
现在,当它们被优化时,编译器将内联该递归,并且不会有显著差异。测试清楚地证实了这一点。这通常适用于大量模板化的内容。可以编写模板来提供抽象,而不需要或很少需要运行时开销,但这依赖于优化来内联所有琐碎的函数。我认为您应该执行
v.reserve(100000000)在这两种情况下,在循环之前进行代码>以使其成为更精确的测试。使用-O0
测量性能是毫无意义的-只需在基准测试时比较优化的代码即可。对于性能测量,您应该运行并测量程序,不仅是一次,而且是x次,然后取测量的运行时的平均值(或中位数)。否则,您总是可以通过一个系统调用以某种方式以不可确定的方式更改您的测量值。首先确保您正确地测量了时间:我想知道-O3
是否会有所不同。PS:我假设pair的速度较慢,因为它可能是使用tuple实现的,恰好达到了内联深度限制。错误假设<代码> STD::配对对/Cuffe需要有两个真正的数据成员,叫做“代码>第一个/CODE”和“代码>第二个/代码>,因此不能用其他任何东西来实现。<代码>配对> /Cuff>在C++之前出现在C++中。此外,无论谁使用更复杂的结构来实现简单结构?@LưuVĩnhPhúcpair
仍然可以使用tuple
实现。我想代码是不断重写的。例如,C++编译器是用C++编写的:)我会沿着同一条原因(任意数量的参数)来回答,然后我看到STD::TUPLE有一对额外的构造函数((5)),这应该阻止参数列表迭代。@ SyntotuaRez:这里不使用构造函数。甚至它也无法避免递归,因为元组本身的结构是递归的。这是语言所要求的吗?我本以为标准中可能会以这种方式指定std::tuple
,但实现可以做它想做的事情。例如,对所有常见的(
| | -O0 | -O2 |
|:------|:-------:|:--------:|
| Pair | 5.01 s | 0.77 s |
| Tuple | 10.6 s | 0.87 s |
g++ (GCC) 4.8.3 20140911 (Red Hat 4.8.3-7)
GLIBCXX_3.4.19
g++ (GCC) 4.9.1 20140903 (prerelease)
GLIBCXX_3.4.20
g++ -std=c++11 -O2 pair.cpp -o pair
perf stat -r 10 -d ./pair
Performance counter stats for './pair' (10 runs):
1647.045151 task-clock:HG (msec) # 0.993 CPUs utilized ( +- 1.94% )
346 context-switches:HG # 0.210 K/sec ( +- 40.13% )
7 cpu-migrations:HG # 0.004 K/sec ( +- 22.01% )
182,978 page-faults:HG # 0.111 M/sec ( +- 0.04% )
3,394,685,602 cycles:HG # 2.061 GHz ( +- 2.24% ) [44.38%]
2,478,474,676 stalled-cycles-frontend:HG # 73.01% frontend cycles idle ( +- 1.24% ) [44.55%]
1,550,747,174 stalled-cycles-backend:HG # 45.68% backend cycles idle ( +- 1.60% ) [44.66%]
2,837,484,461 instructions:HG # 0.84 insns per cycle
# 0.87 stalled cycles per insn ( +- 4.86% ) [55.78%]
526,077,681 branches:HG # 319.407 M/sec ( +- 4.52% ) [55.82%]
829,623 branch-misses:HG # 0.16% of all branches ( +- 4.42% ) [55.74%]
594,396,822 L1-dcache-loads:HG # 360.887 M/sec ( +- 4.74% ) [55.59%]
20,842,113 L1-dcache-load-misses:HG # 3.51% of all L1-dcache hits ( +- 0.68% ) [55.46%]
5,474,166 LLC-loads:HG # 3.324 M/sec ( +- 1.81% ) [44.23%]
<not supported> LLC-load-misses:HG
1.658671368 seconds time elapsed ( +- 1.82% )
g++ -std=c++11 -O2 tuple.cpp -o tuple
perf stat -r 10 -d ./tuple
Performance counter stats for './tuple' (10 runs):
996.090514 task-clock:HG (msec) # 0.996 CPUs utilized ( +- 2.41% )
102 context-switches:HG # 0.102 K/sec ( +- 64.61% )
4 cpu-migrations:HG # 0.004 K/sec ( +- 32.24% )
181,701 page-faults:HG # 0.182 M/sec ( +- 0.06% )
2,052,505,223 cycles:HG # 2.061 GHz ( +- 2.22% ) [44.45%]
1,212,930,513 stalled-cycles-frontend:HG # 59.10% frontend cycles idle ( +- 2.94% ) [44.56%]
621,104,447 stalled-cycles-backend:HG # 30.26% backend cycles idle ( +- 3.48% ) [44.69%]
2,700,410,991 instructions:HG # 1.32 insns per cycle
# 0.45 stalled cycles per insn ( +- 1.66% ) [55.94%]
486,476,408 branches:HG # 488.386 M/sec ( +- 1.70% ) [55.96%]
959,651 branch-misses:HG # 0.20% of all branches ( +- 4.78% ) [55.82%]
547,000,119 L1-dcache-loads:HG # 549.147 M/sec ( +- 2.19% ) [55.67%]
21,540,926 L1-dcache-load-misses:HG # 3.94% of all L1-dcache hits ( +- 2.73% ) [55.43%]
5,751,650 LLC-loads:HG # 5.774 M/sec ( +- 3.60% ) [44.21%]
<not supported> LLC-load-misses:HG
1.000126894 seconds time elapsed ( +- 2.47% )
Performance counter stats for './pair' (10 runs):
1021.922944 task-clock:HG (msec) # 0.997 CPUs utilized ( +- 1.15% )
63 context-switches:HG # 0.062 K/sec ( +- 77.23% )
5 cpu-migrations:HG # 0.005 K/sec ( +- 34.21% )
195,396 page-faults:HG # 0.191 M/sec ( +- 0.00% )
2,109,877,147 cycles:HG # 2.065 GHz ( +- 0.92% ) [44.33%]
1,098,031,078 stalled-cycles-frontend:HG # 52.04% frontend cycles idle ( +- 0.93% ) [44.46%]
701,553,535 stalled-cycles-backend:HG # 33.25% backend cycles idle ( +- 1.09% ) [44.68%]
3,288,420,630 instructions:HG # 1.56 insns per cycle
# 0.33 stalled cycles per insn ( +- 0.88% ) [55.89%]
672,941,736 branches:HG # 658.505 M/sec ( +- 0.80% ) [56.00%]
660,278 branch-misses:HG # 0.10% of all branches ( +- 2.05% ) [55.93%]
474,314,267 L1-dcache-loads:HG # 464.139 M/sec ( +- 1.32% ) [55.73%]
19,481,787 L1-dcache-load-misses:HG # 4.11% of all L1-dcache hits ( +- 0.80% ) [55.51%]
5,155,678 LLC-loads:HG # 5.045 M/sec ( +- 1.69% ) [44.21%]
<not supported> LLC-load-misses:HG
1.025083895 seconds time elapsed ( +- 1.03% )
Performance counter stats for './tuple' (10 runs):
1018.980969 task-clock:HG (msec) # 0.999 CPUs utilized ( +- 0.47% )
8 context-switches:HG # 0.008 K/sec ( +- 29.74% )
3 cpu-migrations:HG # 0.003 K/sec ( +- 42.64% )
195,396 page-faults:HG # 0.192 M/sec ( +- 0.00% )
2,103,574,740 cycles:HG # 2.064 GHz ( +- 0.30% ) [44.28%]
1,088,827,212 stalled-cycles-frontend:HG # 51.76% frontend cycles idle ( +- 0.47% ) [44.56%]
697,438,071 stalled-cycles-backend:HG # 33.15% backend cycles idle ( +- 0.41% ) [44.76%]
3,305,631,646 instructions:HG # 1.57 insns per cycle
# 0.33 stalled cycles per insn ( +- 0.21% ) [55.94%]
675,175,757 branches:HG # 662.599 M/sec ( +- 0.16% ) [56.02%]
656,205 branch-misses:HG # 0.10% of all branches ( +- 0.98% ) [55.93%]
475,532,976 L1-dcache-loads:HG # 466.675 M/sec ( +- 0.13% ) [55.69%]
19,430,992 L1-dcache-load-misses:HG # 4.09% of all L1-dcache hits ( +- 0.20% ) [55.49%]
5,161,624 LLC-loads:HG # 5.065 M/sec ( +- 0.47% ) [44.14%]
<not supported> LLC-load-misses:HG
1.020225388 seconds time elapsed ( +- 0.48% )