C 一个线程计数,另一个线程执行任务和测量
我想实现一个2线程模型,其中1是计数(无限增加一个值),另一个是记录第一个计数器,完成工作,记录第二个记录,并测量两个计数器之间经过的时间 以下是我迄今为止所做的工作:C 一个线程计数,另一个线程执行任务和测量,c,multithreading,gcc,inline-assembly,cpu-registers,C,Multithreading,Gcc,Inline Assembly,Cpu Registers,我想实现一个2线程模型,其中1是计数(无限增加一个值),另一个是记录第一个计数器,完成工作,记录第二个记录,并测量两个计数器之间经过的时间 以下是我迄今为止所做的工作: // global counter register unsigned long counter asm("r13"); // unsigned long counter; void* counter_thread(){ // affinity is set to some isolated CPU so the no
// global counter
register unsigned long counter asm("r13");
// unsigned long counter;
void* counter_thread(){
// affinity is set to some isolated CPU so the noise will be minimal
while(1){
//counter++; // Line 1*
asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
}
}
void* measurement_thread(){
// affinity is set somewhere over here
unsigned long meas = 0;
unsigned long a = 5;
unsigned long r1,r2;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a *=3; // dummy operation that I want to measure
r2 = counter;
mfence();
meas = r2-r1;
printf("counter:%ld \n", counter);
break;
}
}
让我解释一下我到目前为止所做的工作:
// global counter
register unsigned long counter asm("r13");
// unsigned long counter;
void* counter_thread(){
// affinity is set to some isolated CPU so the noise will be minimal
while(1){
//counter++; // Line 1*
asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
}
}
void* measurement_thread(){
// affinity is set somewhere over here
unsigned long meas = 0;
unsigned long a = 5;
unsigned long r1,r2;
sleep(1.0);
while(1){
mfence();
r1 = counter;
a *=3; // dummy operation that I want to measure
r2 = counter;
mfence();
meas = r2-r1;
printf("counter:%ld \n", counter);
break;
}
}
因为我希望计数器是准确的,所以我将亲和性设置为一个独立的CPU。此外,如果我使用第1*行中的计数器,解串功能将为:
d4c: 4c 89 e8 mov %r13,%rax
d4f: 48 83 c0 01 add $0x1,%rax
d53: 49 89 c5 mov %rax,%r13
d56: eb f4 jmp d4c <counter_thread+0x37>
它可以工作,但它没有给我我想要的粒度
编辑:
我的环境:
- CPU:AMD Ryzen 3600
- 内核:5.0.0-32-generic
- 操作系统:Ubuntu 18.04
register unsigned long counter asm("r13");
结果2:测量线程始终读取0。在反汇编代码中,我可以看到两者都处理R13寄存器(计数器),然而,我相信它并不是以某种方式共享的
实验3:
设置:与设置2相同,除了在计数器线程中,我没有执行计数器++,而是执行一个内联程序集,以确保执行1个周期的操作。我的反汇编文件如下所示:
cd1: 49 83 c5 01 add $0x1,%r13
cd5: eb fa jmp cd1 <counter_thread+0x37>
cd1:49 83 c5 01添加$0x1,%r13
cd5:eb-fa jmp cd1
Results3:测量线程的读数如上所述为0。每个线程都有自己的寄存器。每个逻辑CPU内核都有自己的体系结构寄存器,线程在内核上运行时使用这些寄存器。只有信号处理程序(或裸机上的中断)可以修改其线程的寄存器 声明一个类似您的
。。。多线程程序中的asm(“r13”)
有效地提供了线程本地存储,而不是真正的共享全局存储
线程之间只共享内存,不共享寄存器。这就是为什么多个线程可以同时运行,而不需要彼此踩在一起,每个线程都使用它们的寄存器
编译器可以自由使用未声明为register global的寄存器,因此在内核之间共享它们根本不起作用。(而且GCC无法根据您如何声明它们来让它们共享或私有。)
除此之外,寄存器全局不是易失性
或原子性
sor1=计数器代码>和r2=计数器
can CSE所以r2-r1
是编译时常量零,即使您的本地R13正在从信号处理程序更改
如何确保两个线程都在使用寄存器进行计数器值的读/写操作
您不能这样做。内核之间没有共享状态,可以以比缓存更低的延迟进行读/写
如果你想计时,考虑使用,或者<代码> RDPMC < /代码>来读取一个性能计数器(你可能已经设置为计算核心时钟周期)。
使用另一个线程来增加计数器是不必要的,也没有帮助,因为没有非常低的开销从另一个内核读取内容
我机器中的rdtscp指令给出36-72-108。。。循环分辨率最好。因此,我无法区分2个循环和35个循环之间的区别,因为它们都会给出36个循环
那么您使用的rdtsc
是错误的。它不是串行化的,因此需要在定时区域周围使用lfence
。请看我的答案。但是,是的,rdtsc
是昂贵的,rdpmc
的开销只是稍微低一些
但更重要的是,您无法有效地测量a*=3代码>以C表示,以周期为单位表示单个成本
。首先,它可以根据上下文进行不同的编译
但是假设一个正常的lea eax,[rax+rax*2]
,一个现实的指令成本模型有三个维度:uop计数(前端)、后端端口压力和从输入到输出的延迟
有关单个指令计时的更多信息,请参见上的答案。以不同的方式将其放入一个循环中,以测量吞吐量和延迟,并查看性能计数器以获得UOP->ports。或者看看Agner Fog的指导表,因为人们已经做了这些测试
也
同样,这是一条asm指令的计时方式,而不是C语句。启用优化后,C语句的成本可能取决于它如何优化周围的代码。(和/或像所有现代x86 CPU一样,在无序执行CPU上,周围操作的延迟是否隐藏了成本。)
那么你用错了rdtsc。它不是序列化的,所以你需要lfence
在定时区域周围。请参阅我关于如何获得CPU周期的答案
从C++中计算x8664。但是,是的,rdtsc是昂贵的,而rdpmc是昂贵的
只是稍微降低了一些开销
嗯。我做了家庭作业
第一件事。我知道rdtscp
是序列化指令
register unsigned long counter asm("r13");
cd1: 49 83 c5 01 add $0x1,%r13
cd5: eb fa jmp cd1 <counter_thread+0x37>
======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution.
I also checked where the sums are not divisible by 36.
This is the case where I don't use fence instructions with rdtsc.
It turns out that the read value is either 35, or 1,
which I believe the instruction(rdtsc) cannot read the value correctly.
Mfenced rtdscP reads:
Sum: 25884432
Avg: 258
Sum, removed outliers: 25800120
Avg, removed outliers: 258
Mfenced rtdsc reads:
Sum: 17579196
Avg: 175
Sum, removed outliers: 17577684
Avg, removed outliers: 175
Lfenced rtdscP reads:
Sum: 7511688
Avg: 75
Sum, removed outliers: 7501608
Avg, removed outliers: 75
Lfenced rtdsc reads:
Sum: 7024428
Avg: 70
Sum, removed outliers: 7015248
Avg, removed outliers: 70
NOT fenced rtdscP reads:
Sum: 6024888
Avg: 60
Sum, removed outliers: 6024888
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3274866
Avg: 32
Sum, removed outliers: 3232913
Avg, removed outliers: 35
======================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 36217404
Avg: 362
Sum, removed outliers: 36097164
Avg, removed outliers: 361
Mfenced rtdsc reads:
Sum: 22973400
Avg: 229
Sum, removed outliers: 22939236
Avg, removed outliers: 229
Lfenced rtdscP reads:
Sum: 13178196
Avg: 131
Sum, removed outliers: 13177872
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 12631932
Avg: 126
Sum, removed outliers: 12631932
Avg, removed outliers: 126
NOT fenced rtdscP reads:
Sum: 12115548
Avg: 121
Sum, removed outliers: 12103236
Avg, removed outliers: 121
NOT fenced rtdsc reads:
Sum: 3335997
Avg: 33
Sum, removed outliers: 3305333
Avg, removed outliers: 35
=================== END OF AMD RYZEN EXPERIMENTS =========================
======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;
Mfenced rtdscP reads:
Sum: 32120355
Avg: 321
Sum, removed outliers: 27718117
Avg, removed outliers: 278
Mfenced rtdsc reads:
Sum: 23739715
Avg: 237
Sum, removed outliers: 23013028
Avg, removed outliers: 230
Lfenced rtdscP reads:
Sum: 14274916
Avg: 142
Sum, removed outliers: 13026199
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 11083963
Avg: 110
Sum, removed outliers: 10905271
Avg, removed outliers: 109
NOT fenced rtdscP reads:
Sum: 9361738
Avg: 93
Sum, removed outliers: 8993886
Avg, removed outliers: 90
NOT fenced rtdsc reads:
Sum: 4766349
Avg: 47
Sum, removed outliers: 4310312
Avg, removed outliers: 43
=================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 38748536
Avg: 387
Sum, removed outliers: 36719312
Avg, removed outliers: 368
Mfenced rtdsc reads:
Sum: 35106459
Avg: 351
Sum, removed outliers: 33514331
Avg, removed outliers: 335
Lfenced rtdscP reads:
Sum: 23867349
Avg: 238
Sum, removed outliers: 23203849
Avg, removed outliers: 232
Lfenced rtdsc reads:
Sum: 21991975
Avg: 219
Sum, removed outliers: 21394828
Avg, removed outliers: 215
NOT fenced rtdscP reads:
Sum: 19790942
Avg: 197
Sum, removed outliers: 19701909
Avg, removed outliers: 197
NOT fenced rtdsc reads:
Sum: 10841074
Avg: 108
Sum, removed outliers: 10583085
Avg, removed outliers: 106
=================== END OF AMD BULLDOZER EXPERIMENTS =========================
======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration
Using a *=3
Mfenced rtdscP reads:
Sum: 10914893
Avg: 109
Sum, removed outliers: 10820879
Avg, removed outliers: 108
Mfenced rtdsc reads:
Sum: 7866322
Avg: 78
Sum, removed outliers: 7606613
Avg, removed outliers: 76
Lfenced rtdscP reads:
Sum: 4823705
Avg: 48
Sum, removed outliers: 4783842
Avg, removed outliers: 47
Lfenced rtdsc reads:
Sum: 3634106
Avg: 36
Sum, removed outliers: 3463079
Avg, removed outliers: 34
NOT fenced rtdscP reads:
Sum: 2216884
Avg: 22
Sum, removed outliers: 1435830
Avg, removed outliers: 17
NOT fenced rtdsc reads:
Sum: 1736640
Avg: 17
Sum, removed outliers: 986250
Avg, removed outliers: 12
===================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 22008705
Avg: 220
Sum, removed outliers: 16097871
Avg, removed outliers: 177
Mfenced rtdsc reads:
Sum: 13086713
Avg: 130
Sum, removed outliers: 12627094
Avg, removed outliers: 126
Lfenced rtdscP reads:
Sum: 9882409
Avg: 98
Sum, removed outliers: 9753927
Avg, removed outliers: 97
Lfenced rtdsc reads:
Sum: 8854943
Avg: 88
Sum, removed outliers: 8435847
Avg, removed outliers: 84
NOT fenced rtdscP reads:
Sum: 7302577
Avg: 73
Sum, removed outliers: 7190424
Avg, removed outliers: 71
NOT fenced rtdsc reads:
Sum: 1726126
Avg: 17
Sum, removed outliers: 1029630
Avg, removed outliers: 12
=================== END OF INTEL EXPERIMENTS =========================
#define lfence() asm volatile("lfence\n");
#define mfence() asm volatile("mfence\n");
// reading the low end is enough for the measurement because I don't measure too complex result.
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscp\n": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:\n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
9fd: 0f ae f0 mfence
a00: 0f 01 f9 rdtscp
a03: 48 89 05 36 16 20 00 mov %rax,0x201636(%rip) # 202040 <read1>
a0a: 0f ae f0 mfence
a0d: 8b 05 15 16 20 00 mov 0x201615(%rip),%eax # 202028 <a21>
a13: 83 c0 03 add $0x3,%eax #Either this or division operations for measurement
a16: 89 05 0c 16 20 00 mov %eax,0x20160c(%rip) # 202028 <a21>
a1c: 0f ae f0 mfence
a1f: 0f 01 f9 rdtscp
a22: 48 89 05 0f 16 20 00 mov %rax,0x20160f(%rip) # 202038 <read2>
a29: 0f ae f0 mfence
a2c: 48 8b 15 05 16 20 00 mov 0x201605(%rip),%rdx # 202038 <read2>
a33: 48 8b 05 06 16 20 00 mov 0x201606(%rip),%rax # 202040 <read1>
a3a: 48 29 c2 sub %rax,%rdx
a3d: 8b 85 ec ca f3 ff mov -0xc3514(%rbp),%eax
register unsigned long a21 asm("r13");
#define calculation_to_measure(){\
a21 +=3;\
}
#define initvars(){\
read1 = 0;\
read2 = 0;\
a21= 21;\
}
// =========== RDTSCP, double mfence ================
// Reference code, others are similar
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:\n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
9ac: 0f ae f0 mfence
9af: 0f 01 f9 rdtscp
9b2: 48 89 05 7f 16 20 00 mov %rax,0x20167f(%rip) # 202038 <read1>
9b9: 0f ae f0 mfence
9bc: 4c 89 e8 mov %r13,%rax
9bf: 48 83 c0 03 add $0x3,%rax
9c3: 49 89 c5 mov %rax,%r13
9c6: 0f ae f0 mfence
9c9: 0f 01 f9 rdtscp
9cc: 48 89 05 5d 16 20 00 mov %rax,0x20165d(%rip) # 202030 <read2>
9d3: 0f ae f0 mfence
Mfenced rtdscP reads:
Sum: 32846796
Avg: 328
Sum, removed outliers: 32626008
Avg, removed outliers: 327
Mfenced rtdsc reads:
Sum: 18235980
Avg: 182
Sum, removed outliers: 18108180
Avg, removed outliers: 181
Lfenced rtdscP reads:
Sum: 14351508
Avg: 143
Sum, removed outliers: 14238432
Avg, removed outliers: 142
Lfenced rtdsc reads:
Sum: 11179368
Avg: 111
Sum, removed outliers: 10994400
Avg, removed outliers: 115
NOT fenced rtdscP reads:
Sum: 6064488
Avg: 60
Sum, removed outliers: 6064488
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3306394
Avg: 33
Sum, removed outliers: 3278450
Avg, removed outliers: 35
934: 0f ae f0 mfence
937: 0f 01 f9 rdtscp
93a: 48 89 05 f7 16 20 00 mov %rax,0x2016f7(%rip) # 202038 <read1>
941: 0f ae f0 mfence
944: 49 83 c5 03 add $0x3,%r13
948: 0f ae f0 mfence
94b: 0f 01 f9 rdtscp
94e: 48 89 05 db 16 20 00 mov %rax,0x2016db(%rip) # 202030 <read2>
955: 0f ae f0 mfence
Mfenced rtdscP reads:
Sum: 22819428
Avg: 228
Sum, removed outliers: 22796064
Avg, removed outliers: 227
Mfenced rtdsc reads:
Sum: 20630736
Avg: 206
Sum, removed outliers: 19937664
Avg, removed outliers: 199
Lfenced rtdscP reads:
Sum: 13375008
Avg: 133
Sum, removed outliers: 13374144
Avg, removed outliers: 133
Lfenced rtdsc reads:
Sum: 9840312
Avg: 98
Sum, removed outliers: 9774036
Avg, removed outliers: 97
NOT fenced rtdscP reads:
Sum: 8784684
Avg: 87
Sum, removed outliers: 8779932
Avg, removed outliers: 87
NOT fenced rtdsc reads:
Sum: 3274209
Avg: 32
Sum, removed outliers: 3255480
Avg, removed outliers: 36
a89: 0f ae f0 mfence
a8c: 0f 31 rdtsc
a8e: 48 89 05 a3 15 20 00 mov %rax,0x2015a3(%rip) # 202038 <read1>
a95: 0f ae f0 mfence
a98: 49 83 c5 03 add $0x3,%r13
a9c: 0f ae f0 mfence
a9f: 0f 31 rdtsc
aa1: 48 89 05 88 15 20 00 mov %rax,0x201588(%rip) # 202030 <read2>
aa8: 0f ae f0 mfence
Mfenced rtdscP reads:
Sum: 28041804
Avg: 280
Sum, removed outliers: 27724464
Avg, removed outliers: 277
Mfenced rtdsc reads:
Sum: 17936460
Avg: 179
Sum, removed outliers: 17931024
Avg, removed outliers: 179
Lfenced rtdscP reads:
Sum: 7110144
Avg: 71
Sum, removed outliers: 7110144
Avg, removed outliers: 71
Lfenced rtdsc reads:
Sum: 6691140
Avg: 66
Sum, removed outliers: 6672924
Avg, removed outliers: 66
NOT fenced rtdscP reads:
Sum: 5970888
Avg: 59
Sum, removed outliers: 5965236
Avg, removed outliers: 59
NOT fenced rtdsc reads:
Sum: 3402920
Avg: 34
Sum, removed outliers: 3280111
Avg, removed outliers: 35