C 在最近对算法中实现缓存技术
我试图优化,并将其与非缓存程序进行比较,但我被卡住了 主要的问题是,当我使用循环缓存计算时,性能会变得更差,因为它的几乎C 在最近对算法中实现缓存技术,c,caching,blocking,nonblocking,C,Caching,Blocking,Nonblocking,我试图优化,并将其与非缓存程序进行比较,但我被卡住了 主要的问题是,当我使用循环缓存计算时,性能会变得更差,因为它的几乎缓存时间=2 x非缓存时间。如果我改变块的大小,一切都不会发生。。。对于x,y坐标系,我使用结构点数组P 以下是非缓存代码: void compare_points_BF(int *N, point *P){ int i, j, p1, p2; float dx, dy, distance=0, min_dist=inf(); long calc = 0
缓存时间=2 x非缓存时间
。如果我改变块的大小,一切都不会发生。。。对于x,y坐标系,我使用结构点数组P
以下是非缓存代码:
void compare_points_BF(int *N, point *P){
int i, j, p1, p2;
float dx, dy, distance=0, min_dist=inf();
long calc = 0;
for (i=0; i<(*N-1) ; i++){
for (j=i+1; j<*N; j++){
dx = P[i].x - P[j].x;
dy = P[i].y - P[j].y;
//calculate distance of current points
distance = (dx * dx) + (dy * dy);
calc++;
if (distance < min_dist){
min_dist = distance;
p1 = i;
p2 = j;
}
}
}
printf("%ld calculations\t", calc);
}
但通过缓存的示例,我得到:
33550336 calculations Block_size = 128 N = 8192 Run time: 0.402 sec
33550336 calculations Block_size = 256 N = 8192 Run time: 0.383 sec
33550336 calculations Block_size = 512 N = 8192 Run time: 0.384 sec
33550336 calculations Block_size = 1024 N = 8192 Run time: 0.381 sec
33550336 calculations Block_size = 2048 N = 8192 Run time: 0.398 sec
33550336 calculations Block_size = 4096 N = 8192 Run time: 0.400 sec
33550336 calculations Block_size = 8192 N = 8192 Run time: 0.401 sec
33550336 calculations Block_size = 16384 N = 8192 Run time: 0.383 sec
134209536 calculations Block_size = 128 N = 16384 Run time: 1.579 sec
134209536 calculations Block_size = 256 N = 16384 Run time: 1.610 sec
134209536 calculations Block_size = 512 N = 16384 Run time: 1.630 sec
134209536 calculations Block_size = 1024 N = 16384 Run time: 1.530 sec
134209536 calculations Block_size = 2048 N = 16384 Run time: 1.537 sec
134209536 calculations Block_size = 4096 N = 16384 Run time: 1.562 sec
134209536 calculations Block_size = 8192 N = 16384 Run time: 1.520 sec
134209536 calculations Block_size = 16384 N = 16384 Run time: 1.626 sec
536854528 calculations Block_size = 128 N = 32768 Run time: 6.170 sec
536854528 calculations Block_size = 256 N = 32768 Run time: 6.207 sec
536854528 calculations Block_size = 512 N = 32768 Run time: 6.219 sec
536854528 calculations Block_size = 1024 N = 32768 Run time: 6.131 sec
536854528 calculations Block_size = 2048 N = 32768 Run time: 6.077 sec
536854528 calculations Block_size = 4096 N = 32768 Run time: 6.216 sec
536854528 calculations Block_size = 8192 N = 32768 Run time: 6.130 sec
536854528 calculations Block_size = 16384 N = 32768 Run time: 6.181 sec
我已经检查了一遍又一遍代码,它似乎是正确的。我错过了什么?
编译器是否优化代码以实现比我试图实现的更好的缓存使用率?
提前谢谢 已经有很长时间了,只是为了回答这个问题
float compare_points_BF(register int N, register int B, point *P, register point *p1, *p2;){
register int i, j, ib, jb, iin, jjn, num_blocks = (N + (B-1)) / B;
register float distance=0, min_dist=FLT_MAX, regx, regy;
//break array data in N/B blocks
for (i = 0; i < num_blocks; i++){
for (j = i; j < num_blocks; j++){
iin = ( ((i+1)*B) < N ? ((i+1)*B) : N);
jjn = (((j+1)*B) < N ? ((j+1)*B) : N);
//reads the moving frame block to compare with the i block
for (jb = j * B; jb < jjn; jb++){
//avoid float comparisons that occur when i block = j block
//Registers Allocated
regx = P[jb].x;
regy = P[jb].y;
for (i==j ? (ib=jb+1):(ib=i*B); ib < iin; ib++){
//calculate distance of current points
if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
(P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
min_dist = distance;
p1 = &P[ib];
p2 = &P[jb];
}
}
}
}
}
return sqrt(min_dist);
}
四个嵌套循环与两个嵌套循环相比,在“缓存”版本中,似乎还有更多的表达式需要计算。不,这不会更快。@JoachimPileborg但我如何在不使用额外for循环的情况下将数据分解成块呢?不过,计算是一样的。您只将嵌套循环的数量计算为“计算”,但在“缓存”版本中有更多的表达式。你的“未缓存”版本很简单,简单通常是好的。即使你能设法压缩1%或2%的运行时间,通常也是通过使算法更复杂,因此更难理解。你真正想解决的是什么?它是如何为最近的一对优化N平方算法的?这只是你的一个假设,缓存将加速实现,你是否考虑过其他技术,如循环展开或在GPU上运行?对于简单的变量,编译器(如果适当优化)将
P[i].x
和P[i].y
存储在寄存器中,然后按顺序快速遍历数组的其余部分-也就是说,如果您将函数更改为返回minu dist
,或者打印出索引p1、p2
,否则,它只计算迭代次数-这是一种非常好的访问模式,很难击败。
33550336 calculations Block_size = 128 N = 8192 Run time: 0.402 sec
33550336 calculations Block_size = 256 N = 8192 Run time: 0.383 sec
33550336 calculations Block_size = 512 N = 8192 Run time: 0.384 sec
33550336 calculations Block_size = 1024 N = 8192 Run time: 0.381 sec
33550336 calculations Block_size = 2048 N = 8192 Run time: 0.398 sec
33550336 calculations Block_size = 4096 N = 8192 Run time: 0.400 sec
33550336 calculations Block_size = 8192 N = 8192 Run time: 0.401 sec
33550336 calculations Block_size = 16384 N = 8192 Run time: 0.383 sec
134209536 calculations Block_size = 128 N = 16384 Run time: 1.579 sec
134209536 calculations Block_size = 256 N = 16384 Run time: 1.610 sec
134209536 calculations Block_size = 512 N = 16384 Run time: 1.630 sec
134209536 calculations Block_size = 1024 N = 16384 Run time: 1.530 sec
134209536 calculations Block_size = 2048 N = 16384 Run time: 1.537 sec
134209536 calculations Block_size = 4096 N = 16384 Run time: 1.562 sec
134209536 calculations Block_size = 8192 N = 16384 Run time: 1.520 sec
134209536 calculations Block_size = 16384 N = 16384 Run time: 1.626 sec
536854528 calculations Block_size = 128 N = 32768 Run time: 6.170 sec
536854528 calculations Block_size = 256 N = 32768 Run time: 6.207 sec
536854528 calculations Block_size = 512 N = 32768 Run time: 6.219 sec
536854528 calculations Block_size = 1024 N = 32768 Run time: 6.131 sec
536854528 calculations Block_size = 2048 N = 32768 Run time: 6.077 sec
536854528 calculations Block_size = 4096 N = 32768 Run time: 6.216 sec
536854528 calculations Block_size = 8192 N = 32768 Run time: 6.130 sec
536854528 calculations Block_size = 16384 N = 32768 Run time: 6.181 sec
float compare_points_BF(register int N, register int B, point *P, register point *p1, *p2;){
register int i, j, ib, jb, iin, jjn, num_blocks = (N + (B-1)) / B;
register float distance=0, min_dist=FLT_MAX, regx, regy;
//break array data in N/B blocks
for (i = 0; i < num_blocks; i++){
for (j = i; j < num_blocks; j++){
iin = ( ((i+1)*B) < N ? ((i+1)*B) : N);
jjn = (((j+1)*B) < N ? ((j+1)*B) : N);
//reads the moving frame block to compare with the i block
for (jb = j * B; jb < jjn; jb++){
//avoid float comparisons that occur when i block = j block
//Registers Allocated
regx = P[jb].x;
regy = P[jb].y;
for (i==j ? (ib=jb+1):(ib=i*B); ib < iin; ib++){
//calculate distance of current points
if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
(P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
min_dist = distance;
p1 = &P[ib];
p2 = &P[jb];
}
}
}
}
}
return sqrt(min_dist);
}
Block
Size Number of elements
8192 16384 32768 65536 131072 262144 524288 1048576
128 0,079 0,310 1,260 4,960 19,740 78,990 315,661 1.260,862
256 0,079 0,310 1,250 4,940 19,830 78,820 315,410 1.258,402
512 0,080 0,320 1,260 4,920 19,640 78,480 313,851 1.253,141
1024 0,080 0,320 1,250 4,870 19,430 77,540 310,120 1.237,772
2048 0,079 0,310 1,240 4,850 19,340 77,061 308,211 1.229,892
4096 0,079 0,300 1,210 4,890 19,670 78,300 313,310 1.250,572
8192 0,078 0,310 1,210 4,870 19,510 78,110 312,770 1.249,091
16384 0,300 1,200 4,860 19,420 77,870 312,151 1.246,192
32768 1,190 4,780 19,310 77,460 310,970 1.242,102
65536 4,760 19,230 77,660 312,191 1.249,872
131072 18,972 76,850 310,470 1.246,261
262144 76,400 307,521 1.239,402