CUDA clock()导致零时钟周期
我想使用clock()来比较不同的内核实现。我试图在一个简单的SAXPY示例中实现它,但它会导致零时钟周期,这是非常不可能的 我已经找到了一些关于如何实现clock()的示例。和。但不知何故,传输到我的代码不起作用 以下是我正在使用的代码:CUDA clock()导致零时钟周期,cuda,gpgpu,clock,Cuda,Gpgpu,Clock,我想使用clock()来比较不同的内核实现。我试图在一个简单的SAXPY示例中实现它,但它会导致零时钟周期,这是非常不可能的 我已经找到了一些关于如何实现clock()的示例。和。但不知何故,传输到我的代码不起作用 以下是我正在使用的代码: /* SAXPY code example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ */ #include <stdio.h&g
/* SAXPY code example from https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ */
#include <stdio.h>
// The declaration specifier __global__ defines a kernel. This code
// will be copied to the device and will be executed there in parallel
__global__
void saxpy(int n, float a, float *x, float *y, int *kernel_clock)
{
// The indexing of the single threads is done with the following
// code line
int i = blockIdx.x*blockDim.x + threadIdx.x;
clock_t start = clock();
// Each thread is executing just one position of the arrays
if (i < n) y[i] = a*x[i] + y[i];
clock_t stop = clock();
kernel_clock[i] = (int) (stop-start);
}
int main(void)
{
// Clock cycles of threads
int *kernel_clock;
int *d_kernel_clock;
// Creating a huge number
int N = 1<<20;
float *x, *y, *d_x, *d_y;
// Allocate an array on the *host* of the size of N
x = (float*)malloc(N*sizeof(float));
y = (float*)malloc(N*sizeof(float));
kernel_clock = (int*)malloc(N*sizeof(int));
// Allocate an array on the *device* of the size of N
cudaMalloc(&d_x, N*sizeof(float));
cudaMalloc(&d_y, N*sizeof(float));
cudaMalloc(&d_kernel_clock, N*sizeof(int));
// Filling the array of the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Copy the host array to the device array
cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_kernel_clock, kernel_clock, N*sizeof(int), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements. The triple chevrons dedicates how
// the threads are grouped on the device
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_kernel_clock);
cudaDeviceSynchronize();
// Copy the result from the device to the host
cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(kernel_clock, d_kernel_clock, N*sizeof(int), cudaMemcpyDeviceToHost);
// Calculate average clock time
float average_clock = 0;
for (int i = 0; i < N; i++) {
average_clock += (float) (kernel_clock[i]);
}
average_clock /= N;
// Display the time to the screen
printf ("Kernel clock cycles: %.4f\n", average_clock);
// Free the memory on the host and device
free(x);
free(y);
free(kernel_clock);
cudaFree(d_x);
cudaFree(d_y);
cudaFree(d_kernel_clock);
}
我不确定我做错了什么。所以我的问题是:我怎样才能真正得到一个合理的结果?引用你在问题中链接到的一个答案 您还应该知道编译器和汇编器确实执行 指令重新排序,因此您可能需要检查时钟 在SASS输出中,调用不会彼此相邻 (使用cuobjdump进行检查) 我相信这是你问题的根源。如果我使用CUDA 8发行版工具包编译内核,然后使用cuobjdump反汇编生成的机器代码,我会得到以下结果:
code for sm_52
Function : _Z5saxpyifPfS_Pi
.headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x001c4400fe0007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ { CS2R R7, SR_CLOCKLO; /* 0x50c8000005070007 */
/*0018*/ S2R R0, SR_CTAID.X; } /* 0xf0c8000002570000 */
/* 0x083fc400e3e007f0 */
/*0028*/ { CS2R R8, SR_CLOCKLO; /* 0x50c8000005070008 */
/*0030*/ S2R R2, SR_TID.X; } /* 0xf0c8000002170002 */
/*0038*/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ; /* 0x4f107f8000270003 */
/* 0x081fc400fec207f6 */
/*0048*/ XMAD R2, R0.reuse, c[0x0] [0x8], R2; /* 0x4e00010000270002 */
/*0050*/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2; /* 0x5b30011800370000 */
/*0058*/ ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT; /* 0x4b6d038005070007 */
/* 0x001fd400fc2007ec */
/*0068*/ SHR R9, R0, 0x1f; /* 0x3829000001f70009 */
/*0070*/ @!P0 SHF.L.U64 R2, RZ, 0x2, R0; /* 0x36f800400028ff02 */
/*0078*/ @!P0 SHF.L.U64 R3, R0, 0x2, R9; /* 0x36f804c000280003 */
/* 0x001fc040fe4207f6 */
/*0088*/ @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148]; /* 0x4c10800005280204 */
/*0090*/ @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c]; /* 0x4c10080005380305 */
/*0098*/ { @!P0 IADD R2.CC, R2, c[0x0][0x150]; /* 0x4c10800005480202 */
/*00a8*/ @!P0 LDG.E R4, [R4]; } /* 0x0005c400fe400076 */
/* 0xeed4200000080404 */
/*00b0*/ @!P0 IADD.X R3, R3, c[0x0][0x154]; /* 0x4c10080005580303 */
/*00b8*/ @!P0 LDG.E R6, [R2]; /* 0xeed4200000080206 */
/* 0x001fd800fea007e1 */
/*00c8*/ LEA R10.CC, R0, c[0x0][0x158], 0x2; /* 0x4bd781000567000a */
/*00d0*/ IADD R8, -R7, R8; /* 0x5c12000000870708 */
/*00d8*/ LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2; /* 0x1a17048005770009 */
/* 0x001fc008fe4007f1 */
/*00e8*/ MOV R7, R9; /* 0x5c98078000970007 */
/*00f0*/ @!P0 FFMA R0, R4, c[0x0][0x144], R6; /* 0x4980030005180400 */
/*00f8*/ { MOV R6, R10; /* 0x5c98078000a70006 */
/*0108*/ @!P0 STG.E [R2], R0; } /* 0x001ffc005e2001f2 */
/* 0xeedc200000080200 */
/*0110*/ STG.E [R6], R8; /* 0xeedc200000070608 */
/*0118*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*0128*/ BRA 0x120; /* 0xe2400fffff07000f */
/*0130*/ NOP; /* 0x50b0000000070f00 */
/*0138*/ NOP; /* 0x50b0000000070f00 */
.................................
您可以看到,时钟指令已经重新排序,因此调用它们时,它们之间没有任何代码。这将导致运行此代码的许多扭曲(如果不是全部的话)的时钟测量为零或非常接近零。我没有看到任何错误检查。如果使用
cuda memcheck
运行代码,会发生什么情况?cuda memcheck
提供0个错误==========错误摘要:0个错误
谢谢!我现在明白了问题所在,但在您的输出中,我在哪一行看到了这一点?
code for sm_52
Function : _Z5saxpyifPfS_Pi
.headerflags @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x001c4400fe0007f6 */
/*0008*/ MOV R1, c[0x0][0x20]; /* 0x4c98078000870001 */
/*0010*/ { CS2R R7, SR_CLOCKLO; /* 0x50c8000005070007 */
/*0018*/ S2R R0, SR_CTAID.X; } /* 0xf0c8000002570000 */
/* 0x083fc400e3e007f0 */
/*0028*/ { CS2R R8, SR_CLOCKLO; /* 0x50c8000005070008 */
/*0030*/ S2R R2, SR_TID.X; } /* 0xf0c8000002170002 */
/*0038*/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ; /* 0x4f107f8000270003 */
/* 0x081fc400fec207f6 */
/*0048*/ XMAD R2, R0.reuse, c[0x0] [0x8], R2; /* 0x4e00010000270002 */
/*0050*/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2; /* 0x5b30011800370000 */
/*0058*/ ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT; /* 0x4b6d038005070007 */
/* 0x001fd400fc2007ec */
/*0068*/ SHR R9, R0, 0x1f; /* 0x3829000001f70009 */
/*0070*/ @!P0 SHF.L.U64 R2, RZ, 0x2, R0; /* 0x36f800400028ff02 */
/*0078*/ @!P0 SHF.L.U64 R3, R0, 0x2, R9; /* 0x36f804c000280003 */
/* 0x001fc040fe4207f6 */
/*0088*/ @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148]; /* 0x4c10800005280204 */
/*0090*/ @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c]; /* 0x4c10080005380305 */
/*0098*/ { @!P0 IADD R2.CC, R2, c[0x0][0x150]; /* 0x4c10800005480202 */
/*00a8*/ @!P0 LDG.E R4, [R4]; } /* 0x0005c400fe400076 */
/* 0xeed4200000080404 */
/*00b0*/ @!P0 IADD.X R3, R3, c[0x0][0x154]; /* 0x4c10080005580303 */
/*00b8*/ @!P0 LDG.E R6, [R2]; /* 0xeed4200000080206 */
/* 0x001fd800fea007e1 */
/*00c8*/ LEA R10.CC, R0, c[0x0][0x158], 0x2; /* 0x4bd781000567000a */
/*00d0*/ IADD R8, -R7, R8; /* 0x5c12000000870708 */
/*00d8*/ LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2; /* 0x1a17048005770009 */
/* 0x001fc008fe4007f1 */
/*00e8*/ MOV R7, R9; /* 0x5c98078000970007 */
/*00f0*/ @!P0 FFMA R0, R4, c[0x0][0x144], R6; /* 0x4980030005180400 */
/*00f8*/ { MOV R6, R10; /* 0x5c98078000a70006 */
/*0108*/ @!P0 STG.E [R2], R0; } /* 0x001ffc005e2001f2 */
/* 0xeedc200000080200 */
/*0110*/ STG.E [R6], R8; /* 0xeedc200000070608 */
/*0118*/ EXIT; /* 0xe30000000007000f */
/* 0x001f8000fc0007ff */
/*0128*/ BRA 0x120; /* 0xe2400fffff07000f */
/*0130*/ NOP; /* 0x50b0000000070f00 */
/*0138*/ NOP; /* 0x50b0000000070f00 */
.................................