Performance 如何将CUDA与C结合使用来加速一段C代码_Performance_Optimization_Cuda

Performance 如何将CUDA与C结合使用来加速一段C代码

performance optimization cuda

Performance 如何将CUDA与C结合使用来加速一段C代码,performance,optimization,cuda,Performance,Optimization,Cuda,这是我到目前为止写的设备代码 __global__ void syndrom(int *d_s, int *d_cx) { int tid = threadIdx.x + blockDim.x * blockIdx.x + 1; int t2 = 5460; int N_BCH = 16383; if (tid <= t2) { d_s[Usetid] = 0; for (int j = 0; j < N_BCH; j ++) { if (d_cx[j

这是我到目前为止写的设备代码

__global__ void syndrom(int *d_s, int *d_cx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x + 1;
int t2 = 5460;
int N_BCH = 16383;
if (tid <= t2) {
    d_s[Usetid] = 0;
    for (int j = 0; j < N_BCH; j ++) {
        if (d_cx[j] != 0) {
            d_s[tid] ^= d_alpha_to[(tid * j) % N_BCH];
        }
    }
    d_s[tid] = d_index_of[d_s[tid]];
}

但是速度不是很好，我想得到帮助。谢谢。

这不是一个问题，例如，您希望在StackOverflow上提供-d_alpha_to？是什么？但我仍然可以提出一些建议：

使用更多的线程，而不是让每个线程迭代很多次。GPU并行化工作的方式是用准备执行更多计算的线程使处理器饱和。不要在全局内存中的同一位置重复操作。将d_s[tid]放入一个局部变量中，该变量将被放入寄存器中，在那里进行处理，完成后，将其写回。访问全局内存显然比访问寄存器慢得多。用_; restrict _;装饰您的指针，并使d_cx成为常量指针。阅读更多关于限制的信息。

这段代码甚至都不会编译。什么是有用的？d_alpha_to和d_index_是一个数组，长度是16383。@worldholl:给你。

dim3 grid(96);
dim3 block(256);