Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/cplusplus/152.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C++ CUDA:有没有更快的写入全局内存的方法?_C++_Cuda - Fatal编程技术网

C++ CUDA:有没有更快的写入全局内存的方法?

C++ CUDA:有没有更快的写入全局内存的方法?,c++,cuda,C++,Cuda,我正在写一个n体模拟,基本上整个操作是: -Prepare CUDA memory loop { -Copy data to CUDA -Launch kernel -Copy data to host -Operations using data (drawing etc.) } 我注意到,几乎90%的时间都花在将数据写入内核中的全局设备内存上。以下是内核: __global__ void calculateForcesCuda(float *devic

我正在写一个n体模拟,基本上整个操作是:

-Prepare CUDA memory
 loop {
    -Copy data to CUDA
    -Launch kernel
    -Copy data to host
    -Operations using data (drawing etc.)
 }
我注意到,几乎90%的时间都花在将数据写入内核中的全局设备内存上。以下是内核:

 __global__ void calculateForcesCuda(float *deviceXpos, float *deviceYpos, float *deviceZpos,
                                    float *deviceXforces, float *deviceYforces, float *deviceZforces,
                                    float *deviceMasses, int particlesNumber) {
     int tid = threadIdx.x + blockIdx.x * blockDim.x;
     if (tid <= particlesNumber) {
         float particleXpos = deviceXpos[tid];
         float particleYpos = deviceYpos[tid];
         float particleZpos = deviceZpos[tid];
         float xForce = 0.0f;
         float yForce = 0.0f;
         float zForce = 0.0f;
         for (int index=0; index<particlesNumber; index++) {
             if (tid != index) {
                 float otherXpos = deviceXpos[index];
                 float otherYpos = deviceYpos[index];
                 float otherZpos = deviceZpos[index];
                 float mass = deviceMasses[index];
                 float distx = particleXpos - otherXpos;
                 float disty = particleYpos - otherYpos;
                 float distz = particleZpos - otherZpos;
                 float distance = sqrt((distx*distx + disty*disty + distz*distz) + 0.01f);
                 xForce += 10.0f * mass / distance * (otherXpos - particleXpos);
                 yForce += 10.0f * mass / distance * (otherYpos - particleYpos);
                 zForce += 10.0f * mass / distance * (otherZpos - particleZpos);
             }
         }
         deviceXforces[tid] += xForce;
         deviceYforces[tid] += yForce;      
         deviceZforces[tid] += zForce;
     }
 }

。。。总执行时间减少到约0.92秒,这意味着写入全局设备内存需要大约86%的执行时间。有什么方法可以提高这些写操作的性能吗?

在这种计算中,内存通常是一个瓶颈,即使它不会占用您测量的90%的时间。我建议两件事

设备…[索引]
加载到共享内存中 目前,所有线程都读取相同的
deviceXpos[index]
deviceYpos[index]
deviceZpos[index]
deviceMasses[index]
。您可以将它们加载到共享内存中:

static const int blockSize = ....;

__shared__ float shXpos[blockSize];
__shared__ float shYpos[blockSize];
__shared__ float shZpos[blockSize];
__shared__ float shMasses[blockSize];
for (int mainIndex=0; mainIndex<particlesNumber; index+=blockSize) {
    __syncthreads(); //ensure computation from previous iteration has completed
    shXpos[threadIdx.x] = deviceXpos[mainIndex + threadIdx.x];
    shYpos[threadIdx.x] = deviceYpos[mainIndex + threadIdx.x];
    shZpos[threadIdx.x] = deviceZpos[mainIndex + threadIdx.x];
    shMasses[threadIdx.x] = deviceMasses[mainIndex + threadIdx.x];
    __syncthreads(); //ensure all data is read before computation starts
    for (int index=0; index<blockSize; ++index) {
        .... //your computation, using sh....[index] values
    }
}
static const int blockSize=。。。。;
__共享浮点数shXpos[块大小];
__共享浮点数shYpos[块大小];
__共享浮点数shZpos[块大小];
__共享的_uuu浮点数[块大小];

对于(int mainIndex=0;mainIndexy您误解了正在发生的事情。内存写入不是此代码中的瓶颈。删除它们只会让编译器优化大部分代码away@talonmies上帝,你完全正确。所以计算本身实际上很慢。我会留下这个问题,以防其他人也这样做istake。我怀疑计算是问题所在。循环中的内存负载将是最大的问题。开始考虑数据重用和缓存性能。Talonmes说,代码可能内存有限。但是,作为旁注:从性能角度看,在这段代码中执行的计算将受益于使用
rnorm3d()
函数。
static const int blockSize = ....;

__shared__ float shXpos[blockSize];
__shared__ float shYpos[blockSize];
__shared__ float shZpos[blockSize];
__shared__ float shMasses[blockSize];
for (int mainIndex=0; mainIndex<particlesNumber; index+=blockSize) {
    __syncthreads(); //ensure computation from previous iteration has completed
    shXpos[threadIdx.x] = deviceXpos[mainIndex + threadIdx.x];
    shYpos[threadIdx.x] = deviceYpos[mainIndex + threadIdx.x];
    shZpos[threadIdx.x] = deviceZpos[mainIndex + threadIdx.x];
    shMasses[threadIdx.x] = deviceMasses[mainIndex + threadIdx.x];
    __syncthreads(); //ensure all data is read before computation starts
    for (int index=0; index<blockSize; ++index) {
        .... //your computation, using sh....[index] values
    }
}