C++ CUDA-动态共享内存触发器推力::系统::系统错误

C++ CUDA-动态共享内存触发器推力::系统::系统错误,c++,cuda,shared-memory,C++,Cuda,Shared Memory,我刚开始通过Udacity学习CUDA编程。即使在尝试使用动态共享memeory时,我也出现了以下错误 CUDA error at: main.cpp:55 invalid argument cudaGetLastError() terminate called after throwing an instance of thrust::system::system_error' what(): unload of CUDA runtime failed We are unable to e

我刚开始通过Udacity学习CUDA编程。即使在尝试使用动态共享memeory时,我也出现了以下错误

CUDA error at: main.cpp:55
invalid argument cudaGetLastError()
terminate called after throwing an instance of thrust::system::system_error'
what():  unload of CUDA runtime failed

We are unable to execute your code. Did you set the grid and/or block size correctly?
我找了很多,但还是不知道哪里出了问题。有趣的是,如果我把最后两行改为

    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);   
    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*1000>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1); 
compact_内核(d_inputVals,d_inputPos,d_outputVals,d_outputPos,numlems,0);
压缩内核(d_输入值、d_输入值和d_输出值[numlems/2]、d_输出值[numlems/2]、numlems,1);
,运行代码时未引发任何错误。但是,这是没有意义的,因为动态内存分配的空间不应限制为常量。也许不是我的代码,而是Udacity上的设置?下面是我写的代码。任何帮助都将不胜感激

__global__ void compact_kernel(unsigned int* const d_inputVals,
    unsigned int* const d_inputPos,
    unsigned int* const d_outputVals,
    unsigned int* const d_outputPos,
    const size_t numElems,
    const size_t refBit)
{
    const size_t tid = blockIdx.x * blockDim.x + threadIdx.x;

    // predicate
    const bool predicate = (d_inputVals[tid] & 1) == refBit;
    extern __shared__ int s[];   
}

void your_sort(unsigned int* const d_inputVals,
    unsigned int* const d_inputPos,
    unsigned int* const d_outputVals,
    unsigned int* const d_outputPos,
    const size_t numElems)
{ 
    const size_t numBlocks = numElems/512;
    const size_t numThreadsPerBlock = 256;
    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, d_outputVals, d_outputPos, numElems, 0);   
    compact_kernel<<<numBlocks, numThreadsPerBlock, sizeof(int)*numElems>>>(d_inputVals, d_inputPos, &d_outputVals[numElems/2], &d_outputPos[numElems/2], numElems, 1); 
\uuuuu全局\uuuuu无效压缩\u内核(无符号整数*常量d\u输入,
无符号整数*常数d_inputPos,
无符号整数*常数d_输出,
无符号整数*常数d_输出位置,
常数大小,
常数大小(参考位)
{
const size\u t tid=blockIdx.x*blockDim.x+threadIdx.x;
//谓词
常量布尔谓词=(d_inputVals[tid]&1)=refBit;
外部共享内部s[];
}
作废您的\u排序(无符号整数*常量d\u输入,
无符号整数*常数d_inputPos,
无符号整数*常数d_输出,
无符号整数*常数d_输出位置,
常数大小(单位)
{ 
常数size\u t numBlocks=numlems/512;
常量大小\u t numThreadsPerBlock=256;
压缩内核(d_输入值、d_输入值、d_输出值、d_输出值、numElems、0);
压缩内核(d_输入值、d_输入值和d_输出值[numlems/2]、d_输出值[numlems/2]、numlems,1);
}`

编辑:
numElems的值是220480。对于动态内存分配来说,这个数字是否太大?

对于所有当前CUDA设备,共享内存被限制为每个线程块48 KB。根据。

numElems的值是多少?共享内存被限制为每个线程块48 KB。您的号码超出了这个限制。@非常感谢您。就是这样你能把你的评论作为一个答案吗?谢谢你应该去@talonmies,因为他已经暗示了这个答案。