Cuda 对于Compute Capability 1.1设备上的回路输入设备功能_Cuda

Cuda 对于Compute Capability 1.1设备上的回路输入设备功能

cuda

Cuda 对于Compute Capability 1.1设备上的回路输入设备功能,cuda,Cuda,我编写了一个\uuuu设备函数，它使用作为循环。它适用于GTX640卡（计算能力2.1），但不适用于9500GT（计算能力1.1）函数大致如下所示： __device__ void myFuncD(float4 *myArray, float4 *result, uint index, uint foo, uint *here, uint *there) { uint j; float4 myValue = myArray[index]; uint idxHere =

我编写了一个

\uuuu设备

函数，它使用

作为循环。它适用于GTX640卡（计算能力2.1），但不适用于9500GT（计算能力1.1）
函数大致如下所示：
__device__ void myFuncD(float4 *myArray, float4 *result, uint index, uint foo, uint *here, uint *there)
{
    uint j;
    float4 myValue = myArray[index];
    uint idxHere = here[foo];
    uint idxThere = there[foo];
    float4 temp;

    for(j=idxHere;j<idxThere;j++){
        temp = myArray[j];

        //do things with myValue and temp, write result to *result
        result->x += /* some calculations with myValue.x and temp.x */
        result->y += /* some calculations with myValue.y and temp.y */
        result->z += /* some calculations with myValue.z and temp.z */
    }
}

__global__ void myKernelD(float4 *myArray, float4 *myResults, uint *here, uint *there)
{
    uint index = blockDim.x*blockIdx.x+threadIdx.x;

    float4 result = = make_float4(0.0f,0.0f,0.0f,0.0f);
    uint foo1, foo2, foo3, foo4;

    //compute foo1, foo2, foo3, foo4 based on myArray[index]

    myFuncD(myArray, &result, index, foo1, here, there);
    myFuncD(myArray, &result, index, foo2, here, there);
    myFuncD(myArray, &result, index, foo3, here, there);
    myFuncD(myArray, &result, index, foo4, here, there);

    myResults[index] = result;
}

\uuuuu设备\uuuuu无效myFuncD（float4*myArray，float4*result，uint index，uint foo，uint*此处，uint*此处，uint*此处）
{
uint j；
float4 myValue=myArray[index]；
uint idxHere=此处[foo]；
uint idxThere=there[foo]；
浮动4温度；
对于（j=idxHere；jx+=/*使用myValue.x和temp.x进行的一些计算*/
结果->y+=/*使用myValue.y和temp.y进行一些计算*/
结果->z+=/*使用myValue.z和temp.z进行一些计算*/
}
}
__全局无效myKernelD（float4*myArray，float4*myResults，uint*here，uint*there）
{
uint index=blockDim.x*blockIdx.x+threadIdx.x；
float4 result==make_float4（0.0f，0.0f，0.0f，0.0f）；
uint foo1、foo2、foo3、foo4；
//基于myArray[index]计算foo1、foo2、foo3、foo4
myFuncD（myArray，&result，index，foo1，here，there）；
myFuncD（myArray，&result，index，foo2，here，there）；
myFuncD（myArray，&result，index，foo3，here，there）；
myFuncD（myArray，&result，index，foo4，here，there）；
myResults[索引]=结果；
}

在GTX460上，myResults
具有正确的值，但在9500GT上，其成员的每个组件都是零
如何使用compute capability 1.1设备实现相同的效果？
用户试图在每个块中使用过多的线程来启动，并收到错误“启动请求的资源过多”.减少每个块的线程数允许内核启动。
用户试图在每个块上使用太多线程来启动，并收到错误“请求启动的资源太多”。减少每个块的线程数允许内核启动。
具体来说，“它在9500 GT上不工作”是什么意思？我没有看到任何关于SM 1.1上非法的代码的具体内容。特别是，我没有看到标题中提到的类似递归的行为。因此，现在您已经对问题进行了相当大的更改，所有提到的递归都消失了。但您没有说什么在compute 1.1设备上不起作用。请再次编辑您的问题包括对问题的描述。我的意思是for
循环。对此很抱歉。我读了一篇关于另一个问题的讨论，SM 1.1不支持递归并混淆了术语。此外，\uuu设备\uuu
函数是void
函数，使用->
操作符访问结果，myResults的每个成员都是（0.0,0.0,0.0,0.0）.附带问题：即使我不需要w组件，我是否正确地假设使用float4比使用float3更好？您的代码是否检查CUDA运行时报告的错误？我怀疑您只是有一个运行时错误，内核没有运行…啊，您是对的，它报告“启动请求的资源太多”.那么是什么原因造成的呢？你说的“9500 GT不工作”是什么意思？我没有看到任何关于SM 1.1上非法的代码的具体内容。特别是，我没有看到标题中提到的类似递归的行为。因此，现在您已经对问题进行了相当大的更改，所有提到的递归都消失了。但您没有说什么在compute 1.1设备上不起作用。请再次编辑您的问题包括对问题的描述。我的意思是for
循环。对此很抱歉。我读了一篇关于另一个问题的讨论，SM 1.1不支持递归并混淆了术语。此外，\uuu设备\uuu
函数是void
函数，使用->
操作符访问结果，myResults的每个成员都是（0.0,0.0,0.0,0.0）.附带问题：即使我不需要w组件，我是否正确地假设使用float4比使用float3更好？您的代码是否检查CUDA运行时报告的错误？我怀疑您只是有一个运行时错误，内核没有运行…啊，您是对的，它报告“启动请求的资源太多”.那是什么原因呢？