Arrays 将结构数组(AoS)传递给CUDA内核?只有元素0起作用

Arrays 将结构数组(AoS)传递给CUDA内核?只有元素0起作用,arrays,cuda,Arrays,Cuda,我想向CUDA内核传递一个结构数组 我的结构: struct Group_Output_Places { float Parameter[3]; int Place_ID[3]; }; 主机设备生成AoS struct Group_Output_Places Group_Places[31]; // 31 places Group_Places[0].Parameter[0] = 360.2f; // f at the end tells it it is a

我想向CUDA内核传递一个结构数组

我的结构:

    struct Group_Output_Places
{
    float Parameter[3];
    int Place_ID[3];
};
主机设备生成AoS

struct Group_Output_Places Group_Places[31]; // 31 places

    Group_Places[0].Parameter[0] = 360.2f; // f at the end tells it it is a float so it doesnt complain about it being a double
    Group_Places[0].Place_ID[0] = 1;

    Group_Places[0].Parameter[1] = 128.4f;
    Group_Places[0].Place_ID[1] = 2;
...
struct Group_Output_Places *Dev_Group_Places;

cudaMalloc((void**)&Dev_Group_Places, sizeof(struct Group_Output_Places)* 31);

cudaMemcpy(Dev_Group_Places, &Group_Places, sizeof(struct Group_Output_Places)* 31, cudaMemcpyHostToDevice); // sizeof(Group_Output_Places)* 31 becuase it is an array

AddInts << <1, 1 >> >(Dev_Group_Places);

问题是只有Group_Places的第一个元素才能到达内核。我怎样才能让整个AoS转到内核?

我认为您或多或少都是正确的。通常,您可能会对数组访问以及如何使不同的线程访问数组的不同元素感到困惑。此代码是合法的:

__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places){
    struct Group_Output_Places GPU_Group_Places;
    GPU_Group_Places = *Dev_Group_Places;
}
但是,除了数组的第一个元素(
Dev\u Group\u Places
),您无法让代码访问任何其他内容

在CUDA中,内核通常创建一个全局唯一的线程索引,然后使用该线程索引索引到数组中是很常见的。大概是这样的:

#include <stdio.h>

const int num_places = 32;

    struct Group_Output_Places
{
    float Parameter[3];
    int Place_ID[3];
};


__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places, int num_places){
    int i = threadIdx.x+blockDim.x*blockIdx.x; // create globally unique ID
    if (i < num_places){
      struct Group_Output_Places GPU_Group_Places;
      GPU_Group_Places = Dev_Group_Places[i];  //index into array
      printf("from thread %d, place id: %d\n", i, GPU_Group_Places.Place_ID[0]);
}}

int main(){

struct Group_Output_Places Group_Places[num_places];

for (int i = 0; i < num_places; i++){

    Group_Places[i].Parameter[0] = 360.2f; // f at the end tells it it is a float so it doesnt complain about it being a double
    Group_Places[i].Place_ID[0] = i+1;

    Group_Places[i].Parameter[1] = 128.4f;
    Group_Places[i].Place_ID[1] = 2;
}
struct Group_Output_Places *Dev_Group_Places;

cudaMalloc((void**)&Dev_Group_Places, sizeof(struct Group_Output_Places)* num_places);

cudaMemcpy(Dev_Group_Places, &Group_Places, sizeof(struct Group_Output_Places)* num_places, cudaMemcpyHostToDevice); // sizeof(Group_Output_Places)* 31 becuase it is an array
if (num_places <= 1024)
  AddInts << <1, num_places >> >(Dev_Group_Places, num_places);
cudaDeviceSynchronize();
}
$ nvcc -o t1085 t1085.cu
$ cuda-memcheck ./t1085
========= CUDA-MEMCHECK
from thread 0, place id: 1
from thread 16, place id: 17
from thread 1, place id: 2
from thread 2, place id: 3
from thread 17, place id: 18
from thread 3, place id: 4
from thread 4, place id: 5
from thread 18, place id: 19
from thread 5, place id: 6
from thread 6, place id: 7
from thread 19, place id: 20
from thread 7, place id: 8
from thread 8, place id: 9
from thread 20, place id: 21
from thread 9, place id: 10
from thread 10, place id: 11
from thread 21, place id: 22
from thread 11, place id: 12
from thread 12, place id: 13
from thread 22, place id: 23
from thread 13, place id: 14
from thread 14, place id: 15
from thread 23, place id: 24
from thread 15, place id: 16
from thread 24, place id: 25
from thread 25, place id: 26
from thread 26, place id: 27
from thread 27, place id: 28
from thread 28, place id: 29
from thread 29, place id: 30
from thread 30, place id: 31
from thread 31, place id: 32
========= ERROR SUMMARY: 0 errors
$
#包括
const int num_places=32;
结构组输出位置
{
浮点参数[3];
int Place_ID[3];
};
__全局无效附加项(结构组输出位置*开发组位置,整数位置){
int i=threadIdx.x+blockDim.x*blockIdx.x;//创建全局唯一ID
如果(i

每当您在使用CUDA代码时遇到问题时,最好使用它。至少,像我上面所做的那样,使用
CUDA memcheck
运行您的代码。

如果您需要调试代码的帮助,那么您需要提供一个供其他人编译和研究的工具。我无法在不完整的代码中诊断问题。感谢您提供这样的帮助一个详细的答案,不幸的是,似乎有一些混乱。目前我需要将整个数据集发送给每个线程。我不希望每个线程都处理数据集的单个元素。任何线程都可以访问数组的任何元素,或者通过适当修改索引来访问所有元素。将您想要的任何索引放在我在这里评论“索引到数组”。这只是C点编程。在这一点上,我可以用一行单行克隆整个结构数组吗?gPuyGROUP PASS[0:3] = DeVyGROUPSPORITY[0:33];那么,你能用C或C++来做吗?我想如果你想做一个“本地”,你需要一个循环来进行拷贝。像那样复制。但这可能不是必需的。就像每个线程都可以运行一个循环来生成本地副本一样,它也可以在需要时访问全局数据(
Dev\u Group\u Places
)。
#include <stdio.h>

const int num_places = 32;

    struct Group_Output_Places
{
    float Parameter[3];
    int Place_ID[3];
};


__global__ void AddInts(struct Group_Output_Places *Dev_Group_Places, int num_places){
    int i = threadIdx.x+blockDim.x*blockIdx.x; // create globally unique ID
    if (i < num_places){
      struct Group_Output_Places GPU_Group_Places;
      GPU_Group_Places = Dev_Group_Places[i];  //index into array
      printf("from thread %d, place id: %d\n", i, GPU_Group_Places.Place_ID[0]);
}}

int main(){

struct Group_Output_Places Group_Places[num_places];

for (int i = 0; i < num_places; i++){

    Group_Places[i].Parameter[0] = 360.2f; // f at the end tells it it is a float so it doesnt complain about it being a double
    Group_Places[i].Place_ID[0] = i+1;

    Group_Places[i].Parameter[1] = 128.4f;
    Group_Places[i].Place_ID[1] = 2;
}
struct Group_Output_Places *Dev_Group_Places;

cudaMalloc((void**)&Dev_Group_Places, sizeof(struct Group_Output_Places)* num_places);

cudaMemcpy(Dev_Group_Places, &Group_Places, sizeof(struct Group_Output_Places)* num_places, cudaMemcpyHostToDevice); // sizeof(Group_Output_Places)* 31 becuase it is an array
if (num_places <= 1024)
  AddInts << <1, num_places >> >(Dev_Group_Places, num_places);
cudaDeviceSynchronize();
}
$ nvcc -o t1085 t1085.cu
$ cuda-memcheck ./t1085
========= CUDA-MEMCHECK
from thread 0, place id: 1
from thread 16, place id: 17
from thread 1, place id: 2
from thread 2, place id: 3
from thread 17, place id: 18
from thread 3, place id: 4
from thread 4, place id: 5
from thread 18, place id: 19
from thread 5, place id: 6
from thread 6, place id: 7
from thread 19, place id: 20
from thread 7, place id: 8
from thread 8, place id: 9
from thread 20, place id: 21
from thread 9, place id: 10
from thread 10, place id: 11
from thread 21, place id: 22
from thread 11, place id: 12
from thread 12, place id: 13
from thread 22, place id: 23
from thread 13, place id: 14
from thread 14, place id: 15
from thread 23, place id: 24
from thread 15, place id: 16
from thread 24, place id: 25
from thread 25, place id: 26
from thread 26, place id: 27
from thread 27, place id: 28
from thread 28, place id: 29
from thread 29, place id: 30
from thread 30, place id: 31
from thread 31, place id: 32
========= ERROR SUMMARY: 0 errors
$