用CUDA实现神经网络_Cuda_Artificial Intelligence_Neural Network_Gpgpu

用CUDA实现神经网络

cuda artificial-intelligence neural-network

用CUDA实现神经网络,cuda,artificial-intelligence,neural-network,gpgpu,Cuda,Artificial Intelligence,Neural Network,Gpgpu,我正在尝试使用CUDA创建一个神经网络：我的内核看起来像： __global__ void feedForward(float *input, float *output, float **weight) { //Here the threadId uniquely identifies weight in a neuron int weightIndex = threadIdx.x; //Here the blockId uniquely identifies a neuron int

我正在尝试使用CUDA创建一个神经网络：

我的内核看起来像：

__global__ void feedForward(float *input, float *output, float **weight) {

//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;

//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
        * input[weightIndex];
}

我做错什么了吗

这是因为我使用块索引和线程索引来引用权重矩阵。还是问题出在别处

我将对重量矩阵进行如下处理：

cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);

我的内核调用是：

feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);

前馈（d_输入、d_输出、d_权重）；

之后，我呼吁： cudaThreadSynchronize（）

我是CUDA编程新手。任何帮助都将不胜感激

谢谢

您正在使用cudamallocpatch，但不显示变量是如何初始化的；我敢打赌这就是你的错误的根源。cudaMallocPitch相当棘手；第三个参数应以字节为单位，而第四个参数则不以字节为单位。i、 e

int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);

您的变量输入大小是否以字节为单位？如果不是，那么您可能分配的内存太少（即，您可能认为您正在请求64个元素，但实际上您将获得64个字节），因此您将访问内核中超出范围的内存。根据我的经验，“未指定的启动失败”错误通常意味着我有segfault

输出代码有问题。虽然它不会产生所描述的错误，但会产生不正确的结果

int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];

int neuronIndex=blockIdx.x；
如果（neuronIndex我使用CUDA构建了一个非常简单的MLP网络。如果您感兴趣，可以在这里找到我的代码：
如果有任何问题，就开枪吧
Daniel
如何分配权重？显示分配内存的代码以及内核启动。您是否应该知道如何分配权重数组？未指定的启动失败通常意味着您的内核无法执行某些操作。在复制之前检查错误。我打赌您没有以正确的方式复制权重.1). 如何启动内核？2). 写入输出数组时出错。一个块中的所有线程同时将数据写入单个存储单元。您可以使用减少共享内存和单个全局内存写入来替换这部分代码。如果您注释掉所有读取或写入全局内存的内容，并且全局内存运行时没有给出错误，则意味着分段错误，->您的索引错误。另外，您是否确定没有将索引与权重索引转换<代码>权重[neuronIndex][weightIndex]

应该是“权重[weightIndex][neuronIndex]”？通常情况下，2d数组的较小索引是后一个索引。“input_size”以字节为单位。它被初始化为：int input_size=NO_OF_WEIGHTS*sizeof（float）；我想我们可以排除这个可能性。你在任何地方使用音高值吗？还没有。。。但我以后需要它来将重量从设备复制到主机。Thanmks，以便让我了解减重。但这并没有解决问题。有人有别的解决办法吗？

int neuronIndex = blockIdx.x;

if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];

__global__ void feedForward(float *input, float *output, float **weight) {

  int weightIndex = threadIdx.x;
  int neuronIndex = blockIdx.x;
  __shared__ float out_reduce[NO_OF_WEIGHTS];

  out_reduce[weightIndex] = 
     (weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ? 
       weight[neuronIndex][weightIndex] * input[weightIndex]
       : 0.0;
  __syncthreads();

  for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
  {
    if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
    __syncthreads();
  }

  if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex]; 
}