Neural network 加速GPU神经网络_Neural Network_Gpgpu_Thrust_Cublas

Neural network 加速GPU神经网络

neural-network

Neural network 加速GPU神经网络,neural-network,gpgpu,thrust,cublas,Neural Network,Gpgpu,Thrust,Cublas,我正在尝试使用推力库和库布拉斯库实现一个在GPU上运行的神经网络，但要使它比当前的多线程和矢量化CPU实现运行得更快，我遇到了很多困难。该网络有一个包含后勤单位的隐藏层和一个包含线性单位的输出层，下面是代码： // Functor to add bias before computing logistic template <typename T> struct bias_logistic_f { __host__ __device__ T opera

我正在尝试使用推力库和库布拉斯库实现一个在GPU上运行的神经网络，但要使它比当前的多线程和矢量化CPU实现运行得更快，我遇到了很多困难。该网络有一个包含后勤单位的隐藏层和一个包含线性单位的输出层，下面是代码：

// Functor to add bias before computing logistic
template <typename T>
struct bias_logistic_f {
        __host__ __device__
        T operator()(const T& x, const T& y) const {
                return 1/(1+exp(-(x+y)));
        }
};
bias_logistic_f bias_logistic();

// Thrust vectors for input/hidden/output units
thrust::device_vector<FLT> batch(batch_rows*ndim);
thrust::device_vector<FLT> hid(batch_rows*nhid);
thrust::device_vector<FLT> gpu_code(ndata*ncode);

// ...Load data and network weights...

// Multiply input (batch) by weights (vis2hid)
// Our matrices are stored row-major, but BLAS wants column-major,
// so pretend they're transposed and compute hid' = vis2hid' * batch'
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, nhid, batch_rows, ndim,
            &alpha, thrust::raw_pointer_cast(&vis2hid[0]), nhid,
                    thrust::raw_pointer_cast(&batch[0]), ndim,
             &beta, thrust::raw_pointer_cast(&hid[0]), nhid);

// Add hidbiases to hid and compute logistic
thrust::transform(hid.begin(), hid.end(), hidbiases.begin(), hid.begin(),
                  bias_logistic);

// Multiply hid by weights (hid2code)
cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, ncode, batch_rows, nhid,
            &alpha, thrust::raw_pointer_cast(&hid2code[0]), ncode,
                    thrust::raw_pointer_cast(&hid[0]), nhid,
             &beta, thrust::raw_pointer_cast(&gpu_code[b*batch_rows*ncode]), ncode);

// Add codebiases
thrust::transform(gpu_code.begin() + b*batch_rows*ncode, gpu_code.begin() + (b+1)*batch_rows*ncode,
                  codebiases.begin(), gpu_code.begin() + b*batch_rows*ncode,
                  thrust::plus<FLT>());

我们的输入数据是一个稀疏矩阵，大约有150000行和6500列，平均每行大约有100个非零元素。这太大了，无法将完整矩阵作为密集矩阵存储在GPU上，因此我要做的是循环通过稀疏矩阵扩展批次，每个批次1000行，以输入神经网络：

for(int b=0; b<nbatch; ++b) {
    // Zero out batch b
    thrust::fill(batch.begin(), batch.end(), 0.0f);
    // batch_val contains the non-zero values for the current batch, batch_idx the indices within the batch,
    // and batch_ptr indexes into batch_val/batch_idx
    // This is like CSR format except instead of compressing rows, it's compressing submatrices of 1,000 rows
    thrust::scatter(batch_val.begin() + batch_ptr[b],
                    batch_val.begin() + batch_ptr[b+1],
                    batch_idx.begin() + batch_ptr[b],
                    batch.begin());

    // ...Input batch to network (shown above)...
}

我们的CPU实现使用STL向量做同样的事情。当我运行这两个程序并比较它们的运行时间时，我惊讶地发现GPU代码处理数据平均需要38秒，而CPU代码只需要27秒。这可能是因为GPU是几年前的特斯拉C1060，而服务器是较新的24核机器。但我仍然认为，如果有数千个线程可用，它不会慢50%

有没有办法让这段代码运行得更快？我是GPU编程新手，所以我不知道我可能做错了什么。有没有比我在这里所做的更有效的方法来处理稀疏矩阵，比如使用CUSPARSE库？还是完全忘记高级库，用CUDA编写我自己的内核来组合矩阵乘法/逻辑/加法步骤更好？

您有一些主要模块：子矩阵的扩展、输入权重的乘法、隐藏权重的乘法，等等。您是否对这些主要块进行了分析，以分解您的总体执行时间？单次通过上述代码的时间是38秒，还是多次通过的时间？类似CSR的稀疏表示是否完全存储在GPU内存中？或者是否正在进行主机->设备传输？因为你在做矩阵乘法，我想说使用像cusp或cusparse这样的稀疏库可能会比把它当作密集库得到更好的结果。整个稀疏矩阵存储在GPU上，因此在循环之前只有一个传输，之后只有一个传输来复制输出。我运行了nvprof，它说它在两个dgemm调用中花费了96%的时间。切换到cusparse应该会改善很多。我认为，你应该能够将你的循环折叠成一个没有循环的调用序列。