Cuda 并行化伪代码在GPU上工作：克服未对齐的内存访问_Cuda_Parallel Processing_Opencl_Gpgpu_Word2vec

Cuda 并行化伪代码在GPU上工作：克服未对齐的内存访问

cuda parallel-processing opencl

Cuda 并行化伪代码在GPU上工作：克服未对齐的内存访问,cuda,parallel-processing,opencl,gpgpu,word2vec,Cuda,Parallel Processing,Opencl,Gpgpu,Word2vec,下面是我试图并行化的伪代码（取自word2vec C代码）。首先，我将列出数据结构及其相应的大小，然后列出伪代码： 1. long long sen[MAX_SENTENCE_LENGTH] // In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be //fine. 2. float neu1[N] (hidden layer values) //N is the length of each ve

下面是我试图并行化的伪代码（取自word2vec C代码）。首先，我将列出数据结构及其相应的大小，然后列出伪代码：

1.  long long sen[MAX_SENTENCE_LENGTH]  
// In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be  
//fine.

2.  float neu1[N] (hidden layer values)
//N is the length of each vector. For now, max N = 400

3.  float neu1e[N] (hidden layer error values)

4.  float syn0[V * N] (input to hidden layer weight matrix)
// For now, we can assume that V * N is small enough to be stored on the GPU
   // In the test data, V = 72k words

5.  float syn1neg[V * N] (back propagation weights used during negative  
sampling)

6. float exptable[1000]

程序的输入是一个文本文件。然后，程序一次处理一个单词来构建词汇表。例如，如果我的文本文件中有一个句子

“并行编程非常有趣”

然后词汇表如下所示（因为代码根据单词的频率对词汇表进行排序）：

构建词汇表后，代码再次开始处理文本，每次1000个单词。前1000个单词存储在

sen[最大句子长度]

中，然后为

sen

中的所有单词训练一个神经网络，这个过程一直持续到文件结束。对于上面的句子，

sen

看起来像这样的

[1,2,3,0,0,4]

假设培训仅在一次迭代中完成，伪代码如下：

for sen in text
{ 
    for word in sen
    {

        for (c = 0; c < N; c++) 
            neu1[c] = 0;

        for (c = 0; c < N; c++) 
            neu1e[c] = 0;   

       /*The variable window is a user supplied parameter. 
        It is used to consider the context  around a word in a sentence. 
        For example, if I am looking at the first word in the sentence
        (target word is word1), and window = 5, then the words in the 
        window = {word2, word3, word4, word5}. 
        If I am looking at the third word in the sentence 
        (target word is word3), then window = {word1, word2, word4, word5}*/    

        for word in window
        {
            for (c = 0; c < N; c++) 
            neu1[c] += syn0[c + word * N];
        }

        for (c = 0; c < N; c++) 
            neu1[c] /= window;

        //negative: number of negative samples to provide (assume it to be 
             //between 5 to 25)
        for (d = 0; d < negative + 1; d++) 
        {

            target = sen[random_index]  
            l2 = target * N;
            f = 0;
            for (c = 0; c < N; c++) 
            f += neu1[c] * syn1neg[c + l2];

           gradient = exptable[function of f] //f is calculated in the loop above

           for (c = 0; c < N; c++) 
              neu1e[c] += gradient * syn1neg[c + l2];

           for (c = 0; c < N; c++) 
              syn1neg[c + l2] += gradient * neu1[c];

          } //Negative Sampling ends    

        for word in window
        {
             for (c = 0; c < N; c++) 
                syn0[c + word * N] += neu1e[c];
        }

   } // word in sen loop ends

 } // sen in text loop ends

文本中sen的


{ 
在sen中的单词
{
对于（c=0；c


我认为最好的并行化方法是并行处理句子中的单词。考虑到所有的循环，我认为我应该在每个字中使用N
线程，这样单个线程在每个循环中只访问全局内存一次（syn0，syn1neg
）。此外，由于所有neu1
和neu1e
更新都是独立的，因此它们可以驻留在线程的私有内存中，并独立地进行更新
我现在主要关注的是：
全局内存访问是以随机方式进行的，因为syn0
和syn1neg
是基于word
变量的值（词汇表中的索引）进行访问的。而且，正如我们所看到的，句子中的单词不会以任何顺序出现
这是个大问题吗？或者，我们可以通过给GPU提供足够数量的线程来隐藏内存延迟吗？另外，我不明白这种访问模式是否是随机的，因为N个线程/字将访问syn0和syn1neg中的顺序数据，但下一组N个线程可能访问内存中很远的顺序数据
在负采样回路中，需要执行还原操作。变量f
是点积的总和。问题是我计划将neu1
存储在每个线程的私有内存中，而syn1neg
存储在全局内存中
负采样是否需要单独的内核？看起来它需要一种不同于仅仅启动N线程/word的方法，但我不确定哪种方法最有效
除了这些问题，请建议我处理此代码的方式是否存在问题。
开场白：您打开了一罐蠕虫（即使没有SLOC
存在）因此，希望您能够接受第部分所述的评论，因为大象切片似乎是作为一个整体解决复杂主题的唯一方法，而不是从主要问题中“逃脱”到实现域的单独代码行的舒适区，如果事先还没有失去大局，通常会错过大局

A1:
是，这是一个主要事实（也称为“问题”）
GPU-设备采用硅设计并优化为SIMD-s单个i构造-m多个data硬件体系结构，因此它们在两种代码+数据布局中表现最佳，这不需要超过（在其整个生命周期内）小的内存区域（千字节），适合于SIMD SM
核心的片上内存区域（
-s无溢出，ev.LRU维护的一级缓存），因此不引入任何“理想化”-每gloMEM
访问大约350-700 ns的性能破坏性延迟损失
[B]-图2
表示：

特斯拉
拥有SM
和
8[SMXfor sen in text
{ 
    for word in sen
    {

        for (c = 0; c < N; c++) 
            neu1[c] = 0;

        for (c = 0; c < N; c++) 
            neu1e[c] = 0;   

       /*The variable window is a user supplied parameter. 
        It is used to consider the context  around a word in a sentence. 
        For example, if I am looking at the first word in the sentence
        (target word is word1), and window = 5, then the words in the 
        window = {word2, word3, word4, word5}. 
        If I am looking at the third word in the sentence 
        (target word is word3), then window = {word1, word2, word4, word5}*/    

        for word in window
        {
            for (c = 0; c < N; c++) 
            neu1[c] += syn0[c + word * N];
        }

        for (c = 0; c < N; c++) 
            neu1[c] /= window;

        //negative: number of negative samples to provide (assume it to be 
             //between 5 to 25)
        for (d = 0; d < negative + 1; d++) 
        {

            target = sen[random_index]  
            l2 = target * N;
            f = 0;
            for (c = 0; c < N; c++) 
            f += neu1[c] * syn1neg[c + l2];

           gradient = exptable[function of f] //f is calculated in the loop above

           for (c = 0; c < N; c++) 
              neu1e[c] += gradient * syn1neg[c + l2];

           for (c = 0; c < N; c++) 
              syn1neg[c + l2] += gradient * neu1[c];

          } //Negative Sampling ends    

        for word in window
        {
             for (c = 0; c < N; c++) 
                syn0[c + word * N] += neu1e[c];
        }

   } // word in sen loop ends

 } // sen in text loop ends

__global__ void
__launch_bounds__( 1,     // dim3tbGridSIZE <maxThreadsPerBLOCK>         COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________REGISTERs, CACHE_, FETCH_, PROXIMITY_PATTERNs ANALYSES
                   1  /*, // dim3tBlockSIZE <minBlocksPerMULTIPROCESSOR> COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________OPTIMUM_SCHEDULE_TO_FILL_FETCH_LATENCIES
                   ?,     // iAsyncSeqOfCmdsQUEUE_Stream_ID <<- TO LET BE FREELY ASSIGNABLE ... NON-BLOCKING EXEC'd KERNEL
                   0  */  // iSharedMemSIZE
                   )
                 Device_printf_GPU_CLK( int const iTag ){
                        ...
                        return;
}