Cuda 并行化伪代码在GPU上工作:克服未对齐的内存访问

Cuda 并行化伪代码在GPU上工作:克服未对齐的内存访问,cuda,parallel-processing,opencl,gpgpu,word2vec,Cuda,Parallel Processing,Opencl,Gpgpu,Word2vec,下面是我试图并行化的伪代码(取自word2vec C代码)。首先,我将列出数据结构及其相应的大小,然后列出伪代码: 1. long long sen[MAX_SENTENCE_LENGTH] // In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be //fine. 2. float neu1[N] (hidden layer values) //N is the length of each ve

下面是我试图并行化的伪代码(取自word2vec C代码)。首先,我将列出数据结构及其相应的大小,然后列出伪代码:

1.  long long sen[MAX_SENTENCE_LENGTH]  
// In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be  
//fine.

2.  float neu1[N] (hidden layer values)
//N is the length of each vector. For now, max N = 400

3.  float neu1e[N] (hidden layer error values)

4.  float syn0[V * N] (input to hidden layer weight matrix)
// For now, we can assume that V * N is small enough to be stored on the GPU
   // In the test data, V = 72k words

5.  float syn1neg[V * N] (back propagation weights used during negative  
sampling)

6. float exptable[1000] 
程序的输入是一个文本文件。然后,程序一次处理一个单词来构建词汇表。例如,如果我的文本文件中有一个句子

“并行编程非常有趣”

然后词汇表如下所示(因为代码根据单词的频率对词汇表进行排序):

构建词汇表后,代码再次开始处理文本,每次1000个单词。前1000个单词存储在
sen[最大句子长度]
中,然后为
sen
中的所有单词训练一个神经网络,这个过程一直持续到文件结束。对于上面的句子,
sen
看起来像这样的
[1,2,3,0,0,4]

假设培训仅在一次迭代中完成,伪代码如下:

for sen in text
{ 
    for word in sen
    {

        for (c = 0; c < N; c++) 
            neu1[c] = 0;

        for (c = 0; c < N; c++) 
            neu1e[c] = 0;   

       /*The variable window is a user supplied parameter. 
        It is used to consider the context  around a word in a sentence. 
        For example, if I am looking at the first word in the sentence
        (target word is word1), and window = 5, then the words in the 
        window = {word2, word3, word4, word5}. 
        If I am looking at the third word in the sentence 
        (target word is word3), then window = {word1, word2, word4, word5}*/    

        for word in window
        {
            for (c = 0; c < N; c++) 
            neu1[c] += syn0[c + word * N];
        }

        for (c = 0; c < N; c++) 
            neu1[c] /= window;

        //negative: number of negative samples to provide (assume it to be 
             //between 5 to 25)
        for (d = 0; d < negative + 1; d++) 
        {

            target = sen[random_index]  
            l2 = target * N;
            f = 0;
            for (c = 0; c < N; c++) 
            f += neu1[c] * syn1neg[c + l2];

           gradient = exptable[function of f] //f is calculated in the loop above

           for (c = 0; c < N; c++) 
              neu1e[c] += gradient * syn1neg[c + l2];

           for (c = 0; c < N; c++) 
              syn1neg[c + l2] += gradient * neu1[c];

          } //Negative Sampling ends    

        for word in window
        {
             for (c = 0; c < N; c++) 
                syn0[c + word * N] += neu1e[c];
        }

   } // word in sen loop ends

 } // sen in text loop ends
文本中sen的

{ 
在sen中的单词
{
对于(c=0;c
我认为最好的并行化方法是并行处理句子中的单词。考虑到所有的循环,我认为我应该在每个字中使用
N
线程,这样单个线程在每个循环中只访问全局内存一次(
syn0,syn1neg
)。此外,由于所有
neu1
neu1e
更新都是独立的,因此它们可以驻留在线程的私有内存中,并独立地进行更新

我现在主要关注的是:

  • 全局内存访问是以随机方式进行的,因为
    syn0
    syn1neg
    是基于
    word
    变量的值(词汇表中的索引)进行访问的。而且,正如我们所看到的,句子中的单词不会以任何顺序出现
  • 这是个大问题吗?或者,我们可以通过给GPU提供足够数量的线程来隐藏内存延迟吗?另外,我不明白这种访问模式是否是随机的,因为N个线程/字将访问syn0和syn1neg中的顺序数据,但下一组N个线程可能访问内存中很远的顺序数据

  • 在负采样回路中,需要执行还原操作。变量
    f
    是点积的总和。问题是我计划将
    neu1
    存储在每个线程的私有内存中,而
    syn1neg
    存储在全局内存中
  • 负采样是否需要单独的内核?看起来它需要一种不同于仅仅启动N线程/word的方法,但我不确定哪种方法最有效

    除了这些问题,请建议我处理此代码的方式是否存在问题。

    开场白:您打开了一罐蠕虫(即使没有
    SLOC
    存在)因此,希望您能够接受第部分所述的评论,因为大象切片似乎是作为一个整体解决复杂主题的唯一方法,而不是从主要问题中“逃脱”到实现域的单独代码行的舒适区,如果事先还没有失去大局,通常会错过大局


    A1:
    ,这是一个主要事实(也称为“问题”)

    GPU
    -设备采用硅设计并优化为
    SIMD
    -s单个i构造-m多个data硬件体系结构,因此它们在两种代码+数据布局中表现最佳,这不需要超过(在其整个生命周期内)小的内存区域(千字节),适合于
    SIMD SM
    核心的片上内存区域(
    -s无溢出,ev.LRU维护的一级缓存),因此不引入任何“理想化”-每
    gloMEM
    访问大约350-700 ns的性能破坏性延迟损失

    [B]-图2
    表示:

    特斯拉
    拥有
    SM

    8[
    SMXfor sen in text
    { 
        for word in sen
        {
    
            for (c = 0; c < N; c++) 
                neu1[c] = 0;
    
            for (c = 0; c < N; c++) 
                neu1e[c] = 0;   
    
           /*The variable window is a user supplied parameter. 
            It is used to consider the context  around a word in a sentence. 
            For example, if I am looking at the first word in the sentence
            (target word is word1), and window = 5, then the words in the 
            window = {word2, word3, word4, word5}. 
            If I am looking at the third word in the sentence 
            (target word is word3), then window = {word1, word2, word4, word5}*/    
    
            for word in window
            {
                for (c = 0; c < N; c++) 
                neu1[c] += syn0[c + word * N];
            }
    
            for (c = 0; c < N; c++) 
                neu1[c] /= window;
    
            //negative: number of negative samples to provide (assume it to be 
                 //between 5 to 25)
            for (d = 0; d < negative + 1; d++) 
            {
    
                target = sen[random_index]  
                l2 = target * N;
                f = 0;
                for (c = 0; c < N; c++) 
                f += neu1[c] * syn1neg[c + l2];
    
               gradient = exptable[function of f] //f is calculated in the loop above
    
               for (c = 0; c < N; c++) 
                  neu1e[c] += gradient * syn1neg[c + l2];
    
               for (c = 0; c < N; c++) 
                  syn1neg[c + l2] += gradient * neu1[c];
    
              } //Negative Sampling ends    
    
            for word in window
            {
                 for (c = 0; c < N; c++) 
                    syn0[c + word * N] += neu1e[c];
            }
    
       } // word in sen loop ends
    
     } // sen in text loop ends
    
    __global__ void
    __launch_bounds__( 1,     // dim3tbGridSIZE <maxThreadsPerBLOCK>         COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________REGISTERs, CACHE_, FETCH_, PROXIMITY_PATTERNs ANALYSES
                       1  /*, // dim3tBlockSIZE <minBlocksPerMULTIPROCESSOR> COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________OPTIMUM_SCHEDULE_TO_FILL_FETCH_LATENCIES
                       ?,     // iAsyncSeqOfCmdsQUEUE_Stream_ID <<- TO LET BE FREELY ASSIGNABLE ... NON-BLOCKING EXEC'd KERNEL
                       0  */  // iSharedMemSIZE
                       )
                     Device_printf_GPU_CLK( int const iTag ){
                            ...
                            return;
    }