Cuda 并行化伪代码在GPU上工作:克服未对齐的内存访问
下面是我试图并行化的伪代码(取自word2vec C代码)。首先,我将列出数据结构及其相应的大小,然后列出伪代码:Cuda 并行化伪代码在GPU上工作:克服未对齐的内存访问,cuda,parallel-processing,opencl,gpgpu,word2vec,Cuda,Parallel Processing,Opencl,Gpgpu,Word2vec,下面是我试图并行化的伪代码(取自word2vec C代码)。首先,我将列出数据结构及其相应的大小,然后列出伪代码: 1. long long sen[MAX_SENTENCE_LENGTH] // In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be //fine. 2. float neu1[N] (hidden layer values) //N is the length of each ve
1. long long sen[MAX_SENTENCE_LENGTH]
// In the C code, MAX_SENTENCE_LENGTH = 1000. Increasing this should be
//fine.
2. float neu1[N] (hidden layer values)
//N is the length of each vector. For now, max N = 400
3. float neu1e[N] (hidden layer error values)
4. float syn0[V * N] (input to hidden layer weight matrix)
// For now, we can assume that V * N is small enough to be stored on the GPU
// In the test data, V = 72k words
5. float syn1neg[V * N] (back propagation weights used during negative
sampling)
6. float exptable[1000]
程序的输入是一个文本文件。然后,程序一次处理一个单词来构建词汇表。例如,如果我的文本文件中有一个句子
“并行编程非常有趣”
然后词汇表如下所示(因为代码根据单词的频率对词汇表进行排序):
构建词汇表后,代码再次开始处理文本,每次1000个单词。前1000个单词存储在sen[最大句子长度]
中,然后为sen
中的所有单词训练一个神经网络,这个过程一直持续到文件结束。对于上面的句子,sen
看起来像这样的[1,2,3,0,0,4]
假设培训仅在一次迭代中完成,伪代码如下:
for sen in text
{
for word in sen
{
for (c = 0; c < N; c++)
neu1[c] = 0;
for (c = 0; c < N; c++)
neu1e[c] = 0;
/*The variable window is a user supplied parameter.
It is used to consider the context around a word in a sentence.
For example, if I am looking at the first word in the sentence
(target word is word1), and window = 5, then the words in the
window = {word2, word3, word4, word5}.
If I am looking at the third word in the sentence
(target word is word3), then window = {word1, word2, word4, word5}*/
for word in window
{
for (c = 0; c < N; c++)
neu1[c] += syn0[c + word * N];
}
for (c = 0; c < N; c++)
neu1[c] /= window;
//negative: number of negative samples to provide (assume it to be
//between 5 to 25)
for (d = 0; d < negative + 1; d++)
{
target = sen[random_index]
l2 = target * N;
f = 0;
for (c = 0; c < N; c++)
f += neu1[c] * syn1neg[c + l2];
gradient = exptable[function of f] //f is calculated in the loop above
for (c = 0; c < N; c++)
neu1e[c] += gradient * syn1neg[c + l2];
for (c = 0; c < N; c++)
syn1neg[c + l2] += gradient * neu1[c];
} //Negative Sampling ends
for word in window
{
for (c = 0; c < N; c++)
syn0[c + word * N] += neu1e[c];
}
} // word in sen loop ends
} // sen in text loop ends
文本中sen的
{
在sen中的单词
{
对于(c=0;c
我认为最好的并行化方法是并行处理句子中的单词。考虑到所有的循环,我认为我应该在每个字中使用N
线程,这样单个线程在每个循环中只访问全局内存一次(syn0,syn1neg
)。此外,由于所有neu1
和neu1e
更新都是独立的,因此它们可以驻留在线程的私有内存中,并独立地进行更新
我现在主要关注的是:
syn0
和syn1neg
是基于word
变量的值(词汇表中的索引)进行访问的。而且,正如我们所看到的,句子中的单词不会以任何顺序出现f
是点积的总和。问题是我计划将neu1
存储在每个线程的私有内存中,而syn1neg
存储在全局内存中SLOC
存在)因此,希望您能够接受第部分所述的评论,因为大象切片似乎是作为一个整体解决复杂主题的唯一方法,而不是从主要问题中“逃脱”到实现域的单独代码行的舒适区,如果事先还没有失去大局,通常会错过大局
A1:
是,这是一个主要事实(也称为“问题”)
GPU
-设备采用硅设计并优化为SIMD
-s单个i构造-m多个data硬件体系结构,因此它们在两种代码+数据布局中表现最佳,这不需要超过(在其整个生命周期内)小的内存区域(千字节),适合于SIMD SM
核心的片上内存区域(
-s无溢出,ev.LRU维护的一级缓存),因此不引入任何“理想化”-每gloMEM
访问大约350-700 ns的性能破坏性延迟损失
[B]-图2
表示:特斯拉
拥有SM
和8[
SMXfor sen in text
{
for word in sen
{
for (c = 0; c < N; c++)
neu1[c] = 0;
for (c = 0; c < N; c++)
neu1e[c] = 0;
/*The variable window is a user supplied parameter.
It is used to consider the context around a word in a sentence.
For example, if I am looking at the first word in the sentence
(target word is word1), and window = 5, then the words in the
window = {word2, word3, word4, word5}.
If I am looking at the third word in the sentence
(target word is word3), then window = {word1, word2, word4, word5}*/
for word in window
{
for (c = 0; c < N; c++)
neu1[c] += syn0[c + word * N];
}
for (c = 0; c < N; c++)
neu1[c] /= window;
//negative: number of negative samples to provide (assume it to be
//between 5 to 25)
for (d = 0; d < negative + 1; d++)
{
target = sen[random_index]
l2 = target * N;
f = 0;
for (c = 0; c < N; c++)
f += neu1[c] * syn1neg[c + l2];
gradient = exptable[function of f] //f is calculated in the loop above
for (c = 0; c < N; c++)
neu1e[c] += gradient * syn1neg[c + l2];
for (c = 0; c < N; c++)
syn1neg[c + l2] += gradient * neu1[c];
} //Negative Sampling ends
for word in window
{
for (c = 0; c < N; c++)
syn0[c + word * N] += neu1e[c];
}
} // word in sen loop ends
} // sen in text loop ends
__global__ void
__launch_bounds__( 1, // dim3tbGridSIZE <maxThreadsPerBLOCK> COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________REGISTERs, CACHE_, FETCH_, PROXIMITY_PATTERNs ANALYSES
1 /*, // dim3tBlockSIZE <minBlocksPerMULTIPROCESSOR> COMPILE-TIME_ADVICE_FOR_OPTIMISING_COMPILER_____________OPTIMUM_SCHEDULE_TO_FILL_FETCH_LATENCIES
?, // iAsyncSeqOfCmdsQUEUE_Stream_ID <<- TO LET BE FREELY ASSIGNABLE ... NON-BLOCKING EXEC'd KERNEL
0 */ // iSharedMemSIZE
)
Device_printf_GPU_CLK( int const iTag ){
...
return;
}