CUDA动态并行性及由毕奥-萨伐尔定律计算旋涡分量_C_Cuda_Dynamic Programming

CUDA动态并行性及由毕奥-萨伐尔定律计算旋涡分量

c cuda

CUDA动态并行性及由毕奥-萨伐尔定律计算旋涡分量,c,cuda,dynamic-programming,C,Cuda,Dynamic Programming,我正在用CUDA求解毕奥-萨伐尔定律，确定由线元素涡（其中NP~2NV~10^7）引起的NP点的速度由于问题的性质，每个漩涡影响每个点。因此，将每个点指定给一个线程并让该线程计算所有NV漩涡对该点的影响是有意义的尽管占用率明显很高（NP>>NP处理器），但使用CUDA以这种方式并行处理问题的场景的执行速度非常慢（即，与仅在CPU上运行类似的时间）。我怀疑这是因为内核相当复杂，因为每个内核调用都包含一个for循环，该循环通过~10^7计算运行我考虑了一些动态并行性，其中每个点的父线程可以生成

我正在用CUDA求解毕奥-萨伐尔定律，确定由线元素涡（其中

NP~2NV~10^7

）引起的

NP

点的速度

由于问题的性质，每个漩涡影响每个点。因此，将每个点指定给一个线程并让该线程计算所有

NV

漩涡对该点的影响是有意义的

尽管占用率明显很高（

NP>>NP处理器

），但使用CUDA以这种方式并行处理问题的场景的执行速度非常慢（即，与仅在CPU上运行类似的时间）。我怀疑这是因为内核相当复杂，因为每个内核调用都包含一个for循环，该循环通过

~10^7

计算运行

我考虑了一些动态并行性，其中每个点的父线程可以生成多个线程（每个漩涡一个线程）来替换内部for循环。换句话说，子内核是简单的元素，它计算单个漩涡和单个点之间的相互作用

我不是计算机科学家，所以我能够实现一个基本的CUDA方法（见下文1），但我真的很难找到索引，甚至需要尝试一种动态的基准测试方法。请原谅这里的天真

所以我有两个问题：

我说的对吗？其他人是否认为这种动态方法比仅仅并行外部循环（如下面的伪代码片段1）更有效

最佳/正确的索引策略是什么？我就是搞不懂

threadIdx

在孩子们身上的行为。当我从父线程启动内核时，该线程的子线程是否有自己的

threadIdx

，或者它在某种程度上依赖于父线程

threadIdx

（在这种情况下，如何计算子线程索引）

此处的伪代码示例：

预动态CUDA实现

__global__ void biot_predynamic_kernel(vector3 *input1, vector3 *input2, vector3 *answer)
{
    // Get current thread's index
    int idx = blockIdx.x*blockDim.x + threadIdx.x;

    // Sizes and other inputs omitted for clarity

    // Counter up to the size of the input array NV
    int j 

    // Temporary storage for accumulating result
    vector3 temp; 

    if(idx < NP)
    {
        // initialise tempUind and answer[idx] to 0.0

        for(j=0; j<NV; j++)
        {
            // contribution of the current vortex element 
            temp = someInlineFcn(input1[j], input2[idx]);

            // accumulate contributions from each vortex element
            answer[idx] = temp + answer[idx];
        }
    }
}


void main
{

    // allocate arrays of vector3 type and do checks
    // read in data
    // Get number of output points and use that to set blockcount and threadsperblock

    dim3 dimGrid(blockcount);
    dim3 dimBlock(threadsperblock);

    biot_predynamic_kernel<<<dimGrid,dimBlock>>>(answer, input);

    // Block execution until device has completed
    cudaThreadSynchronize();

    // Check for errors and write results
}

\uuuuuuu全局\uuuuuuu无效biot\u预动态\u内核（向量3*输入1，向量3*输入2，向量3*应答）
{
//获取当前线程的索引
int idx=blockIdx.x*blockDim.x+threadIdx.x；
//为清晰起见，省略了尺寸和其他输入
//计数器达到输入阵列NV的大小
int j
//用于累积结果的临时存储器
矢量3温度；
if（idx对于（j=0；j你的问题似乎属于N
-身体问题的范畴，你必须计算，比如说，N
源对N
观察点的影响。从你的代码片段#1中，很明显你在使用一种“暴力”方法，它按顺序扩展为O（N^2）
。在尝试使用动态并行性之前，如果您研究基于树的方法来解决N
-体问题，这将非常有用，它可以将计算复杂性依次降低到O（NlogN）
。可以在以下位置找到一些跟踪：

在

基本思想是利用每个涡旋效应的衰减，以“精确”的方式计算近场相互作用和近似的方式计算远场相互作用。这与多层快速多极子方法（MLFMM）的基本思想相同电动力学。CUDA工具包包含一个N
-身体样本，Nicolas Wilt在文章中对此进行了很好的解释。您可能还希望了解一下MLFMM
最后，我要指出，CUDA动态并行性在两种情况下有助于提高性能：
当算法不适合平坦的单级并行时；例如，插值适合两级并行，请参见此
当计算网格的离散化在计算域上是非均匀的且随时间变化时，经典的例子是中报告的湍流
上述引用的CUDA手册以及David B.Kirk和Wen mei W.Hwu的文章中对CUDA动态并行性带来的性能改进进行了很好的讨论。
您的问题似乎属于N
身体问题的范畴，在这些问题中，您必须计算N
源对N
ob的影响观察点。从您的代码片段#1可以看出，您使用的是一种“蛮力”方法，它按顺序扩展为O（N^2）
。在尝试使用动态并行之前，如果您研究基于树的方法来解决N
-体问题，这将非常有用，它可以将计算复杂性依次降低到O（NlogN）
。基本思想是利用每个漩涡效应的1/r^2
衰减，并以“精确”的方式进行计算近场相互作用和近似远场相互作用。这与多层快速多极子方法（MLFMM）的基本思想相同电动力学。CUDA工具包包含一个N-身体样本，Nicolas Wilt在中对此进行了很好的解释。就你自己的知识而言，你也可能希望看看MLFMM。有关计算理论如何有助于应用毕奥-萨伐尔定律的示例可在上找到。CUDA动态并行性并不是首要的about性能。这是一项旨在以程序员友好的方式更容易实现程序的功能，增加代码重用机会，提高程序员生产率，或许还可以提高代码可读性。但是，如果chine已经在第一次实现中得到充分利用。相反，正如@Jackolanten所建议的，您可能需要重新审视您的基本算法。此外，分析驱动优化可能会有所帮助。@RobertCrovella CUDA dynamic parallelism impr
__global__ void biot_dynamic_child_kernel(vector3 *input1, vector3 *input2, vector3 result)
{
    // Get current thread indices...?????
    int vortexIdx = ?? // index to the current vortex in the input1 vector
    int pointIdx = ?? // index to the current point in the input2 vector

    // Temporary storage for results
    vector3 temp; 

    if(vortexIdx < NV)
    {
        // contribution of the current vortex element to the current point
        vector3 temp = someInlineFcn(input1[vortexIdx], input2[pointIdx]);
    }

    result = AtomicAdd(result,temp);

}

__global__ void biot_dynamic_parent_kernel(vector3 *input1, vector3 *input2, vector3 *answer)
{
    // Get current thread index
    int pointIdx = ???;

    if(pointIdx < NP)
    {
        // Initialise tempUind and answer[idx] to 0.0

        // parallelise the inner loop over all filaments - atomic addition inside the kernel prevents overwrite of answer[pointIdx] by concurrent threads
        biot_dynamic_child_kernel<<<???,???>>>(input1, input2[pointIdx], answer[pointIdx]);

        // do I need to sync threads here? If so how to sync threads spawned by just this parent as opposed to all parents?
    }
}


void main
{

    // allocate arrays of vector3 type and do checks
    // read in data
    // Get number of output points and use that to set blockcount and threadsperblock ???

    dim3 dimGrid(blockcount); //????
    dim3 dimBlock(threadsperblock); //????

    biot_dynamic_parent_kernel<<<dimGrid,dimBlock>>>(answer, input);

    // Block execution until device has completed
    cudaThreadSynchronize();

    // Check for errors and write results
}