CUDA';s NPPImaloc。。。功能保证一致性?
一些让我困惑了一段时间的事情是分配的CUDA内存的对齐要求。我知道,如果它们对齐,访问行元素将更加高效 首先介绍一下背景: 根据CUDA C编程指南(第5.3.2节): 全局内存驻留在设备内存中,可以访问设备内存 通过32、64或128字节内存事务。这些记忆 事务必须自然地只对齐32、64或128字节 与设备内存大小对齐的设备内存段(即 第一个地址是其大小的倍数)可以由 内存事务 我的理解是,对于类型为CUDA';s NPPImaloc。。。功能保证一致性?,cuda,memory-alignment,npp,Cuda,Memory Alignment,Npp,一些让我困惑了一段时间的事情是分配的CUDA内存的对齐要求。我知道,如果它们对齐,访问行元素将更加高效 首先介绍一下背景: 根据CUDA C编程指南(第5.3.2节): 全局内存驻留在设备内存中,可以访问设备内存 通过32、64或128字节内存事务。这些记忆 事务必须自然地只对齐32、64或128字节 与设备内存大小对齐的设备内存段(即 第一个地址是其大小的倍数)可以由 内存事务 我的理解是,对于类型为T的2D交错阵列(例如,R、G、B顺序的像素值),如果numChannels*sizeof(T
T
的2D交错阵列(例如,R、G、B顺序的像素值),如果numChannels*sizeof(T)
为4、8或16,则必须使用cudamallocitch
分配阵列(如果需要性能)。到目前为止,这对我来说效果很好。在分配2D数组之前,我会检查numChannels*sizeof(T)
,如果它是4、16或32,我会使用cudamallocitch
分配它,一切正常
现在问题是:
我意识到在使用NVIDIA的NPP库时,有一系列分配器函数(nppimaloc
…如nppimaloc_32f_C1
等等)。NVIDIA建议使用这些功能以提高性能。我的问题是,这些功能如何保证对齐?更具体地说,他们使用什么样的数学来为音高
得出合适的值
对于单通道512x512像素图像(浮动像素值在[0,1]范围内),我使用了cudamalocpatch
和nppimaloc\u 32f\u C1
cudamallocitch
给了我2048的音高值,而nppimaloc_32f_C1
给了我2560的音高值。后一个数字是从哪里来的,具体是什么
我为什么关心这个我正在编写一个同步内存类模板,用于同步GPU和CPU上的值。这个类应该负责在引擎盖下分配音调记忆(如果可能的话)。由于我希望该类能够与NVIDIA的NPP进行互操作,因此我希望以一种能够为CUDA内核和NPP操作提供良好性能的方式处理所有分配。
我的印象是,
nppimaloc
正在引擎盖下呼叫cudamalocpatch
,但似乎我错了。一个有趣的问题。然而,可能根本没有确定的答案,原因有几个:这些方法的实现并不是公开的。我们必须假设NVIDIA在内部使用了一些特殊的技巧和调整。此外:未指定产生的音高。因此,我们必须假设它可能会在CUDA/NPP的几个版本之间发生变化。特别是,实际音高不太可能取决于执行该方法的设备的硬件版本(“计算能力”)
尽管如此,我还是对此感到好奇,并编写了以下测试:
#include <stdio.h>
#include <npp.h>
template <typename T>
void testStepBytes(const char* name, int elementSize, int numComponents,
T (*allocator)(int, int, int*))
{
printf("%s\n", name);
int dw = 1;
int prevStepBytes = 0;
for (int w=1; w<2050; w+=dw)
{
int stepBytes;
void *p = allocator(w, 1, &stepBytes);
nppiFree(p);
if (stepBytes != prevStepBytes)
{
printf("Stride %5d is used up to w=%5d (%6d bytes)\n",
prevStepBytes, (w-dw), (w-dw)*elementSize*numComponents);
prevStepBytes = stepBytes;
}
}
}
int main(int argc, char *argv[])
{
testStepBytes("nppiMalloc_8u_C1", 1, 1, &nppiMalloc_8u_C1);
testStepBytes("nppiMalloc_8u_C2", 1, 2, &nppiMalloc_8u_C2);
testStepBytes("nppiMalloc_8u_C3", 1, 3, &nppiMalloc_8u_C3);
testStepBytes("nppiMalloc_8u_C4", 1, 4, &nppiMalloc_8u_C4);
testStepBytes("nppiMalloc_16u_C1", 2, 1, &nppiMalloc_16u_C1);
testStepBytes("nppiMalloc_16u_C2", 2, 2, &nppiMalloc_16u_C2);
testStepBytes("nppiMalloc_16u_C3", 2, 3, &nppiMalloc_16u_C3);
testStepBytes("nppiMalloc_16u_C4", 2, 4, &nppiMalloc_16u_C4);
testStepBytes("nppiMalloc_32f_C1", 4, 1, &nppiMalloc_32f_C1);
testStepBytes("nppiMalloc_32f_C2", 4, 2, &nppiMalloc_32f_C2);
testStepBytes("nppiMalloc_32f_C3", 4, 3, &nppiMalloc_32f_C3);
testStepBytes("nppiMalloc_32f_C4", 4, 4, &nppiMalloc_32f_C4);
return 0;
}
确认对于宽度为512的图像,它将使用2560的步幅。2048的预期步幅将用于宽度为504的图像
这些数字似乎有点奇怪,因此我为nppimaloc_8u_C1
运行了另一个测试,以覆盖所有可能的图像行大小(以字节为单位),并使用更大的图像大小,并注意到一个奇怪的模式:当图像大于480字节时,第一次增加间距大小(从512增加到1024),480=512-32。下一步(从1024到1536)发生在映像大于992字节且992=480+512时。下一步(从1536到2048)发生在映像大于1536字节且1536=992+512+32时。从那以后,它似乎主要以512的步长运行,除了中间的几个大小。进一步的步骤总结如下:
nppiMalloc_8u_C1
Stride 0 is used up to w= 0 ( 0 bytes, delta 0)
Stride 512 is used up to w= 480 ( 480 bytes, delta 480)
Stride 1024 is used up to w= 992 ( 992 bytes, delta 512)
Stride 1536 is used up to w= 1536 ( 1536 bytes, delta 544)
Stride 2048 is used up to w= 2016 ( 2016 bytes, delta 480) \
Stride 2560 is used up to w= 2560 ( 2560 bytes, delta 544) | 4
Stride 3072 is used up to w= 3072 ( 3072 bytes, delta 512) |
Stride 3584 is used up to w= 3584 ( 3584 bytes, delta 512) /
Stride 4096 is used up to w= 4064 ( 4064 bytes, delta 480) \
Stride 4608 is used up to w= 4608 ( 4608 bytes, delta 544) |
Stride 5120 is used up to w= 5120 ( 5120 bytes, delta 512) |
Stride 5632 is used up to w= 5632 ( 5632 bytes, delta 512) | 8
Stride 6144 is used up to w= 6144 ( 6144 bytes, delta 512) |
Stride 6656 is used up to w= 6656 ( 6656 bytes, delta 512) |
Stride 7168 is used up to w= 7168 ( 7168 bytes, delta 512) |
Stride 7680 is used up to w= 7680 ( 7680 bytes, delta 512) /
Stride 8192 is used up to w= 8160 ( 8160 bytes, delta 480) \
Stride 8704 is used up to w= 8704 ( 8704 bytes, delta 544) |
Stride 9216 is used up to w= 9216 ( 9216 bytes, delta 512) |
Stride 9728 is used up to w= 9728 ( 9728 bytes, delta 512) |
Stride 10240 is used up to w= 10240 ( 10240 bytes, delta 512) |
Stride 10752 is used up to w= 10752 ( 10752 bytes, delta 512) |
Stride 11264 is used up to w= 11264 ( 11264 bytes, delta 512) |
Stride 11776 is used up to w= 11776 ( 11776 bytes, delta 512) | 16
Stride 12288 is used up to w= 12288 ( 12288 bytes, delta 512) |
Stride 12800 is used up to w= 12800 ( 12800 bytes, delta 512) |
Stride 13312 is used up to w= 13312 ( 13312 bytes, delta 512) |
Stride 13824 is used up to w= 13824 ( 13824 bytes, delta 512) |
Stride 14336 is used up to w= 14336 ( 14336 bytes, delta 512) |
Stride 14848 is used up to w= 14848 ( 14848 bytes, delta 512) |
Stride 15360 is used up to w= 15360 ( 15360 bytes, delta 512) |
Stride 15872 is used up to w= 15872 ( 15872 bytes, delta 512) /
Stride 16384 is used up to w= 16352 ( 16352 bytes, delta 480) \
Stride 16896 is used up to w= 16896 ( 16896 bytes, delta 544) |
Stride 17408 is used up to w= 17408 ( 17408 bytes, delta 512) |
... ... 32
Stride 31232 is used up to w= 31232 ( 31232 bytes, delta 512) |
Stride 31744 is used up to w= 31744 ( 31744 bytes, delta 512) |
Stride 32256 is used up to w= 32256 ( 32256 bytes, delta 512) /
Stride 32768 is used up to w= 32736 ( 32736 bytes, delta 480) \
Stride 33280 is used up to w= 33280 ( 33280 bytes, delta 544) |
Stride 33792 is used up to w= 33792 ( 33792 bytes, delta 512) |
Stride 34304 is used up to w= 34304 ( 34304 bytes, delta 512) |
... ... 64
Stride 64512 is used up to w= 64512 ( 64512 bytes, delta 512) |
Stride 65024 is used up to w= 65024 ( 65024 bytes, delta 512) /
Stride 65536 is used up to w= 65504 ( 65504 bytes, delta 480) \
Stride 66048 is used up to w= 66048 ( 66048 bytes, delta 544) |
Stride 66560 is used up to w= 66560 ( 66560 bytes, delta 512) |
Stride 67072 is used up to w= 67072 ( 67072 bytes, delta 512) |
.... ... 128
Stride 130048 is used up to w=130048 (130048 bytes, delta 512) |
Stride 130560 is used up to w=130560 (130560 bytes, delta 512) /
Stride 131072 is used up to w=131040 (131040 bytes, delta 480) \
Stride 131584 is used up to w=131584 (131584 bytes, delta 544) |
Stride 132096 is used up to w=132096 (132096 bytes, delta 512) |
... | guess...
显然有一种模式。音高与512的倍数有关。对于512*2n的大小,n是一个整数,大小限制有一些奇数-32和+32偏移,导致使用更大的节距
也许我会再看看这个。我很确定,我们可以推导出一个公式,涵盖这个奇怪的音高级数。但同样:这可能取决于基础CUDA版本、NPP版本,甚至所用卡的计算能力
而且,为了完整性:这种奇怪的音高大小可能只是NPP中的一个缺陷。你永远不知道 我想我会提供其他几种分配类型的列表。我正在使用GTX 860M和cuda 7.5版 CudamAllocPicch与textureAlignment属性对齐,而不是像我所怀疑的那样与TextureElectionAlignment对齐。nppi malloc也与textureAlignment边界对齐,但有时会过度分配并提前跳转到下一个512字节 由于所有这些函数都将每一行与textureAlignment对齐,而不是与较小的textureAlignment对齐,因此使用了更多的空间,但纹理应该能够绑定到任何起始行,而无需使用字节偏移量进行地址计算。对于纹理,文档可能不清楚,但事实证明,它们需要32倍的行距(在这一代硬件上,TextureElectionAlignment属性),并且起始点的地址必须是128、256或512的倍数,具体取决于硬件和cuda版本(textureAlignment)。纹理可以绑定到更小的倍数,在找到正确的属性之前,我自己的经验是256字节对齐似乎很好 512字节对齐相当大,但是与使用TextureUpchalignment值相比,纹理和非纹理的性能都会有所提高。我什么都没做
nppiMalloc_8u_C1
Stride 0 is used up to w= 0 ( 0 bytes, delta 0)
Stride 512 is used up to w= 480 ( 480 bytes, delta 480)
Stride 1024 is used up to w= 992 ( 992 bytes, delta 512)
Stride 1536 is used up to w= 1536 ( 1536 bytes, delta 544)
Stride 2048 is used up to w= 2016 ( 2016 bytes, delta 480) \
Stride 2560 is used up to w= 2560 ( 2560 bytes, delta 544) | 4
Stride 3072 is used up to w= 3072 ( 3072 bytes, delta 512) |
Stride 3584 is used up to w= 3584 ( 3584 bytes, delta 512) /
Stride 4096 is used up to w= 4064 ( 4064 bytes, delta 480) \
Stride 4608 is used up to w= 4608 ( 4608 bytes, delta 544) |
Stride 5120 is used up to w= 5120 ( 5120 bytes, delta 512) |
Stride 5632 is used up to w= 5632 ( 5632 bytes, delta 512) | 8
Stride 6144 is used up to w= 6144 ( 6144 bytes, delta 512) |
Stride 6656 is used up to w= 6656 ( 6656 bytes, delta 512) |
Stride 7168 is used up to w= 7168 ( 7168 bytes, delta 512) |
Stride 7680 is used up to w= 7680 ( 7680 bytes, delta 512) /
Stride 8192 is used up to w= 8160 ( 8160 bytes, delta 480) \
Stride 8704 is used up to w= 8704 ( 8704 bytes, delta 544) |
Stride 9216 is used up to w= 9216 ( 9216 bytes, delta 512) |
Stride 9728 is used up to w= 9728 ( 9728 bytes, delta 512) |
Stride 10240 is used up to w= 10240 ( 10240 bytes, delta 512) |
Stride 10752 is used up to w= 10752 ( 10752 bytes, delta 512) |
Stride 11264 is used up to w= 11264 ( 11264 bytes, delta 512) |
Stride 11776 is used up to w= 11776 ( 11776 bytes, delta 512) | 16
Stride 12288 is used up to w= 12288 ( 12288 bytes, delta 512) |
Stride 12800 is used up to w= 12800 ( 12800 bytes, delta 512) |
Stride 13312 is used up to w= 13312 ( 13312 bytes, delta 512) |
Stride 13824 is used up to w= 13824 ( 13824 bytes, delta 512) |
Stride 14336 is used up to w= 14336 ( 14336 bytes, delta 512) |
Stride 14848 is used up to w= 14848 ( 14848 bytes, delta 512) |
Stride 15360 is used up to w= 15360 ( 15360 bytes, delta 512) |
Stride 15872 is used up to w= 15872 ( 15872 bytes, delta 512) /
Stride 16384 is used up to w= 16352 ( 16352 bytes, delta 480) \
Stride 16896 is used up to w= 16896 ( 16896 bytes, delta 544) |
Stride 17408 is used up to w= 17408 ( 17408 bytes, delta 512) |
... ... 32
Stride 31232 is used up to w= 31232 ( 31232 bytes, delta 512) |
Stride 31744 is used up to w= 31744 ( 31744 bytes, delta 512) |
Stride 32256 is used up to w= 32256 ( 32256 bytes, delta 512) /
Stride 32768 is used up to w= 32736 ( 32736 bytes, delta 480) \
Stride 33280 is used up to w= 33280 ( 33280 bytes, delta 544) |
Stride 33792 is used up to w= 33792 ( 33792 bytes, delta 512) |
Stride 34304 is used up to w= 34304 ( 34304 bytes, delta 512) |
... ... 64
Stride 64512 is used up to w= 64512 ( 64512 bytes, delta 512) |
Stride 65024 is used up to w= 65024 ( 65024 bytes, delta 512) /
Stride 65536 is used up to w= 65504 ( 65504 bytes, delta 480) \
Stride 66048 is used up to w= 66048 ( 66048 bytes, delta 544) |
Stride 66560 is used up to w= 66560 ( 66560 bytes, delta 512) |
Stride 67072 is used up to w= 67072 ( 67072 bytes, delta 512) |
.... ... 128
Stride 130048 is used up to w=130048 (130048 bytes, delta 512) |
Stride 130560 is used up to w=130560 (130560 bytes, delta 512) /
Stride 131072 is used up to w=131040 (131040 bytes, delta 480) \
Stride 131584 is used up to w=131584 (131584 bytes, delta 544) |
Stride 132096 is used up to w=132096 (132096 bytes, delta 512) |
... | guess...
int main(int argc, char **argv)
{
void *dmem;
int pitch, pitchOld = 0;
size_t pitch2;
int iOld = 0;
int maxAllocation = 5000;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
printf("%s%d%s%d%s", "textureAlignment ", prop.textureAlignment, " texturePitchAlignment ", prop.texturePitchAlignment, "\n");
printf("%s", "cudaMallocPitch\n");
for (int i=0;i<maxAllocation;++i) {
cudaMallocPitch(&dmem, &pitch2, i, 1);
if (pitch2 != pitchOld && i!= 0) {
printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
pitchOld = pitch2;
iOld = i;
}
cudaFree(dmem);
}
pitchOld = 0;
printf("%s", "nppiMalloc_8u_C1\n");
for (int i=0;i<maxAllocation/sizeof(Npp8u);++i) {
dmem = nppiMalloc_8u_C1(i, 1, &pitch);
if (pitch != pitchOld && i!= 0) {
printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
pitchOld = pitch;
iOld = i;
}
cudaFree(dmem);
}
pitchOld = 0;
printf("%s", "nppiMalloc_32f_C1\n");
for (int i=0;i<maxAllocation/sizeof(Npp32f);++i) {
dmem = nppiMalloc_32f_C1(i, 1, &pitch);
if (pitch != pitchOld && i!= 0) {
printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
pitchOld = pitch;
iOld = i;
}
cudaFree(dmem);
}
pitchOld = 0;
return 0;
}
textureAlignment 512 texturePitchAlignment 32
cudaMallocPitch
width 0to0 -> pitch 0
width 1to512 -> pitch 512
width 513to1024 -> pitch 1024
width 1025to1536 -> pitch 1536
width 1537to2048 -> pitch 2048
width 2049to2560 -> pitch 2560
width 2561to3072 -> pitch 3072
width 3073to3584 -> pitch 3584
width 3585to4096 -> pitch 4096
width 4097to4608 -> pitch 4608
nppiMalloc_8u_C1
width 0to0 -> pitch 0
width 1to480 -> pitch 512
width 481to992 -> pitch 1024
width 993to1536 -> pitch 1536
width 1537to2016 -> pitch 2048
width 2017to2560 -> pitch 2560
width 2561to3072 -> pitch 3072
width 3073to3584 -> pitch 3584
width 3585to4064 -> pitch 4096
width 4065to4608 -> pitch 4608
nppiMalloc_32f_C1
width 0to0 -> pitch 0
width 1to120 -> pitch 512
width 121to248 -> pitch 1024
width 249to384 -> pitch 1536
width 385to504 -> pitch 2048
width 505to640 -> pitch 2560
width 641to768 -> pitch 3072
width 769to896 -> pitch 3584
width 897to1016 -> pitch 4096
width 1017to1152 -> pitch 4608