CUDA-相同的算法在CPU上工作,但在GPU上不工作

CUDA-相同的算法在CPU上工作,但在GPU上不工作,cuda,gpgpu,Cuda,Gpgpu,我目前正在从事我在CUDA的第一个项目,我遇到了一些奇怪的事情,这一定是CUDA固有的,我不理解或忽略了。同样的算法——实际上是完全相同的,它不涉及并行工作——在CPU上工作,但在GPU上不工作 让我更详细地解释一下。我使用重复计算进行阈值化,但减少了传输时间。短篇故事,此功能: __device__ double computeThreshold(unsigned int* histogram, int* nbPixels){ double sum = 0; for (int

我目前正在从事我在CUDA的第一个项目,我遇到了一些奇怪的事情,这一定是CUDA固有的,我不理解或忽略了。同样的算法——实际上是完全相同的,它不涉及并行工作——在CPU上工作,但在GPU上不工作

让我更详细地解释一下。我使用重复计算进行阈值化,但减少了传输时间。短篇故事,此功能:

__device__ double computeThreshold(unsigned int* histogram, int* nbPixels){
    double sum = 0;
    for (int i = 0; i < 256; i++){
        sum += i*histogram[i];
    }
    int sumB = 0, wB = 0, wF = 0;
    double mB, mF, max = 1, between = 0, threshold1 = 0, threshold2 = 0;
    for (int j = 0; j < 256 && !(wF == 0 && j != 0 && wB != 0); j++){
        wB += histogram[j];
        if (wB != 0) {
            wF = *nbPixels - wB;
            if (wF != 0){
                sumB += j*histogram[i];
                mB = sumB / wB;
                mF = (sum - sumB) / wF;
                between = wB * wF *(mB - mF) *(mB - mF);
                if (max < 2.0){
                    threshold1 = j;
                    if (between > max){
                        threshold2 = j;
                    }
                    max = between;
                }
            }
        }
    }

    return (threshold1 + threshold2) / 2.0;
}
编辑4:我的GPU的特性

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 750M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 2048 MBytes (2147483648 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1085 MHz (1.09 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536),
3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Mo
del)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simu
ltaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Versi
on = 7.5, NumDevs = 1, Device0 = GeForce GT 750M
Result = PASS

不太能说明问题,是吗

所以,在提供了一个可编译的示例之后,这真的很难吗?我无法用64位linux,compute 3.0设备,CUDA 7.0发行版的代码重现任何错误:

$ nvcc -arch=sm_30 -Xptxas="-v" histogram.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z11imageKernelPjS_PlPd' for 'sm_30'
ptxas info    : Function properties for _Z11imageKernelPjS_PlPd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 352 bytes cmem[0], 16 bytes cmem[2]

$ for i in `seq 1 20`;
> do
>     cuda-memcheck ./a.out
> done
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors

因此,如果您可以像我一样重现运行时错误,那么您的环境/硬件/工具包版本在某种程度上与我的略有不同。但是在任何情况下,代码本身都是有效的,并且你有一个特定于平台的问题,我无法重现。

好的,结果证明这不是我这边的错误,而是Windows决定2s足够了,需要重置GPU-在那里停止我的计算。非常感谢@RobertCrovella,没有他我永远不会发现这一点。感谢所有试图回答这个问题的人

您需要提供主机代码。在这种调试问题中,除非您能够提供最短、完整的代码,其他人可以复制并粘贴到编辑器、编译和运行,并复制您的错误,否则我们无法帮助您。CUDA附带CUDA memcheck等工具,用于检测内存访问错误。你试过使用它们吗?@Talonmes我知道很难——如果不是不可能的话——找到这样的错误,但我想可能有一个相对基本的原则我可能忽略了。我试过cuda memcheck,是的,它没有发现任何错误。@Nico:可能有一个基本原则你忽略了。但是如果没有我可以分析的代码,我无法告诉你那是什么。我无法分析你发布的内容。有太多未定义的变量。您可能正在运行Windows TDR超时。事实证明,我确实有一个cuda memcheck错误,由于某些原因,它不是第一次出现
========= CUDA-MEMCHECK
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc764]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc788]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launc
h failure" on CUDA API call to cudaMemcpy.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuProfilerStop + 0xb780
2) [0xdb1e2]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0x160f]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xc7a6]
=========     Host Frame:C:\Users\Nicolas\Cours\3PC\test.exe [0xfe24]
=========     Host Frame:C:\WINDOWS\system32\KERNEL32.DLL (BaseThreadInitThunk +
 0x22) [0x13d2]
=========     Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x3
4) [0x15454]
=========
========= ERROR SUMMARY: 3 errors
$ nvcc -arch=sm_30 -Xptxas="-v" histogram.cu 
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z11imageKernelPjS_PlPd' for 'sm_30'
ptxas info    : Function properties for _Z11imageKernelPjS_PlPd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 34 registers, 352 bytes cmem[0], 16 bytes cmem[2]

$ for i in `seq 1 20`;
> do
>     cuda-memcheck ./a.out
> done
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors