Cuda 为什么推力库的交集返回意外结果？_Cuda_Nvidia_Intersection_Thrust

Cuda 为什么推力库的交集返回意外结果？

cuda

Cuda 为什么推力库的交集返回意外结果？,cuda,nvidia,intersection,thrust,Cuda,Nvidia,Intersection,Thrust,我用库推力得到两个更大的整数集的交集。在使用两个小输入的测试中，我得到了正确的结果，但当我使用两个集合（包含10^8和65535*1024个元素）时，我得到了一个空集。谁能解释这个问题？将前两个变量更改为较小的值，推力将返回预期的交点集。我的代码如下 #include <thrust/set_operations.h> #include <thrust/device_vector.h> #include <thrust/device_ptr.h> #inclu

我用库推力得到两个更大的整数集的交集。在使用两个小输入的测试中，我得到了正确的结果，但当我使用两个集合（包含10^8和65535*1024个元素）时，我得到了一个空集。谁能解释这个问题？将前两个变量更改为较小的值，推力将返回预期的交点集。我的代码如下

#include <thrust/set_operations.h>
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <iostream>
#include <stdio.h>


int main() {
    int sizeArrayLonger = 100*1000*1000;
    int sizeArraySmaller = 65535*1024;
    int length_result = sizeArraySmaller;    
    int* list = (int*) malloc(4*sizeArrayLonger);
    int* list_smaller = (int*) malloc(4*sizeArraySmaller);
    int* result = (int*) malloc(4*length_result);

    int* list_gpu;
    int* list_smaller_gpu;
    int* result_gpu;

    // THE NEXT TWO FORS TRANSFORMS THE SMALLER ARRAY IN A SUBSET OF THE LARGER ARRAY
    for (int i=0; i < sizeArraySmaller; i++) {
        list_smaller[i] = i+1;
        list[i] = i+1;
    }
    for (int i=sizeArraySmaller; i < sizeArrayLonger; i++) {
        list[i] = i+1;
    }

    cudaMalloc(&list_gpu, sizeof(int) * sizeArrayLonger);
    cudaMalloc(&list_smaller_gpu, sizeof(int) * sizeArraySmaller);
    cudaMalloc(&result_gpu, sizeof(int) * length_result);

    cudaMemcpy(list_gpu, list, sizeof(int) * sizeArrayLonger, cudaMemcpyHostToDevice);
    cudaMemcpy(list_smaller_gpu, list_smaller, sizeof(int) * sizeArraySmaller, cudaMemcpyHostToDevice);
    cudaMemset(result_gpu, 0, sizeof(int) * length_result);

    typedef thrust::device_ptr<int> device_ptr;

    thrust::set_intersection(device_ptr(list_gpu), device_ptr(list_gpu + sizeArrayLonger), device_ptr(list_smaller_gpu),
        device_ptr(list_smaller_gpu + sizeArraySmaller), device_ptr(result_gpu), thrust::less<int>() );

    // MOVING TO CPU THE MARKER ARRAY OF ELEMENTS OF INTERSECTION SET
    cudaMemcpy(result, result_gpu, sizeof(int)*length_result, cudaMemcpyDeviceToHost);

    cudaDeviceSynchronize();

    // THIS LOOP ITERATES ALL ARRAY NAMED "result" WHERE THE POSITION ARE MARKED WITH 1
    int counter = 0;
    for (int i=0; i < length_result; i++)
        if (result[i]) {
            printf("\n-> %d", result[i]);
            counter++;
        }

    printf("\nTHRUST -> Total of elements: %d\n", counter);

    cudaDeviceReset();

    return 0;
}

#包括
#包括
#包括
#包括
#包括
int main（）{
int sizeraraylonger=100*1000*1000；
int SIZEARRAYSIZER=65535*1024；
int length_result=sizearray较小；
int*list=（int*）malloc（4*sizarraylonger）；
int*list_更小=（int*）malloc（4*sizearray更小）；
int*result=（int*）malloc（4*length\u result）；
int*列表\gpu；
int*列表\u较小\u gpu；
int*结果处理器；
//接下来的两个FORS将较小的数组转换为较大数组的子集
对于（int i=0；i%d”，结果[i]）；
计数器++；
}
printf（“\n信任->元素总数：%d\n”，计数器）；
cudaDeviceReset（）；
返回0；
}

OP最近似乎没有访问过，所以我将为其他读者详细介绍我的评论。（我希望得到一些确认，在编译过程中指定正在使用的设备的计算目标也可以修复OP的观察结果。）

根据我的测试，OP的代码将：

如果为cc2.0设备编译并在cc2.0设备上运行，则通过
如果为cc3.0设备编译并在cc3.x设备上运行，则通过
如果为cc2.0设备编译并在cc3.x设备上运行，则失败

最后一个结果有点不直观。通常，我们认为使用PTX编译的CUDA代码（例如，

nvcc-arch=sm_20…

或类似代码）与未来的体系结构是向前兼容的，因为

但是，存在一个陷阱（以及与推力相关的问题）。CUDA代码查询实际运行的设备（例如，通过

cudaGetDeviceProperties

）并根据使用的设备做出决策（如内核配置决策）并不少见。具体地说，在本例中，推力是在引擎盖下启动一个内核，并根据实际使用的设备决定要为此内核选择的网格x维度的大小。对于此参数，CC 2.x设备限制为65535，但CC 3.x及更高版本的设备除外。因此，在这种情况下，对于足够大的数据集，如果推力检测到它正在cc3.0设备上运行，它将使用大于65535的网格x维度配置这个特定内核。（对于足够小的数据集，它无法做到这一点，因此可能出现的错误不会出现。因此，问题与问题的大小有着松散的联系。）

如果我们在二进制文件中同时嵌入了cc 2.x和cc 3.x PTX（或适当的SASS），那么仍然不会有问题。但是，如果二进制文件中只嵌入了cc2.x PTX，那么JIT进程将使用它来创建适合在CC3.x设备上运行的机器代码（如果正在使用该设备）。但是这种前向JIT编译的SASS仍然受到CC 2.x的限制，包括网格x维度限制65535。但是，

cudaGetDeviceProperties

返回设备是cc3.x设备的事实，因此，如果将此信息用于此特定决策（可接受的网格x尺寸），则此信息将具有误导性

由于这个序列，内核配置不正确，内核启动失败，出现一种特殊的非粘性CUDA运行时API错误。这种类型的非粘性错误不会损坏CUDA上下文，因此仍然允许进一步的CUDA操作，并且将来的CUDA API调用不会返回此错误。为了在CUDA内核启动后捕获此类错误，有必要在内核启动后发出

cudaGetLastError（）

或

cudaPeekAtLastError（）

调用，如建议的那样。如果不这样做，则表示错误“丢失”，并且无法从将来的CUDA API调用中发现（除了

cudaGetLastError（）

或

cudaPeekAtLastError（）

），因为它们不会在状态返回值中指示存在此错误或失败的内核启动

在通过和失败的情况下，仔细使用cuda分析工具，如

nvprof

，以及

cuda memcheck

，可以发现上述大部分问题。在过去的情况下，

cudamemcheck

不会报告错误，分析器会显示对

cudaLaunch

的8个调用以及在GPU上实际执行的8个内核。在失败的情况下，

cuda memcheck

报告了两个典型的内核启动失败

#ifndef __CUDA_ARCH__ 
  kernel<<<(unsigned int) num_blocks, (unsigned int) block_size, (unsigned int) smem_size, stream(thrust::detail::derived_cast(exec))>>>(f); 
#else 
  ...
#endif // __CUDA_ARCH__ 
  synchronize_if_enabled("launch_closure_by_value");

inline __host__ __device__ 
void synchronize_if_enabled(const char *message) 
{ 
// XXX this could potentially be a runtime decision 
//     note we always have to synchronize in __device__ code 
#if __THRUST_SYNCHRONOUS || defined(__CUDA_ARCH__) 
  synchronize(message); 
#else 
  // WAR "unused parameter" warning 
  (void) message; 
#endif

inline __host__ __device__ 
void synchronize(const char *message) 
{ 
  throw_on_error(cudaDeviceSynchronize(), message); 
} // end synchronize()