压缩"；稀疏数据“；使用CUDA（CCL：连接组件标签减少）_Cuda_Gpu_Cudafy.net

压缩"；稀疏数据“；使用CUDA（CCL：连接组件标签减少）

cuda

压缩"；稀疏数据“；使用CUDA（CCL：连接组件标签减少）,cuda,gpu,cudafy.net,Cuda,Gpu,Cudafy.net,我有一个500万的32位整数列表（实际上是2048x2560图像），90%为零。非零单元是标签（例如2049、8195、1334300、34320923、4320932），它们在任何方面都不是连续的或连续的（这是我们的自定义连接组件标签CCL算法的输出）。我正在与NVIDA特斯拉K40合作，所以如果这需要任何前缀扫描工作，我会喜欢它，它使用洗牌、投票或任何更高的CC功能我不需要一个完整的例子，只需要一些建议为了举例说明，这里有一个博客被我们的CCL算法标记其他水滴将具有不同的唯一标签（例

我有一个500万的32位整数列表（实际上是2048x2560图像），90%为零。非零单元是标签（例如2049、8195、1334300、34320923、4320932），它们在任何方面都不是连续的或连续的（这是我们的自定义连接组件标签CCL算法的输出）。我正在与NVIDA特斯拉K40合作，所以如果这需要任何前缀扫描工作，我会喜欢它，它使用洗牌、投票或任何更高的CC功能

我不需要一个完整的例子，只需要一些建议

为了举例说明，这里有一个博客被我们的CCL算法标记

其他水滴将具有不同的唯一标签（例如13282）。但所有这些都将被零包围，并且是椭圆形的。（我们优化了椭圆体的CCL，这就是为什么我们不使用这些库的原因）。但一个副作用是，blob标签不是连续的数字。我们不关心它们的编号顺序，但我们需要一个标记为#1的斑点，另一个标记为#2，最后一个标记为#n，其中n是图像中斑点的数量

我说的“1”是什么意思？我的意思是所有2242个单元格都应该替换为1。所有13282个电池都会有一个#2，等等

我们CCL的最大水滴数等于2048x2560。所以我们知道数组的大小

事实上，罗伯特·克罗维拉一天前已经给出了一个很好的答案。这并不确切，但我现在知道如何应用这个答案了。所以我不需要更多的帮助。但他在时间和精力上都非常慷慨，并要求我用例子重新编写问题，所以我这样做了。

一种可能的方法是使用以下顺序：

推力：：转换

-将输入数据转换为所有1或0：

0 27 42  0 18 99 94 91  0  -- input data
0  1  1  0  1  1  1  1  0  -- this will be our "mask vector"

推力：：包容性扫描

-要将掩码向量转换为渐进序列：

0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector

另一个

推力：：变换

以屏蔽非递增值：

0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector
-------------------------
0  1  2  0  3  4  5  6  0  -- result of "AND" operation

请注意，我们可以将前两个步骤与

推力：：变换_inclusive_scan

结合起来，然后使用稍微不同的变换函子将第三个步骤作为

推力：：变换

执行。这种修改允许我们不必创建临时“掩码”向量

下面是一个完整的示例，显示了使用

推力：：转换扫描的“修改”方法：
$ cat t635.cu
#include <iostream>
#include <stdlib.h>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/generate.h>
#include <thrust/copy.h>


#define DSIZE 20
#define PCT_ZERO 40

struct my_unary_op
{
  __host__ __device__
  int operator()(const int data) const
  {
    return (!data) ?  0:1;}
};

struct my_binary_op
{
  __host__ __device__
  int operator()(const int d1, const int d2) const
  {
    return (!d1) ? 0:d2;}
};

int main(){

// generate DSIZE random 32-bit integers, PCT_ZERO% are zero
  thrust::host_vector<int> h_data(DSIZE);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < DSIZE; i++)
    if ((rand()%100)< PCT_ZERO) h_data[i] = 0;
    else h_data[i] %= 1000;
  thrust::device_vector<int> d_data = h_data;
  thrust::device_vector<int> d_result(DSIZE);
  thrust::transform_inclusive_scan(d_data.begin(), d_data.end(), d_result.begin(), my_unary_op(), thrust::plus<int>());
  thrust::transform(d_data.begin(), d_data.end(), d_result.begin(), d_result.begin(), my_binary_op());
  thrust::copy(d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

$ nvcc -o t635 t635.cu
$ ./t635
0,886,777,0,793,0,386,0,649,0,0,0,0,59,763,926,540,426,0,736,
0,1,2,0,3,0,4,0,5,0,0,0,0,6,7,8,9,10,0,11,
$

一种可能的方法是使用以下顺序：
推力：：转换
-将输入数据转换为所有1或0：
0 27 42  0 18 99 94 91  0  -- input data
0  1  1  0  1  1  1  1  0  -- this will be our "mask vector"


推力：：包容性扫描
-要将掩码向量转换为渐进序列：
0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector


另一个推力：：变换
以屏蔽非递增值：
0  1  1  0  1  1  1  1  0  -- "mask" vector
0  1  2  2  3  4  5  6  6  -- "sequence" vector
-------------------------
0  1  2  0  3  4  5  6  0  -- result of "AND" operation


请注意，我们可以将前两个步骤与推力：：变换_inclusive_scan
结合起来，然后使用稍微不同的变换函子将第三个步骤作为推力：：变换
执行。这种修改允许我们不必创建临时“掩码”向量
下面是一个完整的示例，显示了使用推力：：转换扫描的“修改”方法：
$ cat t635.cu
#include <iostream>
#include <stdlib.h>

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/generate.h>
#include <thrust/copy.h>


#define DSIZE 20
#define PCT_ZERO 40

struct my_unary_op
{
  __host__ __device__
  int operator()(const int data) const
  {
    return (!data) ?  0:1;}
};

struct my_binary_op
{
  __host__ __device__
  int operator()(const int d1, const int d2) const
  {
    return (!d1) ? 0:d2;}
};

int main(){

// generate DSIZE random 32-bit integers, PCT_ZERO% are zero
  thrust::host_vector<int> h_data(DSIZE);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < DSIZE; i++)
    if ((rand()%100)< PCT_ZERO) h_data[i] = 0;
    else h_data[i] %= 1000;
  thrust::device_vector<int> d_data = h_data;
  thrust::device_vector<int> d_result(DSIZE);
  thrust::transform_inclusive_scan(d_data.begin(), d_data.end(), d_result.begin(), my_unary_op(), thrust::plus<int>());
  thrust::transform(d_data.begin(), d_data.end(), d_result.begin(), d_result.begin(), my_binary_op());
  thrust::copy(d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

$ nvcc -o t635 t635.cu
$ ./t635
0,886,777,0,793,0,386,0,649,0,0,0,0,59,763,926,540,426,0,736,
0,1,2,0,3,0,4,0,5,0,0,0,0,6,7,8,9,10,0,11,
$

我的答案与@RobertCrovella给出的答案类似，但我认为使用stress:：lower_bound
而不是自定义二进制搜索更简单。（现在它是纯推力，后端可以互换）
复制输入数据
对复制的数据进行排序
从已排序的数据创建唯一列表
在唯一列表中查找每个输入的下限
我在下面附上了一个完整的例子。有趣的是，通过预挂起排序步骤，再调用推力：：unique
，过程会变得更快。根据输入数据，这可以显著减少排序中的元素数量，这是这里的瓶颈
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>

int main()
{
  const int ndata = 20;
  // Generate host input data
  thrust::host_vector<int> h_data(ndata);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < ndata; i++)
  {
    if ((rand() % 100) < 40)
      h_data[i] = 0;
    else
      h_data[i] %= 10;
  }

  // Copy data to the device
  thrust::device_vector<int> d_data = h_data;
  // Make a second copy of the data
  thrust::device_vector<int> d_result = d_data;
  // Sort the data copy
  thrust::sort(d_result.begin(), d_result.end());
  // Allocate an array to store unique values
  thrust::device_vector<int> d_unique = d_result;
  {
    // Compress all duplicates
    const auto end = thrust::unique(d_unique.begin(), d_unique.end());
    // Search for all original labels, in this compressed range, and write their
    // indices back as the result
    thrust::lower_bound(
      d_unique.begin(), end, d_data.begin(), d_data.end(), d_result.begin());
  }

  thrust::copy(
    d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(),
               d_result.end(),
               std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
int main（）
{
常数int ndata=20；
//生成主机输入数据
推力：主机向量h_数据（ndata）；
生成（h_data.begin（），h_data.end（），rand）；
对于（int i=0；istd:：cout我的答案与@RobertCrovella给出的答案类似，但我认为使用stress:：lower_bound
而不是自定义二进制搜索更简单。（现在它是纯stress，后端可以互换）
复制输入数据
对复制的数据进行排序
从已排序的数据创建唯一列表
在唯一列表中查找每个输入的下限
我在下面提供了一个完整的示例。有趣的是，通过预挂起排序步骤，再调用推力：：unique
，过程会变得更快。根据输入数据，这可以显著减少排序中的元素数量，这是这里的瓶颈
#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/unique.h>
#include <thrust/binary_search.h>
#include <thrust/copy.h>

int main()
{
  const int ndata = 20;
  // Generate host input data
  thrust::host_vector<int> h_data(ndata);
  thrust::generate(h_data.begin(), h_data.end(), rand);
  for (int i = 0; i < ndata; i++)
  {
    if ((rand() % 100) < 40)
      h_data[i] = 0;
    else
      h_data[i] %= 10;
  }

  // Copy data to the device
  thrust::device_vector<int> d_data = h_data;
  // Make a second copy of the data
  thrust::device_vector<int> d_result = d_data;
  // Sort the data copy
  thrust::sort(d_result.begin(), d_result.end());
  // Allocate an array to store unique values
  thrust::device_vector<int> d_unique = d_result;
  {
    // Compress all duplicates
    const auto end = thrust::unique(d_unique.begin(), d_unique.end());
    // Search for all original labels, in this compressed range, and write their
    // indices back as the result
    thrust::lower_bound(
      d_unique.begin(), end, d_data.begin(), d_data.end(), d_result.begin());
  }

  thrust::copy(
    d_data.begin(), d_data.end(), std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  thrust::copy(d_result.begin(),
               d_result.end(),
               std::ostream_iterator<int>(std::cout, ","));
  std::cout << std::endl;
  return 0;
}

#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
int main（）
{
常数int ndata=20；
//生成主机输入数据
推力：主机向量h_数据（ndata）；
推力：：基因