Matrix 多个小矩阵的固定向量并行乘法_Matrix_Cuda_Parallel Processing_Thrust

Matrix 多个小矩阵的固定向量并行乘法

matrix cuda parallel-processing

Matrix 多个小矩阵的固定向量并行乘法,matrix,cuda,parallel-processing,thrust,Matrix,Cuda,Parallel Processing,Thrust,情况如下：我有许多（1000个）元素，这些元素由尺寸为4x2、9x3的小矩阵给出。。。你明白了。所有矩阵都有相同的维数我想用一个预先计算的固定向量乘以这些矩阵中的每一个。简言之： for(i = 1...n) X[i] = M[i] . N; 使用推力并行执行此操作的最佳方法是什么？如何在内存中布局数据注意：可能会有专门的、更合适的库在GPU上执行此操作。我对推力感兴趣，因为它允许我部署到不同的后端，而不仅仅是CUDA。一种可能的方法：将数组（矩阵）展平为单个数据向量。无论如何，

情况如下：我有许多（1000个）元素，这些元素由尺寸为4x2、9x3的小矩阵给出。。。你明白了。所有矩阵都有相同的维数

我想用一个预先计算的固定向量乘以这些矩阵中的每一个。简言之：

for(i = 1...n)
    X[i] = M[i] . N;

使用推力并行执行此操作的最佳方法是什么？如何在内存中布局数据

注意：可能会有专门的、更合适的库在GPU上执行此操作。我对推力感兴趣，因为它允许我部署到不同的后端，而不仅仅是CUDA。

一种可能的方法：

将数组（矩阵）展平为单个数据向量。无论如何，这是实现一般推力处理的有利步骤

使用一种机制获取缩放向量，并将其扩展到展平数据向量的总长度

使用with将两个向量相乘

如果以后需要从展平的数据向量（或结果向量）中访问矩阵，可以使用指针算法或组合使用

如果需要重新使用扩展的缩放向量，则可能需要准确地使用步骤2中概述的方法（即使用该方法创建实际向量，长度=N矩阵，重复）。如果只执行一次，则可以使用计数迭代器、变换迭代器（将矩阵的长度按元素进行模化）和置换迭代器实现相同的效果，以索引到原始缩放向量（长度=1矩阵）

以下示例实现了上述功能，而不使用跨步范围迭代器方法：

#include <iostream>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/functional.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/transform.h>

#define N_MAT 1000
#define H_MAT 4
#define W_MAT 3
#define RANGE 1024

struct my_modulo_functor : public thrust::unary_function<int, int>
{
  __host__ __device__
  int operator() (int idx) {
    return idx%(H_MAT*W_MAT);}
};

int main(){

  thrust::host_vector<int> data(N_MAT*H_MAT*W_MAT);
  thrust::host_vector<int> scale(H_MAT*W_MAT);
  // synthetic; instead flatten/copy matrices into data vector
  for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++) data[i] = rand()%RANGE;
  for (int i = 0; i < H_MAT*W_MAT; i++) scale[i] = rand()%RANGE;

  thrust::device_vector<int> d_data = data;
  thrust::device_vector<int> d_scale = scale;
  thrust::device_vector<int> d_result(N_MAT*H_MAT*W_MAT);

  thrust::transform(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_scale.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), my_modulo_functor())) ,d_result.begin(), thrust::multiplies<int>());

  thrust::host_vector<int> result = d_result;

  for (int i = 0; i < N_MAT*H_MAT*W_MAT; i++)
    if (result[i] != data[i] * scale[i%(H_MAT*W_MAT)]) {std::cout << "Mismatch at: " << i << " cpu result: " << (data[i] * scale[i%(H_MAT*W_MAT)]) << " gpu result: " << result[i] << std::endl; return 1;}
  std::cout << "Success!" << std::endl;
  return 0;
}

#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
#包括
#定义N_MAT 1000
#定义H_MAT 4
#定义W_MAT 3
#定义范围1024
结构my_模函数：公共推力：：一元函数
{
__主机设备__
int运算符（）（int idx）{
返回idx%（H_MAT*W_MAT）；}
};
int main（）{
推力：：主机向量数据（N_-MAT*H_-MAT*W_-MAT）；
推力：主矢量标度（H_-MAT*W_-MAT）；
//合成；而是将矩阵展平/复制到数据向量中
对于（int i=0；i如果（result[i]！=data[i]*scale[i%（H_MAT*W_MAT）]{std:：cout在寻找一个简单的用于乘法小矩阵的软件库时，您可以查看一下。下面，代码根据典型的GEMM参数请求一个专门的矩阵内核（请注意，有些适用）
给定上面的代码，您可以继续运行“xmm”，对整个系列（小）矩阵运行“xmm”，而无需特定的数据结构（下面的代码也使用“预取位置”）
如果（0

除了如上所示的手动循环控制之外，还可以使用libxsmm_gemm_batch（或libxsmm_gemm_batch_omp）（请参阅）。如果存在描述操作数系列（A、B和C矩阵）的数据结构，则后者非常有用
此库提供优异性能的原因有两个：（1）使用内存中代码生成技术进行动态代码专门化；（2）在计算当前乘积时加载下一个矩阵操作数
（如果您正在寻找能够与C/C++很好地融合的东西，这个库支持它。但是，它并不针对CUDA/推力。）
这似乎是一个不错的方法。关于使用花式迭代器而不是更保守的方法所带来的开销，你有什么想法吗？更保守的方法的特点是什么？你是说使用跨步范围方法来创建一个完整的标度向量吗？我很好奇变换（数字，迭代器）的开销有多大
操作引入了标准的转换（数字、其他数字）操作。
double alpha = 1, beta = 1;
const char transa = 'N', transb = 'N';
int flags = LIBXSMM_GEMM_FLAGS(transa, transb);
int prefetch = LIBXSMM_PREFETCH_AUTO;
libxsmm_blasint m = 23, n = 23, k = 23;
libxsmm_dmmfunction xmm = NULL;

xmm = libxsmm_dmmdispatch(m, n, k,
  &m/*lda*/, &k/*ldb*/, &m/*ldc*/,
  &alpha, &beta, &flags, &prefetch);

if (0 < n) { /* check that n is at least 1 */
  # pragma parallel omp private(i)
  for (i = 0; i < (n - 1); ++i) {
    const double *const ai = a + i * asize;
    const double *const bi = b + i * bsize;
    double *const ci = c + i * csize;
    xmm(ai, bi, ci, ai + asize, bi + bsize, ci + csize);
  }
  xmm(a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize,
  /* pseudo prefetch for last element of batch (avoids page fault) */
      a + (n - 1) * asize, b + (n - 1) * bsize, c + (n - 1) * csize);
}