Algorithm 尽可能快地计算矩阵的零空间_Algorithm_Math_Matrix_Cuda

Algorithm 尽可能快地计算矩阵的零空间

algorithm math matrix cuda

Algorithm 尽可能快地计算矩阵的零空间,algorithm,math,matrix,cuda,Algorithm,Math,Matrix,Cuda,我需要并行计算数千个小矩阵的零空间（8x9，而不是我前面写的4x3）（CUDA）。所有的参考文献都指向SVD，但是数值公式中的算法似乎非常昂贵，并且给了我很多我并不真正需要的零空间以外的东西。高斯消去真的不是一种选择吗？还有其他常用的方法吗？“似乎非常昂贵”-您有哪些数据支持这一点也许这就是你寻求的答案也许吧 JAMA和Apache Commons Math都有Java中的SVD实现。为什么不把它们拿出来试试呢？为你的案例获取一些真实的数据，而不是印象。这不会花费你很多，因为代码已经编写和测

我需要并行计算数千个小矩阵的零空间（8x9，而不是我前面写的4x3）（CUDA）。所有的参考文献都指向SVD，但是数值公式中的算法似乎非常昂贵，并且给了我很多我并不真正需要的零空间以外的东西。高斯消去真的不是一种选择吗？还有其他常用的方法吗？

“似乎非常昂贵”-您有哪些数据支持这一点

也许这就是你寻求的答案

也许吧

JAMA和Apache Commons Math都有Java中的SVD实现。为什么不把它们拿出来试试呢？为你的案例获取一些真实的数据，而不是印象。这不会花费你很多，因为代码已经编写和测试过了。

高斯消去法对于4x3矩阵非常快。IIRC我已经用Java做了大约每秒500万次的并行处理。对于这样一个小问题，您最好的选择是自己编写例程（行减少等）；否则，您将浪费大部分时间将数据转换为外部例程的正确格式。

我认为CUDA最重要的事情是找到一种不依赖条件分支的算法（在图形硬件上非常慢）。可以优化为条件赋值的简单if语句更好（或者可以使用？：运算符）

如有必要，您应该能够使用条件赋值进行某种形式的旋转。实际上，确定如何存储结果可能更难：如果您的矩阵秩不足，您希望您的CUDA程序对此做些什么

如果你假设你的4x3矩阵实际上没有秩亏，你可以在没有任何条件的情况下找到你的（单个）空空间向量：矩阵足够小，你可以有效地使用Cramer规则

实际上，因为你并不关心零向量的尺度，你不需要除以行列式，你只需要取子向量的行列式：

    x1 x2 x3
M = y1 y2 y3
    z1 z2 z3
    w1 w2 w3

         |y1 y2 y3|        |x1 x2 x3|       |x1 x2 x3|        |x1 x2 x3|
->  x0 = |z1 z2 z3|  y0 = -|z1 z2 z3|  z0 = |y1 y2 y3|  w0 = -|y1 y2 y3|
         |w1 w2 w3|        |w1 w2 w3|       |w1 w2 w3|        |z1 z2 z3|

请注意，这些3x3行列式只是三重积；您可以通过重用叉积来节省计算。

我想知道矩阵是否是相关的，而不仅仅是随机的，因此您正在寻找的零空间可以被认为是N空间（N=9）中曲线的一维切线。如果是这样的话，您可以使用牛顿的方法来求解二次方程组Ax=0，|x | ^2=1的连续实例，从以前的零空间向量开始，从而加快速度。牛顿的方法使用一阶导数收敛到一个解，因此将使用高斯消去法求解9x9系统。使用这项技术需要您能够通过改变一个参数，从一个矩阵到另一个矩阵进行小步操作

因此，我们的想法是在第一个矩阵上使用SVD进行初始化，但是之后，我们从一个矩阵到另一个矩阵，使用一个矩阵的零空间向量作为下一个矩阵迭代的起点。您需要一到两次迭代才能获得收敛。如果您没有得到Convergence，您可以使用SVD重新启动。如果这种情况就是你所拥有的，那么它比在每个矩阵上重新开始要快得多

很久以前，我用它来绘制与电力系统行为相关的50 x 50二次方程组解的等高线。

直接回答你的问题。。。对QR分解

设A是秩为n的m×n矩阵。QR分解发现正交m-by-m矩阵Q和上三角m-by-n矩阵R，从而A=QR。如果我们定义Q=[Q1 Q2]，其中Q1是m-by-n，Q2是m-by-（m-n），那么Q2的列构成了A^T的空空间

QR分解由Gram-Schmidt、Givens旋转或Householder反射计算。它们具有不同的稳定性和操作计数

你说得对：SVD很贵！我无法说明最先进的东西使用了什么，但当我听到“计算零空间”（编辑：以一种我很容易理解的方式）时，我认为QR。

我不认为上面提出的方法总是给出整个零空间。概括地说：“A=QR，其中Q=[Q1 Q2]，Q1是m-by-n，Q2是m-by-（m-n）。然后Q2的列构成A^T的空空间。”

事实上，这可能只给出零空间的一个子空间。简单的反例是当A=0时，在这种情况下，A^T的零空间是整个R^m

因此，也有必要检查R。根据我在Matlab的经验，如果R的一行是直的0，那么Q中相应的列也应该是a^T的零空间的基础。显然，这种观察是启发式的，取决于用于QR分解的特定算法。

在上面的答案中，已经指出了如何使用QR或SVD方法计算矩阵的零空间。当需要精度时，应首选SVD，另请参见

截至2015年2月，CUDA 7（现为候选版本）通过其新的cuSOLVER库提供SVD。下面我报告一个关于如何使用cuSOLVER的SVD计算矩阵的零空间的示例

请注意，您所关注的问题涉及到几个小矩阵的计算，因此您应该通过使用streams来调整我下面提供的示例，使其对您的案例有意义。将流与您可以使用的每个任务关联

cudaStreamCreate()

及

内核.cu

#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"

#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<math.h>

#include <cusolverDn.h>
#include <cuda_runtime_api.h>

#include "Utilities.cuh"

/********/
/* MAIN */
/********/
int main(){

    // --- gesvd only supports Nrows >= Ncols
    // --- column major memory ordering

    const int Nrows = 7;
    const int Ncols = 5;

    // --- cuSOLVE input/output parameters/arrays
    int work_size = 0;
    int *devInfo;           gpuErrchk(cudaMalloc(&devInfo,          sizeof(int)));

    // --- CUDA solver initialization
    cusolverDnHandle_t solver_handle;
    cusolverDnCreate(&solver_handle);

    // --- Singular values threshold
    double threshold = 1e-12;

    // --- Setting the host, Nrows x Ncols matrix
    double *h_A = (double *)malloc(Nrows * Ncols * sizeof(double));
    for(int j = 0; j < Nrows; j++)
        for(int i = 0; i < Ncols; i++)
            h_A[j + i*Nrows] = (i + j*j) * sqrt((double)(i + j));

    // --- Setting the device matrix and moving the host matrix to the device
    double *d_A;            gpuErrchk(cudaMalloc(&d_A,      Nrows * Ncols * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));

    // --- host side SVD results space
    double *h_U = (double *)malloc(Nrows * Nrows     * sizeof(double));
    double *h_V = (double *)malloc(Ncols * Ncols     * sizeof(double));
    double *h_S = (double *)malloc(min(Nrows, Ncols) * sizeof(double));

    // --- device side SVD workspace and matrices
    double *d_U;            gpuErrchk(cudaMalloc(&d_U,  Nrows * Nrows     * sizeof(double)));
    double *d_V;            gpuErrchk(cudaMalloc(&d_V,  Ncols * Ncols     * sizeof(double)));
    double *d_S;            gpuErrchk(cudaMalloc(&d_S,  min(Nrows, Ncols) * sizeof(double)));

    // --- CUDA SVD initialization
    cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));
    double *work;   gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));

    // --- CUDA SVD execution
    cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));
    int devInfo_h = 0;  gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
    if (devInfo_h != 0) std::cout   << "Unsuccessful SVD execution\n\n";

    // --- Moving the results from device to host
    gpuErrchk(cudaMemcpy(h_S, d_S, min(Nrows, Ncols) * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_U, d_U, Nrows * Nrows     * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_V, d_V, Ncols * Ncols     * sizeof(double), cudaMemcpyDeviceToHost));

    for(int i = 0; i < min(Nrows, Ncols); i++) 
        std::cout << "d_S["<<i<<"] = " << std::setprecision(15) << h_S[i] << std::endl;

    printf("\n\n");

    int count = 0;
    bool flag = 0;
    while (!flag) {
        if (h_S[count] < threshold) flag = 1;
        if (count == min(Nrows, Ncols)) flag = 1;
        count++;
    }
    count--;
    printf("The null space of A has dimension %i\n\n", min(Ncols, Nrows) - count);

    for(int j = count; j < Ncols; j++) {
        printf("Basis vector nr. %i\n", j - count);
        for(int i = 0; i < Ncols; i++)
                std::cout << "d_V["<<i<<"] = " << std::setprecision(15) << h_U[j*Ncols + i] << std::endl;
        printf("\n");
    }

    cusolverDnDestroy(solver_handle);

    return 0;

}

#include <stdio.h>
#include <assert.h>

#include "cuda_runtime.h"
#include <cuda.h>

#include <cusolverDn.h>

/*******************/
/* iDivUp FUNCTION */
/*******************/
extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }

/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) { exit(code); }
   }
}

extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }

/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
{
    switch (error)
    {
        case CUSOLVER_STATUS_SUCCESS:
            return "CUSOLVER_SUCCESS";

        case CUSOLVER_STATUS_NOT_INITIALIZED:
            return "CUSOLVER_STATUS_NOT_INITIALIZED";

        case CUSOLVER_STATUS_ALLOC_FAILED:
            return "CUSOLVER_STATUS_ALLOC_FAILED";

        case CUSOLVER_STATUS_INVALID_VALUE:
            return "CUSOLVER_STATUS_INVALID_VALUE";

        case CUSOLVER_STATUS_ARCH_MISMATCH:
            return "CUSOLVER_STATUS_ARCH_MISMATCH";

        case CUSOLVER_STATUS_EXECUTION_FAILED:
            return "CUSOLVER_STATUS_EXECUTION_FAILED";

        case CUSOLVER_STATUS_INTERNAL_ERROR:
            return "CUSOLVER_STATUS_INTERNAL_ERROR";

        case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
            return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";

    }

    return "<unknown>";
}

inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
    if(CUSOLVER_STATUS_SUCCESS != err) {
        fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
                                _cudaGetErrorEnum(err)); \
        cudaDeviceReset(); assert(0); \
    }
}

extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }

实用程序.cu

#include "cuda_runtime.h"
#include "device_launch_paraMeters.h"

#include<iostream>
#include<iomanip>
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
#include<math.h>

#include <cusolverDn.h>
#include <cuda_runtime_api.h>

#include "Utilities.cuh"

/********/
/* MAIN */
/********/
int main(){

    // --- gesvd only supports Nrows >= Ncols
    // --- column major memory ordering

    const int Nrows = 7;
    const int Ncols = 5;

    // --- cuSOLVE input/output parameters/arrays
    int work_size = 0;
    int *devInfo;           gpuErrchk(cudaMalloc(&devInfo,          sizeof(int)));

    // --- CUDA solver initialization
    cusolverDnHandle_t solver_handle;
    cusolverDnCreate(&solver_handle);

    // --- Singular values threshold
    double threshold = 1e-12;

    // --- Setting the host, Nrows x Ncols matrix
    double *h_A = (double *)malloc(Nrows * Ncols * sizeof(double));
    for(int j = 0; j < Nrows; j++)
        for(int i = 0; i < Ncols; i++)
            h_A[j + i*Nrows] = (i + j*j) * sqrt((double)(i + j));

    // --- Setting the device matrix and moving the host matrix to the device
    double *d_A;            gpuErrchk(cudaMalloc(&d_A,      Nrows * Ncols * sizeof(double)));
    gpuErrchk(cudaMemcpy(d_A, h_A, Nrows * Ncols * sizeof(double), cudaMemcpyHostToDevice));

    // --- host side SVD results space
    double *h_U = (double *)malloc(Nrows * Nrows     * sizeof(double));
    double *h_V = (double *)malloc(Ncols * Ncols     * sizeof(double));
    double *h_S = (double *)malloc(min(Nrows, Ncols) * sizeof(double));

    // --- device side SVD workspace and matrices
    double *d_U;            gpuErrchk(cudaMalloc(&d_U,  Nrows * Nrows     * sizeof(double)));
    double *d_V;            gpuErrchk(cudaMalloc(&d_V,  Ncols * Ncols     * sizeof(double)));
    double *d_S;            gpuErrchk(cudaMalloc(&d_S,  min(Nrows, Ncols) * sizeof(double)));

    // --- CUDA SVD initialization
    cusolveSafeCall(cusolverDnDgesvd_bufferSize(solver_handle, Nrows, Ncols, &work_size));
    double *work;   gpuErrchk(cudaMalloc(&work, work_size * sizeof(double)));

    // --- CUDA SVD execution
    cusolveSafeCall(cusolverDnDgesvd(solver_handle, 'A', 'A', Nrows, Ncols, d_A, Nrows, d_S, d_U, Nrows, d_V, Ncols, work, work_size, NULL, devInfo));
    int devInfo_h = 0;  gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
    if (devInfo_h != 0) std::cout   << "Unsuccessful SVD execution\n\n";

    // --- Moving the results from device to host
    gpuErrchk(cudaMemcpy(h_S, d_S, min(Nrows, Ncols) * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_U, d_U, Nrows * Nrows     * sizeof(double), cudaMemcpyDeviceToHost));
    gpuErrchk(cudaMemcpy(h_V, d_V, Ncols * Ncols     * sizeof(double), cudaMemcpyDeviceToHost));

    for(int i = 0; i < min(Nrows, Ncols); i++) 
        std::cout << "d_S["<<i<<"] = " << std::setprecision(15) << h_S[i] << std::endl;

    printf("\n\n");

    int count = 0;
    bool flag = 0;
    while (!flag) {
        if (h_S[count] < threshold) flag = 1;
        if (count == min(Nrows, Ncols)) flag = 1;
        count++;
    }
    count--;
    printf("The null space of A has dimension %i\n\n", min(Ncols, Nrows) - count);

    for(int j = count; j < Ncols; j++) {
        printf("Basis vector nr. %i\n", j - count);
        for(int i = 0; i < Ncols; i++)
                std::cout << "d_V["<<i<<"] = " << std::setprecision(15) << h_U[j*Ncols + i] << std::endl;
        printf("\n");
    }

    cusolverDnDestroy(solver_handle);

    return 0;

}

#include <stdio.h>
#include <assert.h>

#include "cuda_runtime.h"
#include <cuda.h>

#include <cusolverDn.h>

/*******************/
/* iDivUp FUNCTION */
/*******************/
extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }

/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) { exit(code); }
   }
}

extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }

/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
{
    switch (error)
    {
        case CUSOLVER_STATUS_SUCCESS:
            return "CUSOLVER_SUCCESS";

        case CUSOLVER_STATUS_NOT_INITIALIZED:
            return "CUSOLVER_STATUS_NOT_INITIALIZED";

        case CUSOLVER_STATUS_ALLOC_FAILED:
            return "CUSOLVER_STATUS_ALLOC_FAILED";

        case CUSOLVER_STATUS_INVALID_VALUE:
            return "CUSOLVER_STATUS_INVALID_VALUE";

        case CUSOLVER_STATUS_ARCH_MISMATCH:
            return "CUSOLVER_STATUS_ARCH_MISMATCH";

        case CUSOLVER_STATUS_EXECUTION_FAILED:
            return "CUSOLVER_STATUS_EXECUTION_FAILED";

        case CUSOLVER_STATUS_INTERNAL_ERROR:
            return "CUSOLVER_STATUS_INTERNAL_ERROR";

        case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
            return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";

    }

    return "<unknown>";
}

inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
    if(CUSOLVER_STATUS_SUCCESS != err) {
        fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n",__FILE__, __LINE__,err, \
                                _cudaGetErrorEnum(err)); \
        cudaDeviceReset(); assert(0); \
    }
}

extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }

#包括
#包括
#包括“cuda_runtime.h”
#包括
#包括
/*******************/
/*iDivUp函数*/
/*******************/
外部“C”intIDIVUP（inta，intb）{返回（（a%b）！=0）？（a/b+1）：（a/b）；}
/********************/
/*CUDA错误检查*/
/********************/
//---归功于http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
无效gpuAssert（错误代码