Matlab CPU和GPU中的SVD速度_Matlab_Matrix_Cuda_Svd_Arrayfire

Matlab CPU和GPU中的SVD速度

matlab matrix cuda

Matlab CPU和GPU中的SVD速度,matlab,matrix,cuda,svd,arrayfire,Matlab,Matrix,Cuda,Svd,Arrayfire,我在matlabr2014a中测试svd，似乎没有CPUvsGPU加速比。我使用的是GTX 460卡和Core 2 duo E8500 这是我的密码： %test SVD n=10000; %host Mh= rand(n,1000); tic %[Uh,Sh,Vh]= svd(Mh); svd(Mh); toc %device Md = gpuArray.rand(n,1000); tic %[Ud,Sd,Vd]= svd(Md); svd(Md); toc 此外，不同的运行时间不同，但是C

我在

matlabr2014a

中测试

svd

，似乎没有

CPU

GPU

加速比。我使用的是

GTX 460

卡和

Core 2 duo E8500

这是我的密码：

%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc

此外，不同的运行时间不同，但是

CPU

和

GPU

版本大致相同。为什么没有加速

这里有一些测试

for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n);
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n);
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.


for i=1:10
    clear;
    m= 10000;
    n= 100;
    %host
    Mh= rand(m,n,'single');
    tic
    [Uh,Sh,Vh]= svd(Mh);
    toc
    %device
    Md = gpuArray.rand(m,n,'single');
    tic
    [Ud,Sd,Vd]= svd(Md);
    toc
end

>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.

一般来说，SVD是一个难以并行化的程序。您可以检查高端特斯拉卡的加速效果不是很好

您有一张GTX460卡-。该卡针对游戏（单精度计算）而不是HPC（双精度计算）进行了优化。单精度/双精度吞吐量比为12。因此，该卡具有873 GFLOPS SP/72 GFLOPS DP。选中

因此，如果Md阵列使用双精度元素，则其计算速度将相当缓慢。此外，在调用CPU例程时，很有可能会利用所有CPU内核，从而降低在GPU上运行例程的可能收益。另外，在GPU运行中，您需要为将缓冲区传输到设备付出时间

根据Divakar的建议，您可以使用

Md=single（Md）

将数组转换为单精度，然后再次运行基准测试。您可以尝试使用更大的数据集大小来查看是否有更改。我不希望在你的GPU上有太多的收获

更新1:

发布结果后，我看到DP/SP时间比为2。在CPU端，这是正常的，因为您可以在SSE寄存器中容纳小于2倍的

double

值。然而，GPU端的比率仅为2意味着GPU代码没有充分利用SM内核，因为理论比率为12。换句话说，与DP相比，我希望优化代码的SP性能更好。似乎情况并非如此。

正如VAndrei已经指出的那样，SVD是一种难以并行化的算法

你的主要问题是矩阵的大小。随着矩阵尺寸的增大，奇异值分解的性能迅速下降。因此，您的主要目标应该是减少矩阵的大小。这可以通过使用高斯正态方程（基本上是最小二乘意义上的超定线性系统的简化）来实现

这可以通过简单地将转置乘以矩阵来实现：

MhReduced = Mh' * Mh;

这会将矩阵缩减为cols*cols的大小（如果cols是Mh的列数）。然后您只需调用

[U，S，V]=svd（mhd）
注意：使用此方法可能会产生符号相反的奇异向量（如果要比较这些方法，这一点很重要）
如果你的matix状态良好，这应该不会有问题。然而，在病态矩阵的情况下，该方法可能无法产生可用的结果，而由于SVD的鲁棒性，直接应用SVD仍然可以产生可用的结果
这将立即提高您的性能，至少在矩阵足够大的情况下。另一个优点是可以使用更大的矩阵。您可能根本不需要使用GPU（因为任何一个矩阵都太大，复制到GPU的成本太高，或者减少后矩阵太小，GPU的加速比不够大）
还请注意，如果使用返回值，则会损失大量性能。如果您只对SVD计算的性能感兴趣，请不要获取任何返回值。如果您只对“解决方案向量”感兴趣，只需获取V（并访问最后一列）：[~，~，V]=svd（Mh）
编辑：
我看过您的示例代码，但我不确定它是什么，您正在计算。我还意识到，很难理解我用A'*A
做了什么，因此我将详细解释
给定一个a*x=b的线性系统，a表示系数矩阵
对于m行和n列，x为解向量，b为常数向量（均为m行），可按如下方式计算解：

如果A是正方形（m=n
）：x=A^-1*b
如果A不是正方形（m！=n，m>n
）：
A*x=b
A'*A*x=A'*b
x=（A'*A）^-1*A'*b

A“=（A'*A）^-1*A'通常称为伪逆。但是，此计算确实会对矩阵的条件数产生负面影响。此问题的解决方案是使用奇异值分解（SVD）。
如果USV=svd（A）表示svd的结果，则伪逆由VS“U”
给出，其中S”通过取S的非零元素的逆来形成。
所以A“=VS“U”

x = A"*b

然而，由于奇异值分解的成本相当高，特别是对于大型矩阵。如果矩阵a条件良好，并且不一定需要非常精确的结果（我们谈论的是1e-13或1e-14），则可以使用通过（a'*a）^-1*a
计算伪逆的更快方法
如果您的案例实际上是A*x=0
，只需使用SVD并从V读取最后一列向量，它就是解决方案
如果你使用奇异值分解不是为了解线性系统，而是为了U和S的结果（如你的例子所示），我不确定我所发布的内容是否会对你有所帮助
资料来源：
,
这里有一些示例代码供您测试。使用大型矩阵进行测试，您将看到使用（A'*A）^-1*A'
比其他方法快得多
clear all

nbRows = 30000;
nbCols = 100;
% Matrix A
A = rand(nbRows,nbCols);

% Vector b
b = rand(nbRows,1);

% A*x=b

% Solve for x, using SVD
% [U,S,V]=svd(A,0);
% x= V*((U'*b)./diag(S))
tic
[U1,S1,V1]=svd(A,0);
x1= V1*((U1'*b)./diag(S1));
toc

tic
[U1,S1,V1]=svd(A,0);
x2 = V1*inv(S1)*U1'*b;
toc

% Solve for x, using manual pseudo-inverse
% A*x=b
% A'*A*x = A'*b
% x = (A'*A)^-1 * A'*b
tic
x3 = inv(A'*A) * A'*b;
toc

% Solve for x, let Matlab decide how (most likely SVD)
tic
x4 = A\b;
toc

我在配备GTX 460的笔记本电脑上尝试并行SVD已经有一个多月了，这也是我本科论文的一部分。我做了很多实验，后来发现MATLAB速度非常快，性能优于我的代码。顺便说一下，我使用了单侧Jacobi，我还没有看到任何揭示算法fa的论文
clear all
close all
clc

Nrows = 2500;
Ncols = 2500;

NumTests = 10;

h_A = rand(Nrows, Ncols);
d_A = gpuArray.rand(Nrows, Ncols);

timingCPU = 0;
timingGPU = 0;

for k = 1 : NumTests
    % --- Host
    tic
    [h_U, h_S, h_V] = svd(h_A);
%     h_S = svd(h_A);
    timingCPU = timingCPU + toc;

    % --- Device
    tic
    [d_U, d_S, d_V] = svd(d_A);
%     d_S = svd(d_A);
    timingGPU = timingGPU + toc;
end

fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);

              Sing. values only | Full SVD         | Sing. val. only | Full
                                |                  |                 |
Matrix size   CPU      GPU      | CPU       GPU    |                 |
                                |                  |                 |
 200 x  200   0.0021    0.043   |  0.0051    0.024 |   0.098         |  0.15
1000 x 1000   0.0915    0.3     |  0.169     0.458 |   0.5           |  2.3
2500 x 2500   3.35      2.13    |  4.62      3.97  |   2.9           |  23
5000 x 5000   5.2      13.1     | 26.6      73.8   |  16.1           | 161

 200 x  200      0.036
1000 x 1000      0.2
2500 x 2500      4.5
5000 x 5000     29

#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>

using namespace af;

int main(int argc, char *argv[])
{
    const int N = 1000;

    try {

        // --- Select a device and display arrayfire info
        int device = argc > 1 ? atoi(argv[1]) : 0;
        af::setDevice(device);
        af::info();

        array A = randu(N, N, f64);
        af::array U, S, Vt;

        // --- Warning up
        timer time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        double elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

        time_last = timer::start();
        af::svd(U, S, Vt, A);
        S.eval();
        af::sync();
        elapsed = timer::stop(time_last);
        printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);

    }
    catch (af::exception& e) {

        fprintf(stderr, "%s\n", e.what());
        throw;
    }

    return 0;
}