如何在MATLAB中很好地向量化以下关于向量的偏导数?

如何在MATLAB中很好地向量化以下关于向量的偏导数?,matlab,machine-learning,vectorization,gradient-descent,Matlab,Machine Learning,Vectorization,Gradient Descent,我试图实现以下等式: 在matlab中。为了解释一些符号df/dt^(1){i,j}应该是一个向量,z^{(2)}{k2}是一个实数,a^{(2)}{i,j}是一个实数,t^{(2)}{k2}是一个向量,x}i是一个向量,i,j}{code>是一个向量。有关注释的更多说明,请参阅与此相关的。此外,我还试图通过对输入和输出应该是什么的注释对代码进行大量注释,以最大限度地减少对相关变量维度的混淆 我确实有一个潜在的实现(我相信是正确的),但有时MATLAB有一些很好的隐藏技巧,我想知道这是上面向量

我试图实现以下等式:

在matlab中。为了解释一些符号
df/dt^(1){i,j}
应该是一个向量,
z^{(2)}{k2}
是一个实数,
a^{(2)}{i,j}
是一个实数,
t^{(2)}{k2}
是一个向量,
x}i
是一个向量,
i,j}{code>是一个向量。有关注释的更多说明,请参阅与此相关的。此外,我还试图通过对输入和输出应该是什么的注释对代码进行大量注释,以最大限度地减少对相关变量维度的混淆

我确实有一个潜在的实现(我相信是正确的),但有时MATLAB有一些很好的隐藏技巧,我想知道这是上面向量化方程的一个很好的实现,还是有更好的实现

目前我的代码如下:

function [ dJ_dt1 ] = compute_t1_gradient(t1,x,y,f,z_l1,z_l2,a_l2,c,t2,lambda)
%compute_t1_gradient_loops - computes the t1 parameter of a 2 layer HBF
%   Computes dJ_dt1 according to:
%       dJ_dt1
%   Input:
%       t1 = centers (Dp x Dd x Np)
%       x = data (D x 1)
%       y = label (1 x 1)
%       f = f(x) (1 x 1)
%       z_l1 = inputs l2 (Np x Dd)
%       z_l2 = inputs l1 (K2 x 1)
%       a_l2 = activations l2 (Np x Dd)
%       a_l3 = activations l3 (K2 x 1)
%       c = weights (K2 x 1)
%       t2 = centers (K1 x K2)
%       lambda = reg param (1 x 1)
%       mu_c = step size (1 x 1)
%   Output:
%       dJ_dt1 = gradient (Dp x Dd x Np)
[Dp, ~, ~] = size(t1);
[Np, Dd] = size(a_l2);
x_parts = reshape(x, [Dp, Np])'; % Np x Dp
K1 = Np * Dd;
a_l2_col_vec = reshape(a_l2', [K1, 1]); %K1 x 1
alpha = bsxfun(@minus, a_l2_col_vec, t2); %K1 x K2
c_z_l2 = (c .* exp(-z_l2))'; % 1 x K2
alpha = bsxfun(@times, c_z_l2, alpha); %K1 x K2
alpha = bsxfun(@times, reshape(exp(-z_l1'),[K1, 1]) , alpha);
alpha = sum(alpha, 2); %K1 x 1
xi_t1 = bsxfun(@minus, x_parts', permute(t1, [1,3,2]));
% alpha K1 x 1
% xi_t1 Dp x Np x Dd
dJ_dt1 = bsxfun(@minus, reshape(alpha,[Dd, Np]), permute(xi_t1, [3, 2, 1]));
dJ_dt1 = permute(dJ_dt1,[3,1,2]);
dJ_dt1 = -4*(y-f)*dJ_dt1;
dJ_dt1 = dJ_dt1 + lambda * 0; %TODO
end
实际上,在这一点上,我决定再次实现上述函数作为for循环。不幸的是,他们没有给出相同的答案,这让我怀疑上述说法是否正确。我将粘贴我想要/打算矢量化的for循环代码:

function [ dJ_dt1 ] = compute_t1_gradient_loops(t1,x,y,f,z_l1,z_l2,a_l2,c,t2)
%compute_t1_gradient_loops - computes the t1 parameter of a 2 layer HBF
%   Computes t1 according to:
%       t1 := t1 - mu_c * dJ/dt1
%   Input:
%       t1 = centers (Dp x Dd x Np)
%       x = data (D x 1)
%       y = label (1 x 1)
%       f = f(x) (1 x 1)
%       z_l1 = inputs l2 (Np x Dd)
%       z_l2 = inputs l1 (K2 x 1)
%       a_l2 = activations l2 (Np x Dd)
%       a_l3 = activations l3 (K2 x 1)
%       c = weights (K2 x 1)
%       t2 = centers (K1 x K2)
%       lambda = reg param (1 x 1)
%       mu_c = step size (1 x 1)
%   Output:
%       dJ_dt1 = gradeint (Dp x Dd x Np)
[Dp, ~, ~] = size(t1); %(Dp x Dd x Np)
[Np, Dd] = size(a_l2);
K2 = length(c);
t2_tensor = reshape(t2, Dd, Np, K2);
x_parts = reshape(x, [Dp, Np]);
dJ_dt1 = zeros(Dp, Dd, Np);
for i=1:Dd
    xi = x_parts(:,i);
    for j=1:Np
        t_l1_ij = t1(:,i,j);
        a_l2_ij = a_l2(j, i);
        z_l1_ij = z_l1(j,i);
        alpha_ij = 0;
        for k2=1:K2
            t2_k2ij = t2_tensor(i,j,k2);
            c_k2 = c(k2);
            z_l2_k2 = z_l2(k2);
            new_delta = c_k2*-1*exp(-z_l2_k2)*2*(a_l2_ij - t2_k2ij);
            alpha_ij = alpha_ij + new_delta;
        end
        alpha_ij = 2*(y-f)*-1*exp(-z_l1_ij)*2*(xi - t_l1_ij);
        dJ_dt1(:,i,j) = alpha_ij;
    end
end
end
实际上,我甚至用检查梯度下降式方程的方法来近似导数:

为此,我甚至为此编写了代码:

%% update t1 unit test
%% dimensions
Dp = 3;
Np = 4;
Dd = 2;
K2 = 5;
K1 = Dd * Np;
%% fake data & params
x = (1:Dp*Np)';
y = 3;
c = (1:K2)';
t2 = rand(K1, K2);
t1 = rand(Dp, Dd, Np);
lambda = 0;
mu_t1 = 1;
%% call f(x)
[f, z_l1, z_l2, a_l2, ~ ] = f_star(x,c,t1,t2,Np,Dp);
%% update gradient
dJ_dt1_ij_loops = compute_t1_gradient_loops(t1,x,y,f,z_l1,z_l2,a_l2,c,t2);
dJ_dt1 = compute_t1_gradient(t1,x,y,f,z_l1,z_l2,a_l2,c,t2,lambda);
eps = 1e-4;
e_111 = zeros( size(t1) );
e_111(1,1,1) = eps;
derivative = (J(y, x, c, t2, t1 + e_111, Np, Dp) - J(y, x, c, t2, t1  - e_111, Np, Dp) ) / (2*eps);
derivative
dJ_dt1_ij_loops(1,1,1)
dJ_dt1(1,1,1)
但这两种衍生品似乎都与“近似”的衍生品不符。一次运行的输出如下所示:

>> update_t1_gradient_unit_test

derivative =

    0.0027

dJ_dt1_ij_loops

ans =

    0.0177

dJ_dt1

ans =

   -0.5182

>> 
这对我来说不清楚是否有错误…看起来它几乎与循环匹配,但这足够接近了吗

吴家富说:

但是,我没有看到4个有效数字表示同意!甚至数量级都不一样:(我想两者都是错的,但我似乎不明白为什么或者在哪里/如何


在一个相关的注释中,我还要求检查我在顶部的导数是否真的是(数学上正确的),因为在这一点上,我不确定哪部分是错的,哪部分是正确的。问题的链接在这里:


更新

我已经实现了一个新版本的带有循环的导数,它几乎与我创建的一个小示例一致

以下是新的实现(某处有一个bug…):

以下是计算数值导数的代码(正确且工作正常):

我将提供f的代码和我实际使用的数字,以防人们复制我的结果:

下面是f所做工作的代码(这也是正确的,并且按照预期工作):

以下是我用于测试的数据:

%% Test 1: 
% dimensions
disp('>>>>>>++++======--------> update t1 unit test');
% fake data & params
x = (1:6)'/norm(1:6,2)
c = [29, 30, 31, 32]'
t2 = [(13:16)/norm((13:16),2); (17:20)/norm((17:20),2); (21:24)/norm((21:24),2); (25:28)/norm((25:28),2)]'
Dp = 3;
Dd = 2;
Np = 2;
t1 = zeros(Dp,Dd, Np); % (Dp, Dd, Np)
t1(:,:,1) = [(1:3)/norm((1:3),2); (4:6)/norm((4:6),2)]';
t1(:,:,2) = [(7:9)/norm((7:9),2); (10:12)/norm((10:12),2)]';
t1
% call f(x)
[f, z_l1, z_l2, a_l2, a_l3 ] = f_star_loops(x,c,t1,t2)
% gradient
df_dt1_loops = compute_df_dt1_loops3(t1,x,z_l1,z_l2,a_l2,c,t2);
df_dt1_loops2 = compute_df_dt1_loops3(t1,x,z_l1,z_l2,a_l2,c,t2);
eps = 1e-10;
dJ_dt1_numerical = compute_numerical_derivatives( x, c, t1, t2, eps);
disp('---- Derivatives ----');
for np=1:Np
    np
    dJ_dt1_numerical_np = dJ_dt1_numerical(:,:,np);
    dJ_dt1_numerical_np
    df_dt1_loops2_np = df_dt1_loops(:,:,np);
    df_dt1_loops2_np
end
请注意,数值导数现在是正确的(我确信,因为我将其与mathematica返回的匹配值进行了比较,加上
f
已经过调试,因此它可以按照我的要求工作)

以下是一个输出示例(其中数值导数矩阵应与使用我的方程的导数矩阵匹配):


更新:我对公式中某些数量的指数有一些误解,另请参见更新后的问题。我在下面留下了原始答案(因为矢量化应以相同的方式进行)最后,我添加了与OP的实际问题相对应的最终矢量化版本,以确保完整性

问题 您的代码和公式之间存在一些不一致。在公式中,您引用了
x_i
,但是
x
数组的相应大小是索引
j
的大小。这与您的数学一致。stackexchange问题,其中
i
j
似乎是相互关联的r已更改您在此处使用的符号

无论如何,这里有一个函数的固定循环版本:

function [ dJ_dt1 ] = compute_t1_gradient_loops(t1,x,y,f,z_l1,z_l2,a_l2,c,t2)
%compute_t1_gradient_loops - computes the t1 parameter of a 2 layer HBF
%   Input:
%       t1 = (Dp x Dd x Np)
%       x = (D x 1)
%       z_l1 = (Np x Dd)
%       z_l2 = (K2 x 1)
%       a_l2 = (Np x Dd)
%       c =  (K2 x 1)
%       t2 = (K1 x K2)
%
%       K1=Dd*Np
%        D=Dp*Dd
%       Dp,Np,Dd,K2 unique
%
%   Output:
%       dJ_dt1 = gradient (Dp x Dd x Np)
[Dp, ~, ~] = size(t1); %(Dp x Dd x Np)
[Np, Dd] = size(a_l2);
K2 = length(c);
t2_tensor = reshape(t2, Dd, Np, K2);  %Dd x Np x K2
x_parts = reshape(x, [Dp, Dd]);       %Dp x Dd
dJ_dt1 = zeros(Dp, Dd, Np);           %Dp x Dd x Np
for i=1:Dd
    xi = x_parts(:,i);
    for j=1:Np
        t_l1_ij = t1(:,i,j);
        a_l2_ij = a_l2(j, i);
        z_l1_ij = z_l1(j,i);
        alpha_ij = 0;
        for k2=1:K2
            t2_k2ij = t2_tensor(i,j,k2);
            c_k2 = c(k2);
            z_l2_k2 = z_l2(k2);
            new_delta = c_k2*exp(-z_l2_k2)*(a_l2_ij - t2_k2ij);
            alpha_ij = alpha_ij + new_delta;
        end
        alpha_ij = -4*alpha_ij* exp(-z_l1_ij)*(xi - t_l1_ij);
        dJ_dt1(:,i,j) = alpha_ij;
    end
end
end
需要注意的一些事项:

  • 我将
    x
    的大小更改为
    D=Dp*Dd
    ,以保留公式的
    I
    索引。否则,将不得不重新考虑更多问题
  • 您可以使用
    Dp=size(t1,1)
  • 在循环版本中,您忘记在求和后保留
    alpha_ij
    ,因为您用预因子重写了旧值,而不是将其相乘
  • 如果我误解了您的意图,请告诉我,我将相应地更改循环版本

    矢量化版本 假设循环版本符合您的要求,下面是一个矢量化版本,与您最初的尝试类似:

    function [ dJ_dt1 ] = compute_t1_gradient_vect(t1,x,y,f,z_l1,z_l2,a_l2,c,t2)
    %compute_t1_gradient_vect - computes the t1 parameter of a 2 layer HBF
    %   Input:
    %       t1 = (Dp x Dd x Np)
    %       x = (D x 1)
    %       y = (1 x 1)
    %       f = (1 x 1)
    %       z_l1 = (Np x Dd)
    %       z_l2 = (K2 x 1)
    %       a_l2 = (Np x Dd)
    %       c =  (K2 x 1)
    %       t2 = (K1 x K2)
    %
    %       K1=Dd*Np
    %        D=Dp*Dd
    %       Dp,Np,Dd,K2 unique
    %
    %   Output:
    %       dJ_dt1 = gradient (Dp x Dd x Np)
    Dp = size(t1,1);
    [Np, Dd] = size(a_l2);
    K2 = length(c);
    t2_tensor = reshape(t2, Dd, Np, K2);  %Dd x Np x K2
    x_parts = reshape(x, [Dp, Dd]);       %Dp x Dd
    
    %reorder things to align for bsxfun later
    a_l2=a_l2'; %Dd x Np <-> i,j
    z_l1=z_l1'; %Dd x Np <-> i,j
    t2_tensor = permute(t2_tensor,[3 1 2]); %K2 x Dd x Np
    
    %the 1D part of the sum to be used in partialsum
    %prefactors also put here to minimize computational effort
    tempvar_k2 = -4*c.*exp(-z_l2); % K2 x 1
    
    %compute sum(b(k)*(c-d(k)) as c*sum(b(k))-sum(b(k)*d(k))  (NB)
    partialsum = a_l2*sum(tempvar_k2) ...
                 -squeeze(sum(bsxfun(@times,tempvar_k2,t2_tensor),1)); %Dd x Np
    
    %alternative computation by definition:
    %partialsum = bsxfun(@minus,a_l2,t2_tensor); %Dd x Np x K2
    %partialsum = permute(partialsum,[3 1 2]); %K2 x Dd x Np
    %partialsum = squeeze(sum(bsxfun(@times,tempvar_k2,partialsum),1)); %Dd x Np
    
    %last part of the formula, (x-t1)
    tempvar_lastterm = bsxfun(@minus,x_parts,t1); %Dp x Dd x Np
    tempvar_lastterm = permute(tempvar_lastterm,[2 3 1]); %Dd x Np x Dp
    
    %put together what we have
    dJ_dt1 = bsxfun(@times,partialsum.*exp(-z_l1),tempvar_lastterm); %Dd x Np x Dp
    dJ_dt1 = permute(dJ_dt1,[3 1 2]); %Dp x Dd x Np
    
    请注意,我再次更改了
    x
    的定义,
    ..\u vect2
    表示“天真”矢量化代码的版本。结果表明,结果导数与循环版本和原始矢量化版本完全一致,而它们与优化的矢量版本之间存在最大的
    2e-14
    差异。这意味着我们很好。而接近机器精度的差异仅仅是由于以下事实:手术顺序不同

    为了衡量性能,我将原始测试用例的尺寸乘以100:

    %% dimensions
    Dp = 300;
    Np = 400;
    Dd = 200;
    K2 = 500;
    K1 = Dd * Np;
    
    我还设置了变量,在每次函数调用前后检查
    cputime
    (因为
    tic/toc
    只测量墙上的时钟时间)。循环、优化和“幼稚”的测量时间分别为23秒、2秒和4秒向量版本。另一方面,后两个导数之间的最大差异现在是
    1.8e-5
    。当然,我们的测试数据是随机的,至少可以说这不是最佳条件数据。在实际应用中,这种差异可能不会成为问题,但您应始终小心数据丢失精度(在优化版本中,我们专门减去两个可能较大的数字)。

    ---- Derivatives ---- np = 1 dJ_dt1_numerical_np = 7.4924 13.1801 14.9851 13.5230 22.4777 13.8660 df_dt1_loops2_np = 7.4925 5.0190 14.9851 6.2737 22.4776 7.5285 np = 2 dJ_dt1_numerical_np = 11.4395 13.3836 6.9008 6.6363 2.3621 -0.1108 df_dt1_loops2_np = 14.9346 13.3835 13.6943 6.6363 12.4540 -0.1108
    function [ dJ_dt1 ] = compute_t1_gradient_loops(t1,x,y,f,z_l1,z_l2,a_l2,c,t2)
    %compute_t1_gradient_loops - computes the t1 parameter of a 2 layer HBF
    %   Input:
    %       t1 = (Dp x Dd x Np)
    %       x = (D x 1)
    %       z_l1 = (Np x Dd)
    %       z_l2 = (K2 x 1)
    %       a_l2 = (Np x Dd)
    %       c =  (K2 x 1)
    %       t2 = (K1 x K2)
    %
    %       K1=Dd*Np
    %        D=Dp*Dd
    %       Dp,Np,Dd,K2 unique
    %
    %   Output:
    %       dJ_dt1 = gradient (Dp x Dd x Np)
    [Dp, ~, ~] = size(t1); %(Dp x Dd x Np)
    [Np, Dd] = size(a_l2);
    K2 = length(c);
    t2_tensor = reshape(t2, Dd, Np, K2);  %Dd x Np x K2
    x_parts = reshape(x, [Dp, Dd]);       %Dp x Dd
    dJ_dt1 = zeros(Dp, Dd, Np);           %Dp x Dd x Np
    for i=1:Dd
        xi = x_parts(:,i);
        for j=1:Np
            t_l1_ij = t1(:,i,j);
            a_l2_ij = a_l2(j, i);
            z_l1_ij = z_l1(j,i);
            alpha_ij = 0;
            for k2=1:K2
                t2_k2ij = t2_tensor(i,j,k2);
                c_k2 = c(k2);
                z_l2_k2 = z_l2(k2);
                new_delta = c_k2*exp(-z_l2_k2)*(a_l2_ij - t2_k2ij);
                alpha_ij = alpha_ij + new_delta;
            end
            alpha_ij = -4*alpha_ij* exp(-z_l1_ij)*(xi - t_l1_ij);
            dJ_dt1(:,i,j) = alpha_ij;
        end
    end
    end
    
    function [ dJ_dt1 ] = compute_t1_gradient_vect(t1,x,y,f,z_l1,z_l2,a_l2,c,t2)
    %compute_t1_gradient_vect - computes the t1 parameter of a 2 layer HBF
    %   Input:
    %       t1 = (Dp x Dd x Np)
    %       x = (D x 1)
    %       y = (1 x 1)
    %       f = (1 x 1)
    %       z_l1 = (Np x Dd)
    %       z_l2 = (K2 x 1)
    %       a_l2 = (Np x Dd)
    %       c =  (K2 x 1)
    %       t2 = (K1 x K2)
    %
    %       K1=Dd*Np
    %        D=Dp*Dd
    %       Dp,Np,Dd,K2 unique
    %
    %   Output:
    %       dJ_dt1 = gradient (Dp x Dd x Np)
    Dp = size(t1,1);
    [Np, Dd] = size(a_l2);
    K2 = length(c);
    t2_tensor = reshape(t2, Dd, Np, K2);  %Dd x Np x K2
    x_parts = reshape(x, [Dp, Dd]);       %Dp x Dd
    
    %reorder things to align for bsxfun later
    a_l2=a_l2'; %Dd x Np <-> i,j
    z_l1=z_l1'; %Dd x Np <-> i,j
    t2_tensor = permute(t2_tensor,[3 1 2]); %K2 x Dd x Np
    
    %the 1D part of the sum to be used in partialsum
    %prefactors also put here to minimize computational effort
    tempvar_k2 = -4*c.*exp(-z_l2); % K2 x 1
    
    %compute sum(b(k)*(c-d(k)) as c*sum(b(k))-sum(b(k)*d(k))  (NB)
    partialsum = a_l2*sum(tempvar_k2) ...
                 -squeeze(sum(bsxfun(@times,tempvar_k2,t2_tensor),1)); %Dd x Np
    
    %alternative computation by definition:
    %partialsum = bsxfun(@minus,a_l2,t2_tensor); %Dd x Np x K2
    %partialsum = permute(partialsum,[3 1 2]); %K2 x Dd x Np
    %partialsum = squeeze(sum(bsxfun(@times,tempvar_k2,partialsum),1)); %Dd x Np
    
    %last part of the formula, (x-t1)
    tempvar_lastterm = bsxfun(@minus,x_parts,t1); %Dp x Dd x Np
    tempvar_lastterm = permute(tempvar_lastterm,[2 3 1]); %Dd x Np x Dp
    
    %put together what we have
    dJ_dt1 = bsxfun(@times,partialsum.*exp(-z_l1),tempvar_lastterm); %Dd x Np x Dp
    dJ_dt1 = permute(dJ_dt1,[3 1 2]); %Dp x Dd x Np
    
    %% update t1 unit test
    %% dimensions
    Dp = 3;
    Np = 4;
    Dd = 2;
    K2 = 5;
    K1 = Dd * Np;
    %% fake data & params
    x = (1:Dp*Dd)';
    y = 3;
    c = (1:K2)';
    t2 = rand(K1, K2);
    t1 = rand(Dp, Dd, Np);
    %% update gradient
    dJ_dt1_ij_loops = compute_t1_gradient_loops(t1,x,y,f,z_l1,z_l2,a_l2,c,t2);
    dJ_dt1_vect = compute_t1_gradient_vect(t1,x,y,f,z_l1,z_l2,a_l2,c,t2);
    dJ_dt1_vect2 = compute_t1_gradient_vect2(t1,x,y,f,z_l1,z_l2,a_l2,c,t2);
    
    %% dimensions
    Dp = 300;
    Np = 400;
    Dd = 200;
    K2 = 500;
    K1 = Dd * Np;
    
    function [ dJ_dt1 tempout] = compute_t1_gradient_vect(t1,x,z_l1,z_l2,a_l2,c,t2)
    %compute_t1_gradient_vect - computes the t1 parameter of a 2 layer HBF
    %   Input:
    %       t1 = (Dp x Dd x Np)
    %       x = (D x 1)
    %       z_l1 = (Np x Dd)
    %       z_l2 = (K2 x 1)
    %       a_l2 = (Np x Dd)
    %       c =  (K2 x 1)
    %       t2 = (K1 x K2)
    %
    %       K1=Dd*Np
    %        D=Dp*Np
    %       Dp,Np,Dd,K2 unique
    %
    %   Output:
    %       dJ_dt1 = gradient (Dp x Dd x Np)
    Dp = size(t1,1);
    [Np, Dd] = size(a_l2);
    K2 = length(c);
    t2_tensor = reshape(t2, Dd, Np, K2);  %Dd x Np x K2
    x_parts = reshape(x, [Dp, Np]);       %Dp x Np
    t1 = permute(t1,[1 3 2]);             %Dp x Np x Dd
    
    a_l2=a_l2'; %Dd x Np <-> j,i
    z_l1=z_l1'; %Dd x Np <-> j,i
    
    tempvar_k2 = -4*c.*exp(-z_l2); % K2 x 1
    
    partialsum = bsxfun(@minus,a_l2,t2_tensor); %Dd x Np x K2
    partialsum = permute(partialsum,[3 1 2]);   %K2 x Dd x Np
    partialsum = squeeze(sum(bsxfun(@times,tempvar_k2,partialsum),1)); %Dd x Np
    
    tempvar_lastterm = bsxfun(@minus,x_parts,t1);         %Dp x Np x Dd
    tempvar_lastterm = permute(tempvar_lastterm,[3 2 1]); %Dd x Np x Dp
    
    dJ_dt1 = bsxfun(@times,partialsum.*exp(-z_l1),tempvar_lastterm); %Dd x Np x Dp
    tempout=tempvar_lastterm;
    dJ_dt1 = permute(dJ_dt1,[3 1 2]); %Dp x Dd x Np