Neural network 如何独立于任何损失函数实现Softmax导数？_Neural Network_Regression_Backpropagation_Derivative_Softmax

Neural network 如何独立于任何损失函数实现Softmax导数？

neural-network

Neural network 如何独立于任何损失函数实现Softmax导数？,neural-network,regression,backpropagation,derivative,softmax,Neural Network,Regression,Backpropagation,Derivative,Softmax,对于一个神经网络库，我实现了一些激活函数和损失函数及其导数。它们可以任意组合，输出层上的导数就是损耗导数和激活导数的乘积然而，我未能独立于任何损失函数实现Softmax激活函数的导数。由于标准化，即方程式中的分母，更改单个输入激活会更改所有输出激活，而不仅仅是一个这是我的Softmax实现，其中导数未通过梯度检查约1%。如何实现Softmax导数，使其与任何损失函数结合 import numpy as np class Softmax: def compute(self, in

对于一个神经网络库，我实现了一些激活函数和损失函数及其导数。它们可以任意组合，输出层上的导数就是损耗导数和激活导数的乘积

然而，我未能独立于任何损失函数实现Softmax激活函数的导数。由于标准化，即方程式中的分母，更改单个输入激活会更改所有输出激活，而不仅仅是一个

这是我的Softmax实现，其中导数未通过梯度检查约1%。如何实现Softmax导数，使其与任何损失函数结合

import numpy as np


class Softmax:

    def compute(self, incoming):
        exps = np.exp(incoming)
        return exps / exps.sum()

    def delta(self, incoming, outgoing):
        exps = np.exp(incoming)
        others = exps.sum() - exps
        return 1 / (2 + exps / others + others / exps)


activation = Softmax()
cost = SquaredError()

outgoing = activation.compute(incoming)
delta_output_layer = activation.delta(incoming) * cost.delta(outgoing)

它应该是这样的：（x是softmax层的输入，dy是它上面的损耗产生的增量）

但计算误差的方法应该是：

    yact = activation.compute(x)
    ycost = cost.compute(yact)
    dsoftmax = activation.delta(x, cost.delta(yact, ycost, ytrue))

说明：由于

delta

函数是反向传播算法的一部分，因此它的职责是将向量

dy

（在我的代码中，在您的情况下，

outgoing

）乘以

compute（x）

函数在

处计算的雅可比数。如果你计算出softmax[1]的雅可比矩阵是什么样子的，然后从左边乘以向量dy，经过一点代数运算，你会发现你得到了与我的Python代码相对应的东西

[1]

从数学上讲，Softmaxσ（j）对logit Zi（例如Wi*X）的导数为

红色三角洲是克罗内克三角洲

如果以迭代方式实现：

def softmax_grad(s):
    # input s is softmax value of the original input x. Its shape is (1,n) 
    # i.e.  s = np.array([0.3,0.7]),  x = np.array([0,1])

    # make the matrix whose size is n^2.
    jacobian_m = np.diag(s)

    for i in range(len(jacobian_m)):
        for j in range(len(jacobian_m)):
            if i == j:
                jacobian_m[i][j] = s[i] * (1 - s[i])
            else: 
                jacobian_m[i][j] = -s[i] * s[j]
    return jacobian_m

测试：

如果在矢量化版本中实现：

soft_max = softmax(x)    

# reshape softmax to 2d so np.dot gives matrix multiplication

def softmax_grad(softmax):
    s = softmax.reshape(-1,1)
    return np.diagflat(s) - np.dot(s, s.T)

softmax_grad(soft_max)

#array([[ 0.19661193, -0.19661193],
#       [-0.19661193,  0.19661193]])

这里是一个C++矢量化版本，使用内含子（22倍（！）比非SSE版本快）：

//有多少浮点数可以放入_m256“组”。
//用于向量和矩阵，以确保其尺寸适合
//内在的。
//否则，矩阵的连续行将不按16字节对齐，并且
//对它们的操作将不正确。
#定义F_M256 8的倍数
//检查以快速查看您的行是否可被m256整除。
//在验证所有内容都正确后，您可以“取消定义”以保存性能。
#定义ASSERT\u\u M256\u倍数
#ifdef断言_M256_倍数
#定义断言\u是\u m256\u的倍数（x）断言（（x%F\u的倍数=0）
#否则
#定义断言\u是\u m256\u的倍数（q）
#恩迪夫
//通常用于Reduce函数的末尾，
//其中，需要将最终的_m256 mSum压缩为1个标量。
静态内联浮点慢置浮点值（uuuum256 x）{
常量浮点*sumStart=重新解释强制转换（&x）；
浮动总和=0.0f；
对于（size_t i=0；i，以防分批处理，这里是NumPy（tested vs TensorFlow）中的一个实现。但是，我建议通过混合雅可比矩阵和交叉熵来避免相关的张量运算，这将导致一个非常简单有效的表达式
def softmax(z):
  exps = np.exp(z - np.max(z))
  return exps / np.sum(exps, axis=1, keepdims=True)

def softmax_jacob(s):
  return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
       - np.einsum('ij,ik->ijk', s, s)

def np_softmax_test(z):
  return softmax_jacob(softmax(z))

def tf_softmax_test(z):
  z = tf.constant(z, dtype=tf.float32)
  with tf.GradientTape() as g:
    g.watch(z)
    a = tf.nn.softmax(z) 
  jacob = g.batch_jacobian(a, z)
  return jacob.numpy()

z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))

其他答案很好，在这里分享一个简单的前向/后向
实现，而不考虑损失函数
在下图中，这是对softmax的向后的一个简短推导。第二个等式依赖于损失函数，而不是我们实现的一部分。

向后
通过手动梯度检查进行验证
将numpy导入为np
类Softmax：
def前进（自身，x）：
mx=np.max（x，轴=1，keepdims=True）
x=x-mx#log sum exp技巧
e=np.exp（x）
probs=e/np.sum（np.exp（x），axis=1，keepdims=True）
返回问题
def向后（自身、x、probs、bp_err）：
尺寸=x.形状[1]
输出=np.空（x.形）
对于范围内的j（尺寸）：
d_prob_over_xj=-（probs*probs[：，[j]]）即prob_k*prob_j，无论k==j与否
d_prob_over_xj[：，j]+=probs[：，j]#即当k==j，+prob#j
输出[：，j]=np.sum（bp_err*d_prob_over_xj，轴=1）
返回输出
def计算手动梯度（x，pred\u fn）：
eps=1e-3
批次尺寸，尺寸=x形状
梯度=np.空（x.形）
对于范围内的i（批次大小）：
对于范围内的j（尺寸）：
x[i，j]+=eps
y1=pred_fn（x）
x[i，j]=2*eps
y2=pred_fn（x）
年级[i，j]=（y1-y2）/（2*eps）
x[i，j]+=eps
返校生
def损耗（probs、ys、损耗类型）：
批次大小=问题形状[0]
#虚拟均方误差
如果损失类型=“mse”：
损失=np.总和（（np.沿轴取下（probs，ys.重塑（-1,1），轴=1）-1）**2）/批次大小
值=2*（np.沿轴取（probs，ys.重塑（-1,1），轴=1）-1）/批量大小
#交叉耳道
如果损失类型=“xent”：
损失=-np.sum（np.take_沿_轴（np.log（probs），ys.restrape（-1,1），axis=1））/batch_size
值=-1/np。沿_轴取_（probs，ys.重塑（-1,1），轴=1）/批次大小
错误=np.零（问题形状）
np.沿_轴放置_（错误，Y.重塑（-1,1），值，轴=1）
退货损失
如果名称=“\uuuuu main\uuuuuuuu”：
批量大小=10
尺寸=5
x=np.random.rand（批量大小，尺寸）
ys=np.random.randint（0，尺寸，批量大小）
对于损失类型[“mse”，“xent”]：
S=Softmax（）
probs=S向前（x）
损失，bp\U err=损失\U fn（概率、概率、损失类型）
梯度=S.向后（x，probs，bp_err）
def pred_fn（x，ys）：
pred=S.向前（x）
损耗，err=损耗（pred，ys，损耗类型）
回波损耗
手动梯度=计算手动梯度（x，λx:pred\u fn（x，ys））
#比较两个毕业生
打印（f“损耗类型={损耗类型}，梯度差异={np.sum（（梯度-手动梯度）**2）/批量大小}）
谢谢你的回答。你所说的res指的是什么？我指的是dx（我手动重构了这个答案的代码，忘记了这个代码的出现=）。我在答案中修复了它。你的解决方案
soft_max = softmax(x)    

# reshape softmax to 2d so np.dot gives matrix multiplication

def softmax_grad(softmax):
    s = softmax.reshape(-1,1)
    return np.diagflat(s) - np.dot(s, s.T)

softmax_grad(soft_max)

#array([[ 0.19661193, -0.19661193],
#       [-0.19661193,  0.19661193]])

// How many floats fit into __m256 "group".
// Used by vectors and matrices, to ensure their dimensions are appropriate for 
// intrinsics.
// Otherwise, consecutive rows of matrices will not be 16-byte aligned, and 
// operations on them will be incorrect.
#define F_MULTIPLE_OF_M256 8


//check to quickly see if your rows are divisible by m256.
//you can 'undefine' to save performance, after everything was verified to be correct.
#define ASSERT_THE_M256_MULTIPLES
#ifdef ASSERT_THE_M256_MULTIPLES
    #define assert_is_m256_multiple(x)  assert( (x%F_MULTIPLE_OF_M256) == 0)
#else
    #define assert_is_m256_multiple (q) 
#endif


// usually used at the end of our Reduce functions,
// where the final __m256 mSum needs to be collapsed into 1 scalar.
static inline float slow_hAdd_ps(__m256 x){
    const float *sumStart = reinterpret_cast<const float*>(&x);
    float sum = 0.0f;

    for(size_t i=0; i<F_MULTIPLE_OF_M256; ++i){
        sum += sumStart[i];
    }
    return sum;
}



f_vec SoftmaxGrad_fromResult(const float *softmaxResult,  size_t size,  
                             const float *gradFromAbove){//<--gradient vector, flowing into us from the above layer
assert_is_m256_multiple(size);
//allocate vector, where to store output:
f_vec grad_v(size, true);//true: skip filling with zeros, to save performance.

const __m256* end   = (const __m256*)(softmaxResult + size);


for(size_t i=0; i<size; ++i){// <--for every row
    //go through this i'th row:
    __m256 sum =  _mm256_set1_ps(0.0f);

    const __m256 neg_sft_i  =  _mm256_set1_ps( -softmaxResult[i] );
    const __m256 *s  =  (const __m256*)softmaxResult;
    const __m256 *gAbove  =   (__m256*)gradFromAbove;

    for (s;  s<end; ){
        __m256 mul =  _mm256_mul_ps(*s, neg_sft_i);  //  sftmaxResult_j  *  (-sftmaxResult_i)
        mul =  _mm256_mul_ps( mul, *gAbove );

        sum =  _mm256_add_ps( sum,  mul );//adding to the total sum of this row.
        ++s;
        ++gAbove;
    }
    grad_v[i]  =  slow_hAdd_ps( sum );//collapse the sum into 1 scalar (true sum of this row).
}//end for every row

//reset back to start and subtract a vector, to account for Kronecker delta:
__m256 *g =  (__m256*)grad_v._contents;
__m256 *s =  (__m256*)softmaxResult;
__m256 *gAbove =  (__m256*)gradFromAbove;

for(s; s<end; ){
    __m256 mul = _mm256_mul_ps(*s, *gAbove);
    *g = _mm256_add_ps( *g, mul );
    ++s; 
    ++g;
}

return grad_v;

}

inline static void SoftmaxGrad_fromResult_nonSSE(const float* softmaxResult,  
                                                 const float *gradFromAbove,  //<--gradient vector, flowing into us from the above layer
                                                 float *gradOutput,  
                                                 size_t count ){
    // every pre-softmax element in a layer contributed to the softmax of every other element
    // (it went into the denominator). So gradient will be distributed from every post-softmax element to every pre-elem.
    for(size_t i=0; i<count; ++i){
        //go through this i'th row:
        float sum =  0.0f;

        const float neg_sft_i  =  -softmaxResult[i];

        for(size_t j=0; j<count; ++j){
            float mul =  gradFromAbove[j] * softmaxResult[j] * neg_sft_i;
            sum +=  mul;//adding to the total sum of this row.
        }
        //NOTICE: equals, overwriting any old values:
        gradOutput[i]  =  sum;
    }//end for every row

    for(size_t i=0; i<count; ++i){
        gradOutput[i] +=  softmaxResult[i] * gradFromAbove[i];
    }
}

def softmax(z):
  exps = np.exp(z - np.max(z))
  return exps / np.sum(exps, axis=1, keepdims=True)

def softmax_jacob(s):
  return np.einsum('ij,jk->ijk', s, np.eye(s.shape[-1])) \
       - np.einsum('ij,ik->ijk', s, s)

def np_softmax_test(z):
  return softmax_jacob(softmax(z))

def tf_softmax_test(z):
  z = tf.constant(z, dtype=tf.float32)
  with tf.GradientTape() as g:
    g.watch(z)
    a = tf.nn.softmax(z) 
  jacob = g.batch_jacobian(a, z)
  return jacob.numpy()

z = np.random.randn(3, 5)
np.all(np.isclose(np_softmax_test(z), tf_softmax_test(z)))