Python 这种梯度下降的实现有什么问题?

Python 这种梯度下降的实现有什么问题?,python,regression,linear-regression,Python,Regression,Linear Regression,我试图用梯度下降法实现线性回归,但我的误差发散到无穷大。我已经仔细阅读了我的代码,仍然无法找出哪里出了问题。我希望有人能帮我调试为什么线性回归的实现不起作用 当N=100时,则没有问题,但当N=1000时,则观察到发散到无穷大 import numpy as np class Regression: def __init__(self, xs, ys, w,alpha): self.w = w self.xs = xs self.ys =

我试图用梯度下降法实现线性回归,但我的误差发散到无穷大。我已经仔细阅读了我的代码,仍然无法找出哪里出了问题。我希望有人能帮我调试为什么线性回归的实现不起作用

N=100
时,则没有问题,但当
N=1000
时,则观察到发散到无穷大

import numpy as np

class Regression:
    def __init__(self, xs, ys, w,alpha):
        self.w = w
        self.xs = xs
        self.ys = ys
        self.a = alpha
        self.N = float(len(xs))

    def error(self, ys, yhat):
        return (1./self.N)*np.sum((ys-yhat)**2)

    def propagate(self):
        yhat = xs*w[0]+w[1]
        loss = yhat - self.ys

        r1 = (2./self.N)*np.sum(loss*self.xs)
        r2 = (2./self.N)*np.sum(loss)

        self.w[0] -= self.a*r1
        self.w[1] -= self.a*r2


N = 600
xs = np.arange(0,N)
bias = np.random.sample(size=N)*10
ys = xs * 2. + 2. + bias
ws = np.array([0.,0.])

regressor = Regression(
    xs, ys, ws,
    0.00001)

for i in range(1000):
    regressor.propagate()
输出:

...
2.71623180177e+286
5.27841816362e+286
1.02574818143e+287
1.99332318715e+287
3.87359919362e+287
7.52751526171e+287
1.46281231441e+288
2.84266426942e+288
5.52411274435e+288
1.07349369184e+289
2.0861064206e+289
4.05390365232e+289
7.87789858657e+289
1.5309018532e+290
2.97498179035e+290
5.78124367308e+290
1.12346161297e+291
2.18320843611e+291
4.24260074438e+291
8.2445912074e+291
1.6021607564e+292
3.11345829619e+292
6.05034327761e+292
1.17575539141e+293
2.28483026006e+293
4.4400811218e+293
8.62835227315e+293

当你增加
N
时,梯度分量
r1
r2
在起点
w=[0,0]
分别与
N
成二次和线性比例。对于足够大的
N
,向量
w
的初始步长大于其误差的两倍,这会导致校正超调,并实际增加误差。正反馈导致
w
围绕正确值振荡,振幅不断增加,而不是收敛


如果你把
alpha
缩小十倍,你会发现
N=1000
会收敛。

当你增加
N
时,梯度分量
r1
r2
在起点
w=[0,0]
分别与
N
成二次和线性比例。对于足够大的
N
,向量
w
的初始步长大于其误差的两倍,这会导致校正超调,并实际增加误差。正反馈导致
w
围绕正确值振荡,振幅不断增加,而不是收敛


如果将
alpha
缩小十倍,您会发现
N=1000
将收敛。

您已经超出了方法的收敛半径。我在“传播”的底部添加了一条打印语句以跟踪效果:

    self.w = np.array(res).astype(np.float)
    print self.error(ys, yhat), '\t', r1, '\t', r2, '\t', self.w
正如K.A.布尔指出的,r1与N成二次标度。根据输入选择您的学习速度;这不是SGD算法所承诺的常数。以下是N=600的前20次迭代的输出,如您的代码所示:

486826.997899   -482786.592791  -1211.52883528  [ 4.82786593  0.01211529]
946024.542374   673013.376697   1680.38708612   [-1.90226784 -0.00468858]
1838377.19732   -938192.956012  -2350.99664804  [ 7.47966172  0.01882138]
3572474.5816    1307858.19046   3268.82617841   [-5.59892018 -0.01386688]
6942323.62211   -1823178.2573   -4565.30975898  [ 12.63286239   0.03178622]
13490907.7204   2541543.91414   6355.61930844   [-12.78257675  -0.03176997]
26216686.5837   -3542958.75828  -8868.35584965  [ 22.64701083   0.05691359]
50946528.2176   4938949.44036   12354.1444796   [-26.74248357  -0.06662786]
99003709.9274   -6884985.98436  -17230.4097511  [ 42.10737627   0.10567624]
192392610.191   9597796.6223    24011.0009034   [-53.87058995  -0.13443377]
373874053.385   -13379504.31    -33480.2810842  [ 79.92445315   0.20036904]
726544597.0     18651274.1534   46663.6193386   [-106.58828839   -0.26626715]
1411884707.51   -26000217.8559  -65058.4461128  [ 153.41389017    0.38431731]
2743697288.89   36244780.0586   90684.1600127   [-209.03391041   -0.52252429]
5331791469.79   -50525887.4157  -126423.886221  [ 296.22496374    0.74171457]
10361201450.4   70434012.7562   176228.707876   [-408.11516382   -1.02057251]
20134788880.2   -98186304.1721  -245674.553107  [ 573.7478779     1.43617302]
39127675046.8   136873506.894   342466.322375   [-794.98719104   -1.9884902 ]
76036305324.8   -190804176.229  -477412.833248  [ 1113.05457125     2.78563813]
147760369643.0  265984517.38    665513.730619   [-1546.79060255    -3.86949918]
但是,当alpha设置为E-6(而不是E-5)时,前10行是

14495.6359775   -13788.3126768  -211.542964687  [ 0.01378831  0.00021154]
14306.0982004   -13697.7438847  -210.177498646  [ 0.02748606  0.00042172]
14119.0422005   -13607.7699931  -208.821001646  [ 0.04109383  0.00063054]
13934.4354818   -13518.3870942  -207.473414775  [ 0.05461221  0.00083801]
13752.2459738   -13429.5913063  -206.134679506  [ 0.0680418   0.00104415]
13572.4420258   -13341.3787729  -204.804737697  [ 0.08138318  0.00124895]
13394.9924018   -13253.7456628  -203.483531589  [ 0.09463693  0.00145244]
13219.8662747   -13166.6881702  -202.171003801  [ 0.10780362  0.00165461]
13047.0332208   -13080.202514   -200.867097331  [ 0.12088382  0.00185548]
12876.4632151   -12994.2849383  -199.571755548  [ 0.13387811  0.00205505]
12708.1266257   -12908.9317115  -198.284922195  [ 0.14678704  0.00225333]

。。。而且它还在继续收敛。顺便说一句,即使在N=600时,1000次迭代也不足以实现适当的收敛;您可能希望使用epsilon图,而不是迭代的数量。

您已经超出了方法的收敛半径。我在“传播”的底部添加了一条打印语句以跟踪效果:

    self.w = np.array(res).astype(np.float)
    print self.error(ys, yhat), '\t', r1, '\t', r2, '\t', self.w
正如K.A.布尔指出的,r1与N成二次标度。根据输入选择您的学习速度;这不是SGD算法所承诺的常数。以下是N=600的前20次迭代的输出,如您的代码所示:

486826.997899   -482786.592791  -1211.52883528  [ 4.82786593  0.01211529]
946024.542374   673013.376697   1680.38708612   [-1.90226784 -0.00468858]
1838377.19732   -938192.956012  -2350.99664804  [ 7.47966172  0.01882138]
3572474.5816    1307858.19046   3268.82617841   [-5.59892018 -0.01386688]
6942323.62211   -1823178.2573   -4565.30975898  [ 12.63286239   0.03178622]
13490907.7204   2541543.91414   6355.61930844   [-12.78257675  -0.03176997]
26216686.5837   -3542958.75828  -8868.35584965  [ 22.64701083   0.05691359]
50946528.2176   4938949.44036   12354.1444796   [-26.74248357  -0.06662786]
99003709.9274   -6884985.98436  -17230.4097511  [ 42.10737627   0.10567624]
192392610.191   9597796.6223    24011.0009034   [-53.87058995  -0.13443377]
373874053.385   -13379504.31    -33480.2810842  [ 79.92445315   0.20036904]
726544597.0     18651274.1534   46663.6193386   [-106.58828839   -0.26626715]
1411884707.51   -26000217.8559  -65058.4461128  [ 153.41389017    0.38431731]
2743697288.89   36244780.0586   90684.1600127   [-209.03391041   -0.52252429]
5331791469.79   -50525887.4157  -126423.886221  [ 296.22496374    0.74171457]
10361201450.4   70434012.7562   176228.707876   [-408.11516382   -1.02057251]
20134788880.2   -98186304.1721  -245674.553107  [ 573.7478779     1.43617302]
39127675046.8   136873506.894   342466.322375   [-794.98719104   -1.9884902 ]
76036305324.8   -190804176.229  -477412.833248  [ 1113.05457125     2.78563813]
147760369643.0  265984517.38    665513.730619   [-1546.79060255    -3.86949918]
但是,当alpha设置为E-6(而不是E-5)时,前10行是

14495.6359775   -13788.3126768  -211.542964687  [ 0.01378831  0.00021154]
14306.0982004   -13697.7438847  -210.177498646  [ 0.02748606  0.00042172]
14119.0422005   -13607.7699931  -208.821001646  [ 0.04109383  0.00063054]
13934.4354818   -13518.3870942  -207.473414775  [ 0.05461221  0.00083801]
13752.2459738   -13429.5913063  -206.134679506  [ 0.0680418   0.00104415]
13572.4420258   -13341.3787729  -204.804737697  [ 0.08138318  0.00124895]
13394.9924018   -13253.7456628  -203.483531589  [ 0.09463693  0.00145244]
13219.8662747   -13166.6881702  -202.171003801  [ 0.10780362  0.00165461]
13047.0332208   -13080.202514   -200.867097331  [ 0.12088382  0.00185548]
12876.4632151   -12994.2849383  -199.571755548  [ 0.13387811  0.00205505]
12708.1266257   -12908.9317115  -198.284922195  [ 0.14678704  0.00225333]

。。。而且它还在继续收敛。顺便说一句,即使在N=600时,1000次迭代也不足以实现适当的收敛;您可能希望使用ε数字而不是迭代的数量。

这更适合codereview.stackexchange.com这更适合codereview.stackexchange.com有选择学习速率的算法吗?也就是说,算法的成功与否在很大程度上取决于学习率?我在哪里可以读到更多关于如何解决这个问题的信息?是否有选择学习率的算法?也就是说,算法的成功与否在很大程度上取决于学习率?在哪里可以阅读更多关于如何解决此问题的信息?