Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python中高效的在线线性回归算法_Python_Numpy_Scikit Learn_Linear Regression - Fatal编程技术网

python中高效的在线线性回归算法

python中高效的在线线性回归算法,python,numpy,scikit-learn,linear-regression,Python,Numpy,Scikit Learn,Linear Regression,我得到了一个二维数据集,有两列x和y。我希望在输入新数据时动态地获得线性回归系数和截取。使用scikit learn,我可以计算所有当前可用数据,如下所示: from sklearn.linear_model import LinearRegression regr = LinearRegression() x = np.arange(100) y = np.arange(100)+10*np.random.random_sample((100,)) regr.fit(x,y) print(re

我得到了一个二维数据集,有两列
x
y
。我希望在输入新数据时动态地获得线性回归系数和截取。使用scikit learn,我可以计算所有当前可用数据,如下所示:

from sklearn.linear_model import LinearRegression
regr = LinearRegression()
x = np.arange(100)
y = np.arange(100)+10*np.random.random_sample((100,))
regr.fit(x,y)
print(regr.coef_)
print(regr.intercept_)
然而,我得到了相当大的数据集(总共超过10k行),我想在有新行出现时尽快计算系数和截距。目前计算10k行大约需要600微秒,我想加速这个过程


Scikit learn看起来没有线性回归模块的在线更新功能。有什么更好的方法可以做到这一点吗?

我从这篇文章中找到了解决方案:。实施情况如下:

def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
    """
    x_avg: average of previous x, if no previous sample, set to 0
    y_avg: average of previous y, if no previous sample, set to 0
    Sxy: covariance of previous x and y, if no previous sample, set to 0
    Sx: variance of previous x, if no previous sample, set to 0
    n: number of previous samples
    new_x: new incoming 1-D numpy array x
    new_y: new incoming 1-D numpy array x
    """
    new_n = n + len(new_x)

    new_x_avg = (x_avg*n + np.sum(new_x))/new_n
    new_y_avg = (y_avg*n + np.sum(new_y))/new_n

    if n > 0:
        x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
        y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
    elif n == 0:
        x_star = new_x_avg
        y_star = new_y_avg
    else:
        raise ValueError

    new_Sx = Sx + np.sum((new_x-x_star)**2)
    new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))

    beta = new_Sxy/new_Sx
    alpha = new_y_avg - beta * new_x_avg
    return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
性能比较:

Scikit学习版,共计算10k样本

from sklearn.linear_model import LinearRegression
x = np.arange(10000).reshape(-1,1)
y = np.arange(10000)+100*np.random.random_sample((10000,))
regr = LinearRegression()
%timeit regr.fit(x,y)
# 419 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我的版本假设已经计算了9k样本:

Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1)[:9000], y[:9000])
new_x, new_y = x.reshape(-1,1)[9000:], y[9000:]
%timeit lr(new_x_avg, new_y_avg, Sxy,Sx,n,new_x, new_y)
# 38.7 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

这比预期快10倍。

我从本文中找到了解决方案:。实施情况如下:

def lr(x_avg,y_avg,Sxy,Sx,n,new_x,new_y):
    """
    x_avg: average of previous x, if no previous sample, set to 0
    y_avg: average of previous y, if no previous sample, set to 0
    Sxy: covariance of previous x and y, if no previous sample, set to 0
    Sx: variance of previous x, if no previous sample, set to 0
    n: number of previous samples
    new_x: new incoming 1-D numpy array x
    new_y: new incoming 1-D numpy array x
    """
    new_n = n + len(new_x)

    new_x_avg = (x_avg*n + np.sum(new_x))/new_n
    new_y_avg = (y_avg*n + np.sum(new_y))/new_n

    if n > 0:
        x_star = (x_avg*np.sqrt(n) + new_x_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
        y_star = (y_avg*np.sqrt(n) + new_y_avg*np.sqrt(new_n))/(np.sqrt(n)+np.sqrt(new_n))
    elif n == 0:
        x_star = new_x_avg
        y_star = new_y_avg
    else:
        raise ValueError

    new_Sx = Sx + np.sum((new_x-x_star)**2)
    new_Sxy = Sxy + np.sum((new_x-x_star).reshape(-1) * (new_y-y_star).reshape(-1))

    beta = new_Sxy/new_Sx
    alpha = new_y_avg - beta * new_x_avg
    return new_Sxy, new_Sx, new_n, alpha, beta, new_x_avg, new_y_avg
性能比较:

Scikit学习版,共计算10k样本

from sklearn.linear_model import LinearRegression
x = np.arange(10000).reshape(-1,1)
y = np.arange(10000)+100*np.random.random_sample((10000,))
regr = LinearRegression()
%timeit regr.fit(x,y)
# 419 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
我的版本假设已经计算了9k样本:

Sxy, Sx, n, alpha, beta, new_x_avg, new_y_avg = lr(0, 0, 0, 0, 0, x.reshape(-1,1)[:9000], y[:9000])
new_x, new_y = x.reshape(-1,1)[9000:], y[9000:]
%timeit lr(new_x_avg, new_y_avg, Sxy,Sx,n,new_x, new_y)
# 38.7 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

预计速度会快10倍。

在sklearn中,只有估计器才有在线学习的能力。@VivekKumar是否有其他公式或软件包可以解决此问题?sklearn.linear_model.SGDRegressor是线性回归,但不是使用最小二乘法,而是使用梯度下降法。您应该尝试一下,看看您的输出是否足够接近(或者至少“损失”是相同的),再加上SGD(随机梯度下降)在具有大尺寸功能的大数据集上要快得多。在sklearn中,只有估计器具有在线学习的能力。@VivekKumar是否有其他公式或软件包可以解决此问题?sklearn.linear_model.SGDRegressor是线性回归,但不是使用最小二乘法,而是使用梯度下降。您应该尝试一下,看看您的输出是否足够接近(或者至少“损失”是相同的),再加上SGD(随机梯度下降)在具有大尺寸功能的大数据集上要快得多。与sklearn相比,你得到了类似的预测/系数吗?@VivekKumar他们的系数和截距是相同的xy是协方差的n倍,Sx是x的方差的n倍,不是吗?与sklearn相比,你得到了类似的预测/系数吗?@VivekKumar他们的系数和截距是相同的XY是协方差的n倍,Sx是x的方差的n倍,不是吗?