Python中基于中值的线性回归_Python_Pandas_Numpy_Scipy_Linear Regression

Python中基于中值的线性回归

python pandas numpy

Python中基于中值的线性回归,python,pandas,numpy,scipy,linear-regression,Python,Pandas,Numpy,Scipy,Linear Regression,我想通过最小化中间绝对误差来执行一维线性回归虽然最初假设它应该是一个相当标准的用例，但快速搜索意外地发现所有回归和插值函数都使用均方误差因此，我的问题是：是否有一个函数可以对一维执行基于中值误差的线性回归？正如评论中已经指出的那样，即使您要求的本身是明确定义的，其解决方案的正确方法将取决于您模型的属性。让我们看看为什么，让我们看看多面手优化方法能让你走多远，让我们看看一点数学可以如何简化问题。底部包含一个可复制的可糊化溶液首先，最小二乘法拟合比你试图做的要“容易”，因为专门的算法是适用的；

我想通过最小化中间绝对误差来执行一维线性回归

虽然最初假设它应该是一个相当标准的用例，但快速搜索意外地发现所有回归和插值函数都使用均方误差

因此，我的问题是：是否有一个函数可以对一维执行基于中值误差的线性回归？

正如评论中已经指出的那样，即使您要求的本身是明确定义的，其解决方案的正确方法将取决于您模型的属性。让我们看看为什么，让我们看看多面手优化方法能让你走多远，让我们看看一点数学可以如何简化问题。底部包含一个可复制的可糊化溶液

首先，最小二乘法拟合比你试图做的要“容易”，因为专门的算法是适用的；例如，SciPy's利用了假设您的值是平方和的值。当然，在线性回归的特殊情况下，这个问题也可以被忽略

除了实际优势外，最小二乘线性回归在理论上也可以证明是正确的：如果观测值的和（如果发现适用于模型，则可以证明是正确的），则模型参数的和将是通过最小二乘法获得的。类似地，最小化优化目标的参数将是残差的最大似然估计。现在，如果你事先知道你的数据太脏，残差正态性的假设将失败，那么你尝试做的事情将比普通最小二乘法有优势，但即使如此，你也可以证明其他会影响目标函数选择的假设是正确的，所以我很好奇你怎么会陷入那种境地

使用数值方法有了这些，一些一般性的评论就适用了。首先，请注意，SciPy确实提供了一个可直接应用于您的案例的应用程序。作为一个例子，让我们看看如何应用于单变量情况

# Generate some data
np.random.seed(0)
n = 200
xs = np.arange(n)
ys = 2*xs + 3 + np.random.normal(0, 30, n)

# Define the optimization objective
def f(theta):
    return np.median(np.abs(theta[1]*xs + theta[0] - ys))

# Provide a poor, but not terrible, initial guess to challenge SciPy a bit
initial_theta = [10, 5]
res = minimize(f, initial_theta)

# Plot the results
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

因此，情况肯定会更糟。正如@sascha在评论中指出的，目标的不平滑很快就会成为一个问题，但是，同样取决于你的模型到底是什么样的，你可能会发现自己看到了足以拯救你的东西

如果您的参数空间是低维的，那么简单地绘制优化环境就可以直观地了解优化的健壮性

theta0s = np.linspace(-100, 100, 200)
theta1s = np.linspace(-5, 5, 200)
costs = [[f([theta0, theta1]) for theta0 in theta0s] for theta1 in theta1s]
plt.contour(theta0s, theta1s, costs, 50)
plt.xlabel('$\\theta_0$')
plt.ylabel('$\\theta_1$')
plt.colorbar()

在上面的特定示例中，如果初始猜测为off，则通用优化算法将失败

initial_theta = [10, 10000]
res = minimize(f, initial_theta)
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

还要注意的是，SciPy的许多算法都受益于目标的可微性，即使你的目标是不可微的，这再一次取决于你试图优化的内容，你的残差很可能是可微的，因此，你的目标可能是可微的，因为你能够提供导数（例如，中位数的导数成为其值为中位数的函数的导数）

在我们的例子中，提供雅可比矩阵似乎并没有特别的帮助，如下面的例子所示；在这里，我们增加了残差的方差，刚好足以使整个事物分崩离析

np.random.seed(0)
n = 201
xs = np.arange(n)
ys = 2*xs + 3 + np.random.normal(0, 50, n)
initial_theta = [10, 5]
res = minimize(f, initial_theta)
plt.scatter(xs, ys, s=1)
plt.plot(res.x[1]*xs + res.x[0])

在这个例子中，我们发现自己陷入了奇点之中

theta = res.x
delta = 0.01
theta0s = np.linspace(theta[0]-delta, theta[0]+delta, 200)
theta1s = np.linspace(theta[1]-delta, theta[1]+delta, 200)
costs = [[f([theta0, theta1]) for theta0 in theta0s] for theta1 in theta1s]

plt.contour(theta0s, theta1s, costs, 100)
plt.xlabel('$\\theta_0$')
plt.ylabel('$\\theta_1$')
plt.colorbar()

此外，这是最低限度的混乱：

theta0s = np.linspace(-20, 30, 300)
theta1s = np.linspace(1, 3, 300)
costs = [[f([theta0, theta1]) for theta0 in theta0s] for theta1 in theta1s]

plt.contour(theta0s, theta1s, costs, 50)
plt.xlabel('$\\theta_0$')
plt.ylabel('$\\theta_1$')
plt.colorbar()

如果您发现自己在这里，可能需要一种不同的方法。如@sascha所述，仍然应用通用优化方法的示例包括用更简单的方法替换目标。另一个简单的示例是使用各种不同的初始输入运行优化：

min_f = float('inf')
for _ in range(100):
    initial_theta = np.random.uniform(-10, 10, 2)
    res = minimize(f, initial_theta, jac=fder)
    if res.fun < min_f:
        min_f = res.fun
        theta = res.x
plt.scatter(xs, ys, s=1)
plt.plot(theta[1]*xs + theta[0])

现在，诀窍是，对于固定的

theta1

，

theta0

最小化

f（theta0，theta1）

的值是通过将上述值应用于

ys-theta0*xs

来获得的。换句话说，我们将问题简化为最小化单个变量的函数，下面称为

def best_theta0(theta1):
    # Here we use the data points defined above
    rs = ys - theta1*xs
    return least_median_abs_1d(rs)

def g(theta1):
    return f([best_theta0(theta1), theta1])

虽然这可能比上面的二维优化问题更容易受到攻击，但我们还没有完全脱离森林，因为这一新函数本身具有局部极小值：

theta1s = np.linspace(0, 3, 500)
plt.plot(theta1s, [g(theta1) for theta1 in theta1s])

在我有限的测试中，我似乎能够始终如一地确定最小值

from scipy.optimize import basinhopping
res = basinhopping(g, -10)
print(res.x)  # prints [ 1.72529806]

此时，我们可以总结所有内容并检查结果是否合理：

def least_median(xs, ys, guess_theta1):
    def least_median_abs_1d(x: np.ndarray):
        X = np.sort(x)
        h = len(X)//2
        diffs = X[h:] - X[:h+1]
        min_i = np.argmin(diffs)
        return diffs[min_i]/2 + X[min_i]

    def best_median(theta1):
        rs = ys - theta1*xs
        theta0 = least_median_abs_1d(rs)
        return np.median(np.abs(rs - theta0))

    res = basinhopping(best_median, guess_theta1)
    theta1 = res.x[0]
    theta0 = least_median_abs_1d(ys - theta1*xs)
    return np.array([theta0, theta1]), res.fun

theta, med = least_median(xs, ys, 10)
# Use different colors for the sets of points within and outside the median error
active = ((ys < theta[1]*xs + theta[0] + med) & (ys > theta[1]*xs + theta[0] - med))
not_active = np.logical_not(active)
plt.plot(xs[not_active], ys[not_active], 'g.')
plt.plot(xs[active], ys[active], 'r.')
plt.plot(xs, theta[1]*xs + theta[0], 'b')
plt.plot(xs, theta[1]*xs + theta[0] + med, 'b--')
plt.plot(xs, theta[1]*xs + theta[0] - med, 'b--')

def最小中值（xs，ys，guess_theta1）：
定义最小中值绝对值（x:np.ndarray）：
X=np.sort（X）
h=len（X）//2
差异=X[h:-X[：h+1]
最小值i=np.argmin（差异）
返回差[min\u i]/2+X[min\u i]
def最佳_中值（θ1）：
rs=ys-θ1*xs
θ0=最小中值绝对值1d（rs）
返回np.中值（np.绝对值（rs-θ0））
res=基准海平面（最佳中位数，猜测1）
θ1=res.x[0]
θ0=最小中值λabsλd（ys-θ1*xs）
返回np.array（[theta0，theta1]），res.fun
θ，med=最小_中值（xs，ys，10）
#对中值误差内外的点集使用不同的颜色
活动=（（ystheta[1]*xs+theta[0]-med））
not\u active=np.逻辑\u not（active）
plt.绘图（xs[未激活]、ys[未激活]、g.）
plt.绘图（xs[激活]、ys[激活]、r.）
plt.plot（xs，θ[1]*xs+θ[0]，'b'）
plt.绘图（xs，θ[1]*xs+θ[0]+med，'b--'）
plt.绘图（xs，θ[1]*xs+θ[0]-med，'b--'）

正如评论中已经指出的那样，尽管您所要求的本身是明确定义的，但其解决方案的正确方法将取决于您的模型的属性。让我们看看为什么，让我们看看通才优化方法能让您走多远，让我们看看一点数学如何简化t

theta1s = np.linspace(1.5, 2.5, 500)
plt.plot(theta1s, [g(theta1) for theta1 in theta1s])

from scipy.optimize import basinhopping
res = basinhopping(g, -10)
print(res.x)  # prints [ 1.72529806]

def least_median(xs, ys, guess_theta1):
    def least_median_abs_1d(x: np.ndarray):
        X = np.sort(x)
        h = len(X)//2
        diffs = X[h:] - X[:h+1]
        min_i = np.argmin(diffs)
        return diffs[min_i]/2 + X[min_i]

    def best_median(theta1):
        rs = ys - theta1*xs
        theta0 = least_median_abs_1d(rs)
        return np.median(np.abs(rs - theta0))

    res = basinhopping(best_median, guess_theta1)
    theta1 = res.x[0]
    theta0 = least_median_abs_1d(ys - theta1*xs)
    return np.array([theta0, theta1]), res.fun

theta, med = least_median(xs, ys, 10)
# Use different colors for the sets of points within and outside the median error
active = ((ys < theta[1]*xs + theta[0] + med) & (ys > theta[1]*xs + theta[0] - med))
not_active = np.logical_not(active)
plt.plot(xs[not_active], ys[not_active], 'g.')
plt.plot(xs[active], ys[active], 'r.')
plt.plot(xs, theta[1]*xs + theta[0], 'b')
plt.plot(xs, theta[1]*xs + theta[0] + med, 'b--')
plt.plot(xs, theta[1]*xs + theta[0] - med, 'b--')