Python 将曲线拟合到散点图的边界_Python_Pandas_Scipy_Curve Fitting

Python 将曲线拟合到散点图的边界

python pandas

Python 将曲线拟合到散点图的边界,python,pandas,scipy,curve-fitting,Python,Pandas,Scipy,Curve Fitting,我正试图在散点图的边界上拟合一条曲线。我已经用下面的（简化的）代码完成了一个匹配。它将数据帧切分为几个小的垂直条带，然后在宽度width的条带中找到最小值，忽略nans。（函数是单调递减的。） def func（val）： “”返回“val”的某些函数“” 返回值*2 对于范围（0，最大值，宽度）内的i： _df=df[（df.val>i）和（df.val

我正试图在散点图的边界上拟合一条曲线。

我已经用下面的（简化的）代码完成了一个匹配。它将数据帧切分为几个小的垂直条带，然后在宽度

width

的条带中找到最小值，忽略

nan

s。（函数是单调递减的。）

def func（val）：
“”返回“val”的某些函数“”
返回值*2
对于范围（0，最大值，宽度）内的i：
_df=df[（df.val>i）和（df.val


然后我用scipy.optimize.curve\u fit
进行拟合。我的问题是：有没有更自然的或类似于python的方法来进行拟合？有没有任何方法可以提高精度？（例如，对点密度更高的散点图区域赋予更高的权重？）
我发现这个问题非常有趣，所以我决定尝试一下。我不知道pythonic或natural，但我认为我已经找到了一种更准确的方法，可以在使用每个点的信息的同时，将边拟合到像您这样的数据集
首先，让我们生成一个随机数据，它看起来像你展示的那个。这部分可以很容易地跳过，我发布它只是为了让代码完整和可复制。我使用了两个二元正态分布来模拟这些超密度，并在它们上撒上一层均匀分布的随机点。然后我们可以重新添加到与您类似的直线方程中，直线下的所有内容都被切断，最终结果如下所示：

下面是代码片段：
import numpy as np

x_res = 1000
x_data = np.linspace(0, 2000, x_res)

# true parameters and a function that takes them
true_pars = [80, 70, -5]
model = lambda x, a, b, c: (a / np.sqrt(x + b) + c)
y_truth = model(x_data, *true_pars)

mu_prim, mu_sec = [1750, 0], [450, 1.5]
cov_prim = [[300**2, 0     ],
            [     0, 0.2**2]]
# covariance matrix of the second dist is trickier
cov_sec = [[200**2, -1     ],
           [    -1,  1.0**2]]
prim = np.random.multivariate_normal(mu_prim, cov_prim, x_res*10).T
sec = np.random.multivariate_normal(mu_sec, cov_sec, x_res*1).T
uni = np.vstack([x_data, np.random.rand(x_res) * 7])

# censoring points that will end up below the curve
prim = prim[np.vstack([[prim[1] > 0], [prim[1] > 0]])].reshape(2, -1)
sec = sec[np.vstack([[sec[1] > 0], [sec[1] > 0]])].reshape(2, -1)

# rescaling to data
for dset in [uni, sec, prim]:
    dset[1] += model(dset[0], *true_pars)

# this code block generates the figure above:
import matplotlib.pylab as plt
plt.figure()
plt.plot(prim[0], prim[1], '.', alpha=0.1, label = '2D Gaussian #1')
plt.plot(sec[0], sec[1], '.', alpha=0.5, label = '2D Gaussian #2')
plt.plot(uni[0], uni[1], '.', alpha=0.5, label = 'Uniform')
plt.plot(x_data, y_truth, 'k:', lw = 3, zorder = 1.0, label = 'True edge')
plt.xlim(0, 2000)
plt.ylim(-8, 6)
plt.legend(loc = 'lower left')
plt.show()

# mashing it all together
dset = np.concatenate([prim, sec, uni], axis = 1)

现在我们有了数据和模型，我们可以集体讨论如何拟合点分布的边缘。常用的回归方法，如非线性最小二乘法scipy.optimize.curve_fit
获取数据值y
，并优化模型的自由参数，以便y
和model（x）
是最小值。非线性最小二乘法是一个迭代过程，它试图在每一步摆动曲线参数，以改善每一步的拟合。现在很明显，这是我们不想做的一件事，因为我们希望我们的最小化例程使我们尽可能远离最佳拟合曲线（但不要太远）
这样，让我们考虑下面的函数，而不是简单地返回残差，它也将“翻转”。在迭代的每一步中，曲线上方的点，并将其考虑在内。这样，曲线下方的点实际上总是比曲线上方的点多，导致曲线在每次迭代中向下移动！一旦达到最低点，函数的最小值就会被找到，散点的边缘也会被找到。当然，这种方法假设曲线下没有异常值，但是你的数字似乎不会受到太多的影响
以下是实现这一理念的功能：
def get_flipped(y_data, y_model):
    flipped = y_model - y_data
    flipped[flipped > 0] = 0
    return flipped

def flipped_resid(pars, x, y):
    """
    For every iteration, everything above the currently proposed
    curve is going to be mirrored down, so that the next iterations
    is going to progressively shift downwards.
    """
    y_model = model(x, *pars)
    flipped = get_flipped(y, y_model)
    resid = np.square(y + flipped - y_model)
    #print pars, resid.sum() # uncomment to check the iteration parameters
    return np.nan_to_num(resid)

让我们看看上面的数据是如何显示的：
# plotting the mock data
plt.plot(dset[0], dset[1], '.', alpha=0.2, label = 'Test data')

# mask bad data (we accidentaly generated some NaN values)
gmask = np.isfinite(dset[1])
dset = dset[np.vstack([gmask, gmask])].reshape((2, -1))

from scipy.optimize import leastsq
guesses =[100, 100, 0]
fit_pars, flag = leastsq(func = flipped_resid, x0 = guesses,
                         args = (dset[0], dset[1]))
# plot the fit:
y_fit = model(x_data, *fit_pars)
y_guess = model(x_data, *guesses)
plt.plot(x_data, y_fit, 'r-', zorder = 0.9, label = 'Edge')
plt.plot(x_data, y_guess, 'g-', zorder = 0.9, label = 'Guess')
plt.legend(loc = 'lower left')
plt.show()

上面最重要的部分是调用leastsq
函数。请确保对初始猜测非常小心-如果猜测没有落在分散点上，则模型可能无法正确收敛。在将适当的猜测放入

瞧！边缘与真实边缘完全匹配。
这是一个有趣的问题，我也在尝试解决（并用python实现它）
我认为，与其取min
，不如取k
-最低（或k
-最高，取决于问题）数据点的平均值，并拟合平均值（还应检查拟合参数是否稳健w.r.tk）。
例如，你可以在
这个
.虽然这项工作需要一千票的支持，但只支持了一次。非常感谢，@Vlas。
# plotting the mock data
plt.plot(dset[0], dset[1], '.', alpha=0.2, label = 'Test data')

# mask bad data (we accidentaly generated some NaN values)
gmask = np.isfinite(dset[1])
dset = dset[np.vstack([gmask, gmask])].reshape((2, -1))

from scipy.optimize import leastsq
guesses =[100, 100, 0]
fit_pars, flag = leastsq(func = flipped_resid, x0 = guesses,
                         args = (dset[0], dset[1]))
# plot the fit:
y_fit = model(x_data, *fit_pars)
y_guess = model(x_data, *guesses)
plt.plot(x_data, y_fit, 'r-', zorder = 0.9, label = 'Edge')
plt.plot(x_data, y_guess, 'g-', zorder = 0.9, label = 'Guess')
plt.legend(loc = 'lower left')
plt.show()