Python：找到将多个点（x，y）连接到已知函数y（x）上最近点的正交向量的有效方法_Python_Numpy_Scipy_Minimization_Orthogonal

Python：找到将多个点（x，y）连接到已知函数y（x）上最近点的正交向量的有效方法

python numpy

Python：找到将多个点（x，y）连接到已知函数y（x）上最近点的正交向量的有效方法,python,numpy,scipy,minimization,orthogonal,Python,Numpy,Scipy,Minimization,Orthogonal,我有一个数据集，包括一个长的x值数组和一个同样长的y值数组。对于每个（x，y）对，我想找到已知函数y（x）上最近的点原则上，我可以在每一对上循环并执行最小化，例如scipy.optimize.cobyla，但python中的循环速度很慢。Scipy的odr包看起来很有趣，但我不知道如何让它简单地返回正交向量，同时又不最小化整个过程（将最大迭代次数“maxit”设置为零并不能满足我的要求）有没有一种简单的方法可以利用numpy阵列的速度来完成此任务？答案很简单：不要循环列表中的点在您的计算

我有一个数据集，包括一个长的x值数组和一个同样长的y值数组。对于每个（x，y）对，我想找到已知函数y（x）上最近的点

原则上，我可以在每一对上循环并执行最小化，例如scipy.optimize.cobyla，但python中的循环速度很慢。Scipy的odr包看起来很有趣，但我不知道如何让它简单地返回正交向量，同时又不最小化整个过程（将最大迭代次数“maxit”设置为零并不能满足我的要求）

有没有一种简单的方法可以利用numpy阵列的速度来完成此任务？

答案很简单：

不要循环列表中的点

在您的计算机上循环点函数曲线

为了避免混淆，我冒昧地将函数y（x）重命名为f（z）

import numpy as np

# x and y are your numpy arrays of point coords
x = np.array([1,2])
y = np.array([3,4])
# this is your "y(x)" function
def f(z):
    return z**2

xmin = x.min()
xmax = x.max()
step = 0.01 # choose your step at the precision you want

# find distances to every point
zpoints = np.arange(xmin,xmax,step)
distances_squared = np.array([(y-f(z))**2+(x-z)**2 for z in zpoints])

# find z coords of closest points
zmin = zpoints[distances_squared.argmin(axis=0)]
fmin = np.array([f(z) for z in zmin])

for i in range(len(x)):
    print("point on the curve {},{} is closest to {},{}".format(zmin[i],fmin[i],x[i],y[i]))

曲线上的点1.6700000000000006,2.788900000000002最接近1,3

曲线上的点1.990000000000009,3.9601000000000033最接近2,4

答案很简单：

不要循环列表中的点

在您的计算机上循环点函数曲线

为了避免混淆，我冒昧地将函数y（x）重命名为f（z）

import numpy as np

# x and y are your numpy arrays of point coords
x = np.array([1,2])
y = np.array([3,4])
# this is your "y(x)" function
def f(z):
    return z**2

xmin = x.min()
xmax = x.max()
step = 0.01 # choose your step at the precision you want

# find distances to every point
zpoints = np.arange(xmin,xmax,step)
distances_squared = np.array([(y-f(z))**2+(x-z)**2 for z in zpoints])

# find z coords of closest points
zmin = zpoints[distances_squared.argmin(axis=0)]
fmin = np.array([f(z) for z in zmin])

for i in range(len(x)):
    print("point on the curve {},{} is closest to {},{}".format(zmin[i],fmin[i],x[i],y[i]))

曲线上的点1.6700000000000006,2.788900000000002最接近1,3

曲线上的点1.990000000000009,3.9601000000000033最接近2,4

有一种方法可以加快Hennadii Madan的方法，即让numpy代替python进行循环。像往常一样，这是以牺牲额外的RAM为代价的

下面是我现在用于2d的函数。一个很好的特性是它是对称的——可以交换数据集，计算时间是相同的

def find_nearests_2d(x1, y1, x2, y2):
   """
   Given two data sets d1 = (x1, y1) and d2 = (x2, y2), return the x,y pairs
   from d2 that are closest to each pair from x1, the difference vectors, and
   the d2 indices of these closest points. 

   Parameters
   ----------
   x1
       1D array of x-values for data set 1.
   y1  
       1D array of y-values for data set 1 (must match size of x1).
   x2
       1D array of x-values for data set 2.
   y2
       1D array of y-values for data set 2 (must match size of x2).

   Returns x2mins, y2mins, xdiffs, ydiffs, indices
   -------
   x2mins
       1D array of minimum-distance x-values from data set 2. One value for each x1.
   y2mins
       1D array of minimum-distance y-values from data set 2. One value for each y1.
   xdiffs 
       1D array of differences in x. One value for each x1.
   ydiffs
       1D array of differences in y. One value for each y1.
   indices
       Indices of each minimum-distance point in data set 2. One for each point in
       data set 1.
   """

   # Generate every combination of points for subtracting
   x1s, x2s = _n.meshgrid(x1, x2)
   y1s, y2s = _n.meshgrid(y1, y2)

   # Calculate all the differences
   dx = x1s - x2s
   dy = y1s - y2s
   d2 = dx**2 + dy**2

   # Find the index of the minimum for each data point
   n = _n.argmin(d2, 0)

   # Index for extracting from the meshgrids
   m = range(len(n))

   return x2s[n,m], y2s[n,m], dx[n,m], dy[n,m], d2[n,m], n

然后，还可以使用该函数快速估计x、y对和函数之间的距离：

def find_nearests_function(x, y, f, *args, fpoints=1000):
    """
    Takes a data set (arrays of x and y values), and a function f(x, *args),
    then estimates the points on the curve f(x) that are closest to each of 
    the data set's x,y pairs.

    Parameters
    ----------
    x
        1D array of x-values for data set 1.
    y  
        1D array of y-values for data set 1 (must match size of x).
    f
        A function of the form f(x, *args) with optional additional arguments.
    *args
        Optional additional arguments to send to f (after argument x).
    fpoints=1000
        Number of evenly-spaced points to search in the x-domain (automatically
        the maximum possible range).

    """

    # Make sure everything is a numpy array
    x = _n.array(x)
    y = _n.array(y)

    # First figure out the range we need for f. Since the function is single-
    # valued, we can put bounds on the x-range: for each point, calculate the 
    # y-distance, and subtract / add this to the x-values
    dys  = _n.abs(f(x)-y)
    xmin = min(x-dys)
    xmax = max(x+dys)

    # Get "dense" function arrays
    xf = _n.linspace(xmin, xmax, fpoints)
    yf = f(xf,*args)

    # Find all the minima
    xfs, yfs, dxs, dys, d2s, n = find_nearests_2d(x, y, xf, yf)

    # Return this info plus the function arrays used
    return xfs, yfs, dxs, dys, d2s, n, xf, yf

如果这是正交距离回归的一部分（就像在我的例子中），那么差dx和dy可以很容易地通过误差条数据集进行缩放，而不会产生太大的开销，因此返回的距离是学生化（无单位）残差

最终，这种“到处均匀搜索”的技术只会让你接近，如果函数在x数据范围内不是特别平滑，它就会失败

快速测试代码：

x  = [1,2,5]
y  = [1,-1,1]

def f(x): return _n.cos(x)

fxmin, fymin, dxmin, dymin, d2min, n, xf, yf = find_nearests_function(x, y, f)

import pylab
pylab.plot(x,y, marker='o', ls='', color='m', label='input points')
pylab.plot(xf,yf, color='b', label='function')
pylab.plot(fxmin,fymin, marker='o', ls='', color='r', label='nearest points')
pylab.legend()
pylab.show()

产生

有一种方法可以加快Hennadii Madan的方法，让numpy代替python进行循环。像往常一样，这是以牺牲额外的RAM为代价的

下面是我现在用于2d的函数。一个很好的特性是它是对称的——可以交换数据集，计算时间是相同的

def find_nearests_2d(x1, y1, x2, y2):
   """
   Given two data sets d1 = (x1, y1) and d2 = (x2, y2), return the x,y pairs
   from d2 that are closest to each pair from x1, the difference vectors, and
   the d2 indices of these closest points. 

   Parameters
   ----------
   x1
       1D array of x-values for data set 1.
   y1  
       1D array of y-values for data set 1 (must match size of x1).
   x2
       1D array of x-values for data set 2.
   y2
       1D array of y-values for data set 2 (must match size of x2).

   Returns x2mins, y2mins, xdiffs, ydiffs, indices
   -------
   x2mins
       1D array of minimum-distance x-values from data set 2. One value for each x1.
   y2mins
       1D array of minimum-distance y-values from data set 2. One value for each y1.
   xdiffs 
       1D array of differences in x. One value for each x1.
   ydiffs
       1D array of differences in y. One value for each y1.
   indices
       Indices of each minimum-distance point in data set 2. One for each point in
       data set 1.
   """

   # Generate every combination of points for subtracting
   x1s, x2s = _n.meshgrid(x1, x2)
   y1s, y2s = _n.meshgrid(y1, y2)

   # Calculate all the differences
   dx = x1s - x2s
   dy = y1s - y2s
   d2 = dx**2 + dy**2

   # Find the index of the minimum for each data point
   n = _n.argmin(d2, 0)

   # Index for extracting from the meshgrids
   m = range(len(n))

   return x2s[n,m], y2s[n,m], dx[n,m], dy[n,m], d2[n,m], n

然后，还可以使用该函数快速估计x、y对和函数之间的距离：

def find_nearests_function(x, y, f, *args, fpoints=1000):
    """
    Takes a data set (arrays of x and y values), and a function f(x, *args),
    then estimates the points on the curve f(x) that are closest to each of 
    the data set's x,y pairs.

    Parameters
    ----------
    x
        1D array of x-values for data set 1.
    y  
        1D array of y-values for data set 1 (must match size of x).
    f
        A function of the form f(x, *args) with optional additional arguments.
    *args
        Optional additional arguments to send to f (after argument x).
    fpoints=1000
        Number of evenly-spaced points to search in the x-domain (automatically
        the maximum possible range).

    """

    # Make sure everything is a numpy array
    x = _n.array(x)
    y = _n.array(y)

    # First figure out the range we need for f. Since the function is single-
    # valued, we can put bounds on the x-range: for each point, calculate the 
    # y-distance, and subtract / add this to the x-values
    dys  = _n.abs(f(x)-y)
    xmin = min(x-dys)
    xmax = max(x+dys)

    # Get "dense" function arrays
    xf = _n.linspace(xmin, xmax, fpoints)
    yf = f(xf,*args)

    # Find all the minima
    xfs, yfs, dxs, dys, d2s, n = find_nearests_2d(x, y, xf, yf)

    # Return this info plus the function arrays used
    return xfs, yfs, dxs, dys, d2s, n, xf, yf

最终，这种“到处均匀搜索”的技术只会让你接近，如果函数在x数据范围内不是特别平滑，它就会失败

快速测试代码：

x  = [1,2,5]
y  = [1,-1,1]

def f(x): return _n.cos(x)

fxmin, fymin, dxmin, dymin, d2min, n, xf, yf = find_nearests_function(x, y, f)

import pylab
pylab.plot(x,y, marker='o', ls='', color='m', label='input points')
pylab.plot(xf,yf, color='b', label='function')
pylab.plot(fxmin,fymin, marker='o', ls='', color='r', label='nearest points')
pylab.legend()
pylab.show()

产生

感谢您的回复！这是一个有趣的想法，就是用numpy强行进行搜索。如果我理解正确，它将在时间和内存中进行缩放，如（数据点的数量）x（函数点的数量），这可能会变大（我的函数曲线点比数据点多得多）。我猜这可以通过自动化zpoints上的切片来加速，只与函数曲线上的相关点进行比较。我会玩这个，看看它是否超过了scipy约束极小化。@杰克，过早优化是万恶之源。只要运行代码，看看你是否关心潜在的速度差。嘿，谢谢。我正在制作一个通用库，而不是解决一个问题。我会仔细考虑一下（部分原因是这样做很有趣）。好吧，关于优化的观点仍然成立——除非你对一些“现实”数据进行分析并确定瓶颈，否则讨论加速策略是毫无意义的。这里主要关注的是功能评估的成本。在任何情况下，如果没有对曲线（至少是可微性）的规律性假设，你都不可能比我的方法快得多。在这种情况下，基于梯度的优化器可能会派上用场。实际上有一种方法可以获得巨大的提升（以牺牲RAM为代价），基于这种简单的方法，使用numpy循环（用C编写）而不是python循环（slow）。我马上就发。谢谢你的回复！这是一个有趣的想法，就是用numpy强行进行搜索。如果我理解正确，它将在时间和内存中进行缩放，如（数据点的数量）x（函数点的数量），这可能会变大（我的函数曲线点比数据点多得多）。我猜这可以通过自动化zpoints上的切片来加速，只与函数曲线上的相关点进行比较。我会玩这个，看看它是否超过了scipy约束极小化。@杰克，过早优化是万恶之源。只要运行代码，看看你是否关心潜在的速度差。嘿，谢谢。我正在制作一个通用库，而不是解决一个问题。我会仔细考虑一下（部分原因是这样做很有趣）。好吧，关于优化的观点仍然成立——除非你对一些“现实”数据进行分析并确定瓶颈，否则讨论加速策略是毫无意义的。这里主要关注的是功能评估的成本。在任何情况下，如果没有对曲线（至少是可微性）的规律性假设，你都不可能比我的方法快得多。在这种情况下，基于梯度的优化器可能会派上用场