Python：从一堆点中选择分布得更好的n个点_Python_Numpy_Scipy

Python：从一堆点中选择分布得更好的n个点

python numpy

Python：从一堆点中选择分布得更好的n个点,python,numpy,scipy,Python,Numpy,Scipy,我在XY平面上有一个点的numpy数组，如：我想从所有这些点中选择更好分布的n个点（比如100个）。这就是，我希望点的密度在任何地方都是恒定的大概是这样的：有什么python方法或numpy/scipy函数可以做到这一点吗？除非您给出定义“更好分布”的具体标准，否则我们无法给出明确的答案 “任意点的恒定密度”这句话也有误导性，因为必须指定计算密度的经验方法。你是在网格上近似它吗？如果是这样，网格大小将很重要，边界附近的点将无法正确表示另一种方法可能如下所示：计算所有点对之间的距离矩

我在XY平面上有一个点的numpy数组，如：

我想从所有这些点中选择更好分布的n个点（比如100个）。这就是，我希望点的密度在任何地方都是恒定的

大概是这样的：

有什么python方法或numpy/scipy函数可以做到这一点吗？

除非您给出定义“更好分布”的具体标准，否则我们无法给出明确的答案

“任意点的恒定密度”这句话也有误导性，因为必须指定计算密度的经验方法。你是在网格上近似它吗？如果是这样，网格大小将很重要，边界附近的点将无法正确表示

另一种方法可能如下所示：

计算所有点对之间的距离矩阵

将此距离矩阵视为加权网络，计算数据中每个点的中心度度量，例如，或

根据中心度度量按降序排列点，并保留前100个

重复步骤1-4，可能在点之间使用不同的“距离”概念，并使用不同的中心度度量

其中许多功能直接由SciPy、NetworkX和scikits.learn提供，并将直接在NumPy阵列上工作

如果您确实致力于从规则间距和栅格密度的角度来考虑这个问题，那么您可以看看。特别是，您可以尝试计算点集的凸包，然后应用QMC技术从凸包内的任何位置定期采样。但同样，这给该地区的外部带来了特权，而该地区的采样应该远远少于内部

另一个有趣的方法是对分散的数据运行K-means算法，使用固定数量的聚类K=100。在算法收敛后，您的空间将有100个点（每个簇的平均值）。你可以用不同的随机起点重复几次聚类平均值，然后从更大的一组可能的平均值中取样。由于您的数据看起来并不是自然地聚集到100个组件中，因此这种方法的收敛性不是很好，可能需要运行大量迭代的算法。这也有一个缺点，即100个点的结果集不一定是来自观测数据的点，而是许多点的局部平均值。

@EMS非常正确，您应该仔细考虑您到底想要什么

有更复杂的方法可以做到这一点（EMS的建议非常好！），但蛮力式的方法是将点分到规则的矩形网格上，并从每个分格中随机抽取一个点

主要的缺点是你得不到你要求的分数。相反，你会得到一些比这个数字小的数字

使用

pandas

进行一点创造性的索引可以使这种“网格化”方法变得非常简单，尽管您当然也可以使用“纯”numpy来实现

作为可能最简单的暴力网格方法的一个例子：（这里有很多我们可以做得更好。）

大致根据@EMS在评论中的建议，这里有另一种方法

我们将使用核密度估计来计算点的密度，然后使用核密度估计的倒数作为选择给定点的概率

scipy.stats.gaussian_kde

未针对该用例进行优化（或通常针对大量点）。这是这里的瓶颈。可以通过几种方式为这个特定用例编写一个更优化的版本（近似，这里的特殊情况是成对距离，等等）。然而，这超出了这个问题的范围。请注意，对于这个带有1e5点的特定示例，运行需要一两分钟

这种方法的优点是，你可以得到你要求的确切点数。缺点是，可能会有选定点的局部群集

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))

# Let's approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)

# Try playing around with this weight. Compare 1/dens,  1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()

# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

这种从剩余点（与已拾取点的最小距离最小）迭代拾取点的方法具有可怕的时间复杂度，但会产生非常均匀分布的结果：

from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist


def well_spaced_points(points: ndarray, num_points: int):
    """
    Pick `num_points` well-spaced points from `points` array.

    :param points: An m x n array of m n-dimensional points.
    :param num_points: The number of points to pick.
    :rtype: ndarray
    :return: A num_points x n array of points from the original array.
    """
    # pick a random point
    current_point_index = randint(0, num_points)
    picked_points = array([points[current_point_index]])
    remaining_points = vstack((
        points[: current_point_index],
        points[current_point_index + 1:]
    ))
    # while there are more points to pick
    while picked_points.shape[0] < num_points:
        # find the furthest point to the current point
        distance_pk_rmn = cdist(picked_points, remaining_points)
        min_distance_pk = distance_pk_rmn.min(axis=0)
        i_furthest = argmax(min_distance_pk)
        # add it to picked points and remove it from remaining
        picked_points = vstack((
            picked_points,
            remaining_points[i_furthest]
        ))
        remaining_points = vstack((
            remaining_points[: i_furthest],
            remaining_points[i_furthest + 1:]
        ))

    return picked_points

来自numpy导入数组、argmax、ndarray
从numpy.ma导入vstack
从numpy.random导入normal，randint
从scipy.spatial.distance导入cdist
def井间点（点：ndarray，数量点：int）：
"""
从“点”数组中拾取“num_points”间隔良好的点。
：param points：由m个n维点组成的m x n数组。
：param num_points：要拾取的点数。
：rtype:ndarray
：return:num_points x n原始数组中的点数组。
"""
#随机选取一点
当前点指数=randint（0，num点）
拾取的\u点=数组（[点[当前\u点\u索引]）
剩余_点=vstack((
点[：当前点指数]，
点数[当前点数指数+1:]
))
#虽然还有更多的要点需要挑选
拾取点时。形状[0]


什么是“更好的分布”？它们是离平均值最远的n个点吗？我想有一个恒定的密度
from numpy import array, argmax, ndarray
from numpy.ma import vstack
from numpy.random import normal, randint
from scipy.spatial.distance import cdist


def well_spaced_points(points: ndarray, num_points: int):
    """
    Pick `num_points` well-spaced points from `points` array.

    :param points: An m x n array of m n-dimensional points.
    :param num_points: The number of points to pick.
    :rtype: ndarray
    :return: A num_points x n array of points from the original array.
    """
    # pick a random point
    current_point_index = randint(0, num_points)
    picked_points = array([points[current_point_index]])
    remaining_points = vstack((
        points[: current_point_index],
        points[current_point_index + 1:]
    ))
    # while there are more points to pick
    while picked_points.shape[0] < num_points:
        # find the furthest point to the current point
        distance_pk_rmn = cdist(picked_points, remaining_points)
        min_distance_pk = distance_pk_rmn.min(axis=0)
        i_furthest = argmax(min_distance_pk)
        # add it to picked points and remove it from remaining
        picked_points = vstack((
            picked_points,
            remaining_points[i_furthest]
        ))
        remaining_points = vstack((
            remaining_points[: i_furthest],
            remaining_points[i_furthest + 1:]
        ))

    return picked_points