Python 基于距离阈值的聚类数据_Python_Performance_Pandas_Numpy

Python 基于距离阈值的聚类数据

python performance pandas numpy

Python 基于距离阈值的聚类数据,python,performance,pandas,numpy,Python,Performance,Pandas,Numpy,我想删除一个数据，该数据与之前的数据相差10厘米这就是我所拥有的，但它需要大量的计算时间，因为我的数据集非常庞大 for i in range(len(data)): for j in range(i, len(data)): if (i == j): continue elif np.sqrt((data[i, 0]-data[j, 0])**2 + (data[i, 1]-data[i, 1])**2) <

我想删除一个数据，该数据与之前的数据相差10厘米

这就是我所拥有的，但它需要大量的计算时间，因为我的数据集非常庞大

for i in range(len(data)):
     for j in range(i, len(data)):
          if (i == j):
               continue
          elif np.sqrt((data[i, 0]-data[j, 0])**2 + (data[i, 1]-data[i, 1])**2) <= 0.1:
               data[j, 0] = np.nan
data = data[~np.isnan(data).any(axis=1)]

范围内i的

（len（数据））：
对于范围（i，len（数据））内的j：
如果（i==j）：
持续
elif np.sqrt（（数据[i，0]-data[j，0]）**2+（数据[i，1]-data[i，1]）**2）我们可以使用一个循环-
from scipy.spatial.distance import pdist

def cluster_data_pdist_v1(a, dist_thresh = 0.1):
    d = pdist(a)
    mask = d<=dist_thresh

    n = len(a)
    idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
    start, stop = idx[:-1], idx[1:]
    idx_out = np.zeros(mask.sum(), dtype=int) # use np.empty for bit more speedup
    cur_start = 0
    for iterID,(i,j) in enumerate(zip(start, stop)):
        if iterID not in idx_out[:cur_start]:
            rm_idx = np.flatnonzero(mask[i:j])+iterID+1
            L = len(rm_idx)
            idx_out[cur_start:cur_start+L] = rm_idx
            cur_start += L

    return np.delete(a, idx_out[:cur_start], axis=0)

大约是400x左右

这样的设置的加速

我们可以使用一个循环-

from scipy.spatial.distance import pdist

def cluster_data_pdist_v1(a, dist_thresh = 0.1):
    d = pdist(a)
    mask = d<=dist_thresh

    n = len(a)
    idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
    start, stop = idx[:-1], idx[1:]
    idx_out = np.zeros(mask.sum(), dtype=int) # use np.empty for bit more speedup
    cur_start = 0
    for iterID,(i,j) in enumerate(zip(start, stop)):
        if iterID not in idx_out[:cur_start]:
            rm_idx = np.flatnonzero(mask[i:j])+iterID+1
            L = len(rm_idx)
            idx_out[cur_start:cur_start+L] = rm_idx
            cur_start += L

    return np.delete(a, idx_out[:cur_start], axis=0)

大约是400x左右这样的设置的加速

以下是一种使用以下方法的方法：

借用@Divakar的测试用例，我们可以看到它在

400x

Divakar报告的基础上提供了另一个

100x

加速。与OP相比，我们推断出一个荒谬的

40000x

：

np.random.seed(0)
data1 = np.random.rand(10000,2)
data2 = data1.copy()

from timeit import timeit
kwds = dict(globals=globals(), number=10)

print(timeit("cluster_data_KDTree(data1)", **kwds))
print(timeit("cluster_data_pdist_v1(data2)", **kwds))

np.random.seed(0)
data1 = np.random.rand(10000,2)
data2 = data1.copy()

out1 = cluster_data_KDTree(data1, thr=0.1)
out2 = cluster_data_pdist_v1(data2, dist_thresh = 0.1)
print(np.allclose(out1, out2))

样本输出：

0.05073001119308174
5.646531613077968
True

事实证明，这个测试用例恰好对我的方法非常有利，因为集群非常少，因此迭代也非常少

如果我们通过将阈值更改为

0.01

KDTree

将集群数量大幅增加到约

，则仍然获胜，但加速比从

100x

降低到

15x

：

0.33647687803022563
5.28947562398389
True

以下是使用以下方法的方法：

借用@Divakar的测试用例，我们可以看到它在

400x

Divakar报告的基础上提供了另一个

100x

加速。与OP相比，我们推断出一个荒谬的

40000x

：

np.random.seed(0)
data1 = np.random.rand(10000,2)
data2 = data1.copy()

from timeit import timeit
kwds = dict(globals=globals(), number=10)

print(timeit("cluster_data_KDTree(data1)", **kwds))
print(timeit("cluster_data_pdist_v1(data2)", **kwds))

np.random.seed(0)
data1 = np.random.rand(10000,2)
data2 = data1.copy()

out1 = cluster_data_KDTree(data1, thr=0.1)
out2 = cluster_data_pdist_v1(data2, dist_thresh = 0.1)
print(np.allclose(out1, out2))

样本输出：

0.05073001119308174
5.646531613077968
True

事实证明，这个测试用例恰好对我的方法非常有利，因为集群非常少，因此迭代也非常少

如果我们通过将阈值更改为

0.01

KDTree

将集群数量大幅增加到约

，则仍然获胜，但加速比从

100x

降低到

15x

：

0.33647687803022563
5.28947562398389
True

这看起来像是一个打字错误-

数据[i，1]-数据[i，1]）**2

。本来应该是：

data[i，1]-data[j，1]）**2

看起来是这样的。正确吗？这看起来像是一个打字错误-

数据[i，1]-数据[i，1]）**2

。本来应该是：

data[i，1]-data[j，1]）**2

看起来是这样的。对吗？@divaker谢谢。“我能比你的东西先出来的机会是宝贵的。”迪瓦卡谢谢。我能超越你的东西的机会很少。