Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/gwt/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
最近邻搜索4D空间python快速矢量化_Python_Indexing_Numba_Nearest Neighbor_Kdtree - Fatal编程技术网

最近邻搜索4D空间python快速矢量化

最近邻搜索4D空间python快速矢量化,python,indexing,numba,nearest-neighbor,kdtree,Python,Indexing,Numba,Nearest Neighbor,Kdtree,对于X(有20个)中的每个观测,我想得到k(3)个最近邻。 如何使其快速支持多达300万到400万行? 是否有可能加快循环在元素上的迭代速度?也许通过numpy、numba或某种矢量化 python中的朴素循环: import numpy as np from sklearn.neighbors import KDTree n_points = 20 d_dimensions = 4 k_neighbours = 3 rng = np.random.RandomState(0) X = rn

对于X(有20个)中的每个观测,我想得到k(3)个最近邻。 如何使其快速支持多达300万到400万行? 是否有可能加快循环在元素上的迭代速度?也许通过numpy、numba或某种矢量化

python中的朴素循环:

import numpy as np
from sklearn.neighbors import KDTree

n_points = 20
d_dimensions = 4
k_neighbours = 3

rng = np.random.RandomState(0)
X = rng.random_sample((n_points, d_dimensions))
print(X)
tree = KDTree(X, leaf_size=2, metric='euclidean')

for element in X:
    print('********')
    print(element)

# when simply using the first row
#element = X[:1]
#print(element)

    # potential optimization: query_radius https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html#sklearn.neighbors.KDTree.query_radius
    dist, ind = tree.query([element], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)

    # indices of 3 closest neighbors
    print(ind)
    #[[0 9 1]] !! includes self (element that was searched for)
    print(dist)  # distances to 3 closest neighbors
    #[[0.         0.38559188 0.40997835]] !! includes self (element that was searched for)

    # actual returned elements for index:
    print(X[ind])

    ## after removing self
    print(X[ind][0][1:])
最佳输出为以下结构的pandas.DataFrame:

lat_1,long_1,lat_2,long_2,neighbours_list
0.5488135,0.71518937,0.60276338,0.54488318, [[0.61209572 0.616934   0.94374808 0.6818203 ][0.4236548  0.64589411 0.43758721 0.891773]
编辑 目前,我有一个基于熊猫的实现:

df = df.dropna() # there are sometimes only parts of the tuple (either left or right) defined
X = df[['lat1', 'long1', 'lat2', 'long2']]
tree = KDTree(X, leaf_size=4, metric='euclidean')

k_neighbours = 3
def neighbors_as_list(row, index, complete_list):
    dist, ind = index.query([[row['lat1'], row['long1'], row['lat2'], row['long2']]], k=k_neighbours, return_distance=True, dualtree=False, breadth_first=False, sort_results=True)
    return complete_list.values[ind][0][1:]    
df['neighbors'] = df.apply(neighbors_as_list, index=tree, complete_list=X, axis=1)
df.head()
但这是非常缓慢的

编辑2 当然,这里有一个版本:

import numpy as np
import pandas as pd

from sklearn.neighbors import KDTree
from scipy.spatial import cKDTree

rng = np.random.RandomState(0)
#n_points = 4_000_000
n_points = 20
d_dimensions = 4
k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))
X


df = pd.DataFrame(X)
df = df.reset_index(drop=False)
df.columns = ['id_str', 'lat_1', 'long_1', 'lat_2', 'long_2']
df.id_str = df.id_str.astype(object)
display(df.head())

tree = cKDTree(df[['lat_1', 'long_1', 'lat_2', 'long_2']])
dist,ind=tree.query(X, k=k_neighbours,n_jobs=-1)

display(dist)
print(df[['lat_1', 'long_1', 'lat_2', 'long_2']].shape)
print(X[ind_out].shape)
X[ind_out]

# fails with
# AssertionError: Shape of new values must be compatible with manager shape
df['neighbors'] = X[ind_out]
df

但它失败了,因为我无法重新分配结果。

您可以使用scipy的cKdtree

示例

rng = np.random.RandomState(0)
n_points = 4_000_000
d_dimensions = 4
k_neighbours = 3

X = rng.random_sample((n_points, d_dimensions))

tree = cKDTree(X)

#%timeit tree = cKDTree(X)
#3.74 s ± 29.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

#%%timeit
_,ind=tree.query(X, k=k_neighbours,n_jobs=-1)
#shape=(4000000, 2)
ind_out=ind[:,1:]

#shape=(4000000, 2, 4)
coords_out=X[ind_out].shape
#7.13 s ± 87.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于这种大小的问题,大约11秒是很好的。

但这是11秒钟内的一次迭代,还是我已经获得了所有行的数据?也就是说,for-each循环是否不再必要?@GeorgHeiler是的,您可以获得邻居的所有索引和坐标。如果数组不够的话,唯一剩下的就是创建一个熊猫数据集。哇,这太棒了。但这也意味着我的for-each循环(甚至向量化的pandas实现)完全被本机实现压垮了。是否可以保留与pandas.apply类似的ID?或者我需要根据索引手动加入它吗?我对熊猫没有太多经验。但我想这和把coords_弄出来很相似。至少对于numpy对象,它是相同的。也许可以为这个问题添加一个例子?