Python中的矢量化Haversine距离计算
我正试图计算一长串由纬度和经度确定的位置的距离矩阵,使用两个坐标对元组生成距离的公式:Python中的矢量化Haversine距离计算,python,performance,numpy,pandas,vectorization,Python,Performance,Numpy,Pandas,Vectorization,我正试图计算一长串由纬度和经度确定的位置的距离矩阵,使用两个坐标对元组生成距离的公式: def haversine(point1, point2, miles=False): """ Calculate the great-circle distance bewteen two points on the Earth surface. :input: two 2-tuples, containing the latitude and longitude of each poin
def haversine(point1, point2, miles=False):
""" Calculate the great-circle distance bewteen two points on the Earth surface.
:input: two 2-tuples, containing the latitude and longitude of each point
in decimal degrees.
Example: haversine((45.7597, 4.8422), (48.8567, 2.3508))
:output: Returns the distance bewteen the two points.
The default unit is kilometers. Miles can be returned
if the ``miles`` parameter is set to True.
"""
我可以使用嵌套for循环计算所有点之间的距离,如下所示:
data.head()
id coordinates
0 1 (16.3457688674, 6.30354512503)
1 2 (12.494749307, 28.6263955635)
2 3 (27.794615136, 60.0324947881)
3 4 (44.4269923769, 110.114216113)
4 5 (-69.8540884125, 87.9468778773)
length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
使用一个简单的函数:
distance = {}
def haver_loop(df):
for i, point1 in df.iterrows():
distance[i] = []
for j, point2 in df.iterrows():
distance[i].append(haversine(point1.coordinates, point2.coordinates))
return pd.DataFrame.from_dict(distance, orient='index')
但考虑到时间复杂性,这需要相当长的时间,以20秒左右的速度运行500分,我的列表要长得多。这让我开始研究矢量化,我遇到了
numpy.vectorize
(,但不知道如何在这个上下文中应用它。您可以将您的函数作为np.vectorize()
的参数,然后将其用作pandas.groupby.apply
的参数,如下所示:
haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
例如,样本数据如下所示:
data.head()
id coordinates
0 1 (16.3457688674, 6.30354512503)
1 2 (12.494749307, 28.6263955635)
2 3 (27.794615136, 60.0324947881)
3 4 (44.4269923769, 110.114216113)
4 5 (-69.8540884125, 87.9468778773)
length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
比较500分:
def haver_vect(data):
distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
return distance
%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop
%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop
首先使用
itertools.product
results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]
也就是说,我不确定它的速度会有多快。看起来它可能是来自的一个副本,它看起来非常可并行化。因此,使用NumPy aka矢量化的最佳工具之一,并用NumPy等价物替换数学函数,这里有一个矢量化的解决方案-
# Get data as a Nx2 shaped NumPy array
data = np.array(df['coordinates'].tolist())
# Convert to radians
data = np.deg2rad(data)
# Extract col-1 and 2 as latitudes and longitudes
lat = data[:,0]
lng = data[:,1]
# Elementwise differentiations for lattitudes & longitudes
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
# Finally Calculate haversine
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
运行时测试-
另一种方法在性能改进方面比原始代码有一些积极的希望,因此本节将比较基于广播的方法和基于广播的方法
函数定义-
def vectotized_based(df):
haver_vec = np.vectorize(haversine, otypes=[np.int16])
return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
def broadcasting_based(df):
data = np.array(df['coordinates'].tolist())
data = np.deg2rad(data)
lat = data[:,0]
lng = data[:,1]
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
时间安排-
In [123]: # Input
...: length = 500
...: d1 = np.random.uniform(-90, 90, length)
...: d2 = np.random.uniform(-180, 180, length)
...: coords = tuple(zip(d1, d2))
...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
...:
In [124]: %timeit vectotized_based(df)
1 loops, best of 3: 1.12 s per loop
In [125]: %timeit broadcasting_based(df)
10 loops, best of 3: 68.7 ms per loop
vectorize
仅为方便起见,据我所知,它通常不会提供任何加速(至少不会提供任何有意义的加速)。谢谢,我需要查看一个实现示例,它对所有这些都是新的,并且无法从文档中找到它……可能是重复的,谢谢,我错过了!