如何在python中使用lat-lon数据计算最小距离
我有两个数据帧。一个是带有lat lon数据的用户id,另一个是带有store lat lon数据的存储代码。那里大约有8900万排。我想要最近的(基于最小距离)存储代码对应的用户lat lon如何在python中使用lat-lon数据计算最小距离,python,pandas,scikit-learn,haversine,geopy,Python,Pandas,Scikit Learn,Haversine,Geopy,我有两个数据帧。一个是带有lat lon数据的用户id,另一个是带有store lat lon数据的存储代码。那里大约有8900万排。我想要最近的(基于最小距离)存储代码对应的用户lat lon df1 - id user_lat user_lon 1 13.031885 80.235574 2 19.099819 72.915288 3 22.226980 84.8360
df1 -
id user_lat user_lon
1 13.031885 80.235574
2 19.099819 72.915288
3 22.226980 84.836070
df2 -
store_no s_lat s_lon
22 29.91 73.88
23 28.57 77.33
24 26.86 80.95
到目前为止,我已经做了-
from geopy.distance import vincenty
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
df1 = df1[['user_lat','user_lon']]
df2 = df2[['s_lat','s_lon']]
x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
.drop('k',1)
x.head(20)
user_lat user_lon s_lat s_lon
0 13.031885 80.235574 29.91 73.88
1 13.031885 80.235574 28.57 77.33
2 13.031885 80.235574 26.86 80.95
3 19.099819 72.915288 29.91 73.88
4 19.099819 72.915288 28.57 77.33
5 19.099819 72.915288 26.86 80.95
6 22.226980 84.836070 29.91 73.88
7 22.226980 84.836070 28.57 77.33
8 22.226980 84.836070 26.86 80.95
x['dist'] = np.ravel(dist.pairwise(np.radians(store_lat_lon),np.radians(user_lat_lon)) * 6367)
user_lat user_lon s_lat s_lon dist
0 13.031885 80.235574 29.91 73.88 1986.237557
1 13.031885 80.235574 28.57 77.33 1205.217610
2 13.031885 80.235574 26.86 80.95 1386.069611
3 19.099819 72.915288 29.91 73.88 1752.628427
4 19.099819 72.915288 28.57 77.33 1143.731258
5 19.099819 72.915288 26.86 80.95 1031.246453
6 22.226980 84.836070 29.91 73.88 1538.449674
7 22.226980 84.836070 28.57 77.33 1190.620278
8 22.226980 84.836070 26.86 80.95 647.477461
但我希望数据帧看起来像-
user_lat user_lon s_lat s_lon dist store_no
0 13.031885 80.235574 29.91 73.88 1986.237557 23
1 13.031885 80.235574 28.57 77.33 1205.217610 23
2 13.031885 80.235574 26.86 80.95 1386.069611 23
3 19.099819 72.915288 29.91 73.88 1752.628427 24
4 19.099819 72.915288 28.57 77.33 1143.731258 24
5 19.099819 72.915288 26.86 80.95 1031.246453 24
6 22.226980 84.836070 29.91 73.88 1538.449674 24
7 22.226980 84.836070 28.57 77.33 1190.620278 24
8 22.226980 84.836070 26.86 80.95 647.477461 24
查找每个用户最近的存储是k-d树或ball树数据结构的经典用例。Scikit learn实现了这两种方法,但只有
BallTree
接受哈弗森距离度量,因此我们将使用它
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric
# Set up example data
df1 = pd.DataFrame({'id': [1, 2, 3],
'user_lat': [13.031885, 19.099819, 22.22698],
'user_lon': [80.235574, 72.915288, 84.83607]})
df2 = pd.DataFrame({'store_no': [22, 23, 24],
's_lat': [29.91, 28.57, 26.86],
's_lon': [73.88, 77.33, 80.95]})
# Build k-d tree with haversine distance metric, which expects
# (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(df2[['s_lat', 's_lon']]), metric=dist)
coords = np.radians(df1[['user_lat', 'user_lon']])
dists, ilocs = tree.query(coords)
# dists is in rad; convert to km
df1['dist'] = dists.flatten() * 6367
df1['nearest_store'] = df2.iloc[ilocs.flatten()]['store_no'].values
# Result:
df1
id user_lat user_lon dist nearest_store
0 1 13.031885 80.235574 5061.416309 23
1 2 19.099819 72.915288 8248.857621 24
2 3 22.226980 84.836070 7483.628300 23
查找每个用户最近的存储是k-d树或ball树数据结构的经典用例。Scikit learn实现了这两种方法,但只有
BallTree
接受哈弗森距离度量,因此我们将使用它
import pandas as pd
import numpy as np
from sklearn.neighbors import BallTree, DistanceMetric
# Set up example data
df1 = pd.DataFrame({'id': [1, 2, 3],
'user_lat': [13.031885, 19.099819, 22.22698],
'user_lon': [80.235574, 72.915288, 84.83607]})
df2 = pd.DataFrame({'store_no': [22, 23, 24],
's_lat': [29.91, 28.57, 26.86],
's_lon': [73.88, 77.33, 80.95]})
# Build k-d tree with haversine distance metric, which expects
# (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(df2[['s_lat', 's_lon']]), metric=dist)
coords = np.radians(df1[['user_lat', 'user_lon']])
dists, ilocs = tree.query(coords)
# dists is in rad; convert to km
df1['dist'] = dists.flatten() * 6367
df1['nearest_store'] = df2.iloc[ilocs.flatten()]['store_no'].values
# Result:
df1
id user_lat user_lon dist nearest_store
0 1 13.031885 80.235574 5061.416309 23
1 2 19.099819 72.915288 8248.857621 24
2 3 22.226980 84.836070 7483.628300 23
谢谢你的回复。但我有90M记录。如果我使用for循环,python可能会崩溃…只是替代建议,而不是for循环,不知何故我处于for循环模式。编辑!谢谢你的快速回复…但我也希望距离值(以公里为单位)不客气。我现在远离电脑,无法进行测试,但我想我已经编辑了代码来提供距离。(BallTree.query返回距离和结果索引。)将在大约8小时内检查并完成编辑;好奇地想知道它是否在您的机器上处理9000万行!谢谢你的回复。但我有90M记录。如果我使用for循环,python可能会崩溃…只是替代建议,而不是for循环,不知何故我处于for循环模式。编辑!谢谢你的快速回复…但我也希望距离值(以公里为单位)不客气。我现在远离电脑,无法进行测试,但我想我已经编辑了代码来提供距离。(BallTree.query返回距离和结果索引。)将在大约8小时内检查并完成编辑;好奇地想知道它是否在您的机器上处理9000万行!