Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/15.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
当经销商数据集有25k+时,查找离给定客户位置最近的经销商;记录和客户拥有200k+;使用Python的记录_Python_Python 3.x_Pandas_Optimization_Geopy - Fatal编程技术网

当经销商数据集有25k+时,查找离给定客户位置最近的经销商;记录和客户拥有200k+;使用Python的记录

当经销商数据集有25k+时,查找离给定客户位置最近的经销商;记录和客户拥有200k+;使用Python的记录,python,python-3.x,pandas,optimization,geopy,Python,Python 3.x,Pandas,Optimization,Geopy,我有两张桌子——经销商和客户。对于客户表中的每个客户位置,我需要从经销商表中查找距离最近的经销商 我有一个代码可以工作,但需要几个小时才能运行。我需要帮助优化我的解决方案 经销商表有25k+行,客户表有200k+行。两个表都有3个主列:(DealerID,Lat,Long)和(CustomerID,Lat,Long)。我的输出如下所示: 客户编号 拉特 长的 密商 距离 顾客1 61.61 -149.58 经销商3 15.53 顾客2 42.37 -72.52 经销商258 8.02 顾客3 4

我有两张桌子——经销商和客户。对于客户表中的每个客户位置,我需要从经销商表中查找距离最近的经销商

我有一个代码可以工作,但需要几个小时才能运行。我需要帮助优化我的解决方案

经销商表有25k+行,客户表有200k+行。两个表都有3个主列:(DealerID,Lat,Long)和(CustomerID,Lat,Long)。我的输出如下所示:

客户编号 拉特 长的 密商 距离 顾客1 61.61 -149.58 经销商3 15.53 顾客2 42.37 -72.52 经销商258 8.02 顾客3 42.42 -72.1 经销商1076 32.92 顾客4 31.59 -89.87 经销商32 3.85 顾客5 36.75 -94.84 经销商726 7.90
几周前我遇到了同样的问题,我觉得最好的方法是使用K-最近邻算法

# Dropping the duplicates in order to get the unique dealer list

dealer_df = df_s[["DealerID", "LAT", "LNG"]].drop_duplicates()
dealer_df = dealer_df.set_index("DealerID")

from sklearn.neighbors import KNeighborsClassifier

# Instantiating with n_neighbors as 1 and weights as "distance"

knn = KNeighborsClassifier(n_neighbors=1, weights="distance", n_jobs=-1)
knn.fit(dealer_df.values, dealer_df.index)

df_c["Nearest Dealer"] = knn.predict(df_c[["LAT", "LNG"]].values)
我在近180万个数据点上使用了相同的方法,大约需要5分钟。

生成样本数据:

import pandas as pd

N = 25000
dealers = pd.DataFrame({"DealerID": "Dealer" + pd.RangeIndex(1, N+1).astype(str),
                        "Lat": np.random.uniform(30, 65, N),
                        "Long": np.random.uniform(-150, -70, N)}
                      ).set_index("DealerID")

N = 200000
customers = pd.DataFrame({"CustomerID": "Customer" + pd.RangeIndex(1, N+1).astype(str),
                          "Lat": np.random.uniform(30, 65, N),
                          "Long": np.random.uniform(-150, -70, N)}
                        ).set_index("CustomerID")
您可以从Scipy使用:

from scipy.spatial import KDTree

distances, indices = KDTree(dealers).query(customers)
几秒钟后:

>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
                      Lat        Long ClosestDealer  Distance
CustomerID
Customer1       30.748900 -133.231319   Dealer22102  0.189255
Customer2       38.636134  -98.618844    Dealer1510  0.282966
Customer3       60.282135  -97.100096    Dealer2715  0.182832
Customer4       42.995473 -120.135218   Dealer10539  0.423006
Customer5       50.809563  -80.662491   Dealer12022  0.091765
...                   ...         ...           ...       ...
Customer199996  47.387618  -88.420528   Dealer17124  0.325962
Customer199997  53.618939 -124.432385    Dealer9177  0.133110
Customer199998  58.506937 -146.024708   Dealer15558  0.299639
Customer199999  48.329325 -129.149631   Dealer18371  0.023172
Customer200000  36.599969 -145.019091    Dealer2316  0.199344

[200000 rows x 4 columns]

可以使用t查找数据集中每个点到第二个数据集中的最近邻居。在这里,提问者在他们的数据集中有数百万个点。嘿,谢谢你给我指出那篇文章。我目前使用的是KDTree方法,它可以工作。我还将研究Ballree方法。嘿,谢谢你提供了这个解决方案。对于我拥有的数据集来说,这不是完全可行的,因为即使我增加“n_邻居”以扩大搜索半径,我的一些最近的经销商的搜索半径超过200条记录,并且将“n_邻居”增加到更大的值将再次增加程序的运行时间。但是对于不同结构的数据集,这可能确实有效。嘿,谢谢你提供了这个解决方案。这对我的数据集非常有效,运行时间以秒为单位。:)
>>> customers.assign(ClosestDealer=dealers.iloc[indices].index, Distance=distances)
                      Lat        Long ClosestDealer  Distance
CustomerID
Customer1       30.748900 -133.231319   Dealer22102  0.189255
Customer2       38.636134  -98.618844    Dealer1510  0.282966
Customer3       60.282135  -97.100096    Dealer2715  0.182832
Customer4       42.995473 -120.135218   Dealer10539  0.423006
Customer5       50.809563  -80.662491   Dealer12022  0.091765
...                   ...         ...           ...       ...
Customer199996  47.387618  -88.420528   Dealer17124  0.325962
Customer199997  53.618939 -124.432385    Dealer9177  0.133110
Customer199998  58.506937 -146.024708   Dealer15558  0.299639
Customer199999  48.329325 -129.149631   Dealer18371  0.023172
Customer200000  36.599969 -145.019091    Dealer2316  0.199344

[200000 rows x 4 columns]