Python 计算横向和纵向之间最小哈弗斯线距离的最快方法；PySpark中的一系列Lat Long？_Python_Pyspark_Apache Spark Sql_User Defined Functions_Databricks

Python 计算横向和纵向之间最小哈弗斯线距离的最快方法；PySpark中的一系列Lat Long？

python pyspark

Python 计算横向和纵向之间最小哈弗斯线距离的最快方法；PySpark中的一系列Lat Long？,python,pyspark,apache-spark-sql,user-defined-functions,databricks,Python,Pyspark,Apache Spark Sql,User Defined Functions,Databricks,上下文：我正在寻找一种方法，在PySpark中有效地计算一对lat long和一组lat long之间的距离，然后取这些距离中的最小值这将如何工作：第一步：我有一个Spark数据框，其中包含以纬度和经度为列的餐厅ID 第二步：我有一个由加油站组成的熊猫数据框第三步：我现在想计算每个餐厅和所有加油站位置之间的哈弗森距离，然后得到最小距离！那么让我们说：哈弗森距离b/w餐厅id123和加油站456=5m 哈弗森距离b/w餐厅id123和加油站789=12m 然后我想返回5m

上下文：我正在寻找一种方法，在PySpark中有效地计算一对lat long和一组lat long之间的距离，然后取这些距离中的最小值

这将如何工作：

第一步：我有一个Spark数据框，其中包含以纬度和经度为列的餐厅ID

第二步：我有一个由加油站组成的熊猫数据框

第三步：我现在想计算每个餐厅和所有加油站位置之间的哈弗森距离，然后得到最小距离！那么让我们说：
- 哈弗森距离b/w餐厅id123和加油站456=5m
- 哈弗森距离b/w餐厅id123和加油站789=12m

然后我想返回5m作为值，因为它是最小距离。我想为所有的餐厅ID做这个。一些sudo代码可以更好地理解此问题：

# Sudo code to understand desired logic
for each_restaurant in a list of restaurants:
    calculate the distance between the restaurant and ALL the gas stations
    return minimum distance

迄今取得的进展到目前为止，我使用了矢量化的UDF和普通UDF，如下所示

def haversine_distance(lat, long):
    """Get haversine distances from a single (lat, long) pair to an array
    of (lat, long) pairs.
    """
    # Convert the lat long to radians
    lat = lat.apply(lambda x: radians(x))
    long = long.apply(lambda x: radians(x))

    unit = 'm'
    single_loc = pd.DataFrame( [lat,  long] ).T
    single_loc.columns = ['Latitude', 'Longitude']

    other_locs = gas_stations_df[['Latitude', 'Longitude']].values  # this is a pandas dataframe

    dist_l = []
    for index,row in single_loc.iterrows():
        .... do haversine distance calculations
        d = haversine distance


        dist_l.append(np.min(d) )

    return pd.Series(dist_l)

然后我应用熊猫UDF如下：

restaurant_df = restaurant_df.withColumn('distance_to_nearest_gas_station', lit(haversine_distance('latitude', 'longitude')))

尽管这种方法有效，但扩展速度仍然相当缓慢，我想知道是否有更简单的方法来实现这一点

非常感谢您的阅读

我会忽略开头的“哈弗森”要求，并使用a（二维或三维）将其过滤到几个点，这应该非常快。如果您想要/需要该点的精确距离，您可以使用您想要的任何公式

def haversine_distance(lat, long):
    """Get haversine distances from a single (lat, long) pair to an array
    of (lat, long) pairs.
    """
    # Convert the lat long to radians
    lat = lat.apply(lambda x: radians(x))
    long = long.apply(lambda x: radians(x))

    unit = 'm'
    single_loc = pd.DataFrame( [lat,  long] ).T
    single_loc.columns = ['Latitude', 'Longitude']

    other_locs = gas_stations_df[['Latitude', 'Longitude']].values  # this is a pandas dataframe

    dist_l = []
    for index,row in single_loc.iterrows():
        .... do haversine distance calculations
        d = haversine distance


        dist_l.append(np.min(d) )

    return pd.Series(dist_l)

restaurant_df = restaurant_df.withColumn('distance_to_nearest_gas_station', lit(haversine_distance('latitude', 'longitude')))