Python 列之间的计算（经度/纬度）非常慢_Python_Pandas_Math_Multiple Columns_Latitude Longitude

Python 列之间的计算（经度/纬度）非常慢

python pandas math

Python 列之间的计算（经度/纬度）非常慢,python,pandas,math,multiple-columns,latitude-longitude,Python,Pandas,Math,Multiple Columns,Latitude Longitude,我有两个独立的数据集，df和df2，每个数据集都有经度和纬度列。我想做的是在df中找到距离df2中的点最近的点，然后计算它们之间的距离（以km为单位），并将每个值附加到df2中的新列中我已经想出了一个解决方案，但请记住，df有+700000行，df2大约有60000行，因此我的解决方案将花费很长时间来计算。我能想到的唯一解决方案是使用双for循环 def compute_shortest_dist(df, df2): # array to store all closest dista

我有两个独立的数据集，

df

和

df2

，每个数据集都有

经度

和

纬度

列。我想做的是在

df

中找到距离

df2

中的点最近的点，然后计算它们之间的距离（以

km

为单位），并将每个值附加到

df2

中的新列中

我已经想出了一个解决方案，但请记住，

df

有

+700000

行，

df2

大约有

行，因此我的解决方案将花费很长时间来计算。我能想到的唯一解决方案是使用双

for

循环

def compute_shortest_dist(df, df2):
    # array to store all closest distances
    shortest_dist = []

    # radius of earth (used for calculation)
    R = 6373.0
    for i in df2.index:
        # keeps track of current minimum distance
        min_dist = -1

        # latitude and longitude from df2
        lat1 = df2.ix[i]['Latitude']
        lon1 = df2.ix[i]['Longitude']

        for j in df.index:

            # the following is just the calculation necessary
            # to calculate the distance between each point in km
            lat2 = df.ix[j]['Latitude']
            lon2 = df.ix[j]['Longitude']
            dlon = lon2 - lon1
            dlat = lat2 - lat1
            a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
            c = 2 * atan2(sqrt(a), sqrt(1 - a))
            distance = R * c

            # store new shortest distance
            if min_dist == -1 or distance > min_dist:
                min_dist = distance
        # append shortest distance to array
        shortest_dist.append(min_dist)

这个函数的计算时间太长，我知道一定有更有效的方法，但我不太擅长

pandas

语法

非常感谢您的帮助。

您可以在

numpy

中编写内部循环，这将大大加快速度：

import numpy as np

def compute_shortest_dist(df, df2):
    # array to store all closest distances
    shortest_dist = []

    # radius of earth (used for calculation)
    R = 6373.0
    lat1 = df['Latitude']
    lon1 = df['Longitude']
    for i in df2.index:
        # the following is just the calculation necessary
        # to calculate the distance between each point in km
        lat2 = df2.loc[i, 'Latitude']
        dlat = lat1 - lat2
        dlon = lon1 - df2.loc[i, 'Longitude']
        a = np.sin(dlat / 2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2)**2
        distance = 2* R * np.arctan2(np.sqrt(a), np.sqrt(1 - a))

        # append shortest distance to array
        shortest_dist.append(distance.min())
    return shortest_dist

它必须循环420000000次，所以我认为没有一个真正有效的方法来计算它。您可以尝试使用模块来并行化进程，使其速度提高两倍或四倍，但我想不出更好的方法来计算

numpy.ndarray

应该比

pandas.core.series.series

快。提取纬度和经度时，您可以尝试添加

.values

。可能相关：不要使用

ix

，因为

df2.loc[i，'经度']更好。