Python 最近的记录和两个数据帧中每个记录之间的对应距离_Python_Pandas_Scipy_Distance

Python 最近的记录和两个数据帧中每个记录之间的对应距离

python pandas

Python 最近的记录和两个数据帧中每个记录之间的对应距离,python,pandas,scipy,distance,Python,Pandas,Scipy,Distance,假设我有两个DataFrames:XA和XB，例如每个都有3行2列： import pandas as pd XA = pd.DataFrame({ 'x1': [1, 2, 3], 'x2': [4, 5, 6] }) XB = pd.DataFrame({ 'x1': [8, 7, 6], 'x2': [5, 4, 3] }) 对于XA中的每个记录，我希望在XB中找到最近的记录（例如基于欧氏距离），以及相应的距离。例如，这可能会返回一个索引在id\u a

假设我有两个

DataFrame

XA

和

XB

，例如每个都有3行2列：

import pandas as pd

XA = pd.DataFrame({
    'x1': [1, 2, 3],
    'x2': [4, 5, 6]
})

XB = pd.DataFrame({
    'x1': [8, 7, 6],
    'x2': [5, 4, 3]
})

对于

XA

中的每个记录，我希望在

XB

中找到最近的记录（例如基于欧氏距离），以及相应的距离。例如，这可能会返回一个索引在

id\u a

上的

DataFrame

，并带有

id\u B

和

distance

列

如何才能最有效地执行此操作？

一种方法是计算全距离矩阵，然后使用

nsmallest

将其熔化并聚合，该方法返回索引和值：

from scipy.spatial.distance import cdist

def nearest_record(XA, XB):
    """Get the nearest record in XA for each record in XB.

    Args:
        XA: DataFrame. Each record is matched against the nearest in XB.
        XB: DataFrame.

    Returns:
        DataFrame with columns for id_A (from XA), id_B (from XB), and dist.
        Each id_A maps to a single id_B, which is the nearest record from XB.
    """
    dist = pd.DataFrame(cdist(XA, XB)).reset_index().melt('index')
    dist.columns = ['id_A', 'id_B', 'dist']
    # id_B is sometimes returned as an object.
    dist['id_B'] = dist.id_B.astype(int)
    dist.reset_index(drop=True, inplace=True)
    nearest = dist.groupby('id_A').dist.nsmallest(1).reset_index()
    return nearest.set_index('level_1').join(dist.id_B).reset_index(drop=True)

这表明

id_B

2是距离

XA

中三条记录最近的记录：

nearest_record(XA, XB)

 id_A       dist id_B
0   0   5.099020    2
1   1   4.472136    2
2   2   4.242641    2

但是，由于这涉及到计算全距离矩阵，因此当

XA

和

XB

较大时，计算速度较慢或失败。另一种为每一行计算最近距离的方法可能会更快。

修改以避免完整距离矩阵，您可以在

XA

中找到每一行的最近记录和距离（

nearest_record1（）

），然后调用

apply

在每一行上运行它（

nearest_record（）

）。这样可以在一段时间内将运行时间缩短约85%

这也会返回正确的结果：

nearest_record(XA, XB)
    id_A    id_B        dist
0      0       2    5.099020
1      1       2    4.472136
2      2       2    4.242641

nearest_record(XA, XB)
    id_A    id_B        dist
0      0       2    5.099020
1      1       2    4.472136
2      2       2    4.242641