Python 数据帧与应用函数之间的ID匹配_Python_Performance_Pandas_Numpy_Apply

Python 数据帧与应用函数之间的ID匹配

python performance pandas numpy

Python 数据帧与应用函数之间的ID匹配,python,performance,pandas,numpy,apply,Python,Performance,Pandas,Numpy,Apply,我有两个数据帧，如下所示： df_A： df_B：我想在db_B中添加一列，它是每个标识符的df_B中x，y坐标与df_a之间的欧氏距离。预期的结果将是： ID x y dist a 2 1 1.732 c 3 5 3 b 1 2 3.162 标识符的顺序不一定相同。我知道如何通过循环dfu A的行并在dfu B中找到匹配的ID来实现这一点，但我希望避免使用for循环，因为这将用于具有数千万行的数据。是否有

我有两个数据帧，如下所示：

df_A：

df_B：

我想在db_B中添加一列，它是每个标识符的df_B中x，y坐标与df_a之间的欧氏距离。预期的结果将是：

ID    x     y    dist
a     2     1    1.732
c     3     5    3
b     1     2    3.162

标识符的顺序不一定相同。我知道如何通过循环dfu A的行并在dfu B中找到匹配的ID来实现这一点，但我希望避免使用for循环，因为这将用于具有数千万行的数据。是否有某种方法可以使用apply，但条件是匹配的ID？

如果

ID

不是索引，请这样做

df_B.set_index('ID', inplace=True)
df_A.set_index('ID', inplace=True)

df_B['dist'] = ((df_A - df_B) ** 2).sum(1) ** .5

因为索引和列已经对齐了，所以简单地计算一下就行了

使用方法的解决方案：

对于性能，您可能希望使用NumPy数组，对于相应行之间的欧几里德距离计算，您可以非常高效地进行

合并行的固定以使它们对齐，下面是一个实现-

# Get sorted row indices for dataframe-A
sidx = df_A.index.argsort()
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]

# Sort A rows accordingly and get the elementwise differences against B
s = df_A.values[idx] - df_B.values

# Use einsum to square and sum each row and finally sqrt for distances
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))

样本输入、输出-

In [121]: df_A
Out[121]: 
   0  1
a  0  0
c  3  2
b  2  5

In [122]: df_B
Out[122]: 
   0  1
c  3  5
a  2  1
b  1  2

In [124]: df_B  # After code run
Out[124]: 
   0  1      dist
c  3  5  3.000000
a  2  1  2.236068
b  1  2  3.162278

这里有一个比较

einsum

和其他几款产品的方法。

很好的解决方案！发布的解决方案中有哪一个适合你吗？

In [73]: A
Out[73]:
    x  y
ID
a   0  0
c   3  2
b   2  5

In [74]: B
Out[74]:
    x  y
ID
a   2  1
c   3  5
b   1  2

In [75]: from sklearn.metrics.pairwise import paired_distances

In [76]: B['dist'] = paired_distances(B, A)

In [77]: B
Out[77]:
    x  y      dist
ID
a   2  1  2.236068
c   3  5  3.000000
b   1  2  3.162278

# Get sorted row indices for dataframe-A
sidx = df_A.index.argsort()
idx = sidx[df_A.index.searchsorted(df_B.index,sorter=sidx)]

# Sort A rows accordingly and get the elementwise differences against B
s = df_A.values[idx] - df_B.values

# Use einsum to square and sum each row and finally sqrt for distances
df_B['dist'] = np.sqrt(np.einsum('ij,ij->i',s,s))

In [121]: df_A
Out[121]: 
   0  1
a  0  0
c  3  2
b  2  5

In [122]: df_B
Out[122]: 
   0  1
c  3  5
a  2  1
b  1  2

In [124]: df_B  # After code run
Out[124]: 
   0  1      dist
c  3  5  3.000000
a  2  1  2.236068
b  1  2  3.162278