Python 使用熊猫执行笛卡尔积（交叉连接）_Python_Pandas_Numpy_Dataframe_Merge

Python 使用熊猫执行笛卡尔积（交叉连接）

python pandas numpy dataframe merge

Python 使用熊猫执行笛卡尔积（交叉连接）,python,pandas,numpy,dataframe,merge,Python,Pandas,Numpy,Dataframe,Merge,这篇文章的内容原本是想成为 , 但由于内容的性质和大小，需要完全做到公正地说，这个话题已经转移到了它自己的QnA 给定两个简单的数据帧 left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]}) right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]}) left col1 col2 0 A 1 1 B

这篇文章的内容原本是想成为 , 但由于内容的性质和大小，需要完全做到公正地说，这个话题已经转移到了它自己的QnA

给定两个简单的数据帧

left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})

left

  col1  col2
0    A     1
1    B     2
2    C     3

right

  col1  col2
0    X    20
1    Y    30
2    Z    50

可以计算这些帧的叉积，其形状如下：

A       1      X      20
A       1      Y      30
A       1      Z      50
B       2      X      20
B       2      Y      30
B       2      Z      50
C       3      X      20
C       3      Y      30
C       3      Z      50

计算这个结果最有效的方法是什么？

让我们从建立基准开始。解决此问题的最简单方法是使用临时“键”列：

其工作原理是，为两个数据帧分配一个具有相同值（例如，1）的临时“键”列<代码>合并然后对“键”执行多对多连接

虽然多对多连接技巧适用于大小合理的数据帧，但在较大的数据上，您会看到相对较低的性能

更快的实现需要NumPy。这里有一些著名的。我们可以利用其中一些高性能的解决方案来获得我们想要的输出。然而，我最喜欢的是@senderle的第一个实现

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

泛化：唯一或非唯一索引数据帧上的交叉连接 免责声明
这些解决方案针对具有非混合标量数据类型的数据帧进行了优化。如果处理混合数据类型，请在自担风险

这个技巧适用于任何类型的数据帧。我们使用前面提到的

笛卡尔乘积

计算数据帧的数字索引的笛卡尔乘积，使用它来重新索引数据帧，并且

def cartesian_product_generalized(left, right):
    la, lb = len(left), len(right)
    idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
    return pd.DataFrame(
        np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))

cartesian_product_generalized(left, right)

   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left, right))
True

同样的道理

left2 = left.copy()
left2.index = ['s1', 's2', 's1']

right2 = right.copy()
right2.index = ['x', 'y', 'y']
    

left2
   col1  col2
s1    A     1
s2    B     2
s1    C     3

right2
  col1  col2
x    X    20
y    Y    30
y    Z    50

np.array_equal(cartesian_product_generalized(left, right),
               cartesian_product_basic(left2, right2))
True

此解决方案可以推广到多个数据帧。比如说,

def cartesian_product_multi(*dfs):
    idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
    return pd.DataFrame(
        np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))

cartesian_product_multi(*[left, right, left]).head()

   0  1  2   3  4  5
0  A  1  X  20  A  1
1  A  1  X  20  B  2
2  A  1  X  20  C  3
3  A  1  X  20  D  4
4  A  1  Y  30  A  1

进一步简化当只处理两个数据帧时，一个不涉及@senderle的笛卡尔乘积的简单解决方案是可能的。使用

np.broadcast\u数组

，我们可以达到几乎相同的性能水平

def cartesian_product_simplified(left, right):
    la, lb = len(left), len(right)
    ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
        np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))

np.array_equal(cartesian_product_simplified(left, right),
               cartesian_product_basic(left2, right2))
True

性能比较在一些具有唯一索引的人工数据帧上对这些解决方案进行基准测试

请注意，计时可能会根据您的设置、数据和所选的

cartesian_产品

helper函数（如适用）而有所不同

性能基准代码
这是计时脚本。这里调用的所有函数都在上面定义

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

继续阅读跳转到101中的其他主题继续学习：

*您在这里使用

itertools

product

并在数据框中重新创建值

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

这里有一种三重

concat

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

    col1  col2 col1  col2
0     A     1    X    20
1     A     1    Y    30
2     A     1    Z    50
3     B     2    X    20
4     B     2    Y    30
5     B     2    Z    50
6     C     3    X    20
7     C     3    Y    30
8     C     3    Z    50

您是否也希望在Github中共享您的输入，我认为在pandas中添加

交叉连接

，可以很好地匹配SQL中的所有连接函数。为什么列名会变成整数？当我尝试重命名它们时，

.rename（）

运行，但整数仍然存在。@CameronTaylor您是否忘记使用axis=1参数调用rename？不，更密集-我在整数周围加了引号-谢谢您，还有一个问题。我使用的是笛卡尔积简化，当我尝试将一个50K行df连接到一个30K行df时，我（可以预见）内存不足。关于克服内存问题有什么建议吗？@CameronTaylor其他笛卡尔积函数也会抛出内存错误吗？我想你可以在这里使用笛卡尔积。

from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cartesian_product_basic', 'cartesian_product_generalized', 
              'cartesian_product_multi', 'cartesian_product_simplified'],
       columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        # print(f,c)
        left2 = pd.concat([left] * c, ignore_index=True)
        right2 = pd.concat([right] * c, ignore_index=True)
        stmt = '{}(left2, right2)'.format(f)
        setp = 'from __main__ import left2, right2, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=5)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
   0  1  2   3
0  A  1  X  20
1  A  1  Y  30
2  A  1  Z  50
3  B  2  X  20
4  B  2  Y  30
5  B  2  Z  50
6  C  3  X  20
7  C  3  Y  30
8  C  3  Z  50

m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
       pd.concat([right]*len(left)).reset_index(drop=True) ], 1)

    col1  col2 col1  col2
0     A     1    X    20
1     A     1    Y    30
2     A     1    Z    50
3     B     2    X    20
4     B     2    Y    30
5     B     2    Z    50
6     C     3    X    20
7     C     3    Y    30
8     C     3    Z    50