Python 从笛卡尔乘积创建多索引，但；展开；以相同的方式进行多个级别_Python_Pandas

Python 从笛卡尔乘积创建多索引，但；展开；以相同的方式进行多个级别

python pandas

Python 从笛卡尔乘积创建多索引，但；展开；以相同的方式进行多个级别,python,pandas,Python,Pandas,我希望从笛卡尔产品中创建熊猫的多重索引，其中一个级别是“特殊”的，并且将与任意数量的附加级别关联，我希望以与特殊级别相同的方式“展开”。最终结果更容易证明而不是描述下面的代码显示了这样一种情况：我想基于id和loc的笛卡尔积创建一个多索引，但以与“id”相同的方式展开color和shape。示例中显示了两种不同的方法。对于这种人为设计的情况，这些是足够的解决方案，但对于我的实际用例，数据帧将有>1000万行，这两种方法都不能满足我的性能要求。创建这样一个多索引的最佳方法是什么 import p

我希望从笛卡尔产品中创建熊猫的多重索引，其中一个级别是“特殊”的，并且将与任意数量的附加级别关联，我希望以与特殊级别相同的方式“展开”。最终结果更容易证明而不是描述

下面的代码显示了这样一种情况：我想基于

id

和

loc

的笛卡尔积创建一个多索引，但以与“id”相同的方式展开

color

和

shape

。示例中显示了两种不同的方法。对于这种人为设计的情况，这些是足够的解决方案，但对于我的实际用例，数据帧将有>1000万行，这两种方法都不能满足我的性能要求。创建这样一个多索引的最佳方法是什么

import pandas as pd
import numpy as np

id = np.asarray([1,2,3,4,5])
color= np.asarray(['red','blue','green','orange','purple'])
shape = np.asarray(['square','circle','triangle','rectangle','oval'])
loc = np.asarray(['CA','OR'])

idx = pd.MultiIndex.from_product([id,loc], names=['ID','LOC'])
data = np.ravel(np.random.rand(5,2))

# Approach 1
df1 = pd.DataFrame(data, index=idx)
df1['color'] = color[idx.labels[0]]
df1['shape'] = shape[idx.labels[0]]
df1.set_index(['color','shape'],append=True,inplace=True)
print(df1)

# Approach 2 
idx2 = pd.MultiIndex.from_arrays([id[idx.labels[0]],loc[idx.labels[1]],color[idx.labels[0]],shape[idx.labels[0]]],names=['ID','LOC','color','shape'])
df2 = pd.DataFrame(data, index=idx2)
print(df2)

pd.MultiIndex.from_tuples

pd.MultiIndex.from_tuples

midx = pd.MultiIndex.from_tuples(
    [(id[i], l, color[i], shape[i])
     for i in range(len(id)) for l in loc],
    names=['ID', 'LOC', 'color', 'shape']
)

df3 = pd.DataFrame(data, midx)

df3

                                0
ID LOC color  shape              
1  CA  red    square     0.583714
   OR  red    square     0.038577
2  CA  blue   circle     0.879020
   OR  blue   circle     0.542611
3  CA  green  triangle   0.185523
   OR  green  triangle   0.289909
4  CA  orange rectangle  0.788596
   OR  orange rectangle  0.915843
5  CA  purple oval       0.701603
   OR  purple oval       0.726648

i, j = np.indices((len(id), len(loc)))
a = np.column_stack([
    np.column_stack([id, color, shape])[i.ravel()],
    loc[j.ravel()]
])[:, [0, -1, 1, 2]]

midx = pd.MultiIndex.from_arrays(a.tolist(), names=['ID', 'LOC', 'color', 'shape'])

df4 = pd.DataFrame(data, midx)

df4

                                0
ID LOC color  shape              
1  CA  red    square     0.583714
   OR  red    square     0.038577
2  CA  blue   circle     0.879020
   OR  blue   circle     0.542611
3  CA  green  triangle   0.185523
   OR  green  triangle   0.289909
4  CA  orange rectangle  0.788596
   OR  orange rectangle  0.915843
5  CA  purple oval       0.701603
   OR  purple oval       0.726648