Python 使用多个源列扩展数据帧_Python_Pandas

Python 使用多个源列扩展数据帧

python pandas

Python 使用多个源列扩展数据帧,python,pandas,Python,Pandas,从开始，如果每个“实体”有多个源列，是否可以在熊猫中执行类似的“扩展”操作如果我的数据现在看起来像： Box,Code,Category Green,1221,Active Green,8391,Inactive Red,3709,Inactive Red,2911,Pending Blue,9820,Active Blue,4530,Active 我如何最有效地到达： Box,Code0,Category0,Code1,Category1 Green,1221,Active,8391,In

从开始，如果每个“实体”有多个源列，是否可以在熊猫中执行类似的“扩展”操作

如果我的数据现在看起来像：

Box,Code,Category
Green,1221,Active
Green,8391,Inactive
Red,3709,Inactive
Red,2911,Pending
Blue,9820,Active
Blue,4530,Active

我如何最有效地到达：

Box,Code0,Category0,Code1,Category1
Green,1221,Active,8391,Inactive
Red,3709,Inactive,2911,Pending
Blue,9820,Active,4530,Active

到目前为止，我能够组合起来的唯一“有效”的解决方案是按照链接页面中的示例创建两个单独的数据帧，一个按框和代码分组，另一个按框和类别分组，然后按框将两者合并在一起

a = get_clip.groupby('Box')['Code'].apply(list)
b = get_clip.groupby('Box')['Category'].apply(list)
broadeneda = pd.DataFrame(a.values.tolist(), index = a.index).add_prefix('Code').reset_index()
broadenedb = pd.DataFrame(b.values.tolist(), index = b.index).add_prefix('Category').reset_index()
merged = pd.merge(broadeneda, broadenedb, on='Box', how = 'inner')

有没有一种方法可以实现这一点，而不必分别加宽每一列并在末尾合并？

gourpby

cumcount

unstack

df1=df.assign(n=df.groupby('Box').cumcount()).set_index(['Box','n']).unstack(1)
df1.columns=df1.columns.map('{0[0]}{0[1]}'.format) 
df1
Out[141]: 
       Code0  Code1 Category0 Category1
Box                                    
Blue    9820   4530    Active    Active
Green   1221   8391    Active  Inactive
Red     3709   2911  Inactive   Pending

选项1
使用

设置索引

，

管道

和

设置轴

df.set_index(['Box', df.groupby('Box').cumcount()]).unstack().pipe(
    lambda d: d.set_axis(d.columns.map('{0[0]}{0[1]}'.format), 1, False)
)

       Code0  Code1 Category0 Category1
Box                                    
Blue    9820   4530    Active    Active
Green   1221   8391    Active  Inactive
Red     3709   2911  Inactive   Pending

选项2
使用

defaultdict

from collections import defaultdict

d = defaultdict(dict)

for a, *b in df.values:
    i = len(d[a]) // len(b)
    c = (f'Code{i}', f'Category{i}')
    d[a].update(dict(zip(c, b)))

pd.DataFrame.from_dict(d, 'index').rename_axis('Box')

       Code0 Category0  Code1 Category1
Box                                    
Blue    9820    Active   4530    Active
Green   1221    Active   8391  Inactive
Red     3709  Inactive   2911   Pending

这可以通过子数据帧的迭代来完成：

cols = ["Box","Code0","Category0","Code1","Category1"]
newdf = pd.DataFrame(columns = cols)    # create an empty dataframe to be filled
for box in pd.unique(df.Box):           # for each color in Box
    subdf = df[df.Box == box]           # get a sub-dataframe
    newrow = subdf.values[0].tolist()   # get its values and then its full first row
    newrow.extend(subdf.values[1].tolist()[1:3])    # add second and third entries of second row
    newdf = pd.concat([newdf, pd.DataFrame(data=[newrow], columns=cols)], axis=0)   # add to new dataframe

print(newdf)

输出：

     Box   Code0 Category0   Code1 Category1
0  Green  1221.0    Active  8391.0  Inactive
0    Red  3709.0  Inactive  2911.0   Pending
0   Blue  9820.0    Active  4530.0    Active

似乎相同的颜色将出现在一行中，并且每种颜色都有相同的行。（两个重要假设）。因此，我们可以将df拆分为奇数部分，

df[：：2]

，偶数部分，

df[1:：2]

，然后将其合并在一起

pd.merge(df[::2], df[1::2], on="Box")

    Box     Code_x  Category_x  Code_y  Category_y
0   Green   1221    Active  8391    Inactive
1   Red     3709    Inactive    2911    Pending
2   Blue    9820    Active  4530    Active

您可以通过重置其列来轻松重命名它。

如果我的起始数据集中有更多列，要更改此设置，我是否只需继续向

df1.columns=df1.columns.map（“{0[0]}{0[1]}.format）

行添加元素？如中所示，如果我添加另一列（例如，任意称为“Subject”），我是否会将上面的行更改为：

df1.columns=df1.columns.map（“{0[0]}{0[1]}{0[2]}.”格式）

？（编辑：删除了未正确呈现的注释格式）我对该语句进行了实验，现在我意识到，

{0[0]}{0[1]}

似乎是变量引用名称，而不是位置引用。相同的颜色会出现在一行中，并且每种颜色都有相同的行吗？或者可能有丢失的数据？如果没有，我将删除我的答案。