Python 基于多列分组的数据帧分割
我有这种数据帧,我会在多个列中分割成多个具有唯一值的数据帧。 DF: 我可以基于一列代码来完成这项工作,代码是Python 基于多列分组的数据帧分割,python,pandas,dataframe,Python,Pandas,Dataframe,我有这种数据帧,我会在多个列中分割成多个具有唯一值的数据帧。 DF: 我可以基于一列代码来完成这项工作,代码是df_list=[d for,d在df.groupby(['a'])]] 我能够通过以下方式完成我想要的操作: for df in df_list: df["e"] = df.apply(lambda x: df.loc[x.name+1:,"c"].mean(),axis=1) 输出 df_list [ a b
df_list=[d for,d在df.groupby(['a'])]]
我能够通过以下方式完成我想要的操作:
for df in df_list:
df["e"] = df.apply(lambda x: df.loc[x.name+1:,"c"].mean(),axis=1)
输出
df_list
[ a b c d e
2 black grey 0 0 2.0
5 black brown 2 8 NaN,
a b c d e
1 brown red 4 5 NaN,
a b c d e
4 green blue 0 3 NaN,
a b c d e
0 red green 1 2 5.0
3 red blue 6 1 4.0
6 red grey 4 6 NaN]
但是如何处理多个列呢
“红色”值的预期结果:
您可以提取
a
和b
列的唯一值,并将每个列用作筛选器。比如说,
import pandas as pd
df = pd.DataFrame(
[
["red", "green", 1, 2],
["brown", "red", 4, 5],
["black", "grey", 0, 0],
["red", "blue", 6, 1],
["green", "blue", 0, 3],
["black", "brown", 2, 8],
["red", "grey", 4, 6],
],
columns=["a", "b", "c", "d"]
)
colors = pd.unique(df[['a', 'b']].values.ravel('K'))
>>> colors
array(['red', 'brown', 'black', 'green', 'grey', 'blue'], dtype=object)
迭代每种颜色,并在过滤后对生成的当前_df
执行操作
df_list = []
for color in colors:
current_df = df[(df.a == color) | (df.b == color)].copy().reset_index(drop=True)
current_df["e"] = current_df.apply(
lambda x: (
current_df[(current_df.a == color)].loc[x.name + 1 :, "c"].sum()
+ current_df[(current_df.b == color)].loc[x.name + 1 :, "d"].sum()
)
/ (current_df.shape[0] - x.name - 1),
axis=1
)
df_list.append(current_df)
(current_-df.shape[0]-x.name-1)
成为添加的值的数目,因为x.name
是行号,current_-df.shape[0]
是当前过滤的df
的总行数。这相当于:
df_list = []
for color in colors:
current_df = df[(df.a == color) | (df.b == color)].copy()
current_df["e"] = current_df.apply(
lambda x: (
current_df[(current_df.a == color)].loc[x.name + 1 :, "c"].sum()
+ current_df[(current_df.b == color)].loc[x.name + 1 :, "d"].sum()
)
/ (
current_df[(current_df.a == color)].loc[x.name + 1 :, "c"].size
+ current_df[(current_df.b == color)].loc[x.name + 1 :, "d"].size
),
axis=1,
)
df_list.append(current_df)
红色的结果:
>>> df_list[0]
a b c d e
0 red green 1 2 5.0
1 brown red 4 5 5.0
3 red blue 6 1 4.0
6 red grey 4 6 NaN
伟大的但有一个问题,它计算列“c”中的值,但列“d”中有一个“red”值,因此第一行的结果应该是5+4+6,而不是4+6+4。你是对的。让我更正我的答案。现在检查一下,@charlesalakissgreat!非常感谢你!
df_list = []
for color in colors:
current_df = df[(df.a == color) | (df.b == color)].copy()
current_df["e"] = current_df.apply(
lambda x: (
current_df[(current_df.a == color)].loc[x.name + 1 :, "c"].sum()
+ current_df[(current_df.b == color)].loc[x.name + 1 :, "d"].sum()
)
/ (
current_df[(current_df.a == color)].loc[x.name + 1 :, "c"].size
+ current_df[(current_df.b == color)].loc[x.name + 1 :, "d"].size
),
axis=1,
)
df_list.append(current_df)
>>> df_list[0]
a b c d e
0 red green 1 2 5.0
1 brown red 4 5 5.0
3 red blue 6 1 4.0
6 red grey 4 6 NaN