Python 在groupby中获取相应的值
我有一个类似的数据集Python 在groupby中获取相应的值,python,pandas,Python,Pandas,我有一个类似的数据集 Serial A B 1 12 1 31 1 1 12 1 31 203 1 10 1 2 2 32 100 2 32 242 2 3 3 2 3 23
Serial A B
1 12
1 31
1
1 12
1 31 203
1 10
1 2
2 32 100
2 32 242
2 3
3 2
3 23 100
3
3 23
我根据序列对数据帧进行分组,并通过df['A_MAX']=df.groupby('Serial')['A']].transform('MAX').values
找到每个A
列的最大值,并通过df['A_MAX']=df['A_MAX']保留第一个值。掩码(df['Serial'].duplicated(),'')
现在对于B_对应的
列,我想得到A_MAX
对应的B
值。我想在A
中找到A的MAX
值,但每组都有类似的MAXA
值。附加条件,例如在Serial 2
中,我还希望获得32
之间的最小B
值,其思想是用于每组的最大值,然后通过删除缺少的值并通过Serial
获得第一行。创建系列
用户和上次使用:
可以将缺少的值转换为空字符串,但会得到混合值-数值和字符串,因此下一步处理可能会有问题:
df['A_MAX'] = df.groupby('Serial')['A'].transform('max')
df['A_MAX'] = df['A_MAX'].mask(df['Serial'].duplicated(), '')
s = (df.sort_values(['Serial','A'], ascending=[True, False])
.dropna(subset=['B'])
.drop_duplicates('Serial')
.set_index('Serial')['B'])
df['B_corresponding'] = df['Serial'].map(s).mask(df['Serial'].duplicated(), '')
print (df)
Serial A B A_MAX B_corresponding
0 1 12.0 NaN 31 203
1 1 31.0 NaN
2 1 NaN NaN
3 1 12.0 NaN
4 1 31.0 203.0
5 1 10.0 NaN
6 1 2.0 NaN
7 2 32.0 100.0 32 100
8 2 32.0 242.0
9 2 3.0 NaN
10 3 2.0 NaN 23 100
11 3 23.0 100.0
12 3 NaN NaN
13 3 23.0 NaN
如果你不太倾向于只使用熊猫,你也可以使用字典来达到同样的效果
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
如果您愿意,您可以将其连接到原始数据帧并屏蔽重复项
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft
a_to_b_mapping = df.groupby('A')['B'].min().to_dict()
series_to_a_mapping = df.groupby('Series')['A'].max().to_dict()
agg_df = {}
for series, a in series_to_a_mapping.items():
agg_df.append((series, a, a_to_b_mapping.get(a, None)))
agg_df = pd.DataFrame(agg_df, columns=['Series', 'A_max', 'B_corresponding'])
agg_df.head()
Series A_max B_corresponding
0 1 31.0 203.0
1 2 32.0 100.0
2 3 23.0 100.0
dft = df.join(final_df.set_index('Serial'), on='Serial', how='left')
dft['A_max'] = dft['A_max'].mask(dft['A_max'].duplicated(), '')
dft['B_corresponding'] = dft['B_corresponding'].mask(dft['B_corresponding'].duplicated(), '')
dft