Python 3.x pandas groupby子组的频率计算、插入新行和重新排列列
我需要一些在子组上执行一些操作的帮助,但我真的很困惑。我将尝试用注释快速描述操作和所需的输出 (1) 计算每个子组的出现频率百分比 (2) 显示与0不存在的记录 (3) 重新排列记录和列的顺序 假设以下df为原始数据:Python 3.x pandas groupby子组的频率计算、插入新行和重新排列列,python-3.x,pandas,group-by,transform,frequency,Python 3.x,Pandas,Group By,Transform,Frequency,我需要一些在子组上执行一些操作的帮助,但我真的很困惑。我将尝试用注释快速描述操作和所需的输出 (1) 计算每个子组的出现频率百分比 (2) 显示与0不存在的记录 (3) 重新排列记录和列的顺序 假设以下df为原始数据: df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3], 'branch':['A','A','C','C','C','C','A','A','C','A'], 'produ
df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})
下面的分组_df接近我的想法,但我无法获得所需的输出
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
我尝试分别处理每个组,但I)它没有考虑替换的NaN值,ii)我应该避免处理每个组,因为我需要在以后连接许多组(此df只是一个示例),因为我需要绘制整个组
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
希望我的问题有意义。提前非常感谢您对我尝试表演的任何见解,非常感谢 在我发布答案的前一天,在试图帮助解决这个问题的人的帮助下,我设法找到了解决办法
为了解释计算的最后一点,我对每个元素进行了转换,将其除以每个组的计数之和,以找到每个元素的第0级组频率,而不是行/列/总频率
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
运行上面的代码段,生成所需的输出:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0
在我发布答案的前一天,在试图帮助解决这个问题的人的帮助下,我设法找到了一个解决方案
为了解释计算的最后一点,我对每个元素进行了转换,将其除以每个组的计数之和,以找到每个元素的第0级组频率,而不是行/列/总频率
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
运行上面的代码段,生成所需的输出:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0
我仍然相信,多次取消堆叠和重新堆叠并不是将对象转换为所需格式的最具python风格的方式,我欢迎使用更优雅和复杂代码的任何其他答案。为了给出一个想法,我想到了astype('category')和reindex('level='branch'),例如,但我还没有达到能够胜任分类索引的程度。我仍然相信,多次取消堆叠和重新堆叠并不是将对象转换为所需格式的最简单的方式,我欢迎使用更优雅和复杂代码的任何其他答案。举例来说,我想到了astype('category')和reindex(level='branch'),但我还没有达到能够胜任分类索引的程度。