Python 3.x pandas groupby子组的频率计算、插入新行和重新排列列_Python 3.x_Pandas_Group By_Transform_Frequency

Python 3.x pandas groupby子组的频率计算、插入新行和重新排列列

python-3.x pandas

Python 3.x pandas groupby子组的频率计算、插入新行和重新排列列,python-3.x,pandas,group-by,transform,frequency,Python 3.x,Pandas,Group By,Transform,Frequency,我需要一些在子组上执行一些操作的帮助，但我真的很困惑。我将尝试用注释快速描述操作和所需的输出（1）计算每个子组的出现频率百分比（2）显示与0不存在的记录（3）重新排列记录和列的顺序假设以下df为原始数据： df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3], 'branch':['A','A','C','C','C','C','A','A','C','A'], 'produ

我需要一些在子组上执行一些操作的帮助，但我真的很困惑。我将尝试用注释快速描述操作和所需的输出

（1）计算每个子组的出现频率百分比

（2）显示与0不存在的记录

（3）重新排列记录和列的顺序

假设以下df为原始数据：

df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
                 'branch':['A','A','C','C','C','C','A','A','C','A'],
                 'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})

下面的分组_df接近我的想法，但我无法获得所需的输出

grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})

# output:
products      accessories  bags  clothes  shoes
store branch                                   
1     A               0.0   0.0      1.0    1.0
      C               0.0   0.0      1.0    0.0
2     C               1.0   0.0      1.0    1.0
3     A               0.0   2.0      1.0    0.0
      C               0.0   0.0      1.0    0.0

# desirable output: if (1), (2) and (3) take place somehow...
products      clothes  shoes  accessories  bags
store branch                                   
1     B             0      0            0     0  #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
      A          33.3   33.3            0     0
      C          33.3    0.0            0     0
2     B             0      0            0     0
      A             0      0            0     0
      C          33.3   33.3         33.3     0
3     B             0      0            0     0  #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
      A            25      0            0    50
      C            25      0            0     0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above

我尝试分别处理每个组，但I）它没有考虑替换的NaN值，ii）我应该避免处理每个组，因为我需要在以后连接许多组（此df只是一个示例），因为我需要绘制整个组

grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products      accessories  bags  clothes  shoes
store branch                                   
1     A               NaN   NaN     50.0  100.0  #why has it transformed on axis='columns'?
      C               NaN   NaN     50.0    0.0

希望我的问题有意义。提前非常感谢您对我尝试表演的任何见解，非常感谢

在我发布答案的前一天，在试图帮助解决这个问题的人的帮助下，我设法找到了解决办法

为了解释计算的最后一点，我对每个元素进行了转换，将其除以每个组的计数之和，以找到每个元素的第0级组频率，而不是行/列/总频率

grouped_df = df.groupby(['store', 'branch', 'products']).size()\
    .unstack('branch')\
        .reindex(['B','C','A'], axis=1, fill_value=0)\
            .stack('branch')\
                .unstack('products')\
                    .replace({np.nan:0})\
                        .transform(
                            lambda x: x*100/df.groupby(['store']).size()
                                   ).round(1)\
                            .reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')

运行上面的代码段，生成所需的输出：

products      accessories  bags  clothes  shoes
store branch                                   
1     B               0.0   0.0      0.0    0.0
      C               0.0   0.0     33.3    0.0
      A               0.0   0.0     33.3   33.3
2     B               0.0   0.0      0.0    0.0
      C              33.3   0.0     33.3   33.3
3     B               0.0   0.0      0.0    0.0
      C               0.0   0.0     25.0    0.0
      A               0.0  50.0     25.0    0.0

在我发布答案的前一天，在试图帮助解决这个问题的人的帮助下，我设法找到了一个解决方案

为了解释计算的最后一点，我对每个元素进行了转换，将其除以每个组的计数之和，以找到每个元素的第0级组频率，而不是行/列/总频率

grouped_df = df.groupby(['store', 'branch', 'products']).size()\
    .unstack('branch')\
        .reindex(['B','C','A'], axis=1, fill_value=0)\
            .stack('branch')\
                .unstack('products')\
                    .replace({np.nan:0})\
                        .transform(
                            lambda x: x*100/df.groupby(['store']).size()
                                   ).round(1)\
                            .reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')

运行上面的代码段，生成所需的输出：

products      accessories  bags  clothes  shoes
store branch                                   
1     B               0.0   0.0      0.0    0.0
      C               0.0   0.0     33.3    0.0
      A               0.0   0.0     33.3   33.3
2     B               0.0   0.0      0.0    0.0
      C              33.3   0.0     33.3   33.3
3     B               0.0   0.0      0.0    0.0
      C               0.0   0.0     25.0    0.0
      A               0.0  50.0     25.0    0.0

我仍然相信，多次取消堆叠和重新堆叠并不是将对象转换为所需格式的最具python风格的方式，我欢迎使用更优雅和复杂代码的任何其他答案。为了给出一个想法，我想到了astype（'category'）和reindex（'level='branch'），例如，但我还没有达到能够胜任分类索引的程度。我仍然相信，多次取消堆叠和重新堆叠并不是将对象转换为所需格式的最简单的方式，我欢迎使用更优雅和复杂代码的任何其他答案。举例来说，我想到了astype（'category'）和reindex（level='branch'），但我还没有达到能够胜任分类索引的程度。