Pythonic/Panda创建Groupby函数的方法_Python_Python 3.x_Pandas

Pythonic/Panda创建Groupby函数的方法

python python-3.x pandas

Pythonic/Panda创建Groupby函数的方法,python,python-3.x,pandas,Python,Python 3.x,Pandas,我对编程相当陌生&我正在寻找一种更具python风格的方法来实现一些代码。以下是虚拟数据： df = pd.DataFrame({ 'Category':np.random.choice( ['Group A','Group B'], 10000), 'Sub-Category':np.random.choice( ['X','Y','Z'], 10000), 'Sub-Category-2':np.random.choice( ['G','F','I'], 10000), 'Product'

我对编程相当陌生&我正在寻找一种更具python风格的方法来实现一些代码。以下是虚拟数据：

 df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000), 
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2018',  
                  freq='D'), 10000)})

我有很多这样的事务数据，我在上面执行各种Groupby。我目前的解决方案是制作一个主groupby，如下所示：

master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\
.unstack()

在此基础上，我使用.groupby（level=）函数执行各种groupby，以我所寻找的方式聚合信息。我通常会在每一级做一个总结。此外，我使用下面代码的一些变体在每个级别创建小计

y = master.groupby(level=[0,1,2]).sum()
y.index = pd.MultiIndex.from_arrays([
    y.index.get_level_values(0),
    y.index.get_level_values(1),
    y.index.get_level_values(2) + ' Total',
    len(y.index)*['']
])

y1 = master.groupby(level=[0,1]).sum()
y1.index = pd.MultiIndex.from_arrays([
    y1.index.get_level_values(0),
    y1.index.get_level_values(1)+ ' Total',
    len(y1.index)*[''],
    len(y1.index)*['']
])

y2 = master.groupby(level=[0]).sum()
y2.index = pd.MultiIndex.from_arrays([
    y2.index.get_level_values(0)+ ' Total',
    len(y2.index)*[''],
    len(y2.index)*[''],
    len(y2.index)*['']
])

pd.concat([master,y,y1,y2]).sort_index()\
    .assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])\
    .assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)\
    .dropna(how='all')\

这只是一个例子-我可以执行相同的练习，但以不同的顺序执行groupby。例如，接下来我可能想按“类别”、“产品”和“客户”进行分组，因此我必须： master.groupby（级别=[1,3,0）.sum（）

然后，我将不得不对上述小计重复整个练习。我还经常更改时间段-可能是一年结束的特定月份，可能是今年迄今，可能是按季度，等等

从我到目前为止在编程方面所学的知识（这显然是最小的！）来看，您应该在任何时候重复代码时编写函数

是否有一种方法可以构造一个函数，在该函数中，您可以向Groupby提供级别以及时间范围，同时为每个级别创建一个函数

感谢您对这方面的任何指导。非常感谢.E/P> < P>对于一个干燥的ER解决方案，考虑将当前的方法归纳为一个定义的模块，它通过日期范围过滤原始数据帧并运行聚合，接收<代码> GROPY按级别和日期范围（后者是可选的）。作为传入参数：

方法

def multiple_agg(mylevels, start_date='2016-01-01', end_date='2018-12-31'):

    filter_df = df[df['Date'].between(start_date, end_date)]

    master = (filter_df.groupby(['Customer', 'Category', 'Sub-Category', 'Product', 
                     pd.Grouper(key='Date',freq='A')])['Units_Sold']
                .sum()
                .unstack()
              )

    y = master.groupby(level=mylevels[:-1]).sum()
    y.index = pd.MultiIndex.from_arrays([
        y.index.get_level_values(0),
        y.index.get_level_values(1),
        y.index.get_level_values(2) + ' Total',
        len(y.index)*['']
    ])

    y1 = master.groupby(level=mylevels[0:2]).sum()
    y1.index = pd.MultiIndex.from_arrays([
        y1.index.get_level_values(0),
        y1.index.get_level_values(1)+ ' Total',
        len(y1.index)*[''],
        len(y1.index)*['']
    ])

    y2 = master.groupby(level=mylevels[0]).sum()
    y2.index = pd.MultiIndex.from_arrays([
        y2.index.get_level_values(0)+ ' Total',
        len(y2.index)*[''],
        len(y2.index)*[''],
        len(y2.index)*['']
    ])

    final_df = (pd.concat([master,y,y1,y2])
                         .sort_index()
                         .assign(Diff = lambda x: x.iloc[:,-1] - x.iloc[:,-2])
                         .assign(Diff_Perc = lambda x: (x.iloc[:,-2] / x.iloc[:,-3])- 1)
                         .dropna(how='all')
                         .reorder_levels(mylevels)
                )

    return final_df

聚合运行（不同级别和日期范围）
测试（
final_df
是OP的
pd.concat（）
输出）

我想您可以使用
sum
和
level
参数来实现：

master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\ .unstack() s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True) s2 = master.sum(level=[0,1]) # Wanted to use assign method but because of the hyphen in the column name you can't. # Also use the Z in front for sorting purposes s2['Sub-Category'] = 'ZTotal' s2['Product'] = '' s2 = s2.set_index(['Sub-Category','Product'], append=True) s3 = master.sum(level=[0]) s3['Category'] = 'Total' s3['Sub-Category'] = '' s3['Product'] = '' s3 = s3.set_index(['Category','Sub-Category','Product'], append=True) master_new = pd.concat([master,s1,s2,s3]).sort_index() master_new
输出：

Date 2016-12-31 2017-12-31 2018-12-31 Customer Category Sub-Category Product 30XWmt1jm0 Group A X Product 1 651.0 341.0 453.0 Product 2 267.0 445.0 117.0 Product 3 186.0 280.0 352.0 Total 1104.0 1066.0 922.0 Y Product 1 426.0 417.0 670.0 Product 2 362.0 210.0 380.0 Product 3 232.0 290.0 430.0 Total 1020.0 917.0 1480.0 Z Product 1 196.0 212.0 703.0 Product 2 277.0 340.0 579.0 Product 3 416.0 392.0 259.0 Total 889.0 944.0 1541.0 ZTotal 3013.0 2927.0 3943.0 Group B X Product 1 356.0 230.0 407.0 Product 2 402.0 370.0 590.0 Product 3 262.0 381.0 377.0 Total 1020.0 981.0 1374.0 Y Product 1 575.0 314.0 643.0 Product 2 557.0 375.0 411.0 Product 3 344.0 246.0 280.0 Total 1476.0 935.0 1334.0 Z Product 1 278.0 152.0 392.0 Product 2 149.0 596.0 303.0 Product 3 234.0 505.0 521.0 Total 661.0 1253.0 1216.0 ZTotal 3157.0 3169.0 3924.0 Total 6170.0 6096.0 7867.0 3U2anYOD6o Group A X Product 1 214.0 443.0 195.0 Product 2 170.0 220.0 423.0 Product 3 111.0 469.0 369.0 ... ... ... ... somc22Y2Hi Group B Z Total 906.0 1063.0 680.0 ZTotal 3070.0 3751.0 2736.0 Total 6435.0 7187.0 6474.0 zRZq6MSKuS Group A X Product 1 421.0 182.0 387.0 Product 2 359.0 287.0 331.0 Product 3 232.0 394.0 279.0 Total 1012.0 863.0 997.0 Y Product 1 245.0 366.0 111.0 Product 2 377.0 148.0 239.0 Product 3 372.0 219.0 310.0 Total 994.0 733.0 660.0 Z Product 1 280.0 363.0 354.0 Product 2 384.0 604.0 178.0 Product 3 219.0 462.0 366.0 Total 883.0 1429.0 898.0 ZTotal 2889.0 3025.0 2555.0 Group B X Product 1 466.0 413.0 187.0 Product 2 502.0 370.0 368.0 Product 3 745.0 480.0 318.0 Total 1713.0 1263.0 873.0 Y Product 1 218.0 226.0 385.0 Product 2 123.0 382.0 570.0 Product 3 173.0 572.0 327.0 Total 514.0 1180.0 1282.0 Z Product 1 480.0 317.0 604.0 Product 2 256.0 215.0 572.0 Product 3 463.0 50.0 349.0 Total 1199.0 582.0 1525.0 ZTotal 3426.0 3025.0 3680.0 Total 6315.0 6050.0 6235.0 [675 rows x 3 columns]

你的
y2
真的是你的意思吗，还是应该是
level=[0]
？你说得对&我对它进行了编辑以反映这一点。谢谢！非常感谢你的帮助！我正在尝试更好地理解函数-当我运行不同的聚合时，agg_df1返回我期望的结果，包括小计。当我运行aagg_df2和agg_df3时，它不会返回我期望的结果-顺序与agg_df1 e中的相同即使你通过了不同的级别？而且，小计部分也不见了。最后，我似乎无法让你的测试部分发挥作用。非常感谢你的帮助！另外两个agg_dfs正在演示如何按照你的指示更改参数。因此，不，它们与你发布的任何内容都不匹配。根据你的具体需要进行调整。你可以需要将您的
pd.concat（）
输出分配给最终的_df变量。并且确保在顶部添加
np.random.seed（##############################)
以复制相同的随机数。再次感谢！很抱歉，不清楚，我的意思是运行“multiple，start_date='2017-01-01'，end_date='2018-12-31'）返回相同的多索引（客户，然后是类别，子类别，然后是产品）。我认为多索引顺序会根据级别传递的顺序而变化。我明白了。它可以归结为函数末尾的
pd.concat
。由于主项是第一项，因此它会根据其索引顺序进行追加。但是，可以使用MyLevel参数。但您现在必须传递一个包含4个数字的列表，其中最后一个数字在聚合中从未使用过。请参阅“编辑”。@Parfait Nice solution。谢谢，Scott。这是一个很好的解决方案，我将继续使用它创建小计。我希望将它变成一个类似函数的解决方案，我可以将级别和日期传递给-有什么建议吗？
master = df.groupby(['Customer','Category','Sub-Category','Product',pd.Grouper(key='Date',freq='A')])['Units_Sold'].sum()\ .unstack() s1 = master.sum(level=[0,1,2]).assign(Product='Total').set_index('Product',append=True) s2 = master.sum(level=[0,1]) # Wanted to use assign method but because of the hyphen in the column name you can't. # Also use the Z in front for sorting purposes s2['Sub-Category'] = 'ZTotal' s2['Product'] = '' s2 = s2.set_index(['Sub-Category','Product'], append=True) s3 = master.sum(level=[0]) s3['Category'] = 'Total' s3['Sub-Category'] = '' s3['Product'] = '' s3 = s3.set_index(['Category','Sub-Category','Product'], append=True) master_new = pd.concat([master,s1,s2,s3]).sort_index() master_new

Date 2016-12-31 2017-12-31 2018-12-31 Customer Category Sub-Category Product 30XWmt1jm0 Group A X Product 1 651.0 341.0 453.0 Product 2 267.0 445.0 117.0 Product 3 186.0 280.0 352.0 Total 1104.0 1066.0 922.0 Y Product 1 426.0 417.0 670.0 Product 2 362.0 210.0 380.0 Product 3 232.0 290.0 430.0 Total 1020.0 917.0 1480.0 Z Product 1 196.0 212.0 703.0 Product 2 277.0 340.0 579.0 Product 3 416.0 392.0 259.0 Total 889.0 944.0 1541.0 ZTotal 3013.0 2927.0 3943.0 Group B X Product 1 356.0 230.0 407.0 Product 2 402.0 370.0 590.0 Product 3 262.0 381.0 377.0 Total 1020.0 981.0 1374.0 Y Product 1 575.0 314.0 643.0 Product 2 557.0 375.0 411.0 Product 3 344.0 246.0 280.0 Total 1476.0 935.0 1334.0 Z Product 1 278.0 152.0 392.0 Product 2 149.0 596.0 303.0 Product 3 234.0 505.0 521.0 Total 661.0 1253.0 1216.0 ZTotal 3157.0 3169.0 3924.0 Total 6170.0 6096.0 7867.0 3U2anYOD6o Group A X Product 1 214.0 443.0 195.0 Product 2 170.0 220.0 423.0 Product 3 111.0 469.0 369.0 ... ... ... ... somc22Y2Hi Group B Z Total 906.0 1063.0 680.0 ZTotal 3070.0 3751.0 2736.0 Total 6435.0 7187.0 6474.0 zRZq6MSKuS Group A X Product 1 421.0 182.0 387.0 Product 2 359.0 287.0 331.0 Product 3 232.0 394.0 279.0 Total 1012.0 863.0 997.0 Y Product 1 245.0 366.0 111.0 Product 2 377.0 148.0 239.0 Product 3 372.0 219.0 310.0 Total 994.0 733.0 660.0 Z Product 1 280.0 363.0 354.0 Product 2 384.0 604.0 178.0 Product 3 219.0 462.0 366.0 Total 883.0 1429.0 898.0 ZTotal 2889.0 3025.0 2555.0 Group B X Product 1 466.0 413.0 187.0 Product 2 502.0 370.0 368.0 Product 3 745.0 480.0 318.0 Total 1713.0 1263.0 873.0 Y Product 1 218.0 226.0 385.0 Product 2 123.0 382.0 570.0 Product 3 173.0 572.0 327.0 Total 514.0 1180.0 1282.0 Z Product 1 480.0 317.0 604.0 Product 2 256.0 215.0 572.0 Product 3 463.0 50.0 349.0 Total 1199.0 582.0 1525.0 ZTotal 3426.0 3025.0 3680.0 Total 6315.0 6050.0 6235.0 [675 rows x 3 columns]