Python 使用函数数组创建groupby apply函数_Python_Pandas_Group By_Data Science_Feature Engineering

Python 使用函数数组创建groupby apply函数

python pandas

Python 使用函数数组创建groupby apply函数,python,pandas,group-by,data-science,feature-engineering,Python,Pandas,Group By,Data Science,Feature Engineering,我有一个类似于此示例的数据集 df = pd.DataFrame({ 'Store' : [100, 100, 100, 100, 101, 101, 101, 101], 'Product' : [5, 3, 10, 1, 3, 11, 2, 5], 'Category' : ['A', 'B', 'C', 'A', 'B', 'A', 'C', 'A'], 'Sales' : [100, 235, 120, 56, 789, 230, 300, 35] })

我有一个类似于此示例的数据集

df = pd.DataFrame({
    'Store' : [100, 100, 100, 100, 101, 101, 101, 101],
    'Product' : [5, 3, 10, 1, 3, 11, 2, 5],
    'Category' : ['A', 'B', 'C', 'A', 'B', 'A', 'C', 'A'],
    'Sales' : [100, 235, 120, 56, 789, 230, 300, 35]
})

所以它就像

Store   Product Category    Sales
100      5       A           100
100      3       B           235
100      10      C           120
100      1       A           56
101      3       B           789
101      11      A           230
101      2       C           300
101      5       A           35

每个商店都有一些产品，每个产品都有一些类别。我需要找出每个商店的总销售额以及每个商店中每个类别的销售额百分比。所以结果必须是这样的：

         total_Sales    Category_A  Category_B  Category_C
Store               
100       511            30.528376   45.988258   23.483366
101       1354           19.571640   58.271787   22.156573

df1 = df.groupby(['Store']).apply(lambda x: x['Sales'].sum())
df1 = df1.to_frame()
df1 = df1.rename(columns={0 : 'Sales'})

def category_util(x, col, cat):
    total_sales = x['Sales'].sum()
    cat_sales = x[x[col] == cat]['Sales'].sum()
    if cat_sales == 0:
        return 0
    else:
        return cat_sales*100/total_sales
df1['Category_A'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'A'))
df1['Category_B'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'B'))
df1['Category_C'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'C'))

类别列的单位为%

目前我是这样做的：

         total_Sales    Category_A  Category_B  Category_C
Store               
100       511            30.528376   45.988258   23.483366
101       1354           19.571640   58.271787   22.156573

df1 = df.groupby(['Store']).apply(lambda x: x['Sales'].sum())
df1 = df1.to_frame()
df1 = df1.rename(columns={0 : 'Sales'})

def category_util(x, col, cat):
    total_sales = x['Sales'].sum()
    cat_sales = x[x[col] == cat]['Sales'].sum()
    if cat_sales == 0:
        return 0
    else:
        return cat_sales*100/total_sales
df1['Category_A'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'A'))
df1['Category_B'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'B'))
df1['Category_C'] = df.groupby(['Store']).apply(lambda x: category_util(x, 'Category', 'C'))

df1是期望的输出。它工作得很好，但是每个apply函数都在一次又一次地对分组列进行排序，对于一个大数据集来说，这非常耗时。我想在一个函数调用中实现这一点。我试过这样的方法：

df.groupby(['Store']).agg([lambda x: category_util(x, 'Category', 'A'),
                          lambda x: category_util(x, 'Category', 'B'),
                          lambda x: category_util(x, 'Category', 'C')])

但它失败了，出现了“销售”的关键错误`

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Sales'

有什么解决办法吗？有没有办法将apply函数与lambda函数数组一起使用，并一次性计算所有列？如果不能使用apply，是否可以使用agg？这真的会节省我很多时间。提前谢谢。

您可以使用pivot\u table和unstack

您可以使用pivot_table和unstack

我们可以使用带有unstack的groupby。然后我们将总和除以轴=1：

我们可以创建两个相对便宜的groupby对象操作，并通过管道传递一个函数，该函数返回一个数据帧，其中包含总和和百分比：

group1 = df.groupby('Store')

group2 = df.groupby(['Store', 'Category'])

(df.assign(total_sales = group1.Sales.transform('sum'))
.groupby(['Store','Category'])
.pipe(lambda df: pd.DataFrame({"res" :df.Sales.sum()
                                        .div(df.total_sales.max())
                                        .mul(100), 
                               "total_sales": df.total_sales.max()}))
.set_index('total_sales', append = True)
.unstack('Category')
.droplevel(0, axis=1)
.add_prefix('Category_')
.rename_axis(columns=None)
.reset_index()
)


   Store  total_sales  Category_A  Category_B  Category_C
0    100          511   30.528376   45.988258   23.483366
1    101         1354   19.571640   58.271787   22.156573

我们可以创建两个相对便宜的groupby对象操作，并通过管道传递一个函数，该函数返回一个数据帧，其中包含总和和百分比：

group1 = df.groupby('Store')

group2 = df.groupby(['Store', 'Category'])

(df.assign(total_sales = group1.Sales.transform('sum'))
.groupby(['Store','Category'])
.pipe(lambda df: pd.DataFrame({"res" :df.Sales.sum()
                                        .div(df.total_sales.max())
                                        .mul(100), 
                               "total_sales": df.total_sales.max()}))
.set_index('total_sales', append = True)
.unstack('Category')
.droplevel(0, axis=1)
.add_prefix('Category_')
.rename_axis(columns=None)
.reset_index()
)


   Store  total_sales  Category_A  Category_B  Category_C
0    100          511   30.528376   45.988258   23.483366
1    101         1354   19.571640   58.271787   22.156573

非常感谢你。这确实是最好的解决方案。非常感谢。这确实是最好的解决方案。

group1 = df.groupby('Store')

group2 = df.groupby(['Store', 'Category'])

(df.assign(total_sales = group1.Sales.transform('sum'))
.groupby(['Store','Category'])
.pipe(lambda df: pd.DataFrame({"res" :df.Sales.sum()
                                        .div(df.total_sales.max())
                                        .mul(100), 
                               "total_sales": df.total_sales.max()}))
.set_index('total_sales', append = True)
.unstack('Category')
.droplevel(0, axis=1)
.add_prefix('Category_')
.rename_axis(columns=None)
.reset_index()
)


   Store  total_sales  Category_A  Category_B  Category_C
0    100          511   30.528376   45.988258   23.483366
1    101         1354   19.571640   58.271787   22.156573