Python将按行过滤并返回计数_Python_Pandas_Filter_Pandas Groupby

Python将按行过滤并返回计数

python pandas filter

Python将按行过滤并返回计数,python,pandas,filter,pandas-groupby,Python,Pandas,Filter,Pandas Groupby,我在以下数据帧中有一个示例： df = pd.DataFrame({ 'cluster': ['A','B','C','A','B','C','D','D'], 'profit': [-1.0,1.5,1,0.5,3.0,-2,-1, -2] }) 在输出到另一个datafram之前，我正在执行一系列groupby操作，其中大部分都需要工作 df['cluster_total_profit'] = df.groupby(['cluster'])['profit'].tr

我在以下数据帧中有一个示例：

df = pd.DataFrame({
    'cluster': ['A','B','C','A','B','C','D','D'],
    'profit': [-1.0,1.5,1,0.5,3.0,-2,-1, -2]
    })

在输出到另一个datafram之前，我正在执行一系列groupby操作，其中大部分都需要工作

df['cluster_total_profit'] = df.groupby(['cluster'])['profit'].transform('sum')

df['cluster_mean_profit'] = df.groupby(['cluster'])['profit'].transform('mean')

df['occurances'] = df.groupby(['cluster'])['profit'].transform('count')

df['std'] = df.groupby(['cluster'])['profit'].transform('std')

clusters = df[['cluster','cluster_total_profit', 'cluster_mean_profit', 'occurances', 'std']].drop_duplicates().reset_index(drop=True)

结果如下：

  cluster  cluster_total_profit  cluster_mean_profit  occurances      std
0       A                  -0.5                -0.25           2  1.06066
1       B                   4.5                 2.25           2  1.06066
2       C                  -1.0                -0.50           2  2.12132
3       D                  -3.0                -1.50           2  0.707107

我尝试进行的最后一个转换是计算每个组中有利可图的事件的数量，并用这些事件的数量填充df。可以在上表中收集输出，如下所示：

  cluster  cluster_total_profit  cluster_mean_profit  occurances      std    profitable_events
0       A                  -0.5                -0.25           2  1.06066    1
1       B                   4.5                 2.25           2  1.06066    2
2       C                  -1.0                -0.50           2  2.12132    1
3       D                  -3.0                -1.50           2  0.707107   0

我已经看过了，但是我不能把这些例子转换成我的确切用例。这是我的密码：

df['profitable_events'] = df.cluster.map(df.groupby(['cluster']).filter(lambda x: x[x['profit'] > 0.0].count()))

clusters = df[['cluster','cluster_total_profit', 'cluster_mean_profit', 'occurances', 'std', 'profitable_events']].drop_duplicates().reset_index(drop=True)

以及：

两者都抛出一个错误“TypeError:filter函数返回了一个序列，但需要一个标量布尔”

我还尝试：

df['profitable_events'] = df.cluster.map(df.groupby(['cluster']).filter(lambda x: len(x[x['profit'] > 0.0].index)))

出现错误：“TypeError:filter函数返回int，但应为标量bool”

我肯定有一个quic修复，但不确定它是什么

非常感谢您提前

您可以使用自定义函数计算盈利活动：

df.groupby('cluster')['profit'].agg([
    'sum','mean','count','std',
    ('profitable_event', lambda x: x.gt(0).sum())   
])

输出

         sum  mean  count       std  profitable_event
cluster                                              
A       -0.5 -0.25      2  1.060660               1.0
B        4.5  2.25      2  1.060660               2.0
C       -1.0 -0.50      2  2.121320               1.0
D       -3.0 -1.50      2  0.707107               0.0

您可以使用自定义函数计算盈利事件：

df.groupby('cluster')['profit'].agg([
    'sum','mean','count','std',
    ('profitable_event', lambda x: x.gt(0).sum())   
])

输出

         sum  mean  count       std  profitable_event
cluster                                              
A       -0.5 -0.25      2  1.060660               1.0
B        4.5  2.25      2  1.060660               2.0
C       -1.0 -0.50      2  2.121320               1.0
D       -3.0 -1.50      2  0.707107               0.0

为什么要使用

转换

然后

删除重复项

？这不就是

df.groupby（'cluster'）.agg（['sum'，'mean'，'count'，'std']）

？啊，这真的很有用，但它并不能完全回答这个问题。我正在寻找一种方法来计算利润大于0的组中的行数。如果我理解正确，此方法将对组中的每一行（而不是过滤组）运行sum、std、mean、count？为什么要使用

transform

然后

drop\u duplicates

？这不就是

df.groupby（'cluster'）.agg（['sum'，'mean'，'count'，'std']）

？啊，这真的很有用，但它并不能完全回答这个问题。我正在寻找一种方法来计算利润大于0的组中的行数。如果我理解正确的话，这个方法将在组中的每一行上运行sum、std、mean、count，而不是过滤组？是的，我只是太懒了：-）。希望OP能明白。是的，我只是太懒了：-）。希望OP能得到这个想法。