Python 大熊猫的群居迁移

Python 大熊猫的群居迁移,python,pandas,Python,Pandas,我正在尝试按帐户计算累计收入。以下是一些示例数据: import pandas as pd data = { 'account_id': ['111','111','111','222','222','333','333','333','666','666'], 'company': ['initech','initech','initech','jackson steinem & co','jackson steinem & co','ingen','ingen

我正在尝试按帐户计算累计收入。以下是一些示例数据:

import pandas as pd
data = {
    'account_id': ['111','111','111','222','222','333','333','333','666','666'],
    'company': ['initech','initech','initech','jackson steinem & co','jackson steinem & co','ingen','ingen','ingen','enron','enron'],
    'cohort_period': [0,1,2,0,1,0,1,2,0,1],
    'revenue':[3.67,9.95,9.95,193.29,299.95,83.03,499.95,99.95,1.52,19.95]
}
df = pd.DataFrame(data)
哪些产出:

In [17]: df
Out[17]:
  account_id  cohort_period               company  revenue
0        111              0               initech     3.67
1        111              1               initech     9.95
2        111              2               initech     9.95
3        222              0  jackson steinem & co   193.29
4        222              1  jackson steinem & co   299.95
5        333              0                 ingen    83.03
6        333              1                 ingen   499.95
7        333              2                 ingen    99.95
8        666              0                 enron     1.52
9        666              1                 enron    19.95
关于如何做到这一点,有很多例子,基本上是:

df['cumulative_revenue'] = df.groupby('account_id')['revenue'].cumsum()
然而,这里有一个陷阱:在这个数据中,队列期间0的收入是按比例分配的,为了我的分析目的,我不关心这一点。我需要的是在队列周期1开始累积总和。例如,Initech的累积收入应如下所示:

0    nan
1    9.95
2    19.90
这里有一个方法:

# check valid cohort_period
valid_cohort = df.cohort_period.ne(0)

# cumulative sum revenue where cohort_period is not equal to zero and mask otherwise as nan
df['cum_revenue'] = valid_cohort.mul(df.revenue).groupby(df.account_id).cumsum().where(valid_cohort)

print(df)
#  account_id  cohort_period               company  revenue  cum_revenue
#0        111              0               initech     3.67          NaN
#1        111              1               initech     9.95         9.95
#2        111              2               initech     9.95        19.90
#3        222              0  jackson steinem & co   193.29          NaN
#4        222              1  jackson steinem & co   299.95       299.95
#5        333              0                 ingen    83.03          NaN
#6        333              1                 ingen   499.95       499.95
#7        333              2                 ingen    99.95       599.90
#8        666              0                 enron     1.52          NaN
#9        666              1                 enron    19.95        19.95

我将创建一个新变量“new”

df['New']=df.revenue
df.loc[df['cohort_period']==0,'New']=np.nan
df['cumulative_revenue']=df.groupby('account_id')['New'].cumsum()
df
Out[63]: 
  account_id  cohort_period               company  revenue     New  \
0        111              0               initech     3.67     NaN   
1        111              1               initech     9.95    9.95   
2        111              2               initech     9.95    9.95   
3        222              0  jackson steinem & co   193.29     NaN   
4        222              1  jackson steinem & co   299.95  299.95   
5        333              0                 ingen    83.03     NaN   
6        333              1                 ingen   499.95  499.95   
7        333              2                 ingen    99.95   99.95   
8        666              0                 enron     1.52     NaN   
9        666              1                 enron    19.95   19.95   
   cumulative_revenue  
0                 NaN  
1                9.95  
2               19.90  
3                 NaN  
4              299.95  
5                 NaN  
6              499.95  
7              599.90  
8                 NaN  
9               19.95  
mask

df.groupby('account_id').apply(lambda x :x['revenue'].mask(x['cohort_period'].eq(0)).cumsum())