Python 熊猫-自上次交易以来的计数_Python_Pandas_Dataframe

Python 熊猫-自上次交易以来的计数

python pandas dataframe

Python 熊猫-自上次交易以来的计数,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个包含货币交易记录的数据框（称之为txn_df），下面是这个问题中的重要列： txn_year txn_month custid withdraw deposit 2011 4 123 0.0 100.0 2011 5 123 0.0 0.0 2011 6 123 0.0 0.0 2011

我有一个包含货币交易记录的数据框（称之为

txn_df

），下面是这个问题中的重要列：

txn_year    txn_month   custid  withdraw    deposit
2011        4           123     0.0         100.0
2011        5           123     0.0         0.0
2011        6           123     0.0         0.0
2011        7           123     50.1        0.0
2011        8           123     0.0         0.0

假设我们这里有多个客户<代码>提取和

存款

0.0表示没有发生任何交易。我想做的是生成一个新的列，指示自发生事务以来已经发生了多少个月。类似于此：

txn_year    txn_month   custid  withdraw    deposit     num_months_since_last_txn
2011        4           123     0.0         100.0       0
2011        5           123     0.0         0.0         1
2011        6           123     0.0         0.0         2           
2011        7           123     50.1        0.0         3
2011        8           123     0.0         0.0         1

到目前为止，我能想到的唯一解决方案是当

取款

和

存款

中的任何一个值大于0.0，但我无法从那里继续时，生成一个新列

has_txn

（为1/0或真/假）。

解决此问题的一种方法

df['series'] =  df[['withdraw','deposit']].ne(0).sum(axis=1)
m = df['series']>=1

正如@Chris A评论的那样

m = df[['withdraw','deposit']].gt(0).any(axis=1) #replacement for above snippet,

df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1).fillna(0)
print df

输出：

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          6     123       0.0      0.0
3      2011          7     123      50.1      0.0
4      2011          8     123       0.0      0.0
   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
2      2011          6     123       0.0      0.0                        2.0
3      2011          7     123      50.1      0.0                        3.0
4      2011          8     123       0.0      0.0                        1.0

说明：

要获取已发生或未发生的事务，请使用

ne

和sum以获取二进制值

当事务为1时，使用

groupby

，

cumsum

，

cumcount

从0,1,2…n创建序列

使用

.loc

注：可能是我增加了更复杂的问题来解决这个问题。但它会给你一个解决这个问题的想法和方法

考虑客户Id的解决方案

df=df.sort_values(by=['custid','txn_month'])
mask=~df.duplicated(subset=['custid'],keep='first')
m = df[['withdraw','deposit']].gt(0).any(axis=1)
df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1)
df.loc[mask,'num_months_since_last_txn']=0

样本输入：

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          4    1245       0.0    100.0
3      2011          5    1245       0.0      0.0
4      2011          6     123       0.0      0.0
5      2011          7    1245      50.1      0.0
6      2011          7     123      50.1      0.0
7      2011          8     123       0.0      0.0
8      2011          6    1245       0.0      0.0
9      2011          8    1245       0.0      0.0

样本输出：

   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
4      2011          6     123       0.0      0.0                        2.0
6      2011          7     123      50.1      0.0                        3.0
7      2011          8     123       0.0      0.0                        1.0
2      2011          4    1245       0.0    100.0                        0.0
3      2011          5    1245       0.0      0.0                        1.0
8      2011          6    1245       0.0      0.0                        2.0
5      2011          7    1245      50.1      0.0                        3.0
9      2011          8    1245       0.0      0.0                        1.0

关于考虑客户ID的说明

上述代码基于[1,1]之间的间隔工作。因此，要制作相同的格式，请按客户id和txn_月对df进行排序，以便将来添加txn_年

fillna（0）在这里不起作用，因为shift不会为下一个客户创建NaN。若要重置为0，请查找客户Id的重复项并获取第一个值，然后将其替换为0

回答得很好。假设一个月内既有“存款”又有“取款”

df['series']

将是

，因此不会被布尔掩码

的逻辑拾取。您可以将第1行和第2行压缩为：

m=df[['draw'，'deposit']]].gt（0）。any（axis=1）

@chris A-您是对的。谢谢你的帮助。更新到解决方案只是好奇，此解决方案是否包括按

custid

分组？@menorah84-不，不会，它基于从1到1的位置。但是，您可以在按cust id对数据帧进行排序后应用相同的方法。然后，您必须将fillna命令替换为custidsorry的开头，以便在这篇文章中如此晚地发布某些内容，但PySpark中有没有办法做到这一点？