Python 熊猫-添加新的聚合功能
我在熊猫中有这个数据帧:Python 熊猫-添加新的聚合功能,python,pandas,Python,Pandas,我在熊猫中有这个数据帧: day customer amount 0 1 cust1 500 1 2 cust2 100 2 1 cust1 50 3 2 cust1 100 4 2 cust2 250 5 6 cust1 20 为方便起见: df = pd.DataFrame({'day': [1, 2, 1, 2, 2, 6],
day customer amount
0 1 cust1 500
1 2 cust2 100
2 1 cust1 50
3 2 cust1 100
4 2 cust2 250
5 6 cust1 20
为方便起见:
df = pd.DataFrame({'day': [1, 2, 1, 2, 2, 6],
'customer': ['cust1', 'cust2', 'cust1', 'cust1', 'cust2', 'cust1'],
'amount': [500, 100, 50, 100, 250, 20]})
我想创建一个新的列“amount2days”,以便增加过去两天每位客户的金额,以获得以下数据框:
day customer amount amount2days ----------------------------
0 1 cust1 500 500 (no past transactions)
1 2 cust2 100 100 (no past transactions)
2 1 cust1 50 550 (500 + 50 = rows 0,2
3 2 cust1 100 650 (500 + 50 + 100, rows 0,2,3)
4 2 cust2 250 350 (100 + 250, rows 1,4)
5 6 cust1 20 20 (notice day is 6, and no day=5 for cust1)
i、 e.我想执行以下(伪)代码:
每行。最方便的方法是什么
我希望进行的求和是在一天内完成的,但天不一定要在每一行中递增,如示例所示。我仍然想计算过去两天的金额。您可以使用panda的
滚动
来移动窗口操作(取决于panda的版本,重置索引
,就像jezrael的回答中那样会更安全):
使用
groupby
和sum
注意:
以下是避免数据错误对齐的必要添加:
df['amount2days'] = (df.groupby('customer')['amount']
.rolling(2, min_periods=0)
.sum()
.reset_index(level=0, drop=True))
print (df)
day customer amount amount2days
1 1 cust1 500 500.0
2 2 cust1 100 600.0
3 3 cust1 250 350.0
为什么不在这里使用。\u numpy
?因为如果不是默认索引,则输出应被错误分配-请检查以下示例:
df = pd.DataFrame({'day': {0: 1, 2: 2, 5: 3, 1: 1, 6: 2, 4: 3}, 'customer': {0: 'cust2', 2: 'cust2', 5: 'cust2', 1: 'cust1', 6: 'cust1', 4: 'cust1'}, 'amount': {0: 5000, 2: 1000, 5: 2500, 1: 500, 6: 100, 4: 250}})
print (df)
day customer amount
0 1 cust2 5000
2 2 cust2 1000
5 3 cust2 2500
1 1 cust1 500
6 2 cust1 100
4 3 cust1 250
编辑:一般解决方案:
def f(x):
N = 1
for i in pd.unique(x['day']):
y = x[x['day'].between(i - N, i)]
x.loc[y.index[-1], 'amountNdays'] = y['amount'].sum()
return x
df = df.groupby('customer').apply(f)
df['amountNdays'] = df['amountNdays'].fillna(df['amount'])
print (df)
day customer amount amountNdays
0 1 cust1 500 500.0
1 2 cust2 100 100.0
2 1 cust1 50 550.0
3 2 cust1 100 650.0
4 2 cust2 250 350.0
5 6 cust1 20 20.0
这回答了你的问题吗?谢谢我更新了我的问题,使之更清楚。我的数据不一定会在每一行中增加“天”,但我仍然希望向后加2天。在这种情况下,简单的滚动会起作用吗?如果向后的行没有什么可求和的话,仍然不起作用,我再次编辑了我的示例(应该只有20行,但是这个方法给出了120行)。谢谢。@jezarel加上“there”我是指最后一行的“amount2days”。也许有一种方法可以很容易地将其概括为amount2days?@user112112-添加的一般解决方案注释不用于扩展讨论;这段对话已经结束。
df = pd.DataFrame({'day': {0: 1, 2: 2, 5: 3, 1: 1, 6: 2, 4: 3}, 'customer': {0: 'cust2', 2: 'cust2', 5: 'cust2', 1: 'cust1', 6: 'cust1', 4: 'cust1'}, 'amount': {0: 5000, 2: 1000, 5: 2500, 1: 500, 6: 100, 4: 250}})
print (df)
day customer amount
0 1 cust2 5000
2 2 cust2 1000
5 3 cust2 2500
1 1 cust1 500
6 2 cust1 100
4 3 cust1 250
df['amount2days'] = (df.groupby('customer', sort=False).amount
.rolling(2, min_periods=0)
.sum()
.to_numpy())
df['amount2days1'] = (df.groupby('customer')['amount']
.rolling(2, min_periods=0)
.sum()
.reset_index(level=0, drop=True))
print (df)
day customer amount amount2days amount2days1
0 1 cust2 5000 500.0 5000.0
2 2 cust2 1000 600.0 6000.0
5 3 cust2 2500 350.0 3500.0
1 1 cust1 500 5000.0 500.0
6 2 cust1 100 6000.0 600.0
4 3 cust1 250 3500.0 350.0
def f(x):
N = 1
for i in pd.unique(x['day']):
y = x[x['day'].between(i - N, i)]
x.loc[y.index[-1], 'amountNdays'] = y['amount'].sum()
return x
df = df.groupby('customer').apply(f)
df['amountNdays'] = df['amountNdays'].fillna(df['amount'])
print (df)
day customer amount amountNdays
0 1 cust1 500 500.0
1 2 cust2 100 100.0
2 1 cust1 50 550.0
3 2 cust1 100 650.0
4 2 cust2 250 350.0
5 6 cust1 20 20.0