Python PANDES groupby-按列值展开平均值
我刚接触熊猫,对在这里做什么有些茫然。我有一个从csv导入的数据框,它(高度简化)如下所示:Python PANDES groupby-按列值展开平均值,python,pandas,Python,Pandas,我刚接触熊猫,对在这里做什么有些茫然。我有一个从csv导入的数据框,它(高度简化)如下所示: date = ['2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10', '2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10'] event = ['213','213','213','213','214','214','214','215
date = ['2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10',
'2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10']
event = ['213','213','213','213','214','214','214','215','215','215']
side = ['A','B','B','B','A','B','A','B','A','B',]
value = [0.193,0.193,0.092,0.027,0.027,0.058,0.027,0.079,0.193,0.159]
df = pd.DataFrame(zip(event,date,side,value),
columns=['event','date','side','value'])
event date side value
0 213 2013-08-10 A 0.193
1 213 2013-08-10 B 0.193
2 213 2013-08-10 B 0.092
3 213 2013-08-10 B 0.027
4 214 2013-08-10 A 0.027
5 214 2013-08-10 B 0.058
6 214 2013-08-10 A 0.027
7 215 2013-08-10 B 0.079
8 215 2013-08-10 A 0.193
9 215 2013-08-10 B 0.159
value
event side roll_mean
213 A 0.193 0
B 0.312 0
214 A 0.054 0.193
B 0.058 0.312
215 A 0.193 0.124
B 0.238 0.185
我想要的是对每一个事件的每一边对应的值求和。我通过groupby实现了这一点:
groupby = df.groupby(['event','side']).sum()
value
event side
213 A 0.193
B 0.312
214 A 0.054
B 0.058
215 A 0.193
B 0.238
但我还想添加一个新列,每边的扩展平均值如下:
date = ['2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10',
'2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10']
event = ['213','213','213','213','214','214','214','215','215','215']
side = ['A','B','B','B','A','B','A','B','A','B',]
value = [0.193,0.193,0.092,0.027,0.027,0.058,0.027,0.079,0.193,0.159]
df = pd.DataFrame(zip(event,date,side,value),
columns=['event','date','side','value'])
event date side value
0 213 2013-08-10 A 0.193
1 213 2013-08-10 B 0.193
2 213 2013-08-10 B 0.092
3 213 2013-08-10 B 0.027
4 214 2013-08-10 A 0.027
5 214 2013-08-10 B 0.058
6 214 2013-08-10 A 0.027
7 215 2013-08-10 B 0.079
8 215 2013-08-10 A 0.193
9 215 2013-08-10 B 0.159
value
event side roll_mean
213 A 0.193 0
B 0.312 0
214 A 0.054 0.193
B 0.058 0.312
215 A 0.193 0.124
B 0.238 0.185
请注意,每个事件都有两个边,但并不总是A和B。我想要的是类似于excel的mean.if函数的东西,该函数计算当前边的所有值的扩展平均值,应用于前面的所有行。在此方面的任何帮助都将不胜感激。我认为您实际上是在寻找一个扩展平均值,而不是滚动平均值。扩展平均值考虑所有以前的值。我将从您停止的地方开始:
In [63]: res = df.groupby(['event','side']).sum()
In [64]: res
Out[64]:
value
event side
213 A 0.193
B 0.312
214 A 0.054
B 0.058
215 A 0.193
B 0.238
现在,我们要分组侧边
,并取扩展平均值:
In [65]: res['expanding_mean'] = res.groupby(level='side').apply(pd.expanding_mean).shift(2)
In [66]: res
Out[66]:
value expanding_mean
event side
213 A 0.193 NaN
B 0.312 NaN
214 A 0.054 0.1930
B 0.058 0.3120
215 A 0.193 0.1235
B 0.238 0.1850
您的结果需要按2进行
shift
ed,因为您希望平均值包括所有以前的值,而不是当前值(确保这是您实际想要的,这似乎有点滑稽)。您可以用len(res.index.levels[1])
替换shift(2)
,以便在有两个以上的边的情况下使其更为通用。我在数据帧中添加了更多的“边”,因此当结果不仅仅是“a”或“B”时,它可以工作。这是你想要的吗
import pandas as pd
import numpy as np
date = ['2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10',
'2013-08-10','2013-08-10','2013-08-10','2013-08-10','2013-08-10']
event = ['213','213','213','213','214','214','214','215','215','215']
side = ['A','B','A','B','C','A','C','A','C','A',]
value = [0.193,0.193,0.092,0.027,0.027,0.058,0.027,0.079,0.193,0.159]
df = pd.DataFrame(list(zip(event,date,side,value)),
columns=['event','date','side','value'])
print(df)
event date side value
0 213 2013-08-10 A 0.193
1 213 2013-08-10 B 0.193
2 213 2013-08-10 A 0.092
3 213 2013-08-10 B 0.027
4 214 2013-08-10 C 0.027
5 214 2013-08-10 A 0.058
6 214 2013-08-10 C 0.027
7 215 2013-08-10 A 0.079
8 215 2013-08-10 C 0.193
9 215 2013-08-10 A 0.159
ds = df.groupby(['event','side']).sum()
print(ds)
value
event side
213 A 0.285
B 0.220
214 A 0.058
C 0.054
215 A 0.238
C 0.193
ds.reset_index(inplace=True)
ds['exp_mean'] = np.NaN
for s in ds.side.unique():
ndx = ds[ds.side==s].index
ds.ix[ndx,'exp_mean'] = pd.expanding_mean(ds.ix[ndx,'value']).shift(1)
ds.set_index(['event', 'side'], inplace=True, drop=True)
print(ds)
value exp_mean
event side
213 A 0.285 NaN
B 0.220 NaN
214 A 0.058 0.2850
C 0.054 NaN
215 A 0.238 0.1715
C 0.193 0.0540
请参见以下内容(第60-78行):
你认为滚动窗口是什么样的?为什么边缘的滚动平均值为零?如果不可计算,它不是更愿意为null吗?窗口将是任何以前的事件,是的,它应该为null。不知道扩展平均值。这正是我想要的。谢谢事实证明,这不是我需要的。后来在数据帧中出现了A和B之外的不同侧面,这使事情变得复杂。我需要的是类似excel的mean.if()函数,其中的条件是值属于同一侧,A、B、C等。我希望您理解。也就是说,移位不起作用,因为不同的一侧不会以任何特定顺序出现。