Python 时间序列数据的分组和重采样
数据:Python 时间序列数据的分组和重采样,python,pandas,Python,Pandas,数据: ohlc_dict = { 'Open':'first', 'High':'max', 'Low':'min', 'Last': 'last', 'Volume': 'sum'} data['hod'] = [r.hour for r in data.index] data.head(10) Out[61]: Open High Low Last Volume hod dow Timestamp
ohlc_dict = {
'Open':'first',
'High':'max',
'Low':'min',
'Last': 'last',
'Volume': 'sum'}
data['hod'] = [r.hour for r in data.index]
data.head(10)
Out[61]:
Open High Low Last Volume hod dow
Timestamp
2014-05-08 08:00:00 136.230 136.290 136.190 136.290 7077 8 Thursday
2014-05-08 08:15:00 136.290 136.300 136.240 136.250 3881 8 Thursday
2014-05-08 08:30:00 136.240 136.270 136.230 136.230 2540 8 Thursday
2014-05-08 08:45:00 136.230 136.260 136.230 136.250 2293 8 Thursday
2014-05-08 09:00:00 136.250 136.360 136.240 136.360 15014 9 Thursday
2014-05-08 09:15:00 136.350 136.360 136.260 136.270 11697 9 Thursday
2014-05-08 09:30:00 136.270 136.270 136.190 136.200 15600 9 Thursday
2014-05-08 09:45:00 136.200 136.270 136.200 136.240 9025 9 Thursday
2014-05-08 10:00:00 136.240 136.270 136.240 136.260 7128 10 Thursday
2014-05-08 10:15:00 136.250 136.260 136.200 136.200 6100 10 Thursday
data['2016'].groupby('hod').Volume.mean().head()
hod
8 8452.597
9 16485.398
10 15619.626
11 14132.666
12 11470.058
Name: Volume, dtype: float64
df_h1 = data.resample('1h').agg(ohlc_dict).dropna()
df_h1['hod'] = [r.hour for r in df_h1.index]
df_h1['2016'].groupby('hod')['Volume'].mean()
Timestamp
2014-05-08 08:00:00 15791.000
2014-05-08 09:00:00 51336.000
2014-05-08 10:00:00 28855.000
2014-05-08 11:00:00 56543.000
2014-05-08 12:00:00 25249.000
Name: Volume, dtype: float64
问题:
ohlc_dict = {
'Open':'first',
'High':'max',
'Low':'min',
'Last': 'last',
'Volume': 'sum'}
data['hod'] = [r.hour for r in data.index]
data.head(10)
Out[61]:
Open High Low Last Volume hod dow
Timestamp
2014-05-08 08:00:00 136.230 136.290 136.190 136.290 7077 8 Thursday
2014-05-08 08:15:00 136.290 136.300 136.240 136.250 3881 8 Thursday
2014-05-08 08:30:00 136.240 136.270 136.230 136.230 2540 8 Thursday
2014-05-08 08:45:00 136.230 136.260 136.230 136.250 2293 8 Thursday
2014-05-08 09:00:00 136.250 136.360 136.240 136.360 15014 9 Thursday
2014-05-08 09:15:00 136.350 136.360 136.260 136.270 11697 9 Thursday
2014-05-08 09:30:00 136.270 136.270 136.190 136.200 15600 9 Thursday
2014-05-08 09:45:00 136.200 136.270 136.200 136.240 9025 9 Thursday
2014-05-08 10:00:00 136.240 136.270 136.240 136.260 7128 10 Thursday
2014-05-08 10:15:00 136.250 136.260 136.200 136.200 6100 10 Thursday
data['2016'].groupby('hod').Volume.mean().head()
hod
8 8452.597
9 16485.398
10 15619.626
11 14132.666
12 11470.058
Name: Volume, dtype: float64
df_h1 = data.resample('1h').agg(ohlc_dict).dropna()
df_h1['hod'] = [r.hour for r in df_h1.index]
df_h1['2016'].groupby('hod')['Volume'].mean()
Timestamp
2014-05-08 08:00:00 15791.000
2014-05-08 09:00:00 51336.000
2014-05-08 10:00:00 28855.000
2014-05-08 11:00:00 56543.000
2014-05-08 12:00:00 25249.000
Name: Volume, dtype: float64
以下两项都将时间范围从15分钟更改为1小时间隔:
方法1:
ohlc_dict = {
'Open':'first',
'High':'max',
'Low':'min',
'Last': 'last',
'Volume': 'sum'}
data['hod'] = [r.hour for r in data.index]
data.head(10)
Out[61]:
Open High Low Last Volume hod dow
Timestamp
2014-05-08 08:00:00 136.230 136.290 136.190 136.290 7077 8 Thursday
2014-05-08 08:15:00 136.290 136.300 136.240 136.250 3881 8 Thursday
2014-05-08 08:30:00 136.240 136.270 136.230 136.230 2540 8 Thursday
2014-05-08 08:45:00 136.230 136.260 136.230 136.250 2293 8 Thursday
2014-05-08 09:00:00 136.250 136.360 136.240 136.360 15014 9 Thursday
2014-05-08 09:15:00 136.350 136.360 136.260 136.270 11697 9 Thursday
2014-05-08 09:30:00 136.270 136.270 136.190 136.200 15600 9 Thursday
2014-05-08 09:45:00 136.200 136.270 136.200 136.240 9025 9 Thursday
2014-05-08 10:00:00 136.240 136.270 136.240 136.260 7128 10 Thursday
2014-05-08 10:15:00 136.250 136.260 136.200 136.200 6100 10 Thursday
data['2016'].groupby('hod').Volume.mean().head()
hod
8 8452.597
9 16485.398
10 15619.626
11 14132.666
12 11470.058
Name: Volume, dtype: float64
df_h1 = data.resample('1h').agg(ohlc_dict).dropna()
df_h1['hod'] = [r.hour for r in df_h1.index]
df_h1['2016'].groupby('hod')['Volume'].mean()
Timestamp
2014-05-08 08:00:00 15791.000
2014-05-08 09:00:00 51336.000
2014-05-08 10:00:00 28855.000
2014-05-08 11:00:00 56543.000
2014-05-08 12:00:00 25249.000
Name: Volume, dtype: float64
方法2:
ohlc_dict = {
'Open':'first',
'High':'max',
'Low':'min',
'Last': 'last',
'Volume': 'sum'}
data['hod'] = [r.hour for r in data.index]
data.head(10)
Out[61]:
Open High Low Last Volume hod dow
Timestamp
2014-05-08 08:00:00 136.230 136.290 136.190 136.290 7077 8 Thursday
2014-05-08 08:15:00 136.290 136.300 136.240 136.250 3881 8 Thursday
2014-05-08 08:30:00 136.240 136.270 136.230 136.230 2540 8 Thursday
2014-05-08 08:45:00 136.230 136.260 136.230 136.250 2293 8 Thursday
2014-05-08 09:00:00 136.250 136.360 136.240 136.360 15014 9 Thursday
2014-05-08 09:15:00 136.350 136.360 136.260 136.270 11697 9 Thursday
2014-05-08 09:30:00 136.270 136.270 136.190 136.200 15600 9 Thursday
2014-05-08 09:45:00 136.200 136.270 136.200 136.240 9025 9 Thursday
2014-05-08 10:00:00 136.240 136.270 136.240 136.260 7128 10 Thursday
2014-05-08 10:15:00 136.250 136.260 136.200 136.200 6100 10 Thursday
data['2016'].groupby('hod').Volume.mean().head()
hod
8 8452.597
9 16485.398
10 15619.626
11 14132.666
12 11470.058
Name: Volume, dtype: float64
df_h1 = data.resample('1h').agg(ohlc_dict).dropna()
df_h1['hod'] = [r.hour for r in df_h1.index]
df_h1['2016'].groupby('hod')['Volume'].mean()
Timestamp
2014-05-08 08:00:00 15791.000
2014-05-08 09:00:00 51336.000
2014-05-08 10:00:00 28855.000
2014-05-08 11:00:00 56543.000
2014-05-08 12:00:00 25249.000
Name: Volume, dtype: float64
只有方法2给出了体积数据的精确输出
我如何更改方法1,以获得与方法2相同的
体积
输出,但使用groupby
而不是重采样
?我不知道如何在方法1中使用ohlc\u dict
,我觉得这是必需的。在方法1中,你对该小时类型的所有观察值取直接平均值。方法2:首先对每小时的总量求和,然后对所有小时的总量求平均值。我希望方法2给出的结果是方法1的倍数,其中倍数是每小时观察的频率。您好@piRSquared感谢您的输入,我怀疑这一点,这就是为什么我问如何在方法1中使用ohlc_dict
。有可能吗?@piRsquared(或其他感兴趣的人)。我已经对照一个商业图表包交叉引用了这些数据,方法2似乎给出了正确的输出。查看groupby(方法1)语法需要如何修改以获得与方法2相同的结果将非常有用。在方法1中,您将对该小时类型的所有观察值进行直接平均。方法2:首先对每小时的总量求和,然后对所有小时的总量求平均值。我希望方法2给出的结果是方法1的倍数,其中倍数是每小时观察的频率。您好@piRSquared感谢您的输入,我怀疑这一点,这就是为什么我问如何在方法1中使用ohlc_dict
。有可能吗?@piRsquared(或其他感兴趣的人)。我已经对照一个商业图表包交叉引用了这些数据,方法2似乎给出了正确的输出。了解如何修改groupby(方法1)语法以获得与方法2相同的结果将非常有用。