Python 在数据帧索引（DatetimeIndex）上应用groupby聚合_Python_Pandas

Python 在数据帧索引（DatetimeIndex）上应用groupby聚合

python pandas

Python 在数据帧索引（DatetimeIndex）上应用groupby聚合,python,pandas,Python,Pandas,我正在尝试使用pandas 0.13.1减少气象数据。我有一个很大的浮动数据框。多亏了这一点，我将数据按半小时的间隔进行分组，效率最高。我使用groupby+apply而不是重采样，因为需要检查多个列 >>> winddata sonic_Ux sonic_Uy sonic_Uz TIMESTAMP 2014-04-30 14

我正在尝试使用pandas 0.13.1减少气象数据。我有一个很大的浮动数据框。多亏了这一点，我将数据按半小时的间隔进行分组，效率最高。我使用groupby+apply而不是重采样，因为需要检查多个列

>>> winddata
                            sonic_Ux  sonic_Uy  sonic_Uz
TIMESTAMP                                               
2014-04-30 14:13:12.300000  0.322444  2.530129  0.347921
2014-04-30 14:13:12.400000  0.357793  2.571811  0.360840
2014-04-30 14:13:12.500000  0.469529  2.400510  0.193011
2014-04-30 14:13:12.600000  0.298787  2.212599  0.404752
2014-04-30 14:13:12.700000  0.259310  2.054919  0.066324
2014-04-30 14:13:12.800000  0.342952  1.962965  0.070500
2014-04-30 14:13:12.900000  0.434589  2.210533 -0.010147
                                 ...       ...       ...

[4361447 rows x 3 columns]
>>> winddata.dtypes
sonic_Ux    float64
sonic_Uy    float64
sonic_Uz    float64
dtype: object
>>> hhdata = winddata.groupby(TimeGrouper('30T')); hhdata
<pandas.core.groupby.DataFrameGroupBy object at 0xb440790c>

我可以很好地循环通过groupby对象。我也可以将结果包装在序列或数据帧中，但是包装值需要添加一个索引，该索引与我的原始索引是元组。按照的建议删除重复索引没有按预期工作。由于我可以从这个问题中重现问题和解决方案，我想知道我是否认为它的行为有所不同，因为我是根据DateTimeIndex和index进行分组的

>>> for name, g in hhdata:
...     print name, atan2(g['sonic_Ux'].mean(), g['sonic_Uy'].mean()), '   wd'
... 
2014-04-30 14:00:00 0.13861912975    wd
2014-04-30 14:30:00 0.511709085506    wd
2014-04-30 15:00:00 -1.5088990774    wd
2014-04-30 15:30:00 0.13200013186    wd
    <<snip>>
>>> def winddir(g):
...     return pd.Series(atan2( np.mean(g['sonic_Ux']), np.mean(g['sonic_Uy']) ), name='wd')
... 
>>> hhdata.apply(winddir)
2014-04-30 14:00:00  0    0.138619
2014-04-30 14:30:00  0    0.511709
2014-04-30 15:00:00  0   -1.508899
2014-04-30 15:30:00  0    0.132000
...
2014-05-05 14:00:00  0   -2.551593
2014-05-05 14:30:00  0   -2.523250
2014-05-05 15:00:00  0   -2.698828
Name: wd, Length: 243, dtype: float64
>>> hhdata.apply(winddir).index[0]
(Timestamp('2014-04-30 14:00:00', tz=None), 0)
>>> def winddir(g):
...     return pd.DataFrame({'wd':atan2(g['sonic_Ux'].mean(), g['sonic_Uy'].mean())}, index=[g.name])
... 
>>> hhdata.apply(winddir)
                                               wd
2014-04-30 14:00:00 2014-04-30 14:00:00  0.138619
2014-04-30 14:30:00 2014-04-30 14:30:00  0.511709
2014-04-30 15:00:00 2014-04-30 15:00:00 -1.508899
2014-04-30 15:30:00 2014-04-30 15:30:00  0.132000
                                              ...

[243 rows x 1 columns]
>>> hhdata.apply(winddir).index[0]
(Timestamp('2014-04-30 14:00:00', tz=None), Timestamp('2014-04-30 14:00:00', tz=None))
>>> 
>>> tsfast.groupby(TimeGrouper('30T')).apply(lambda g:
...     Series({'wd': atan2(g.sonic_Ux.mean(), g.sonic_Uy.mean()), 
...             'ws': np.sqrt(g.sonic_Ux.mean()**2 + g.sonic_Uy.mean()**2)}))
2014-04-30 14:00:00  wd    0.138619
                     ws    1.304311
2014-04-30 14:30:00  wd    0.511709
                     ws    0.143762
2014-04-30 15:00:00  wd   -1.508899
                     ws    0.856643
...
2014-05-05 14:30:00  wd   -2.523250
                     ws    3.317810
2014-05-05 15:00:00  wd   -2.698828
                     ws    3.279520
Length: 486, dtype: float64

如果根本不使用apply，而是先计算平均聚合，然后使用np.atan2，这将更加有效。明天我将举一个例子看看你的异常，看起来你试图将函数应用于每一行，但没有指定axis=1，例如df.applyf，axis=1将函数应用于每一行这很有用，但不能完全解决我的问题。我会把它改写得更清楚。

>>> for name, g in hhdata:
...     print name, atan2(g['sonic_Ux'].mean(), g['sonic_Uy'].mean()), '   wd'
... 
2014-04-30 14:00:00 0.13861912975    wd
2014-04-30 14:30:00 0.511709085506    wd
2014-04-30 15:00:00 -1.5088990774    wd
2014-04-30 15:30:00 0.13200013186    wd
    <<snip>>
>>> def winddir(g):
...     return pd.Series(atan2( np.mean(g['sonic_Ux']), np.mean(g['sonic_Uy']) ), name='wd')
... 
>>> hhdata.apply(winddir)
2014-04-30 14:00:00  0    0.138619
2014-04-30 14:30:00  0    0.511709
2014-04-30 15:00:00  0   -1.508899
2014-04-30 15:30:00  0    0.132000
...
2014-05-05 14:00:00  0   -2.551593
2014-05-05 14:30:00  0   -2.523250
2014-05-05 15:00:00  0   -2.698828
Name: wd, Length: 243, dtype: float64
>>> hhdata.apply(winddir).index[0]
(Timestamp('2014-04-30 14:00:00', tz=None), 0)
>>> def winddir(g):
...     return pd.DataFrame({'wd':atan2(g['sonic_Ux'].mean(), g['sonic_Uy'].mean())}, index=[g.name])
... 
>>> hhdata.apply(winddir)
                                               wd
2014-04-30 14:00:00 2014-04-30 14:00:00  0.138619
2014-04-30 14:30:00 2014-04-30 14:30:00  0.511709
2014-04-30 15:00:00 2014-04-30 15:00:00 -1.508899
2014-04-30 15:30:00 2014-04-30 15:30:00  0.132000
                                              ...

[243 rows x 1 columns]
>>> hhdata.apply(winddir).index[0]
(Timestamp('2014-04-30 14:00:00', tz=None), Timestamp('2014-04-30 14:00:00', tz=None))
>>> 
>>> tsfast.groupby(TimeGrouper('30T')).apply(lambda g:
...     Series({'wd': atan2(g.sonic_Ux.mean(), g.sonic_Uy.mean()), 
...             'ws': np.sqrt(g.sonic_Ux.mean()**2 + g.sonic_Uy.mean()**2)}))
2014-04-30 14:00:00  wd    0.138619
                     ws    1.304311
2014-04-30 14:30:00  wd    0.511709
                     ws    0.143762
2014-04-30 15:00:00  wd   -1.508899
                     ws    0.856643
...
2014-05-05 14:30:00  wd   -2.523250
                     ws    3.317810
2014-05-05 15:00:00  wd   -2.698828
                     ws    3.279520
Length: 486, dtype: float64

>>> winddata.index.name = 'WASINDEX'
>>> data2 = winddata.reset_index()
>>> def to_hh(x): # <-- big hammer
...     ts = x.isoformat()
...     return ts[:14] + ('30:00' if int(ts[14:16]) >= 30 else '00:00')
... 
>>> data2['TS'] = data2['WASINDEX'].apply(lambda x: to_hh(x))
>>> wd = data2.groupby('TS').apply(lambda df: Series({'wd': np.arctan2(df.x.mean(), df.y.mean())}))
>>> type(wd)
pandas.core.frame.DataFrame
>>> wd.columns
Index([u'wd'], dtype=object)
>>> wd.index
Index([u'2014-04-30T14:00:00', u'2014-04-30T14:30:00', <<snip>> dtype=object)

In [31]: pd.set_option('max_rows',10)

In [32]: winddata = DataFrame({ 'x' : np.random.randn(N), 'y' : np.random.randn(N)+2, 'z' : np.random.randn(N) },pd.date_range('20140430 14:13:12',periods=N,freq='100ms'))

In [33]: winddata
Out[33]: 
                                   x         y         z
2014-04-30 14:13:12        -0.065350  0.567525  2.212534
2014-04-30 14:13:12.100000 -0.436498  2.591799  2.424359
2014-04-30 14:13:12.200000 -1.059038  3.120631 -0.645579
2014-04-30 14:13:12.300000  1.973474  0.630424  0.966405
2014-04-30 14:13:12.400000  0.575082  1.941845 -0.674695
...                              ...       ...       ...
2014-05-05 15:22:16.200000  0.601962  0.027834 -0.101967
2014-05-05 15:22:16.300000  0.741777  1.764745  0.991516
2014-05-05 15:22:16.400000 -0.494253  1.765930  2.493000
2014-05-05 15:22:16.500000 -2.643749  0.671604  0.275096
2014-05-05 15:22:16.600000  0.676698  0.958903  0.946942

[4361447 rows x 3 columns]

In [34]: winddata.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 4361447 entries, 2014-04-30 14:13:12 to 2014-05-05 15:22:16.600000
Freq: 100L
Data columns (total 3 columns):
x    float64
y    float64
z    float64
dtypes: float64(3)

In [35]: g = winddata.groupby(pd.Grouper(freq='30T'))

In [36]: results = DataFrame({'x' : g['x'].mean(), 'y' : g['y'].mean() })

In [37]: results['wd'] = np.arctan2(results['x'],results['y'])

In [38]: results['ws'] = np.sqrt(results['x']**2+results['y']**2)

In [39]: results
Out[39]: 
                            x         y        wd        ws
2014-04-30 14:00:00  0.005060  1.986778  0.002547  1.986784
2014-04-30 14:30:00  0.004922  2.015551  0.002442  2.015557
2014-04-30 15:00:00 -0.004209  1.988889 -0.002116  1.988893
2014-04-30 15:30:00  0.008410  2.003453  0.004198  2.003470
2014-04-30 16:00:00  0.004027  1.997369  0.002016  1.997373
...                       ...       ...       ...       ...
2014-05-05 13:00:00  0.006901  1.991252  0.003466  1.991264
2014-05-05 13:30:00  0.005458  2.008731  0.002717  2.008739
2014-05-05 14:00:00 -0.000805  2.000045 -0.000402  2.000045
2014-05-05 14:30:00 -0.004556  1.997437 -0.002281  1.997443
2014-05-05 15:00:00  0.003444  2.000182  0.001722  2.000185

[243 rows x 4 columns]