Python groupby.aggregate和修改,然后强制转换/重新索引

Python groupby.aggregate和修改,然后强制转换/重新索引,python,pandas,Python,Pandas,我希望对数据帧的每月粒度结构应用一些偏差,然后在初始数据帧中对其进行重铸。我首先做一个分组和汇总。这部分很好用。然后我重新编制索引,带上NaN。我希望通过将groupby元素的月份与初始数据帧匹配来完成重新索引。 我希望能够在不同的粒度上完成此操作(年->月和年,…) 有人知道这个问题的解决办法吗 >>> df['profile'] date 2015-01-01 00:00:00 3.000000 2015-01-01 01:00:00 3.000143 2015

我希望对数据帧的每月粒度结构应用一些偏差,然后在初始数据帧中对其进行重铸。我首先做一个分组和汇总。这部分很好用。然后我重新编制索引,带上NaN。我希望通过将groupby元素的月份与初始数据帧匹配来完成重新索引。 我希望能够在不同的粒度上完成此操作(年->月和年,…)

有人知道这个问题的解决办法吗

>>> df['profile']
date
2015-01-01 00:00:00    3.000000
2015-01-01 01:00:00    3.000143
2015-01-01 02:00:00    3.000287
2015-01-01 03:00:00    3.000430
2015-01-01 04:00:00    3.000574
...
2015-12-31 20:00:00    2.999426
2015-12-31 21:00:00    2.999570
2015-12-31 22:00:00    2.999713
2015-12-31 23:00:00    2.999857
Freq: H, Name: profile, Length: 8760

### Deviation on monthly basis
>>> dev_monthly = np.random.uniform(0.5, 1.5, len(df['profile'].groupby(df.index.month).aggregate(np.sum)))


>>> df['profile_monthly'] = (df['profile'].groupby(df.index.month).aggregate(np.sum) * dev_monthly).reindex(df)

>>> df['profile_monthly']
date
2015-01-01 00:00:00   NaN
2015-01-01 01:00:00   NaN
2015-01-01 02:00:00   NaN
...
2015-12-31 22:00:00   NaN
2015-12-31 23:00:00   NaN
Freq: H, Name: profile_monthly, Length: 8760
看看这本书

您正在使用
method='bfill'
查找
resample
,然后是
fillna

In [105]: df = DataFrame({'profile': normal(3, 0.1, size=10000)}, pd.date_range(start='2015-01-
01', freq='H', periods=10000))

In [106]: df['profile_monthly'] = df.profile.resample('M', how='sum')

In [107]: df
Out[107]:
                     profile  profile_monthly
2015-01-01 00:00:00   2.8328              NaN
2015-01-01 01:00:00   3.0607              NaN
2015-01-01 02:00:00   3.0138              NaN
2015-01-01 03:00:00   3.0402              NaN
2015-01-01 04:00:00   3.0335              NaN
2015-01-01 05:00:00   3.0087              NaN
2015-01-01 06:00:00   3.0557              NaN
2015-01-01 07:00:00   2.9280              NaN
2015-01-01 08:00:00   3.1359              NaN
2015-01-01 09:00:00   2.9681              NaN
2015-01-01 10:00:00   3.1240              NaN
2015-01-01 11:00:00   3.0635              NaN
2015-01-01 12:00:00   2.9206              NaN
2015-01-01 13:00:00   3.0714              NaN
2015-01-01 14:00:00   3.0688              NaN
2015-01-01 15:00:00   3.0703              NaN
2015-01-01 16:00:00   2.9102              NaN
2015-01-01 17:00:00   2.9368              NaN
2015-01-01 18:00:00   3.0864              NaN
2015-01-01 19:00:00   3.2124              NaN
2015-01-01 20:00:00   2.8988              NaN
2015-01-01 21:00:00   3.0659              NaN
2015-01-01 22:00:00   2.7973              NaN
2015-01-01 23:00:00   3.0824              NaN
2015-01-02 00:00:00   3.0199              NaN
                         ...              ...

[10000 rows x 2 columns]

In [108]: df.dropna()
Out[108]:
            profile  profile_monthly
2015-01-31   2.9769        2230.9931
2015-02-28   2.9930        2016.1045
2015-03-31   2.7817        2232.4096
2015-04-30   3.1695        2158.7834
2015-05-31   2.9040        2236.5962
2015-06-30   2.8697        2162.7784
2015-07-31   2.9278        2231.7232
2015-08-31   2.8289        2236.4603
2015-09-30   3.0368        2163.5916
2015-10-31   3.1517        2233.2285
2015-11-30   3.0450        2158.6998
2015-12-31   2.8261        2228.5550
2016-01-31   3.0264        2229.2221

[13 rows x 2 columns]

In [110]: df.fillna(method='bfill')
Out[110]:
                     profile  profile_monthly
2015-01-01 00:00:00   2.8328        2230.9931
2015-01-01 01:00:00   3.0607        2230.9931
2015-01-01 02:00:00   3.0138        2230.9931
2015-01-01 03:00:00   3.0402        2230.9931
2015-01-01 04:00:00   3.0335        2230.9931
2015-01-01 05:00:00   3.0087        2230.9931
2015-01-01 06:00:00   3.0557        2230.9931
2015-01-01 07:00:00   2.9280        2230.9931
2015-01-01 08:00:00   3.1359        2230.9931
2015-01-01 09:00:00   2.9681        2230.9931
2015-01-01 10:00:00   3.1240        2230.9931
2015-01-01 11:00:00   3.0635        2230.9931
2015-01-01 12:00:00   2.9206        2230.9931
2015-01-01 13:00:00   3.0714        2230.9931
2015-01-01 14:00:00   3.0688        2230.9931
2015-01-01 15:00:00   3.0703        2230.9931
2015-01-01 16:00:00   2.9102        2230.9931
2015-01-01 17:00:00   2.9368        2230.9931
2015-01-01 18:00:00   3.0864        2230.9931
2015-01-01 19:00:00   3.2124        2230.9931
2015-01-01 20:00:00   2.8988        2230.9931
2015-01-01 21:00:00   3.0659        2230.9931
2015-01-01 22:00:00   2.7973        2230.9931
2015-01-01 23:00:00   3.0824        2230.9931
2015-01-02 00:00:00   3.0199        2230.9931
                         ...              ...

[10000 rows x 2 columns]

当我使用您的代码时,我对2015-12-31 00:00:00和2015-12-31 01:00:00的值不同,如下所示:

>>> df.fillna(method='bfill')[np.logical_and(df.index.month==12, df.index.day==31)]
                    profile  profile_monthly
2015-12-31 00:00:00  2.926504      2232.288997
2015-12-31 01:00:00  3.008543      2234.470731
2015-12-31 02:00:00  2.930133      2234.470731
2015-12-31 03:00:00  3.078552      2234.470731
2015-12-31 04:00:00  3.141578      2234.470731
2015-12-31 05:00:00  3.061820      2234.470731
2015-12-31 06:00:00  2.981626      2234.470731
2015-12-31 07:00:00  3.010749      2234.470731
2015-12-31 08:00:00  2.878577      2234.470731
2015-12-31 09:00:00  2.915487      2234.470731
2015-12-31 10:00:00  3.072721      2234.470731
2015-12-31 11:00:00  3.087866      2234.470731
2015-12-31 12:00:00  3.089208      2234.470731
2015-12-31 13:00:00  2.957047      2234.470731
2015-12-31 14:00:00  3.002072      2234.470731
2015-12-31 15:00:00  3.106656      2234.470731
2015-12-31 16:00:00  3.100891      2234.470731
2015-12-31 17:00:00  3.077835      2234.470731
2015-12-31 18:00:00  3.032497      2234.470731
2015-12-31 19:00:00  2.959838      2234.470731
2015-12-31 20:00:00  2.878819      2234.470731
2015-12-31 21:00:00  3.041171      2234.470731
2015-12-31 22:00:00  3.061970      2234.470731
2015-12-31 23:00:00  3.019011      2234.470731

[24 rows x 2 columns]
因此,我最终找到了以下解决方案:

>>> AA  = df.groupby((df.index.year, df.index.month)).aggregate(np.mean)
>>> AA['dev'] = np.random.randn(0,1,len(AA))
>>> df['dev'] = AA.ix[zip(df.index.year, df.index.month)]['dev'].values
又短又快。唯一的问题是:


=>如何处理其他粒度(半年、季度、周等)?

@Philip Cloud:这行不通!你的答案会在一年的最后一天出现问题,因为00:00时会有好结果,但不会在(01:00和下一个小时)之后。这是由于重采样方法导致的,该方法于2015年1月1日00:00:00,2015年2月1日00:00:00。。。有其他方法吗?你能发布一个你想要的输出的例子吗?你可能需要实际阅读文档并使用一些参数来重新采样。我很确定您可以更改
closed
label
loffset
的一些组合,以实现您想要的结果。