Python 熊猫：按时间间隔滚动平均值_Python_Pandas_Time Series_Rolling Computation

Python 熊猫：按时间间隔滚动平均值

python pandas

Python 熊猫：按时间间隔滚动平均值,python,pandas,time-series,rolling-computation,Python,Pandas,Time Series,Rolling Computation,我有一大堆民意调查数据；我想计算一个平均值，根据三天的时间窗口得到每天的估计值。根据，rolling.*函数基于指定数量的值而不是特定的日期时间范围来计算窗口如何实现此功能示例输入数据： polls_subset.tail(20) Out[185]: favorable unfavorable other enddate 2012-10-25 0.48 0.49

我有一大堆民意调查数据；我想计算一个平均值，根据三天的时间窗口得到每天的估计值。根据，

rolling.*

函数基于指定数量的值而不是特定的日期时间范围来计算窗口

如何实现此功能

示例输入数据：

polls_subset.tail(20)
Out[185]: 
            favorable  unfavorable  other

enddate                                  
2012-10-25       0.48         0.49   0.03
2012-10-25       0.51         0.48   0.02
2012-10-27       0.51         0.47   0.02
2012-10-26       0.56         0.40   0.04
2012-10-28       0.48         0.49   0.04
2012-10-28       0.46         0.46   0.09
2012-10-28       0.48         0.49   0.03
2012-10-28       0.49         0.48   0.03
2012-10-30       0.53         0.45   0.02
2012-11-01       0.49         0.49   0.03
2012-11-01       0.47         0.47   0.05
2012-11-01       0.51         0.45   0.04
2012-11-03       0.49         0.45   0.06
2012-11-04       0.53         0.39   0.00
2012-11-04       0.47         0.44   0.08
2012-11-04       0.49         0.48   0.03
2012-11-04       0.52         0.46   0.01
2012-11-04       0.50         0.47   0.03
2012-11-05       0.51         0.46   0.02
2012-11-07       0.51         0.41   0.00

每个日期的输出只有一行。

类似这样的内容如何：

首先将数据帧重新采样为1D间隔。这取所有重复天数的平均值。使用

fill\u方法

选项填写缺失的日期值。接下来，将重新采样的帧传递到pd中。滚动平均值，窗口为3，最小周期=1：

pd.rolling_mean(df.resample("1D", fill_method="ffill"), window=3, min_periods=1)

            favorable  unfavorable     other
enddate
2012-10-25   0.495000     0.485000  0.025000
2012-10-26   0.527500     0.442500  0.032500
2012-10-27   0.521667     0.451667  0.028333
2012-10-28   0.515833     0.450000  0.035833
2012-10-29   0.488333     0.476667  0.038333
2012-10-30   0.495000     0.470000  0.038333
2012-10-31   0.512500     0.460000  0.029167
2012-11-01   0.516667     0.456667  0.026667
2012-11-02   0.503333     0.463333  0.033333
2012-11-03   0.490000     0.463333  0.046667
2012-11-04   0.494000     0.456000  0.043333
2012-11-05   0.500667     0.452667  0.036667
2012-11-06   0.507333     0.456000  0.023333
2012-11-07   0.510000     0.443333  0.013333

更新：本在评论中指出。使用新语法，这将是：

df.resample（“1d”）.sum（）.fillna（0）.滚动（窗口=3，最小周期=1）.平均值（）

我刚才问了同样的问题，但数据点间隔不规则。在这里，重新采样并不是一个真正的选项。所以我创建了自己的函数。也许它对其他人也有用：

from pandas import Series, DataFrame
import pandas as pd
from datetime import datetime, timedelta
import numpy as np

def rolling_mean(data, window, min_periods=1, center=False):
    ''' Function that computes a rolling mean

    Parameters
    ----------
    data : DataFrame or Series
           If a DataFrame is passed, the rolling_mean is computed for all columns.
    window : int or string
             If int is passed, window is the number of observations used for calculating 
             the statistic, as defined by the function pd.rolling_mean()
             If a string is passed, it must be a frequency string, e.g. '90S'. This is
             internally converted into a DateOffset object, representing the window size.
    min_periods : int
                  Minimum number of observations in window required to have a value.

    Returns
    -------
    Series or DataFrame, if more than one column    
    '''
    def f(x):
        '''Function to apply that actually computes the rolling mean'''
        if center == False:
            dslice = col[x-pd.datetools.to_offset(window).delta+timedelta(0,0,1):x]
                # adding a microsecond because when slicing with labels start and endpoint
                # are inclusive
        else:
            dslice = col[x-pd.datetools.to_offset(window).delta/2+timedelta(0,0,1):
                         x+pd.datetools.to_offset(window).delta/2]
        if dslice.size < min_periods:
            return np.nan
        else:
            return dslice.mean()

    data = DataFrame(data.copy())
    dfout = DataFrame()
    if isinstance(window, int):
        dfout = pd.rolling_mean(data, window, min_periods=min_periods, center=center)
    elif isinstance(window, basestring):
        idx = Series(data.index.to_pydatetime(), index=data.index)
        for colname, col in data.iterkv():
            result = idx.apply(f)
            result.name = colname
            dfout = dfout.join(result, how='outer')
    if dfout.columns.size == 1:
        dfout = dfout.ix[:,0]
    return dfout


# Example
idx = [datetime(2011, 2, 7, 0, 0),
       datetime(2011, 2, 7, 0, 1),
       datetime(2011, 2, 7, 0, 1, 30),
       datetime(2011, 2, 7, 0, 2),
       datetime(2011, 2, 7, 0, 4),
       datetime(2011, 2, 7, 0, 5),
       datetime(2011, 2, 7, 0, 5, 10),
       datetime(2011, 2, 7, 0, 6),
       datetime(2011, 2, 7, 0, 8),
       datetime(2011, 2, 7, 0, 9)]
idx = pd.Index(idx)
vals = np.arange(len(idx)).astype(float)
s = Series(vals, index=idx)
rm = rolling_mean(s, window='2min')

来自熊猫导入系列，数据帧
作为pd进口熊猫
从datetime导入datetime，timedelta
将numpy作为np导入
def滚动平均值（数据、窗口、最小周期=1，中心=False）：
“计算滚动平均值的函数”
参数
----------
数据：数据帧或系列
如果传递数据帧，则计算所有列的滚动平均值。
窗口：int或string
如果传递int，则window是用于计算的观察数
由函数pd.rolling_mean（）定义的统计量
如果传递的字符串必须是频率字符串，例如“90S”。这是
内部转换为DateOffset对象，表示窗口大小。
最小周期：int
窗口中需要有值的最小观察数。
退换商品
-------
系列或数据帧（如果有多个列）
'''
def f（x）：
''实际计算滚动平均值的函数''
如果中心==假：
dslice=col[x-pd.datetools.to_偏移量（窗口）.delta+timedelta（0,0,1）：x]
#添加微秒，因为在使用标签开始和结束进行切片时
#包括
其他：
dslice=col[x-pd.datetools.to_偏移量（窗口）.delta/2+timedelta（0,0,1）：
x+pd.datetools.to_偏移量（窗口）。增量/2]
如果dslice.size<最小周期：
返回np.nan
其他：
返回dslice.mean（）
data=DataFrame（data.copy（））
dfout=DataFrame（）
如果isinstance（窗口，int）：
dfout=pd.滚动平均值（数据、窗口、最小周期=最小周期、中心=中心）
elif isinstance（窗口、基串）：
idx=Series（data.index.to_pydatetime（），index=data.index）
对于colname，data.iterkv（）中的col：
结果=idx.apply（f）
result.name=colname
dfout=dfout.join（结果，how='outer'）
如果dfout.columns.size==1：
dfout=dfout.ix[：，0]
返回数据输出
#范例
idx=[datetime（2011,2,7,0,0），
日期时间（2011年2月7日0月1日），
日期时间（2011,2,7,0,1,30），
日期时间（2011年2月7日0月2日），
日期时间（2011年2月7日0月4日），
日期时间（2011年2月7日0月5日），
日期时间（2011,2,7,0,5,10），
日期时间（2011年2月7日0月6日），
日期时间（2011年2月7日0月8日），
日期时间（2011,2,7,0,9）]
idx=局部放电指数（idx）
VAL=np.arange（len（idx））.aType（float）
s=系列（VAL，索引=idx）
rm=滚动平均值（s，窗口=2分钟）

user2689410的代码正是我所需要的。提供我的版本（归功于user2689410），由于一次计算数据帧中整行的平均值，因此速度更快

希望我的后缀约定是可读的：_s:string、_i:int、_b:bool、_ser:Series和_df:DataFrame。当您找到多个后缀时，类型可以是两个

import pandas as pd
from datetime import datetime, timedelta
import numpy as np

def time_offset_rolling_mean_df_ser(data_df_ser, window_i_s, min_periods_i=1, center_b=False):
    """ Function that computes a rolling mean

    Credit goes to user2689410 at http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval

    Parameters
    ----------
    data_df_ser : DataFrame or Series
         If a DataFrame is passed, the time_offset_rolling_mean_df_ser is computed for all columns.
    window_i_s : int or string
         If int is passed, window_i_s is the number of observations used for calculating
         the statistic, as defined by the function pd.time_offset_rolling_mean_df_ser()
         If a string is passed, it must be a frequency string, e.g. '90S'. This is
         internally converted into a DateOffset object, representing the window_i_s size.
    min_periods_i : int
         Minimum number of observations in window_i_s required to have a value.

    Returns
    -------
    Series or DataFrame, if more than one column

    >>> idx = [
    ...     datetime(2011, 2, 7, 0, 0),
    ...     datetime(2011, 2, 7, 0, 1),
    ...     datetime(2011, 2, 7, 0, 1, 30),
    ...     datetime(2011, 2, 7, 0, 2),
    ...     datetime(2011, 2, 7, 0, 4),
    ...     datetime(2011, 2, 7, 0, 5),
    ...     datetime(2011, 2, 7, 0, 5, 10),
    ...     datetime(2011, 2, 7, 0, 6),
    ...     datetime(2011, 2, 7, 0, 8),
    ...     datetime(2011, 2, 7, 0, 9)]
    >>> idx = pd.Index(idx)
    >>> vals = np.arange(len(idx)).astype(float)
    >>> ser = pd.Series(vals, index=idx)
    >>> df = pd.DataFrame({'s1':ser, 's2':ser+1})
    >>> time_offset_rolling_mean_df_ser(df, window_i_s='2min')
                          s1   s2
    2011-02-07 00:00:00  0.0  1.0
    2011-02-07 00:01:00  0.5  1.5
    2011-02-07 00:01:30  1.0  2.0
    2011-02-07 00:02:00  2.0  3.0
    2011-02-07 00:04:00  4.0  5.0
    2011-02-07 00:05:00  4.5  5.5
    2011-02-07 00:05:10  5.0  6.0
    2011-02-07 00:06:00  6.0  7.0
    2011-02-07 00:08:00  8.0  9.0
    2011-02-07 00:09:00  8.5  9.5
    """

    def calculate_mean_at_ts(ts):
        """Function (closure) to apply that actually computes the rolling mean"""
        if center_b == False:
            dslice_df_ser = data_df_ser[
                ts-pd.datetools.to_offset(window_i_s).delta+timedelta(0,0,1):
                ts
            ]
            # adding a microsecond because when slicing with labels start and endpoint
            # are inclusive
        else:
            dslice_df_ser = data_df_ser[
                ts-pd.datetools.to_offset(window_i_s).delta/2+timedelta(0,0,1):
                ts+pd.datetools.to_offset(window_i_s).delta/2
            ]
        if  (isinstance(dslice_df_ser, pd.DataFrame) and dslice_df_ser.shape[0] < min_periods_i) or \
            (isinstance(dslice_df_ser, pd.Series) and dslice_df_ser.size < min_periods_i):
            return dslice_df_ser.mean()*np.nan   # keeps number format and whether Series or DataFrame
        else:
            return dslice_df_ser.mean()

    if isinstance(window_i_s, int):
        mean_df_ser = pd.rolling_mean(data_df_ser, window=window_i_s, min_periods=min_periods_i, center=center_b)
    elif isinstance(window_i_s, basestring):
        idx_ser = pd.Series(data_df_ser.index.to_pydatetime(), index=data_df_ser.index)
        mean_df_ser = idx_ser.apply(calculate_mean_at_ts)

    return mean_df_ser

将熊猫作为pd导入
从datetime导入datetime，timedelta
将numpy作为np导入
定义时间、偏移量、滚动平均值、间隔（数据间隔、窗口间隔、最小周期间隔=1、中心间隔=假）：
“”“计算滚动平均值的函数。”
用户2689410在http://stackoverflow.com/questions/15771472/pandas-rolling-mean-by-time-interval
参数
----------
数据帧或序列
如果传递了数据帧，则会计算所有列的时间偏移量滚动平均值df ser。
窗口：int或string
如果传递int，则window_i_s是用于计算的观测数
由函数pd.time\u offset\u rolling\u mean\u df\u ser（）定义的统计信息
如果传递的字符串必须是频率字符串，例如“90S”。这是
内部转换为DateOffset对象，表示窗口大小。
最小周期：整数
窗口中需要有值的最小观察数。
退换商品
-------
系列或数据帧（如果有多个列）
>>>idx=[
…日期时间（2011,2,7,0,0），
…日期时间（2011年2月7日0月1日），
…日期时间（2011,2,7,0,1,30），
…日期时间（2011年2月7日0月2日），
…日期时间（2011年2月7日0月4日），
…日期时间（2011年2月7日0月5日），
…日期时间（2011,2,7,0,5,10），
…日期时间（2011年2月7日0月6日），
…日期时间（2011年2月7日0月8日），
…日期时间（2011年2月7日0月9日）]
>>>idx=局部放电指数（idx）
>>>VAL=np.arange（len（idx））.aType（float）
>>>ser=pd.系列（VAL，索引=idx）
>>>df=pd.DataFrame（{'s1'：ser，'s2'：ser+1}）
>>>时间偏移滚动平均值（df，窗口=2min）
s1 s2
2011-02-07 00:00:00  0.0  1.0
2011-02-07 00:01:00  0.5  1.5
2011-02-07 00:01:30  1.0  2.0
2011-02-07 00:02:00  2.0  3.0
2011-02-07 00:04:00  4.0  5.0
2011-02-07
AttributeError: 'MonthEnd' object has no attribute 'delta'

>>> wt = df.resample('D',limit=5).count()

            favorable  unfavorable  other
enddate                                  
2012-10-25          2            2      2
2012-10-26          1            1      1
2012-10-27          1            1      1

>>> df2 = df.resample('D').mean()

            favorable  unfavorable  other
enddate                                  
2012-10-25      0.495        0.485  0.025
2012-10-26      0.560        0.400  0.040
2012-10-27      0.510        0.470  0.020

>>> df3 = df2 * wt
>>> df3 = df3.rolling(3,min_periods=1).sum()
>>> wt3 = wt.rolling(3,min_periods=1).sum()

>>> df3 = df3 / wt3  

            favorable  unfavorable     other
enddate                                     
2012-10-25   0.495000     0.485000  0.025000
2012-10-26   0.516667     0.456667  0.030000
2012-10-27   0.515000     0.460000  0.027500
2012-10-28   0.496667     0.465000  0.041667
2012-10-29   0.484000     0.478000  0.042000
2012-10-30   0.488000     0.474000  0.042000
2012-10-31   0.530000     0.450000  0.020000
2012-11-01   0.500000     0.465000  0.035000
2012-11-02   0.490000     0.470000  0.040000
2012-11-03   0.490000     0.465000  0.045000
2012-11-04   0.500000     0.448333  0.035000
2012-11-05   0.501429     0.450000  0.032857
2012-11-06   0.503333     0.450000  0.028333
2012-11-07   0.510000     0.435000  0.010000

In [1]: df = DataFrame({'B': range(5)})

In [2]: df.index = [Timestamp('20130101 09:00:00'),
   ...:             Timestamp('20130101 09:00:02'),
   ...:             Timestamp('20130101 09:00:03'),
   ...:             Timestamp('20130101 09:00:05'),
   ...:             Timestamp('20130101 09:00:06')]

In [3]: df
Out[3]: 
                     B
2013-01-01 09:00:00  0
2013-01-01 09:00:02  1
2013-01-01 09:00:03  2
2013-01-01 09:00:05  3
2013-01-01 09:00:06  4

In [4]: df.rolling(2, min_periods=1).sum()
Out[4]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  5.0
2013-01-01 09:00:06  7.0

In [5]: df.rolling('2s', min_periods=1).sum()
Out[5]: 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:00:02  1.0
2013-01-01 09:00:03  3.0
2013-01-01 09:00:05  3.0
2013-01-01 09:00:06  7.0

import pandas as pd
import datetime as dt

#populate your dataframe: "df"
#...

df[df.index<(df.index[0]+dt.timedelta(hours=1))] #gives you a slice. you can then take .sum() .mean(), whatever

data.index = pd.to_datetime(data['Index']).values

  df=pd.read_csv('poll.csv',parse_dates=['enddate'],dtype={'favorable':np.float,'unfavorable':np.float,'other':np.float})

  df.set_index('enddate')
  df=df.fillna(0)

 fig, axs = plt.subplots(figsize=(5,10))
 df.plot(x='enddate', ax=axs)
 plt.show()


 df.rolling(window=3,min_periods=3).mean().plot()
 plt.show()
 print("The larger the window coefficient the smoother the line will appear")
 print('The min_periods is the minimum number of observations in the window required to have a value')

 df.rolling(window=6,min_periods=3).mean().plot()
 plt.show()