Python 以时间为自变量的滚动最小二乘法_Python_Pandas_Regression_Linear Regression

Python 以时间为自变量的滚动最小二乘法

python pandas

Python 以时间为自变量的滚动最小二乘法,python,pandas,regression,linear-regression,Python,Pandas,Regression,Linear Regression,我正试图利用股价的数据框架/时间序列，在熊猫中建立一个滚动OLS模型。我想做的是在过去N天内执行OLS计算，返回预测价格和斜率，并将它们添加到数据框中各自的列中。就我所知，我唯一的选择是使用pyfinance中的PandasRollingOLS，因此我将在我的示例中使用它，但如果有其他方法，我很乐意使用它我的数据框如下所示，例如： Date Price .... 2019-03-31 08:59:59.999 1660 2019-03-31 09:59

我正试图利用股价的数据框架/时间序列，在熊猫中建立一个滚动OLS模型。我想做的是在过去N天内执行OLS计算，返回预测价格和斜率，并将它们添加到数据框中各自的列中。就我所知，我唯一的选择是使用

pyfinance

中的

PandasRollingOLS

，因此我将在我的示例中使用它，但如果有其他方法，我很乐意使用它

我的数据框如下所示，例如：

Date                     Price
....
2019-03-31 08:59:59.999  1660
2019-03-31 09:59:59.999  1657
2019-03-31 10:59:59.999  1656
2019-03-31 11:59:59.999  1652
2019-03-31 12:59:59.999  1646
2019-03-31 13:59:59.999  1645
2019-03-31 14:59:59.999  1650
2019-03-31 15:59:59.999  1669
2019-03-31 16:59:59.999  1674

我想使用

Date

列作为自变量执行滚动回归。通常我会：

X = df['Date']
y = df['Price']
model = ols.PandasRollingOLS(y, X, window=250)

然而，使用

df['Date']

作为my X返回一个错误也就不足为奇了

因此，我的第一个问题是，我需要对我的

Date

列做什么才能使

PandasRollingOLS

正常工作。我的下一个问题是，我需要调用什么来返回预测值和斜率？使用常规的

OLS

我会做一些类似

model.predict

和

model.slope

的事情，但是这些选项显然不适用于

PandasRollingOLS

实际上，我想将这些值添加到我的df中的新列中，因此我们考虑了类似于

df['Predict']=model.Predict的东西，例如，但显然这不是答案。理想的结果是：
Date                     Price  Predict  Slope
....
2019-03-31 08:59:59.999  1660   1665     0.10
2019-03-31 09:59:59.999  1657   1663     0.10
2019-03-31 10:59:59.999  1656   1661     0.09
2019-03-31 11:59:59.999  1652   1658     0.08
2019-03-31 12:59:59.999  1646   1651     0.07
2019-03-31 13:59:59.999  1645   1646     0.07
2019-03-31 14:59:59.999  1650   1643     0.07
2019-03-31 15:59:59.999  1669   1642     0.07
2019-03-31 16:59:59.999  1674   1645     0.08

非常感谢您的帮助，干杯。
您可以使用datetime.datetime.strtime
和time.mktime
将日期转换为整数，然后使用statsmodels
和处理滚动窗口的自定义函数为数据帧的所需子集构建模型：
输出：
                         Price      Predict     Slope
Date                                                 
2019-03-31 10:59:59.999   1656  1657.670504  0.000001
2019-03-31 11:59:59.999   1652  1655.003830  0.000001
2019-03-31 12:59:59.999   1646  1651.337151  0.000001
2019-03-31 13:59:59.999   1645  1647.670478  0.000001
2019-03-31 14:59:59.999   1650  1647.003818  0.000001
2019-03-31 15:59:59.999   1669  1654.670518  0.000001
2019-03-31 16:59:59.999   1674  1664.337207  0.000001

#%%
# imports
import datetime, time
import pandas as pd
import numpy as np
import statsmodels.api as sm
from collections import OrderedDict

# your data in a more easily reprodicible format
data = {'Date': ['2019-03-31 08:59:59.999', '2019-03-31 09:59:59.999', '2019-03-31 10:59:59.999',
        '2019-03-31 11:59:59.999',  '2019-03-31 12:59:59.999', '2019-03-31 13:59:59.999',
        '2019-03-31 14:59:59.999', '2019-03-31 15:59:59.999', '2019-03-31 16:59:59.999'],
        'Price': [1660, 1657, 1656, 1652, 1646, 1645, 1650, 1669, 1674]}

# function to make a useful time structure as independent variable
def myTime(date_time_str):
    date_time_obj = datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S.%f')
    return(time.mktime(date_time_obj.timetuple()))

# add time structure to dataset
data['Time'] = [myTime(obs) for obs in data['Date']]

# time for pandas
df = pd.DataFrame(data)

# Function for rolling OLS of a desired window size on a pandas dataframe

def RegressionRoll(df, subset, dependent, independent, const, win):
    """
    RegressionRoll takes a dataframe, makes a subset of the data if you like,
    and runs a series of regressions with a specified window length, and
    returns a dataframe with BETA or R^2 for each window split of the data.

    Parameters:
    ===========
    df -- pandas dataframe
    subset -- integer - has to be smaller than the size of the df or 0 if no subset.
    dependent -- string that specifies name of denpendent variable
    independent -- LIST of strings that specifies name of indenpendent variables
    const -- boolean - whether or not to include a constant term
    win -- integer - window length of each model

    Example:
    ========
    df_rolling = RegressionRoll(df=df, subset = 0, 
                                dependent = 'Price', independent = ['Time'],
                                const = False, win = 3)

    """

    # Data subset
    if subset != 0:
        df = df.tail(subset)
    else:
        df = df

    # Loopinfo
    end = df.shape[0]+1
    win = win
    rng = np.arange(start = win, stop = end, step = 1)

    # Subset and store dataframes
    frames = {}
    n = 1

    for i in rng:
        df_temp = df.iloc[:i].tail(win)
        newname = 'df' + str(n)
        frames.update({newname: df_temp})
        n += 1

    # Analysis on subsets
    df_results = pd.DataFrame()
    for frame in frames:

    #debug
    #print(frames[frame])

    # Rolling data frames
    dfr = frames[frame]
    y = dependent
    x = independent

    # Model with or without constant
    if const == True:
        x = sm.add_constant(dfr[x])
        model = sm.OLS(dfr[y], x).fit()
    else:
        model = sm.OLS(dfr[y], dfr[x]).fit()

    # Retrieve price and price prediction
    Prediction = model.predict()[-1]
    d = {'Price':dfr['Price'].iloc[-1], 'Predict':Prediction}
    df_prediction = pd.DataFrame(d, index = dfr['Date'][-1:])

    # Retrieve parameters (constant and slope, or slope only)
    theParams = model.params[0:]
    coefs = theParams.to_frame()
    df_temp = pd.DataFrame(coefs.T)
    df_temp.index = dfr['Date'][-1:]

    # Build dataframe with Price, Prediction and Slope (+constant if desired)
    df_temp2 = pd.concat([df_prediction, df_temp], axis = 1)
    df_temp2=df_temp2.rename(columns = {'Time':'Slope'})
    df_results = pd.concat([df_results, df_temp2], axis = 0)

return(df_results)

# test run
df_rolling = RegressionRoll(df=df, subset = 0, 
                            dependent = 'Price', independent = ['Time'],
                            const = False, win = 3)
print(df_rolling)

代码：
                         Price      Predict     Slope
Date                                                 
2019-03-31 10:59:59.999   1656  1657.670504  0.000001
2019-03-31 11:59:59.999   1652  1655.003830  0.000001
2019-03-31 12:59:59.999   1646  1651.337151  0.000001
2019-03-31 13:59:59.999   1645  1647.670478  0.000001
2019-03-31 14:59:59.999   1650  1647.003818  0.000001
2019-03-31 15:59:59.999   1669  1654.670518  0.000001
2019-03-31 16:59:59.999   1674  1664.337207  0.000001

#%%
# imports
import datetime, time
import pandas as pd
import numpy as np
import statsmodels.api as sm
from collections import OrderedDict

# your data in a more easily reprodicible format
data = {'Date': ['2019-03-31 08:59:59.999', '2019-03-31 09:59:59.999', '2019-03-31 10:59:59.999',
        '2019-03-31 11:59:59.999',  '2019-03-31 12:59:59.999', '2019-03-31 13:59:59.999',
        '2019-03-31 14:59:59.999', '2019-03-31 15:59:59.999', '2019-03-31 16:59:59.999'],
        'Price': [1660, 1657, 1656, 1652, 1646, 1645, 1650, 1669, 1674]}

# function to make a useful time structure as independent variable
def myTime(date_time_str):
    date_time_obj = datetime.datetime.strptime(date_time_str, '%Y-%m-%d %H:%M:%S.%f')
    return(time.mktime(date_time_obj.timetuple()))

# add time structure to dataset
data['Time'] = [myTime(obs) for obs in data['Date']]

# time for pandas
df = pd.DataFrame(data)

# Function for rolling OLS of a desired window size on a pandas dataframe

def RegressionRoll(df, subset, dependent, independent, const, win):
    """
    RegressionRoll takes a dataframe, makes a subset of the data if you like,
    and runs a series of regressions with a specified window length, and
    returns a dataframe with BETA or R^2 for each window split of the data.

    Parameters:
    ===========
    df -- pandas dataframe
    subset -- integer - has to be smaller than the size of the df or 0 if no subset.
    dependent -- string that specifies name of denpendent variable
    independent -- LIST of strings that specifies name of indenpendent variables
    const -- boolean - whether or not to include a constant term
    win -- integer - window length of each model

    Example:
    ========
    df_rolling = RegressionRoll(df=df, subset = 0, 
                                dependent = 'Price', independent = ['Time'],
                                const = False, win = 3)

    """

    # Data subset
    if subset != 0:
        df = df.tail(subset)
    else:
        df = df

    # Loopinfo
    end = df.shape[0]+1
    win = win
    rng = np.arange(start = win, stop = end, step = 1)

    # Subset and store dataframes
    frames = {}
    n = 1

    for i in rng:
        df_temp = df.iloc[:i].tail(win)
        newname = 'df' + str(n)
        frames.update({newname: df_temp})
        n += 1

    # Analysis on subsets
    df_results = pd.DataFrame()
    for frame in frames:

    #debug
    #print(frames[frame])

    # Rolling data frames
    dfr = frames[frame]
    y = dependent
    x = independent

    # Model with or without constant
    if const == True:
        x = sm.add_constant(dfr[x])
        model = sm.OLS(dfr[y], x).fit()
    else:
        model = sm.OLS(dfr[y], dfr[x]).fit()

    # Retrieve price and price prediction
    Prediction = model.predict()[-1]
    d = {'Price':dfr['Price'].iloc[-1], 'Predict':Prediction}
    df_prediction = pd.DataFrame(d, index = dfr['Date'][-1:])

    # Retrieve parameters (constant and slope, or slope only)
    theParams = model.params[0:]
    coefs = theParams.to_frame()
    df_temp = pd.DataFrame(coefs.T)
    df_temp.index = dfr['Date'][-1:]

    # Build dataframe with Price, Prediction and Slope (+constant if desired)
    df_temp2 = pd.concat([df_prediction, df_temp], axis = 1)
    df_temp2=df_temp2.rename(columns = {'Time':'Slope'})
    df_results = pd.concat([df_results, df_temp2], axis = 0)

return(df_results)

# test run
df_rolling = RegressionRoll(df=df, subset = 0, 
                            dependent = 'Price', independent = ['Time'],
                            const = False, win = 3)
print(df_rolling)

通过不指定太多变量，而是将更多表达式直接放入字典和函数中，可以很容易地缩短代码，但如果生成的输出确实表示您所需的输出，我们可以看看这一点。另外，您没有指定是否在分析中包含常数项，因此我也包含了一个处理该问题的选项。
@top bantz很乐意提供帮助！类似但不完全相同的滚动回归问题不时出现。问题中最有趣的部分之一是如何构造所需的输出。您也可以查看我的帖子，了解更广泛的挑战方法，包括其他参数选项，如R^2。