Python 不正确的scikit学习带日期偏移的线性模型预测

Python 不正确的scikit学习带日期偏移的线性模型预测,python,statistics,scikit-learn,time-series,forecasting,Python,Statistics,Scikit Learn,Time Series,Forecasting,我试图预测时间序列数据,但是通过在训练和预测之前用date\u offset-时间点来抵消结果。这样做的原因是试图用当前数据预测未来的时间点。有关示例,请参见 总而言之: 数据=[1,2,3,4,5]应预测结果=[2,3,4,5,6]如果日期偏移=1 下图中的结果显示红线移动了date\u offset,并且没有预测未来的date\u offset。无论我的日期偏移量有多大,它都会不断移动,并且不会预测我最后得到的结果,即,result=5(这是已知的)。事实上,红线根本不应该偏移,只要偏移量越

我试图预测时间序列数据,但是通过在训练和预测之前用
date\u offset
-时间点来抵消结果。这样做的原因是试图用当前数据预测未来的时间点。有关示例,请参见

总而言之:
数据=[1,2,3,4,5]
应预测
结果=[2,3,4,5,6]
如果
日期偏移=1

下图中的结果显示红线移动了
date\u offset
,并且没有预测未来的
date\u offset
。无论我的日期偏移量有多大,它都会不断移动,并且不会预测我最后得到的结果,即,
result=5
(这是已知的)。事实上,红线根本不应该偏移,只要偏移量越大,精度就越低。我做错了什么

请参见下面的示例代码和结果图像:

from sklearn import linear_model
import matplotlib.pyplot as plt
import numpy as np

date_offset = 1

data = np.array([9330.0, 9470.0, 9550.0, 9620.0, 9600.0, 9585.0, 9600.0, 9600.0, 9430.0, 9460.0, 9450.0, 9650.0, 9620.0, 9650.0, 9500.0, 9400.0, 9165.0, 9100.0, 8755.0, 8850.0, 8990.0, 9150.0, 9195.0, 9175.0, 9250.0, 9200.0, 9350.0, 9280.0, 9370.0, 9470.0, 9445.0, 9440.0, 9280.0, 9325.0, 9170.0, 9270.0, 9200.0, 9450.0, 9510.0, 9371.0, 9499.0, 9499.0, 9400.0, 9500.0, 9550.0, 9670.0, 9700.0, 9760.0, 9767.4599999999991, 9652.0, 9520.0, 9600.0, 9610.0, 9700.0, 9825.0, 9900.0, 9950.0, 9801.0, 9770.0, 9545.0, 9630.0, 9710.0, 9700.0, 9700.0, 9600.0, 9615.0, 9575.0, 9500.0, 9600.0, 9480.0, 9565.0, 9510.0, 9475.0, 9600.0, 9400.0, 9400.0, 9400.0, 9300.0, 9430.0, 9410.0, 9380.0, 9320.0, 9000.0, 9100.0, 9000.0, 9200.0, 9210.0, 9251.0, 9460.0, 9400.0, 9600.0, 9621.0, 9440.0, 9490.0, 9675.0, 9850.0, 9680.0, 10100.0, 9900.0, 10100.0, 9949.0, 10040.0, 10050.0, 10200.0, 10400.0, 10350.0, 10200.0, 10175.0, 10001.0, 10110.0, 10400.0, 10401.0, 10300.0, 10548.0, 10515.0, 10475.0, 10200.0, 10481.0, 10500.0, 10540.0, 10559.0, 10300.0, 10400.0, 10202.0, 10330.0, 10450.0, 10540.0, 10540.0, 10650.0, 10450.0, 10550.0, 10501.0, 10206.0, 10250.0, 10345.0, 10225.0, 10330.0, 10506.0, 11401.0, 11245.0, 11360.0, 11549.0, 11415.0, 11450.0, 11460.0, 11600.0, 11530.0, 11450.0, 11402.0, 11299.0])
data = data[np.newaxis].T

results = np.array([9470.0, 9545.0, 9635.0, 9640.0, 9600.0, 9622.0, 9555.0, 9429.0, 9495.0, 9489.0, 9630.0, 9612.0, 9630.0, 9501.0, 9372.0, 9165.0, 9024.0, 8780.0, 8800.0, 8937.0, 9051.0, 9100.0, 9166.0, 9220.0, 9214.0, 9240.0, 9254.0, 9400.0, 9450.0, 9470.0, 9445.0, 9301.0, 9316.0, 9170.0, 9270.0, 9251.0, 9422.0, 9466.0, 9373.0, 9440.0, 9415.0, 9410.0, 9500.0, 9520.0, 9620.0, 9705.0, 9760.0, 9765.0, 9651.0, 9520.0, 9600.0, 9610.0, 9700.0, 9805.0, 9900.0, 9950.0, 9800.0, 9765.0, 9602.0, 9630.0, 9790.0, 9710.0, 9800.0, 9649.0, 9580.0, 9780.0, 9560.0, 9501.0, 9511.0, 9530.0, 9498.0, 9475.0, 9595.0, 9500.0, 9460.0, 9400.0, 9310.0, 9382.0, 9375.0, 9385.0, 9320.0, 9100.0, 8990.0, 9045.0, 9129.0, 9201.0, 9251.0, 9424.0, 9440.0, 9500.0, 9621.0, 9490.0, 9512.0, 9599.0, 9819.0, 9684.0, 10025.0, 9984.0, 10110.0, 9950.0, 10048.0, 10095.0, 10200.0, 10338.0, 10315.0, 10200.0, 10166.0, 10095.0, 10110.0, 10400.0, 10445.0, 10360.0, 10548.0, 10510.0, 10480.0, 10180.0, 10488.0, 10520.0, 10510.0, 10565.0, 10450.0, 10400.0, 10240.0, 10338.0, 10410.0, 10540.0, 10481.0, 10521.0, 10530.0, 10325.0, 10510.0, 10446.0, 10249.0, 10236.0, 10211.0, 10340.0, 10394.0, 11370.0, 11250.0, 11306.0, 11368.0, 11415.0, 11400.0, 11452.0, 11509.0, 11500.0, 11455.0, 11400.0, 11300.0, 11369.0])

# Date offset to predict next i-days results
data = data[:-date_offset]
results = results[date_offset:]

train_data = data[:-50]
train_results = results[:-50]

test_data = data[-50:]
test_results = results[-50:]

regressor = linear_model.BayesianRidge(normalize=True)
regressor.fit(train_data, train_results)

plt.figure(figsize=(8,6))
plt.plot(regressor.predict(test_data), '--', color='#EB3737', linewidth=2, label='Prediction')
plt.plot(test_results, label='True', color='green', linewidth=2)
plt.legend(loc='best')
plt.show()

首先,这个模型还不错。例如,当实际值为10450时,它预测10350,这非常接近。而且,很明显,预测点的时间越长,其预测越不准确,因为方差在增长,有时甚至偏差也在增长。你不能期望相反的结果

其次,它是一个线性模型,因此当预测变量本质上不是线性时,它不能绝对精确

第三,必须谨慎选择预测变量。例如,在这种情况下,您可能尝试预测的不是时间T的值,而是时间T的值变化(即C[T]=V[T]-V[T-1])或最后K个值的移动平均值。在这里,你可能会(或者相反,可能不会)发现你正试图对所谓的“随机游走”进行建模,而这种“随机游走”很难根据其随机性准确预测


<>和最后,你可以考虑其他模型,比如arima,它更适合于预测时间序列。

< P>首先,模型不是很坏。例如,当实际值为10450时,它预测10350,这非常接近。而且,很明显,预测点的时间越长,其预测越不准确,因为方差在增长,有时甚至偏差也在增长。你不能期望相反的结果

其次,它是一个线性模型,因此当预测变量本质上不是线性时,它不能绝对精确

第三,必须谨慎选择预测变量。例如,在这种情况下,您可能尝试预测的不是时间T的值,而是时间T的值变化(即C[T]=V[T]-V[T-1])或最后K个值的移动平均值。在这里,你可能会(或者相反,可能不会)发现你正试图对所谓的“随机游走”进行建模,而这种“随机游走”很难根据其随机性准确预测


<>和最后,你可以考虑其他模型,比如arima,它更适合于预测时间序列。

< p>
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import linear_model
def organize_data(to_forecast, window, horizon):
    """
     Input:
      to_forecast, univariate time series organized as numpy array
      window, number of items to use in the forecast window
      horizon, horizon of the forecast
     Output:
      X, a matrix where each row contains a forecast window
      y, the target values for each row of X
    """
    shape = to_forecast.shape[:-1] + \
            (to_forecast.shape[-1] - window + 1, window)
    strides = to_forecast.strides + (to_forecast.strides[-1],)
    X = np.lib.stride_tricks.as_strided(to_forecast, 
                                        shape=shape, 
                                        strides=strides)
    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])
    return X[:-horizon], y

data = np.array([9330.0, 9470.0, 9550.0, 9620.0, 9600.0, 9585.0, 9600.0, 9600.0, 9430.0, 9460.0, 9450.0, 9650.0, 9620.0, 9650.0, 9500.0, 9400.0, 9165.0, 9100.0, 8755.0, 8850.0, 8990.0, 9150.0, 9195.0, 9175.0, 9250.0, 9200.0, 9350.0, 9280.0, 9370.0, 9470.0, 9445.0, 9440.0, 9280.0, 9325.0, 9170.0, 9270.0, 9200.0, 9450.0, 9510.0, 9371.0, 9499.0, 9499.0, 9400.0, 9500.0, 9550.0, 9670.0, 9700.0, 9760.0, 9767.4599999999991, 9652.0, 9520.0, 9600.0, 9610.0, 9700.0, 9825.0, 9900.0, 9950.0, 9801.0, 9770.0, 9545.0, 9630.0, 9710.0, 9700.0, 9700.0, 9600.0, 9615.0, 9575.0, 9500.0, 9600.0, 9480.0, 9565.0, 9510.0, 9475.0, 9600.0, 9400.0, 9400.0, 9400.0, 9300.0, 9430.0, 9410.0, 9380.0, 9320.0, 9000.0, 9100.0, 9000.0, 9200.0, 9210.0, 9251.0, 9460.0, 9400.0, 9600.0, 9621.0, 9440.0, 9490.0, 9675.0, 9850.0, 9680.0, 10100.0, 9900.0, 10100.0, 9949.0, 10040.0, 10050.0, 10200.0, 10400.0, 10350.0, 10200.0, 10175.0, 10001.0, 10110.0, 10400.0, 10401.0, 10300.0, 10548.0, 10515.0, 10475.0, 10200.0, 10481.0, 10500.0, 10540.0, 10559.0, 10300.0, 10400.0, 10202.0, 10330.0, 10450.0, 10540.0, 10540.0, 10650.0, 10450.0, 10550.0, 10501.0, 10206.0, 10250.0, 10345.0, 10225.0, 10330.0, 10506.0, 11401.0, 11245.0, 11360.0, 11549.0, 11415.0, 11450.0, 11460.0, 11600.0, 11530.0, 11450.0, 11402.0, 11299.0])

train_window = 50
k = 5   # number of previous observations to use
h = 2   # forecast horizon
X,y = organize_data(data, k, h)

train_data = X[:train_window]
train_results = y[:train_window]

test_data = X[train_window:]
test_results = y[train_window:]

regressor = linear_model.BayesianRidge(normalize=True)
regressor.fit(train_data, train_results)

plt.figure(figsize=(8,6))
plt.plot(regressor.predict(X), '--', color='#EB3737', linewidth=2, label='Prediction')
plt.plot(y, label='True', color='green', linewidth=2)
plt.legend(loc='best')
plt.show()

添加回组织数据步骤:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import linear_model
def organize_data(to_forecast, window, horizon):
    """
     Input:
      to_forecast, univariate time series organized as numpy array
      window, number of items to use in the forecast window
      horizon, horizon of the forecast
     Output:
      X, a matrix where each row contains a forecast window
      y, the target values for each row of X
    """
    shape = to_forecast.shape[:-1] + \
            (to_forecast.shape[-1] - window + 1, window)
    strides = to_forecast.strides + (to_forecast.strides[-1],)
    X = np.lib.stride_tricks.as_strided(to_forecast, 
                                        shape=shape, 
                                        strides=strides)
    y = np.array([X[i+horizon][-1] for i in range(len(X)-horizon)])
    return X[:-horizon], y

data = np.array([9330.0, 9470.0, 9550.0, 9620.0, 9600.0, 9585.0, 9600.0, 9600.0, 9430.0, 9460.0, 9450.0, 9650.0, 9620.0, 9650.0, 9500.0, 9400.0, 9165.0, 9100.0, 8755.0, 8850.0, 8990.0, 9150.0, 9195.0, 9175.0, 9250.0, 9200.0, 9350.0, 9280.0, 9370.0, 9470.0, 9445.0, 9440.0, 9280.0, 9325.0, 9170.0, 9270.0, 9200.0, 9450.0, 9510.0, 9371.0, 9499.0, 9499.0, 9400.0, 9500.0, 9550.0, 9670.0, 9700.0, 9760.0, 9767.4599999999991, 9652.0, 9520.0, 9600.0, 9610.0, 9700.0, 9825.0, 9900.0, 9950.0, 9801.0, 9770.0, 9545.0, 9630.0, 9710.0, 9700.0, 9700.0, 9600.0, 9615.0, 9575.0, 9500.0, 9600.0, 9480.0, 9565.0, 9510.0, 9475.0, 9600.0, 9400.0, 9400.0, 9400.0, 9300.0, 9430.0, 9410.0, 9380.0, 9320.0, 9000.0, 9100.0, 9000.0, 9200.0, 9210.0, 9251.0, 9460.0, 9400.0, 9600.0, 9621.0, 9440.0, 9490.0, 9675.0, 9850.0, 9680.0, 10100.0, 9900.0, 10100.0, 9949.0, 10040.0, 10050.0, 10200.0, 10400.0, 10350.0, 10200.0, 10175.0, 10001.0, 10110.0, 10400.0, 10401.0, 10300.0, 10548.0, 10515.0, 10475.0, 10200.0, 10481.0, 10500.0, 10540.0, 10559.0, 10300.0, 10400.0, 10202.0, 10330.0, 10450.0, 10540.0, 10540.0, 10650.0, 10450.0, 10550.0, 10501.0, 10206.0, 10250.0, 10345.0, 10225.0, 10330.0, 10506.0, 11401.0, 11245.0, 11360.0, 11549.0, 11415.0, 11450.0, 11460.0, 11600.0, 11530.0, 11450.0, 11402.0, 11299.0])

train_window = 50
k = 5   # number of previous observations to use
h = 2   # forecast horizon
X,y = organize_data(data, k, h)

train_data = X[:train_window]
train_results = y[:train_window]

test_data = X[train_window:]
test_results = y[train_window:]

regressor = linear_model.BayesianRidge(normalize=True)
regressor.fit(train_data, train_results)

plt.figure(figsize=(8,6))
plt.plot(regressor.predict(X), '--', color='#EB3737', linewidth=2, label='Prediction')
plt.plot(y, label='True', color='green', linewidth=2)
plt.legend(loc='best')
plt.show()