在Python statsmodels.tsa ARIMA中包含多个季节性术语_Python_Time Series_Statsmodels

在Python statsmodels.tsa ARIMA中包含多个季节性术语

python

在Python statsmodels.tsa ARIMA中包含多个季节性术语,python,time-series,statsmodels,Python,Time Series,Statsmodels,我正在尝试使用python 2.7.11和优秀的statsmodels.tsa包在python中建模时间序列。我的数据包括几周内每小时的交通强度测量。因此，数据具有多个季节性成分，日形成24小时周期；几周形成一个168小时的周期此时，statsmodels.tsa中的建模选项未设置为处理多个季节性，因为它们只允许指定一个季节性因素。然而，我在R中遇到了Rob Hyneman关于多重季节性的工作。他使用傅立叶级数对时间序列的季节性成分进行建模，包括模型中对应于每个季节周期的频率的傅立叶级数我用

我正在尝试使用python 2.7.11和优秀的statsmodels.tsa包在python中建模时间序列。我的数据包括几周内每小时的交通强度测量。因此，数据具有多个季节性成分，日形成24小时周期；几周形成一个168小时的周期

此时，statsmodels.tsa中的建模选项未设置为处理多个季节性，因为它们只允许指定一个季节性因素。然而，我在R中遇到了Rob Hyneman关于多重季节性的工作。他使用傅立叶级数对时间序列的季节性成分进行建模，包括模型中对应于每个季节周期的频率的傅立叶级数

我用韦尔奇的方法获得了我观察到的时间序列中信号的功率谱密度，提取了信号中与我预期的季节效应频率相对应的峰值，并使用频率和振幅生成了与我预期的数据中的季节趋势相对应的正弦波模式。另一方面，我认为这允许我绕过Hyneman基于AIC选择k值的步骤，因为我使用的是观测数据中固有的信号

为了确保正弦波与数据中季节性模式的出现相匹配，我通过在24小时周期内目视选择一个峰值，并将其出现的时间与代表正弦波的变量的最高值相匹配，将两个正弦波模式的峰值与观测数据中的峰值相匹配。在此之前，我已经检查了每日峰值是否始终出现在同一小时

到目前为止，似乎还不错——用获得的频率和振幅绘制的正弦波图大致与观测数据相符。然后我拟合了一个ARIMA（2,0,0）模型，包括两个基于分解的变量作为外生变量。在这一点上，我想测试模型的预测效用。然而，这正是事情变得复杂的地方

当我使用statsmodels软件包中的ARIMA时，我通过拟合模型得到的估计值形成了一个模式，该模式复制了正弦波，但其值范围与我的观察值相匹配。观测结果中仍然存在大量的差异，而这些差异并没有用季节性趋势来解释，这使我相信，在模型拟合过程中，有些地方并没有按照预期的方式进行

不幸的是，我对时间序列建模的艺术还不够精通，不知道我的意外结果是否是由于外生变量的性质造成的，包括我应该使用的statsmodels功能，但忽略了，或者是关于季节趋势概念的错误假设

我有一些具体问题：

使用python中的statsmodels在ARIMA模型中是否可以包含多个季节性趋势（即基于傅立叶或分解）
当正弦波作为外生变量包含在上述模型和以下代码中时，使用正弦波重建季节趋势是否会造成困难
为什么下面代码中指定的模型不能产生与观测数据更接近的预测

非常感谢您的帮助

致以最良好的祝愿，并提前表示感谢

埃弗特

p、 s:如果我的代码样本和数据文件太长，我很抱歉——因为我不确定是什么原因导致了意外的结果，我想我会发布整个内容。另外，我也为有时没有遵循PEP8而道歉——我还在学习：）

代码示例：

import os
import re
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.signal import welch
import operator


# Function which plots rolling mean of data set in order to estimate stationarity
# 'timeseries' = Data to be used for ARIMA modeling
#


def plotmean(timeseries, show=0, path=''):
    rolmean = pd.rolling_mean(timeseries, window=12)
    rolstd = pd.rolling_std(timeseries, window=12)
    fig = plt.figure(figsize=(12, 8))
    orig = plt.plot(timeseries, color='blue', label='Observed scores')
    mean = plt.plot(rolmean, color='red', label='Rolling mean')
    std = plt.plot(rolstd, color='black', label='Rolling SD')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()


#
# Function to decompose a function over time f(t) into a spectrum of signal amplitude and frequency
# 'dta' = The dataset used
# 'show' = Whether or not to show plot
# 'path' = Where to store plot, if desirable
#
# Output:
# frequency range and spectral density range
#


def runwelch(dta, show, path):
    nps = (len(dta) / 2) + 8
    nov = nps / 2
    fft = nps
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    f, Pxx_den = welch(dta, fs=fs_temp, nperseg=nps, noverlap=nov, nfft=fft, scaling="spectrum")
    plt.plot(f, Pxx_den)
    plt.ylim([0.5e-7, 10])
    plt.xlabel('frequency [Hz]')
    plt.ylabel('PSD [V**2/Hz]')
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return f, Pxx_den


#
# Function which gets amplitude and frequency of n most important periodical cycles, and provides plot
# to visually inspect if they correspond to expected seasonal components.
# 'freq' = output of Welch decomposition
# 'density' = output of Welch decomposition
# 'n' = desired number of peaks to extract
# 'show' = whether to show plots of corresponding sine functions


def getsines(n_obs, freq, density, n, show):
    ftemp = freq
    dtemp = density
    fstore = []
    dstore = []
    astore = []
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    samplespace = n_obs * 3600
    for a in range(0, n, 1):
        max_index, max_value = max(enumerate(dtemp), key=operator.itemgetter(1))
        dstore.append(max_value)
        fstore.append(ftemp[max_index])
        astore.append(np.sqrt(max_value))
        dtemp[max_index] = 0
    if show == 1:
        for b in range(0, len(fstore), 1):
            sound_sine = sine(fstore[b], samplespace, fs_temp, astore[b], 1)
            plt.plot(sound_sine)
            plt.show()
            plt.clf()
    return fstore, astore


def sine(freq, time_interval, rate, amp):
    w = 2. * np.pi * freq
    t = np.linspace(0, time_interval, time_interval * rate)
    y = amp * np.sin(w * t)
    return y


#
# Function which adapts the calculated sine waves for the returned sines for k = 1 through k = kmax
# 'dta' = Data set


def buildFterms(dta, fstore, astore):
    n = len(fstore)
    n_obs = len(dta)
    fs_temp = .0002778
    # Set to 1/3600 because of hourly sampling
    samplespace = n_obs * 3600 + (24 * 3600)
    # Add one excess day for later fitting of sine waves to peaks
    store = []
    for i in range(0, n, 1):
        tmp = sine(fstore[i], samplespace, 0.0002778, astore[i])
        store.append(tmp)
    k_168_store = store[0]
    k_24_store = store[1]
    k_24 = np.transpose(k_24_store)
    k_168 = np.transpose(k_168_store)
    k_24 = pd.Series(k_24)
    k_168 = pd.Series(k_168)
    dta_ind, dta_val = max(enumerate(dta.iloc[120:143]), key=operator.itemgetter(1))
    # Visually inspect mean plot, select interval which has clear and representative peak, use to determine index.
    k_24_ind, k_24_val = max(enumerate(k_24.iloc[0:23]), key=operator.itemgetter(1))
    # peak in sound level at index 1 is matched by peak in sine wave at index 7. Thus, sound level[0] corresponds to\
    # sine waves[6]
    # print dta_ind, dta_val, k_24_ind, k_24_val
    k_24_sel = k_24[6:1014]
    k_168_sel = k_168[6:1014]
    exog = pd.concat([k_24_sel, k_168_sel], axis=1)
    return exog


#
# Function which takes data, makes a plot of the ACF and PACF, and saves the plot, if needed
# 'x' = Time series data, time indexed, over which to plot the ACF and PACF.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
# Use output plot to visually interpret necessary parameters p, d, q, and seasonal component for SARIMAX procedure
#


def plotpacf(x, show=0, path=''):
    dflength = len(x)
    nlags = dflength * .80
    fig = plt.figure(figsize=(12, 8))
    ax1 = fig.add_subplot(211)
    fig = sm.graphics.tsa.plot_acf(x.squeeze(), lags=nlags, ax=ax1)
    ax2 = fig.add_subplot(212)
    fig = sm.graphics.tsa.plot_pacf(x, lags=nlags, ax=ax2)
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()


#
# Function to calculate the Dickey-Fuller test of stationarity
# 'dta' = Time series data, time indexed, over which to test for stationarity using the Dickey-Fuller test.
#

def dftest(dta):
    print 'Results of Dickey-Fuller Test:'
    dftest = sm.tsa.stattools.adfuller(dta, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    if dfoutput[0] < dfoutput[4]:
        dfoutput['Stationary'] = 'True'
    else:
        dfoutput['Stationary'] = 'False'
    print dfoutput


#
# Function to difference the time series, in order to determine optimal value of d for ACF and PACF
# 'dta' = Data, time series indexed, to be differenced
# 'd' = Order of differencing to be applied
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def diffit(dta, d, show, path=''):
    templist = []
    for i in range(0, (len(dta) - d), 1):
        tempval = dta[i] - dta[i + d]
        templist.append(tempval)
    y = templist[d:len(templist)]
    y = pd.Series(y)
    plotpacf(y, show, path)
    return y


#
# Function to fit the ARIMA model based on parameters obtained from the ACF / PACF plot
# 'dta' = Time series data, time indexed, over which to fit a SARIMAX model.
# 'exog' = Exogenous variables used in ARIMA model
# 'p' = Number of AutoRegressive lags, initially based on the cutoff point of the ACF
# 'd' = Order of differencing based on visual examination of ACF and PACF plots
# 'q' = Number of Moving Average lags, initially based on the utoff point of the PACF
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def runARIMA(dta, exogvar, p, d, q, show=0, path=''):
    mod = sm.tsa.ARIMA(dta, (p, d, q), exogvar)
    results = mod.fit()
    resids = results.resid.values
    summarised = results.summary()
    print summarised
    plotpacf(resids, show, path)
    return results


#
# Function to use fitted ARIMA for prediction of observed data, compare predicted to observed
# 'dta' = Data used in ARIMA prediction
# 'exog' = Exogenous variables fitted in the model
# 'arima' = Result from correctly fitted ARIMA model, likely on the residuals of a decomposed time series
# 'datrng' = Range of dates used for original time series definition, used for specifying predictions
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#


def ARIMAcompare(dta, exogvar, arima, datrng, show=0, path=''):
    dflength = len(datrng) - 1
    observation = dta
    prediction = arima.predict(start=3, end=dflength, exog=exogvar, dynamic=True)
    df = pd.concat([prediction, observation], axis=1)
    df.columns = ['predicted', 'observed']
    plt.plot(prediction)
    plt.plot(observation)
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return df


#
# Function use fitted ARIMA model for predictions
# 'pred_hours' = number of hours we want to predict scores for
# 'firsttime' = last timestamp in observations
# 'df' = data frame containing data on which the ARIMA model was previously fitted
# 'results' = output of the modeling procedure
# 'freq' = Frequency of seasonal cycle that was used in decomposition
# 'decomp' = Output of the time series decomposition step
# 'mark' = Amount of hours included in the graph prior to prediction. Set at as close to 2 weeks as possible.
# 'show' = Whether or not to show the resulting plot (0 = don't show [default], 1 = show)
# 'path' = A full file path specification indicating whether or not the file should be saved (default = 0, don't save)
#
# Output: A dataframe with observed and predicted values. Note that predictions > 5 time units are considered unreliable
# by modeling standards.
#


def pred(pred_hours, k, df, arima, show=0, path=''):
    n_obs = len(df.index)
    lastdt = df.index[n_obs - 1]
    lastdt = lastdt.to_datetime()
    datrng = pd.date_range(lastdt, periods=(pred_hours + 1), freq='H')
    future = pd.DataFrame(index=datrng, columns=df.columns)
    df = pd.concat([df, future])
    lendf = len(df.index)
    df['predicted'] = arima.predict(start=n_obs, end=lendf, exog=k, dynamic=True)
    print df
    marked = 2 * pred_hours
    df[['predicted', 'observed']].ix[-marked:].plot(figsize=(12, 8))
    if show != 0:
        plt.show()
    if path != '':
        plt.savefig(path, format='png', bbox_inches='tight')
    plt.clf()
    return df[['predicted', 'observed']].ix[-marked:]


dirnow = os.getcwd()
fpath = dirnow + '/sounds_full2.csv'
fhand = open(fpath)
dta = pd.read_csv(fhand, sep=',')
dta_sel = dta.iloc[1248:2256, 2]
#
#
#
# Extract start and end date of measurements from sound data, adding one hour because
# the last hour of the last day is not counted
#
sound_start = dta.iloc[1248, 0]
# The above .iloc value needs to be changed depending on the length of the sound data set being read in.
#
# Establish start date
sound_start = re.sub('-', '/', sound_start)
sound_start = re.sub('_', ' ', sound_start)
sound_start = sound_start + ':00'
sound_start = pd.to_datetime(sound_start, format='%d/%m/%Y %H:%M:%S')
#
# Establish end date
indexer = len(dta.index) - 1
sound_end = dta.iloc[indexer, 0]
sound_end = re.sub('-', '/', sound_end)
sound_end = re.sub('_', ' ', sound_end)
sound_end = sound_end + ':00'
sound_end = pd.to_datetime(sound_end, format='%d/%m/%Y %H:%M:%S')
sound_diff = sound_end - sound_start
#
# Derive number of periods and create data set
num_observed = (sound_diff.days * 24) + ((sound_diff.seconds + 3600) / 3600)
usedates3 = pd.date_range(sound_start, periods=num_observed, freq='H')
usedates3 = pd.Series(usedates3)
usedates3.index = dta_sel.index
timedfreq = pd.concat([usedates3, dta_sel], axis=1)
timedfreq.index = timedfreq.iloc[:, 0]
freqset = pd.Series(timedfreq.iloc[:, 1])
filepath = dirnow + '/Sound_RollingMean.png'
plotmean(freqset, 0, filepath)
# Plotted mean shows recurring (seasonal) trends at periods of 24 hours and 168 hours.
# This means a seasonal model is needed that accounts for both of these influences
# To do so, Fourier series representing the 24- and 168 hour seasonal trends can be added to the ARIMA-model
#
#
#
#
# Check for stationarity of data
#
dftest(freqset)
# Time series can be considered stationary
#
#
#
# Establish frequencies and amplitudes with which to fit ARIMA model
#
# Decompose signal into frequency and amplitude
#
filepath = dirnow + "/Welch.png"
f, Pxx_den = runwelch(freqset, 0, filepath)
#
# Obtain sine wave parameters, optionally view test plots to check periodicity
freqs, amplitudes = getsines(len(freqset), f, Pxx_den, 2, 0)
#
# Use parameters to build Fourier series for observed data with varying values for k
exog_sel = buildFterms(freqset, freqs, amplitudes)
exog_sel.index = freqset.index
#
# fit ARIMA model, plot ACF and PACF for fitted model, check for effects orders of differencing on residuals
#
filepath = dirnow + '/Sound_resid_ACFPACF.png'
Sound_ARIMA = runARIMA(freqset, exog_sel, 1, 0, 0, show=0, path=filepath)
sound_residuals = Sound_ARIMA.resid
#
# Plot various acf / pacf plots of differencing given model residuals
filepath = dirnow + '/Sound_resid_ACFPACF_d1.png'
tempdta_d1 = diffit(sound_residuals, 1, 0, filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_d2.png'
tempdta_d2 = diffit(sound_residuals, 2, 0, filepath)
# Of the two differenced models, one order of differencing seems to yield the best results
# Visual inspection of plots and model output suggests model with p = 2, d = 0 or p = 1, d = 1 to be optimal.
#
#
#
# Find optimal form of model
filepath = dirnow + '/Sound_resid_ACFPACF_200.png'
Sound_ARIMA_200 = runARIMA(freqset, exog_sel, 2, 0, 0, show=0, path=filepath)
filepath = dirnow + '/Sound_resid_ACFPACF_110.png'
Sound_ARIMA_110 = runARIMA(freqset, exog_sel, 1, 1, 0, show=0, path=filepath)
# Based on model output and ACF / PACF plot comparison for 'Sound_resid_ACFPACF_110.png' and \
# 'Sound_resid_ACFPACF_200.png', the model parameters for p = 2, d = 0, q = 0 are closer to optimal.
#
# Use selected model to predict observed values
filepath = dirnow + '/Sound_PredictObserved.png'
sound_comparison = ARIMAcompare(freqset, exog_sel, Sound_ARIMA_200, usedates3, 0, filepath)
#
# Predict values and store for Sound dataset
filepath = dirnow + '/Sound_PredictFuture.png'
sound_storepred = pred(168, exog_sel.iloc[0:170, :], sound_comparison, Sound_ARIMA_200, 0, filepath)

导入操作系统
进口稀土
作为pd进口熊猫
将numpy作为np导入
将statsmodels.api作为sm导入
将matplotlib.pyplot作为plt导入
从scipy.signal导入welch
进口经营者
#绘制数据集滚动平均值以估计平稳性的函数
#“timeseries”=用于ARIMA建模的数据
#
def plotmean（时间序列，显示=0，路径=“”）：
rolmean=pd.滚动平均值（时间序列，窗口=12）
rolstd=pd.滚动标准（时间序列，窗口=12）
图=plt.图（图尺寸=（12,8））
orig=plt.plot（时间序列，颜色为蓝色，标签为观察分数）
平均值=plt.绘图（rolmean，颜色=红色，标签=滚动平均值）
标准=plt.绘图（Rolsd，颜色为黑色，标签为滚动标准）
plt.图例（loc='best'）
产品名称（“滚动平均值和标准偏差”）
如果显示！=0:
plt.show（）
如果路径！=''：
plt.savefig（路径，格式='png'，bbox\u英寸='tight'）
plt.clf（）
#
#函数将函数随时间f（t）分解为信号振幅和频率的频谱
#“dta”=使用的数据集
#“显示”=是否显示绘图
#“路径”=存储绘图的位置（如果需要）
#
#输出：
#频率范围和频谱密度范围
#
def runwelch（dta、显示、路径）：
nps=（len（dta）/2）+8
11月=nps/2
fft=nps
fs_温度=.0002778
#由于每小时采样，设置为1/3600
f、 Pxx_den=welch（dta，fs=fs_temp，nperseg=nps，noverlap=nov，nfft=fft，scaling=“spectrum”）
平面图（f，Pxx_den）
plt.ylim（[0.5e-7,10]）
plt.xlabel（“频率[Hz]”）
plt.ylabel（'PSD[V**2/Hz]”）
如果显示！=0:
plt.show（）
如果路径！=''：
plt.savefig（路径，格式='png'，bbox\u英寸='tight'）
plt.clf（）
返回f，Pxx_den
#
#函数，用于获取n个最重要周期的振幅和频率，并提供绘图
#目视检查它们是否符合预期的季节性成分。
#“freq”=韦尔奇分解的输出
#“密度”=韦尔奇分解的输出
#“n”=需要提取的峰数
#“显示”=是否显示相应正弦函数的曲线图
定义获取（n_obs，freq，density，n，show）：
ftemp=freq
dtemp=密度
fstore=[]
dstore=[]
astore=[]
fs_温度=.0002778
#设置为1/3600，因为