Last time, we discovered more about our time series and we concluded that several hints suggest it might be modelled as an AutoRegression model with order 7.
Let’s create the AR model object:
from statsmodels.tsa.ar_model import AutoReg
plt.clf()
plt.close()
future = AutoReg(series, lags=7)
future_fit = future.fit()
After fitting the model, we can look at its summary:
>>> future_fit.summary()
<class 'statsmodels.iolib.summary.Summary'>
"""
AutoReg Model Results
==============================================================================
Dep. Variable: pps No. Observations: 63
Model: AutoReg(7) Log Likelihood -409.435
Method: Conditional MLE S.D. of innovations 362.272
Date: Tue, 06 Feb 2024 AIC 12.106
Time: 07:00:59 BIC 12.432
Sample: 01-22-2024 HQIC 12.232
- 01-22-2024
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept 1.129e+04 2882.429 3.918 0.000 5644.467 1.69e+04
pps.L1 -0.4296 0.110 -3.904 0.000 -0.645 -0.214
pps.L2 -0.4301 0.110 -3.905 0.000 -0.646 -0.214
pps.L3 -0.4302 0.110 -3.908 0.000 -0.646 -0.214
pps.L4 -0.4296 0.110 -3.902 0.000 -0.645 -0.214
pps.L5 -0.4310 0.110 -3.916 0.000 -0.647 -0.215
pps.L6 -0.4279 0.110 -3.885 0.000 -0.644 -0.212
pps.L7 0.5676 0.110 5.159 0.000 0.352 0.783
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -0.9020 -0.4341j 1.0010 -0.4286
AR.2 -0.9020 +0.4341j 1.0010 0.4286
AR.3 -0.2228 -0.9753j 1.0004 -0.2857
AR.4 -0.2228 +0.9753j 1.0004 0.2857
AR.5 0.6236 -0.7820j 1.0001 -0.1429
AR.6 0.6236 +0.7820j 1.0001 0.1429
AR.7 1.7564 -0.0000j 1.7564 -0.0000
-----------------------------------------------------------------------------
"""
It has very useful information and tells us how the software builds the function y(t) representing our phenomenon.
Remember an AR model of order n can be represented as y(t) = const + white_noise + alpha_1*y(t-1) + … + alpha_n*y(t-n).
The summary tells us those alphas! For example alpha_1 is -0.4296 (pps.L1) and so on.
There are other interesting values like AIC and BIC but we will ignore them for now.
Params can also be extracted as follows:
>>> future_fit.params
intercept 11293.922977
pps.L1 -0.429555
pps.L2 -0.430062
pps.L3 -0.430156
pps.L4 -0.429636
pps.L5 -0.431049
pps.L6 -0.427854
pps.L7 0.567574
dtype: float64
Intercept is the constant (or we might consider it constant + white noise).
Now, let’s do some math.
>>> p=future_fit.params.to_list()[1:]
>>> d=list(series[-7:].to_dict()['pps'].values())
>>>
>>>
>>> p
[-0.4295547793075185, -0.4300618439790047, -0.4301563158147217, -0.42963563769891255, -0.4310488915892021, -0.4278542615375023, 0.5675738415555855]
>>> d
[0, 0, 850, 8500, 16978, 0, 0]
>>>
>>> d.reverse()
>>> d
[0, 0, 16978, 8500, 850, 0, 0]
>>>
>>>
>>>
>>> tot=0
>>> for x in range(0, len(d)):
... tot+=p[x]*d[x]
...
>>> tot+11293.922977
-27.56543119392336
What do we do?
- we get the list of alpha parameters
- we get the list of the last seven samples (order 7)
- we reverse the list of the last seven samples so that the first element is the latest one
- with a for, we compute sum(d[n]*p[n]). Basically, we compute the next element of the series. We “predict” y(t) using y(t-1),…,y(t-7) with the AR(7) formula
- we add the intercept
- final result is -27 something which is “weird” for a pps value but ok…
Then, we use python to predict future values:
>>> forecast = future_fit.forecast(steps=12)
>>> forecast
2024-01-22 03:16:00 -27.565431
2024-01-22 03:17:00 -16.181718
2024-01-22 03:18:00 840.057212
2024-01-22 03:19:00 8512.156971
2024-01-22 03:20:00 16931.281156
2024-01-22 03:21:00 17.734689
2024-01-22 03:22:00 1.099462
2024-01-22 03:23:00 -25.227558
2024-01-22 03:24:00 -15.385912
2024-01-22 03:25:00 839.910973
2024-01-22 03:26:00 8529.643842
2024-01-22 03:27:00 16887.907546
Freq: T, dtype: float64
Check the first predicted value… it is the one we obtained by manually computing the first next value!
Last, we plot both the sampled values and the predicted ones:
plt.plot(series, label='Actual')
plt.plot(forecast, label='Predicted')
plt.savefig('7lags_prediction.png')
And we obtain this:
Look at that plot. The prediction looks great. The seasonality is there, suggesting the model really is able to fit our data and reliably predict the future.
Still, we know the first predicted value was a negative one but… ok, this still shows us this AR model gives us something worth a chance.
To understand the importance of the order, I did the same with order 8:
Still good 🙂
or order 6:
good again 🙂
but with order 4:
prediction gets really worse.
Orders 6, 7 and 8 looked pretty good so I tried to get more insights:
future = AutoReg(series[:-7], lags=6).fit()
forecast = future.forecast(steps=7).to_list()
forecast
list(series[-7:].to_dict()['pps'].values())
e6 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))
e6abs = [abs(ele) for ele in e6]
future = AutoReg(series[:-7], lags=7).fit()
forecast = future.forecast(steps=7).to_list()
forecast
list(series[-7:].to_dict()['pps'].values())
e7 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))
e7abs = [abs(ele) for ele in e7]
future = AutoReg(series[:-7], lags=8).fit()
forecast = future.forecast(steps=7).to_list()
forecast
list(series[-7:].to_dict()['pps'].values())
e8 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))
e8abs = [abs(ele) for ele in e8]
plt.close()
plt.clf()
xpoints = np.arange(7)
plt.plot(xpoints, e6abs, label='6 lags')
plt.plot(xpoints, e7abs, label='7 lags')
plt.plot(xpoints, e8abs, label='8 lags')
plt.legend()
plt.savefig('abserr_6vs7vs8_prediction.png')
- for each lag, I created a model excluding the last 7 samples and predicted 7 samples
- this way, i can compare the 7 predicted samples with the actual 7 values
- I computed the error between each predicted sample and its actual value (absolute value) and plotted them for all the three lag values (6, 7, 8)
My idea was to see if one model has lower error values, meaning it is “closer” to reality:
It does not look like that.
AR7 and AR8 models seem pretty similar.
AR6 was less precise for the “firsts and lasts” predictions but was more precise “in the middle”.
At this point, I did this:
>>> from statistics import mean
>>> mean(e6abs)
84.3659734121268
>>> mean(e7abs)
64.69594873334134
>>> mean(e8abs)
63.983032787370384
I computed the average of the errors. Averages are always a bit risky as they can “smooth” things a bit too much (one single very high/low outlier might pollute the final result) but it is worth having a look at the results.
Based on those numbers order 8 seems to be the one leading to a lower average error.
After all, if we recall ACF/PACF plots
PACF cutoff is at 8, suggesting order 8 is the right one… even if seasonality has period 7.
Anyhow, both models, AR7 and AR8, made good predictions (the difference is very small) and this shows us that picking the right order is not so straightforward. Are there other indicators that could help us choose the right order? Yes, there are…but next time
Ciao
IoSonoUmberto