machinelearning – IoSonoUnRouter

Predicting the future with an AR model

Last time, we discovered more about our time series and we concluded that several hints suggest it might be modelled as an AutoRegression model with order 7.

Let’s create the AR model object:

from statsmodels.tsa.ar_model import AutoReg


plt.clf()

plt.close()

future = AutoReg(series, lags=7)
future_fit = future.fit()

After fitting the model, we can look at its summary:

>>> future_fit.summary()

<class 'statsmodels.iolib.summary.Summary'>

"""

                            AutoReg Model Results

==============================================================================

Dep. Variable:                    pps   No. Observations:                   63

Model:                     AutoReg(7)   Log Likelihood                -409.435

Method:               Conditional MLE   S.D. of innovations            362.272

Date:                Tue, 06 Feb 2024   AIC                             12.106

Time:                        07:00:59   BIC                             12.432

Sample:                    01-22-2024   HQIC                            12.232

                         - 01-22-2024

==============================================================================

                 coef    std err          z      P>|z|      [0.025      0.975]

------------------------------------------------------------------------------

intercept   1.129e+04   2882.429      3.918      0.000    5644.467    1.69e+04

pps.L1        -0.4296      0.110     -3.904      0.000      -0.645      -0.214

pps.L2        -0.4301      0.110     -3.905      0.000      -0.646      -0.214

pps.L3        -0.4302      0.110     -3.908      0.000      -0.646      -0.214

pps.L4        -0.4296      0.110     -3.902      0.000      -0.645      -0.214

pps.L5        -0.4310      0.110     -3.916      0.000      -0.647      -0.215

pps.L6        -0.4279      0.110     -3.885      0.000      -0.644      -0.212

pps.L7         0.5676      0.110      5.159      0.000       0.352       0.783

                                    Roots

=============================================================================

                  Real          Imaginary           Modulus         Frequency

-----------------------------------------------------------------------------

AR.1           -0.9020           -0.4341j            1.0010           -0.4286

AR.2           -0.9020           +0.4341j            1.0010            0.4286

AR.3           -0.2228           -0.9753j            1.0004           -0.2857

AR.4           -0.2228           +0.9753j            1.0004            0.2857

AR.5            0.6236           -0.7820j            1.0001           -0.1429

AR.6            0.6236           +0.7820j            1.0001            0.1429

AR.7            1.7564           -0.0000j            1.7564           -0.0000

-----------------------------------------------------------------------------

"""

It has very useful information and tells us how the software builds the function y(t) representing our phenomenon.

Remember an AR model of order n can be represented as y(t) = const + white_noise + alpha_1*y(t-1) + … + alpha_n*y(t-n).

The summary tells us those alphas! For example alpha_1 is -0.4296 (pps.L1) and so on.

There are other interesting values like AIC and BIC but we will ignore them for now.

Params can also be extracted as follows:

>>> future_fit.params

intercept    11293.922977

pps.L1          -0.429555

pps.L2          -0.430062

pps.L3          -0.430156

pps.L4          -0.429636

pps.L5          -0.431049

pps.L6          -0.427854

pps.L7           0.567574

dtype: float64

Intercept is the constant (or we might consider it constant + white noise).

Now, let’s do some math.

>>> p=future_fit.params.to_list()[1:]

>>> d=list(series[-7:].to_dict()['pps'].values())
>>>
>>>
>>> p

[-0.4295547793075185, -0.4300618439790047, -0.4301563158147217, -0.42963563769891255, -0.4310488915892021, -0.4278542615375023, 0.5675738415555855]

>>> d

[0, 0, 850, 8500, 16978, 0, 0]

>>>

>>> d.reverse()

>>> d

[0, 0, 16978, 8500, 850, 0, 0]

>>>

>>>

>>>

>>> tot=0

>>> for x in range(0, len(d)):

...     tot+=p[x]*d[x]

...

>>> tot+11293.922977

-27.56543119392336

What do we do?

we get the list of alpha parameters
we get the list of the last seven samples (order 7)
we reverse the list of the last seven samples so that the first element is the latest one
with a for, we compute sum(d[n]*p[n]). Basically, we compute the next element of the series. We “predict” y(t) using y(t-1),…,y(t-7) with the AR(7) formula
we add the intercept
final result is -27 something which is “weird” for a pps value but ok…

Then, we use python to predict future values:

>>> forecast = future_fit.forecast(steps=12)

>>> forecast

2024-01-22 03:16:00      -27.565431

2024-01-22 03:17:00      -16.181718

2024-01-22 03:18:00      840.057212

2024-01-22 03:19:00     8512.156971

2024-01-22 03:20:00    16931.281156

2024-01-22 03:21:00       17.734689

2024-01-22 03:22:00        1.099462

2024-01-22 03:23:00      -25.227558

2024-01-22 03:24:00      -15.385912

2024-01-22 03:25:00      839.910973

2024-01-22 03:26:00     8529.643842

2024-01-22 03:27:00    16887.907546

Freq: T, dtype: float64

Check the first predicted value… it is the one we obtained by manually computing the first next value!

Last, we plot both the sampled values and the predicted ones:

plt.plot(series, label='Actual')

plt.plot(forecast, label='Predicted')

plt.savefig('7lags_prediction.png')

And we obtain this:

Look at that plot. The prediction looks great. The seasonality is there, suggesting the model really is able to fit our data and reliably predict the future.

Still, we know the first predicted value was a negative one but… ok, this still shows us this AR model gives us something worth a chance.

To understand the importance of the order, I did the same with order 8:

Still good 🙂

or order 6:

good again 🙂

but with order 4:

prediction gets really worse.

Orders 6, 7 and 8 looked pretty good so I tried to get more insights:

future = AutoReg(series[:-7], lags=6).fit()

forecast = future.forecast(steps=7).to_list()

forecast

list(series[-7:].to_dict()['pps'].values())

e6 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))

e6abs = [abs(ele) for ele in e6]



future = AutoReg(series[:-7], lags=7).fit()

forecast = future.forecast(steps=7).to_list()

forecast

list(series[-7:].to_dict()['pps'].values())

e7 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))

e7abs = [abs(ele) for ele in e7]





future = AutoReg(series[:-7], lags=8).fit()

forecast = future.forecast(steps=7).to_list()

forecast

list(series[-7:].to_dict()['pps'].values())

e8 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))

e8abs = [abs(ele) for ele in e8]



plt.close()

plt.clf()

xpoints = np.arange(7)



plt.plot(xpoints, e6abs, label='6 lags')

plt.plot(xpoints, e7abs, label='7 lags')

plt.plot(xpoints, e8abs, label='8 lags')

plt.legend()

plt.savefig('abserr_6vs7vs8_prediction.png')

for each lag, I created a model excluding the last 7 samples and predicted 7 samples
this way, i can compare the 7 predicted samples with the actual 7 values
I computed the error between each predicted sample and its actual value (absolute value) and plotted them for all the three lag values (6, 7, 8)

My idea was to see if one model has lower error values, meaning it is “closer” to reality:

It does not look like that.
AR7 and AR8 models seem pretty similar.
AR6 was less precise for the “firsts and lasts” predictions but was more precise “in the middle”.

At this point, I did this:

>>> from statistics import mean

>>> mean(e6abs)

84.3659734121268

>>> mean(e7abs)

64.69594873334134

>>> mean(e8abs)

63.983032787370384

I computed the average of the errors. Averages are always a bit risky as they can “smooth” things a bit too much (one single very high/low outlier might pollute the final result) but it is worth having a look at the results.
Based on those numbers order 8 seems to be the one leading to a lower average error.

After all, if we recall ACF/PACF plots

PACF cutoff is at 8, suggesting order 8 is the right one… even if seasonality has period 7.

Anyhow, both models, AR7 and AR8, made good predictions (the difference is very small) and this shows us that picking the right order is not so straightforward. Are there other indicators that could help us choose the right order? Yes, there are…but next time

Ciao
IoSonoUmberto

To predict the future we need a model

Last time, we started looking at our series.

We have looked at correlation between the series and some of its lagged versions and verified if the series was stationary.

Our goal is to find a model able to forecast future values of our series.

We said we want to understand whether to use an Auto Regressive model or a Moving Average model (there are other models but to start let’s focus on these two).

We also saw our series is not stationary which, in theory, is a condition needed to use one of those models.
Anyhow, we are going to try anyway.

Before jumping to the AR vs MA choice, let’s quickly see how AR and MA models work.

Simply put, those models are nothing more that a mathematical formula describing how series values evolve from time. Actually, it is not just a generic “evolution over time”. These models describe how values evolve over time but also how a value at time t depends on one or more previous values.

On the internet, it is easy to bump into this image describing an AR model:

Basically, what the formula tells us is that the value at a given time depends on previous values plus a constant and some white noise. Moreover, each previous value is multiplied by a certain parameter.

One simple question arises; how many “past” values do I consider? It depends. That’s something we have to find out. We have to determine the so-called order of the model. Order p means considering p past values to predict the next value.

The moving average formula looks similar:

The main difference is that we do not consider past values but past errors. Moreover, the constant is replaced by the mean of the series. Q represent the order of them model; it is equivalent to p for an AR model.

Later, we are going to see how to find out those theta/phi parameters and, most importantly, the order q/p of the model.

Before coming to that, there is another matter to solve first. Do I use an AR or a MA model?

To make a decision, we are going to get some help from two “friends”: auto-correlation plot and partial auto-correlation plot.

Let’s have a look at the auto-correlation plot (ACF) to understand how it works and what represents.

On the x-axis we have the number of lags while on the y axis we have the correlation.

For example, x=0 shows the correlation of the series with itself (of course it is 1…but it is also useless information). Instead, x=3 shows the correlation between the series and its version shifted by 3.

The ACF at lag n, computes the correlation between y(t) and y(t+3), taking into account the contribution of values between y(t) and y(t+3).

The goal of this plot is to identify the number of lags where correlation is higher. The idea is that those values are the most adequate to build a model able to predict future values by leveraging high correlation.

You can see a blue area. That represents the 95% confidence level. Whatever is outside that area is statistically significant. In our case, lag 1 is not significant (and should be ignores) while lag 7 is significant.

The other useful plot is the partial auto-correlation one, known as PACF. The main principle behind this plot is similar; anyhow, the main difference is that this tool measures the direct correlation between y(t) and y(t+n).

Here is the code to plot them both:

plt.clf()

plt.close()

plot_acf(series, lags=31)

plt.savefig('acf_ipps.png')



plt.clf()

plt.close()

plot_pacf(series, lags=30, method='ywm')

plt.savefig('pacf_ipps.png')

For PACF, I used method ‘ywm’ as it gives more reliable results.

We have ACF and PACF. So what? How can I decide AR or MA?

Again, the internet often provides this small table as a simple rule of thumb:

Let’s have a look at both ACF and PACF:

The ACF seems to follow a geometric decay. Anyhow, we have some values (7, 14, 21) that break the “global decay” but their trend is a geometric decay. This happens because the usually expected geometric decay is seen with stationary series. Our series is not stationary but has seasonality. That is the kind of ACF plot we get when seasonality is part of the game.

Those high correlation lags are useful to understand the season patterns. Values 7, 14 , 21 are not random but indicate the period to be 7.

Once we clarified this, the table above is still useful:

ACF geometric decay
PACF has a significant cut off at lag 7

This first analysis tells us we should use an AR model with 7 lags.

This means our AR model will look something like:

y(t) = C + wn + alpha1*y(t-1) + alpha2*y(t-2) + ... + alpha7*y(t-7)

Just a note. This is not an “always true” rule or an exact science. Seeing certain behaviors in ACF/PACF plots does not guarantee that we will be able to build a model capable of forecasting future in a reliable way.

Next time, we are going to see how to build an AR model with order 7 and find out all those alpha parameters.

What kind of time series do I have in front of me?

Last time, I collected some data from a router a plotted it with python. What I obtained was a so-called time series, showing us how the collected variable evolves over time.

This is how our series looks like:

Now, we want to understand more which sort of time series we have in front of us.

This will help us understand how to model it in order to use the right tools “to predict the future”.

Looking at it, it looks like we have a repeating pattern, with a spike every 6-8 minutes.

Let’s see if we can say something better that “looks like”.

First thing we can do is to look at possible correlation. We can do this by plotting a simple lag plot where each series value is paired with the previous value of the series. In other words on the x axis we have y(t), our time series, and on the y axis we have y(t+1). The goal is to see if there is some dependency between our time series and its lagged version (in this case the lag is 1 as we are comparing the series with the same series but “one step behind”).

Python allows us to get a lag plot easily:

plt.clf()

plt.close()

lag_plot(series).get_figure().savefig('corr_ipps.png')

And we get this:

It is pretty clear there is no clear or visible correlation. There seem to be some “correlated subsets”. For example, after a high values (e.g about 17000 at time t) we always drop to 0 at time t+1. Another one is a constant behavior where we always go from around 7000 to 17000.

Based on that image, we have to say there is no correlation but is that always true? I would tend to say no. There is no correlation, as showed above, when we plot our series against its 1-lagged version.
Anyhow, earlier, we said it seems there is some sort of patter repeating every 6-8 samples.

We can generate a new lag plot but, this time, with a different number of lags (in this case 7):

lag_plot(series, lag=7).get_figure().savefig('7lags_corr_ipps.png')

This time the image looks pretty different:

We have three main “groups” but all of them are “aligned”, suggesting us there is, this time, a correlation. If so, this would mean that our series is correlated with its 7-lagged version, which is also an indication of a patter with periodicity 7.

Instead of looking at a plot, we can compute the correlation factor.

>>> corrvalues = DataFrame(series)

>>> corrframe = concat([corrvalues.shift(1), corrvalues], axis=1)

>>> corrframe.columns = ['t', 't+1']

>>> corrres = corrframe.corr()



>>> corrframe = concat([corrvalues.shift(7), corrvalues], axis=1)
>>> corrframe.columns = ['t', 't+7']
>>> seven_corrres = corrframe.corr()


>>> print(corrres
            t       t+1
t-1  1.000000  0.197446
t+1  0.197446  1.000000

>>> print(seven_corrres)

            t       t+7

pps  1.000000  0.997731

pps  0.997731  1.000000

Now we have a number that clearly shows us the difference:

with 1 lag, correlation is 0.19, pretty low
with 7 lags, correlation is 0.99, extremely high

Finding the number of lags giving us a very high correlation is very important, especially when setting up the model that will try to forecast future values. For now, let’s remember that 7 means high correlation. It will become useful later on.

Next, we are going to look at stationarity. Stationarity means that statistical properties like covariance or autocorrelation are constant over time. Moreover, it means that series has no trends or seasonality (repeating patterns).

Stationarity is an important property as Auto Regressive or Moving Average models need the series to be stationary.

In order to understand whether we have a stationary series or not, we can perform the adfuller test and look at the p-value. If that value will be <=0,05, then the series will be stationary.

>>> statres = adfuller(series)

>>> statres[1]

0.25282810517473603

Our series is not stationary!

Does this mean we cannot use AR (auto regression) or MA (moving average)? Nope 🙂 An ideal series is stationary. Anyhow, we can still build, for example, an AR model to predict the future and the fact our series has seasonality will become handy when building the model and picking the right parameter.

Next time
Ciao
IoSonoUmberto