Few ways to find the AR/MA model order

Last time, I built an AR model to predict the future of a time series.

A key element to find out was the order of the model. We picked 7 as that was the “period” of out series.

Anyhow, in the last part of the post, we started having the feeling that, maybe, another order could be better.

For this reason, I’ve tried to explore other ways to select the right order.

First option is to build multiple models with different orders and compute AIC/BIC values. The order that gives the lowest AIC/BIC values might be the right candidate.

This is the code to compute AIC/BIC for multiple orders:

plt.clf()

plt.close()
max_order = 10
aic_values = []
bic_values = []
for p in range(1, max_order + 1):
model = AutoReg(series, lags=p)
result = model.fit()
aic_values.append(result.aic)
bic_values.append(result.bic)
order = range(1, max_order + 1)
plt.plot(order, aic_values, marker='o', label='AIC')
plt.plot(order, bic_values, marker='o', label='BIC')
plt.xlabel('Order (p)')
plt.ylabel('Information Criterion Value')
plt.title('Comparison of AIC and BIC')
plt.savefig('aic_bic_ipps.png')

We can visualize data:

As we can see the lowest value seems order 8. This is confirmed by getting the minimum index in the aic/bic lists (+1):

>>> aic_values.index(min(np.array(aic_values)))+1

8
>>> bic_values.index(min(np.array(bic_values)))+1
8

Another way is to use the ar_select_order function:

>>> sel = ar_select_order(series, 13)

>>> sel.ar_lags
array([1, 2, 3, 4, 5, 6, 7, 8])
>>> sel.ar_lags[-1]
8

Yet another method is to leverage the ljung box test (and the Box Pierce one):

plt.clf()

plt.close()
max_order = 10
lbp_values = []
bpp_values = []
for p in range(1, max_order + 1):
model = AutoReg(series, lags=p)
result = acorr_ljungbox(model.fit().resid, lags=[p], return_df=True, boxpierce=True)
lbp_value = result.iloc[0,1]
lbp_values.append(lbp_value)
bpp_value = result.iloc[0,3]
bpp_values.append(bpp_value)
threshold = 0.05
lb_best_order = lbp_values.index(min(np.array(lbp_values))) + 1
bp_best_order = bpp_values.index(min(np.array(bpp_values))) + 1
print("LB P Values: ", lbp_values)
print("LB Selected Order (p):", lb_best_order)
print("LB Valid order: " + str(min(np.array(lbp_values)) < threshold))
print("BP P Values: ", bpp_values)
print("BP Selected Order (p):", bp_best_order)
print("BP Valid order: " + str(min(np.array(bpp_values)) < threshold))

again, we compute values for multiple orders and we pick the lowest one.

Result is:

LB Selected Order (p): 6

LB Valid order: True
BP Selected Order (p): 6
BP Valid order: True

Result is 6 this time!

This tells us something pretty important. Different methods give us different results. There is no correct order “no matter what”!

We can use all these techniques to get an idea of “good orders”. If the same order always appears, it is a good indication it must be it.

Otherwise, when different values are given as potentially “good orders”, the only thing to do it is try them all and see which model fits reality best.

At the end of the day, we are trying to forecast the future and what do we normally do with our everyday life? Trial and error. This is not so different 🙂

Ciao
IoSonoUmberto

Predicting the future with an AR model

Last time, we discovered more about our time series and we concluded that several hints suggest it might be modelled as an AutoRegression model with order 7.

Let’s create the AR model object:

from statsmodels.tsa.ar_model import AutoReg


plt.clf()
plt.close()
future = AutoReg(series, lags=7)
future_fit = future.fit()

After fitting the model, we can look at its summary:

>>> future_fit.summary()

<class 'statsmodels.iolib.summary.Summary'>
"""
AutoReg Model Results
==============================================================================
Dep. Variable: pps No. Observations: 63
Model: AutoReg(7) Log Likelihood -409.435
Method: Conditional MLE S.D. of innovations 362.272
Date: Tue, 06 Feb 2024 AIC 12.106
Time: 07:00:59 BIC 12.432
Sample: 01-22-2024 HQIC 12.232
- 01-22-2024
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
intercept 1.129e+04 2882.429 3.918 0.000 5644.467 1.69e+04
pps.L1 -0.4296 0.110 -3.904 0.000 -0.645 -0.214
pps.L2 -0.4301 0.110 -3.905 0.000 -0.646 -0.214
pps.L3 -0.4302 0.110 -3.908 0.000 -0.646 -0.214
pps.L4 -0.4296 0.110 -3.902 0.000 -0.645 -0.214
pps.L5 -0.4310 0.110 -3.916 0.000 -0.647 -0.215
pps.L6 -0.4279 0.110 -3.885 0.000 -0.644 -0.212
pps.L7 0.5676 0.110 5.159 0.000 0.352 0.783
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 -0.9020 -0.4341j 1.0010 -0.4286
AR.2 -0.9020 +0.4341j 1.0010 0.4286
AR.3 -0.2228 -0.9753j 1.0004 -0.2857
AR.4 -0.2228 +0.9753j 1.0004 0.2857
AR.5 0.6236 -0.7820j 1.0001 -0.1429
AR.6 0.6236 +0.7820j 1.0001 0.1429
AR.7 1.7564 -0.0000j 1.7564 -0.0000
-----------------------------------------------------------------------------
"""

It has very useful information and tells us how the software builds the function y(t) representing our phenomenon.

Remember an AR model of order n can be represented as y(t) = const + white_noise + alpha_1*y(t-1) + … + alpha_n*y(t-n).

The summary tells us those alphas! For example alpha_1 is -0.4296 (pps.L1) and so on.

There are other interesting values like AIC and BIC but we will ignore them for now.

Params can also be extracted as follows:

>>> future_fit.params

intercept 11293.922977
pps.L1 -0.429555
pps.L2 -0.430062
pps.L3 -0.430156
pps.L4 -0.429636
pps.L5 -0.431049
pps.L6 -0.427854
pps.L7 0.567574
dtype: float64

Intercept is the constant (or we might consider it constant + white noise).

Now, let’s do some math.

>>> p=future_fit.params.to_list()[1:]

>>> d=list(series[-7:].to_dict()['pps'].values())
>>>
>>>
>>> p
[-0.4295547793075185, -0.4300618439790047, -0.4301563158147217, -0.42963563769891255, -0.4310488915892021, -0.4278542615375023, 0.5675738415555855]
>>> d
[0, 0, 850, 8500, 16978, 0, 0]
>>>
>>> d.reverse()
>>> d
[0, 0, 16978, 8500, 850, 0, 0]
>>>
>>>
>>>
>>> tot=0
>>> for x in range(0, len(d)):
... tot+=p[x]*d[x]
...
>>> tot+11293.922977
-27.56543119392336

What do we do?

  • we get the list of alpha parameters
  • we get the list of the last seven samples (order 7)
  • we reverse the list of the last seven samples so that the first element is the latest one
  • with a for, we compute sum(d[n]*p[n]). Basically, we compute the next element of the series. We “predict” y(t) using y(t-1),…,y(t-7) with the AR(7) formula
  • we add the intercept
  • final result is -27 something which is “weird” for a pps value but ok…

Then, we use python to predict future values:

>>> forecast = future_fit.forecast(steps=12)

>>> forecast
2024-01-22 03:16:00 -27.565431
2024-01-22 03:17:00 -16.181718
2024-01-22 03:18:00 840.057212
2024-01-22 03:19:00 8512.156971
2024-01-22 03:20:00 16931.281156
2024-01-22 03:21:00 17.734689
2024-01-22 03:22:00 1.099462
2024-01-22 03:23:00 -25.227558
2024-01-22 03:24:00 -15.385912
2024-01-22 03:25:00 839.910973
2024-01-22 03:26:00 8529.643842
2024-01-22 03:27:00 16887.907546
Freq: T, dtype: float64

Check the first predicted value… it is the one we obtained by manually computing the first next value!

Last, we plot both the sampled values and the predicted ones:

plt.plot(series, label='Actual')

plt.plot(forecast, label='Predicted')
plt.savefig('7lags_prediction.png')

And we obtain this:

Look at that plot. The prediction looks great. The seasonality is there, suggesting the model really is able to fit our data and reliably predict the future.

Still, we know the first predicted value was a negative one but… ok, this still shows us this AR model gives us something worth a chance.

To understand the importance of the order, I did the same with order 8:

Still good 🙂

or order 6:

good again 🙂

but with order 4:

prediction gets really worse.

Orders 6, 7 and 8 looked pretty good so I tried to get more insights:

future = AutoReg(series[:-7], lags=6).fit()

forecast = future.forecast(steps=7).to_list()
forecast
list(series[-7:].to_dict()['pps'].values())
e6 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))
e6abs = [abs(ele) for ele in e6]

future = AutoReg(series[:-7], lags=7).fit()
forecast = future.forecast(steps=7).to_list()
forecast
list(series[-7:].to_dict()['pps'].values())
e7 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))
e7abs = [abs(ele) for ele in e7]


future = AutoReg(series[:-7], lags=8).fit()
forecast = future.forecast(steps=7).to_list()
forecast
list(series[-7:].to_dict()['pps'].values())
e8 = np.subtract(np.array(forecast),np.array(list(series[-7:].to_dict()['pps'].values())))
e8abs = [abs(ele) for ele in e8]

plt.close()
plt.clf()
xpoints = np.arange(7)

plt.plot(xpoints, e6abs, label='6 lags')
plt.plot(xpoints, e7abs, label='7 lags')
plt.plot(xpoints, e8abs, label='8 lags')
plt.legend()
plt.savefig('abserr_6vs7vs8_prediction.png')
  • for each lag, I created a model excluding the last 7 samples and predicted 7 samples
  • this way, i can compare the 7 predicted samples with the actual 7 values
  • I computed the error between each predicted sample and its actual value (absolute value) and plotted them for all the three lag values (6, 7, 8)

My idea was to see if one model has lower error values, meaning it is “closer” to reality:

It does not look like that.
AR7 and AR8 models seem pretty similar.
AR6 was less precise for the “firsts and lasts” predictions but was more precise “in the middle”.

At this point, I did this:

>>> from statistics import mean

>>> mean(e6abs)
84.3659734121268
>>> mean(e7abs)
64.69594873334134
>>> mean(e8abs)
63.983032787370384

I computed the average of the errors. Averages are always a bit risky as they can “smooth” things a bit too much (one single very high/low outlier might pollute the final result) but it is worth having a look at the results.
Based on those numbers order 8 seems to be the one leading to a lower average error.

After all, if we recall ACF/PACF plots

PACF cutoff is at 8, suggesting order 8 is the right one… even if seasonality has period 7.

Anyhow, both models, AR7 and AR8, made good predictions (the difference is very small) and this shows us that picking the right order is not so straightforward. Are there other indicators that could help us choose the right order? Yes, there are…but next time

Ciao
IoSonoUmberto

To predict the future we need a model

Last time, we started looking at our series.

We have looked at correlation between the series and some of its lagged versions and verified if the series was stationary.

Our goal is to find a model able to forecast future values of our series.

We said we want to understand whether to use an Auto Regressive model or a Moving Average model (there are other models but to start let’s focus on these two).

We also saw our series is not stationary which, in theory, is a condition needed to use one of those models.
Anyhow, we are going to try anyway.

Before jumping to the AR vs MA choice, let’s quickly see how AR and MA models work.

Simply put, those models are nothing more that a mathematical formula describing how series values evolve from time. Actually, it is not just a generic “evolution over time”. These models describe how values evolve over time but also how a value at time t depends on one or more previous values.

On the internet, it is easy to bump into this image describing an AR model:

Basically, what the formula tells us is that the value at a given time depends on previous values plus a constant and some white noise. Moreover, each previous value is multiplied by a certain parameter.

One simple question arises; how many “past” values do I consider? It depends. That’s something we have to find out. We have to determine the so-called order of the model. Order p means considering p past values to predict the next value.

The moving average formula looks similar:

The main difference is that we do not consider past values but past errors. Moreover, the constant is replaced by the mean of the series. Q represent the order of them model; it is equivalent to p for an AR model.

Later, we are going to see how to find out those theta/phi parameters and, most importantly, the order q/p of the model.

Before coming to that, there is another matter to solve first. Do I use an AR or a MA model?

To make a decision, we are going to get some help from two “friends”: auto-correlation plot and partial auto-correlation plot.

Let’s have a look at the auto-correlation plot (ACF) to understand how it works and what represents.

On the x-axis we have the number of lags while on the y axis we have the correlation.

For example, x=0 shows the correlation of the series with itself (of course it is 1…but it is also useless information). Instead, x=3 shows the correlation between the series and its version shifted by 3.

The ACF at lag n, computes the correlation between y(t) and y(t+3), taking into account the contribution of values between y(t) and y(t+3).

The goal of this plot is to identify the number of lags where correlation is higher. The idea is that those values are the most adequate to build a model able to predict future values by leveraging high correlation.

You can see a blue area. That represents the 95% confidence level. Whatever is outside that area is statistically significant. In our case, lag 1 is not significant (and should be ignores) while lag 7 is significant.

The other useful plot is the partial auto-correlation one, known as PACF. The main principle behind this plot is similar; anyhow, the main difference is that this tool measures the direct correlation between y(t) and y(t+n).

Here is the code to plot them both:

plt.clf()

plt.close()
plot_acf(series, lags=31)
plt.savefig('acf_ipps.png')

plt.clf()
plt.close()
plot_pacf(series, lags=30, method='ywm')
plt.savefig('pacf_ipps.png')

For PACF, I used method ‘ywm’ as it gives more reliable results.

We have ACF and PACF. So what? How can I decide AR or MA?

Again, the internet often provides this small table as a simple rule of thumb:

Let’s have a look at both ACF and PACF:

The ACF seems to follow a geometric decay. Anyhow, we have some values (7, 14, 21) that break the “global decay” but their trend is a geometric decay. This happens because the usually expected geometric decay is seen with stationary series. Our series is not stationary but has seasonality. That is the kind of ACF plot we get when seasonality is part of the game.

Those high correlation lags are useful to understand the season patterns. Values 7, 14 , 21 are not random but indicate the period to be 7.

Once we clarified this, the table above is still useful:

  • ACF geometric decay
  • PACF has a significant cut off at lag 7

This first analysis tells us we should use an AR model with 7 lags.

This means our AR model will look something like:

y(t) = C + wn + alpha1*y(t-1) + alpha2*y(t-2) + ... + alpha7*y(t-7)

Just a note. This is not an “always true” rule or an exact science. Seeing certain behaviors in ACF/PACF plots does not guarantee that we will be able to build a model capable of forecasting future in a reliable way.

Next time, we are going to see how to build an AR model with order 7 and find out all those alpha parameters.

What kind of time series do I have in front of me?

Last time, I collected some data from a router a plotted it with python. What I obtained was a so-called time series, showing us how the collected variable evolves over time.

This is how our series looks like:

Now, we want to understand more which sort of time series we have in front of us.

This will help us understand how to model it in order to use the right tools “to predict the future”.

Looking at it, it looks like we have a repeating pattern, with a spike every 6-8 minutes.

Let’s see if we can say something better that “looks like”.

First thing we can do is to look at possible correlation. We can do this by plotting a simple lag plot where each series value is paired with the previous value of the series. In other words on the x axis we have y(t), our time series, and on the y axis we have y(t+1). The goal is to see if there is some dependency between our time series and its lagged version (in this case the lag is 1 as we are comparing the series with the same series but “one step behind”).

Python allows us to get a lag plot easily:

plt.clf()

plt.close()
lag_plot(series).get_figure().savefig('corr_ipps.png')

And we get this:

It is pretty clear there is no clear or visible correlation. There seem to be some “correlated subsets”. For example, after a high values (e.g about 17000 at time t) we always drop to 0 at time t+1. Another one is a constant behavior where we always go from around 7000 to 17000.

Based on that image, we have to say there is no correlation but is that always true? I would tend to say no. There is no correlation, as showed above, when we plot our series against its 1-lagged version.
Anyhow, earlier, we said it seems there is some sort of patter repeating every 6-8 samples.

We can generate a new lag plot but, this time, with a different number of lags (in this case 7):

lag_plot(series, lag=7).get_figure().savefig('7lags_corr_ipps.png')

This time the image looks pretty different:

We have three main “groups” but all of them are “aligned”, suggesting us there is, this time, a correlation. If so, this would mean that our series is correlated with its 7-lagged version, which is also an indication of a patter with periodicity 7.

Instead of looking at a plot, we can compute the correlation factor.

>>> corrvalues = DataFrame(series)

>>> corrframe = concat([corrvalues.shift(1), corrvalues], axis=1)
>>> corrframe.columns = ['t', 't+1']
>>> corrres = corrframe.corr()

>>> corrframe = concat([corrvalues.shift(7), corrvalues], axis=1)
>>> corrframe.columns = ['t', 't+7']
>>> seven_corrres = corrframe.corr()

>>> print(corrres
   t t+1
t-1 1.000000 0.197446
t+1 0.197446 1.000000

>>> print(seven_corrres)
   t t+7
pps 1.000000 0.997731
pps 0.997731 1.000000

Now we have a number that clearly shows us the difference:

  • with 1 lag, correlation is 0.19, pretty low
  • with 7 lags, correlation is 0.99, extremely high

Finding the number of lags giving us a very high correlation is very important, especially when setting up the model that will try to forecast future values. For now, let’s remember that 7 means high correlation. It will become useful later on.

Next, we are going to look at stationarity. Stationarity means that statistical properties like covariance or autocorrelation are constant over time. Moreover, it means that series has no trends or seasonality (repeating patterns).

Stationarity is an important property as Auto Regressive or Moving Average models need the series to be stationary.

In order to understand whether we have a stationary series or not, we can perform the adfuller test and look at the p-value. If that value will be <=0,05, then the series will be stationary.

>>> statres = adfuller(series)

>>> statres[1]
0.25282810517473603

Our series is not stationary!

Does this mean we cannot use AR (auto regression) or MA (moving average)? Nope 🙂 An ideal series is stationary. Anyhow, we can still build, for example, an AR model to predict the future and the fact our series has seasonality will become handy when building the model and picking the right parameter.

Next time
Ciao
IoSonoUmberto

Turn router data into a Time Series

Nowadays, we hear about AI and ML everywhere. AI seems capable of very nice things but also sparks some concerns about “how far it might go”. Fortunately, we are mainly using it to generate images with 6-fingers people or to ask it which wine is better with a filet.

I work with networks and, unfortunately, cannot spend time creating images with routers wearing a sombrero riding a rhino.
Anyhow, I found out there is some ML I can try learning and see how it can help with networks. One useful ML technique is Time Series Analysis.

As the name suggests, it is used to analyze how a given variable evolves over time. This means recognizing a trend, detecting a recurring pattern or forecasting future values.

To better understand some of it, I built up a small lab and started getting data.

Getting data is the first step you need as without data there won’t be any time series.

To collect data from a router we have several options: snmp, netconf or telemetry to name some.

I picked netconf and, more specifically, I built a small python script using PyEz, Juniper python library to interact with Junos devices.

Here is the script I used to periodically connect to a device and extract a variable value (in this case, input pps of a specific interface) and save all the data into a csv file.

from jnpr.junos import Device

from jnpr.junos.exception import *
from jnpr.junos.factory.factory_loader import FactoryLoader
import time
import sys

dev = Device(host='10.49.106.23', user='root', password='Embe1mpls')
dev.open()

samples=int(sys.argv[1])

i=0

fo=open("ipps_data.csv","w")

while i<samples:
ipps_xml=dev.rpc.get_interface_information(interface_name='ge-0/0/3')
ipps=int(ipps_xml.findtext('.//input-pps'))
print(time.strftime('%Y-%m-%d %H:%M')+","+str(ipps))
fo.write(time.strftime('%Y-%m-%d %H:%M')+","+str(ipps))
i+=1
time.sleep(60)

fo.close()

Of course, this is just an example and can be customized as needed.

Anyhow, we should end up with something like this:

[root@localhost pyai]# cat ipps_data.csv

when,pps
2024-01-22 02:13,0
2024-01-22 02:14,0
2024-01-22 02:15,850
2024-01-22 02:16,8500
2024-01-22 02:17,16978
2024-01-22 02:18,0
2024-01-22 02:19,0
2024-01-22 02:20,0
2024-01-22 02:21,0
2024-01-22 02:22,1276
...

Now, we have our data.

Time to turn it into a time series.

We are going to use several modules and packages. Before starting, make sure to have them available:

pip3 install matplotlib
pip3 install pandas
pip3 install numpy
pip3 install statsmodels

Next, we are ready to load the csv and turn it into a time series data structure:

from pandas import read_csv

from datetime import datetime
import matplotlib.pyplot as plt

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M')

series = read_csv('ipps_data.csv', parse_dates=['when'], date_parser=dateparse, header=0, sep=',', index_col='when')

We do two things here:

  • we define a lambda function telling how to parse time
  • we read the csv file and store its content into variable series

About read_csv:

  • header=0 means first row contains the names of the columns
  • parse_dates is used to specify which columns must be treated as dates
  • date_parser references the lambda function that parses dates properly (based on the format we used when collecting data)
  • sep is the separator (“,” is th de-facto standard separator for csv)
  • index_col is used to tell which column to used as index (dates)

As a result we get this data structure:

>>> type(series)

<class 'pandas.core.frame.DataFrame'>
>>> series
pps
when
2024-01-22 02:13:00 0
2024-01-22 02:14:00 0
2024-01-22 02:15:00 850
2024-01-22 02:16:00 8500
2024-01-22 02:17:00 16978
... ...
2024-01-22 03:11:00 850
2024-01-22 03:12:00 8500
2024-01-22 03:13:00 16978
2024-01-22 03:14:00 0
2024-01-22 03:15:00 0

[63 rows x 1 columns]
>>> series.index
DatetimeIndex(['2024-01-22 02:13:00', '2024-01-22 02:14:00',
'2024-01-22 02:15:00', '2024-01-22 02:16:00',
'2024-01-22 02:17:00', '2024-01-22 02:18:00',
'2024-01-22 02:19:00', '2024-01-22 02:20:00',
'2024-01-22 02:21:00', '2024-01-22 02:22:00',
'2024-01-22 02:23:00', '2024-01-22 02:24:00',
'2024-01-22 02:25:00', '2024-01-22 02:26:00',
'2024-01-22 02:27:00', '2024-01-22 02:28:00',
'2024-01-22 02:29:00', '2024-01-22 02:30:00',
'2024-01-22 02:31:00', '2024-01-22 02:32:00',
'2024-01-22 02:33:00', '2024-01-22 02:34:00',
'2024-01-22 02:35:00', '2024-01-22 02:36:00',
'2024-01-22 02:37:00', '2024-01-22 02:38:00',
'2024-01-22 02:39:00', '2024-01-22 02:40:00',
'2024-01-22 02:41:00', '2024-01-22 02:42:00',
'2024-01-22 02:43:00', '2024-01-22 02:44:00',
'2024-01-22 02:45:00', '2024-01-22 02:46:00',
'2024-01-22 02:47:00', '2024-01-22 02:48:00',
'2024-01-22 02:49:00', '2024-01-22 02:50:00',
'2024-01-22 02:51:00', '2024-01-22 02:52:00',
'2024-01-22 02:53:00', '2024-01-22 02:54:00',
'2024-01-22 02:55:00', '2024-01-22 02:56:00',
'2024-01-22 02:57:00', '2024-01-22 02:58:00',
'2024-01-22 02:59:00', '2024-01-22 03:00:00',
'2024-01-22 03:01:00', '2024-01-22 03:02:00',
'2024-01-22 03:03:00', '2024-01-22 03:04:00',
'2024-01-22 03:05:00', '2024-01-22 03:06:00',
'2024-01-22 03:07:00', '2024-01-22 03:08:00',
'2024-01-22 03:09:00', '2024-01-22 03:10:00',
'2024-01-22 03:11:00', '2024-01-22 03:12:00',
'2024-01-22 03:13:00', '2024-01-22 03:14:00',
'2024-01-22 03:15:00'],
dtype='datetime64[ns]', name='when', freq=None)
>>> series.keys()
Index(['pps'], dtype='object')
>>> series.values
array([[ 0],
[ 0],
[ 850],
[ 8500],
[16978],
[ 0],
[ 0],
[ 0],
[ 0],
[ 1276],
[ 7654],
[17869],
[ 1],
[ 0],
[ 0],
[ 0],
[ 1021],
[ 9350],
[16152],
[ 0],
[ 0],
[ 0],
[ 0],
[ 870],
[ 8550],
[16800],
[ 0],
[ 0],
[ 0],
[ 0],
[ 750],
[ 8500],
[16478],
[ 0],
[ 0],
[ 0],
[ 0],
[ 860],
[ 8570],
[16278],
[ 0],
[ 0],
[ 0],
[ 0],
[ 850],
[ 8600],
[16878],
[ 0],
[ 0],
[ 0],
[ 0],
[ 855],
[ 8400],
[17178],
[ 0],
[ 0],
[ 0],
[ 0],
[ 850],
[ 8500],
[16978],
[ 0],
[ 0]])

We can plot the time series to get a visual idea of how input pps evolves over time:

series.plot().get_figure().savefig('ts_ipps.png')

The result is:

As you can see, the plot shows some obvious periodicity. It looks like those spikes repeat themselves regularly.

From now on, we are going to look at how different concepts and algorithms can help us understand more about our series. We will try to answer some questions like
Is the series stationary?
Is there any correlation?
If so, when is correlation higher?
And can this high correlation help us build a model able to forecast future values?

A lot of things to look at and understand
Next time, I’ll start tackling some of those questions

Ciao
IoSonoUmberto