Time series: ARIMA
TOC
Time series
A time series is a collection of data points gathered sequentially over a specific time period. The purpose of analyzing this series is to understand past patterns and potentially predict future trends. It is hypothesized that these data points are produced by an underlying process according to a particular statistical distribution. In other words, each data point is a realization of a random variable. This is why a time series can also be referred to as a discrete time stochastic process.
Examples of time series include the stock price or sunspot appearance.
Time series possess certain characteristics that we need to familiarize ourselves with, such as trend, seasonality, and serial dependence. A trend refers to a steady directional shift in the time series data. If there’s a logical explanation for this trend, we call it deterministic, if not, it’s considered stochastic. Seasonality is a characteristic that reflects the variation in data according to specific times of the year, like temperature fluctuations. Serial correlation, on the other hand, refers to the interdependence between data points, particularly when they are chronologically close.
Stationarity
Let’s define expectation, variance and covariance. The expected value or expectation E(x) of a random variable x is its mean value in the population. We denote
For a time series, the definitions of mean, variance are a bit different. The mean of a time series
A time series is deemed to be second order stationary if the correlation between consecutive data points relies solely on the lag, which is the number of time intervals separating each pair of consecutive observations. Then the serial covariance (auto covariance) of lag k
White noise
Let’s consider two operators: backward shift and difference operators. The backward shift or lag operator B inputs an element and output the previous element:
A time series model is when the model fits the series so that the remaining series doesn’t have auto correlation. The residual error series
Random walk
A random walk is another time series model in which the current element is the previous one plus a random step up or down:
Autoregressive model - AR(p)
Let’s define a strictly stationary series, a series that is unchanged for any arbitrary shift in time. A time series model
We would also consider a way to choose among models: Akaike Information Criterion (AIC). If we take the likelihood function for a statistical model with k parameters, and L maximizes the likelihood then
Now we dive into the autoregressive model. It is an extension of the random walk in which it adds term further back in time. A time series model
To know whether an AR(p) is stationary or not, we solve the characteristic function (the autoregressive model in backward shift form):
Moving average - MA(q)
The difference with AR(p) is that a MA(q) is the linear combination of white noise terms. A time series model
Simple moving average (SMA)
We take the unweighted mean of the previous points.
import pandas as pd
from matplotlib import pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
electric = pd.read_csv('time-series/Electric_Production.csv', encoding='utf-8')
electric.head()
DATE | IPG2211A2N | |
---|---|---|
0 | 1/1/1985 | 72.5052 |
1 | 2/1/1985 | 70.6720 |
2 | 3/1/1985 | 62.4502 |
3 | 4/1/1985 | 57.4714 |
4 | 5/1/1985 | 55.3151 |
# SMA over 10 day and 20 day periods
electric['SMA_10'] = electric.IPG2211A2N.rolling(10, min_periods=1).mean()
electric['SMA_20'] = electric.IPG2211A2N.rolling(20, min_periods=1).mean()
# blue = Electric Production, RED = 10 days, yellow: colors for the line plot
colors = ['blue', 'red', 'yellow']
# Line plot
electric.plot(color=colors, linewidth=3, figsize=(12,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.legend(labels =['Electric production', '10-day SMA', '20-day SMA'], fontsize=14)
plt.title('The daily electric production', fontsize=20)
plt.xlabel('Day', fontsize=16)
plt.ylabel('Electric production', fontsize=16)
Text(0, 0.5, 'Electric production')
Cumulative Moving Average (CMA)
We take the unweighted mean of past values.
# CMA
electric['CMA'] = electric.IPG2211A2N.expanding().mean()
# blue -Daily electric production and yellow -CMA
colors = ['blue', 'yellow']
# line plot
electric[['IPG2211A2N', 'CMA']].plot(color=colors, linewidth=3, figsize=(12,6))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.legend(labels =['Daily electric production', 'CMA'], fontsize=14)
plt.title('The daily electric production', fontsize=20)
plt.xlabel('Day', fontsize=16)
plt.ylabel('Electric production', fontsize=16)
Text(0, 0.5, 'Electric production')
Exponential Moving Average (EMA)
This method gives weight to recent observations, EMA is faster to change and more sensitive.
# EMA Electric Production
# Let's smoothing factor - 0.1
electric['EMA_0.1'] = electric.IPG2211A2N.ewm(alpha=0.1, adjust=False).mean()
# Let's smoothing factor - 0.3
electric['EMA_0.3'] = electric.IPG2211A2N.ewm(alpha=0.3, adjust=False).mean()
# blue - Daily electric production, red- smoothing factor - 0.1, yellow - smoothing factor - 0.3
colors = ['blue', 'red', 'yellow']
electric[['IPG2211A2N', 'EMA_0.1', 'EMA_0.3']].plot(color=colors, linewidth=3, figsize=(12,6), alpha=0.8)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.legend(labels=['Daily electric production', 'EMA - alpha=0.1', 'EMA - alpha=0.3'], fontsize=14)
plt.title('Daily electric production', fontsize=20)
plt.xlabel('Day', fontsize=16)
plt.ylabel('Electric production', fontsize=16)
Text(0, 0.5, 'Electric production')
ARMA
ARMA, an acronym for Autoregressive Moving Average, is a model designed to comprehend and forecast a time series. The “AR” segment involves the regression of the variable
The combined model ARMA(p,q) will have p autoregressive terms and q moving average terms.
The ARMA model incorporates both past trends and unexpected occurrences. The Autoregressive (AR) component considers past data, while the Moving Average (MA) component accounts for sudden, unpredictable events. However, the ARMA model doesn’t handle volatility clustering, which is a situation where variance fluctuates over time, a condition also known as heteroskedasticity.
ARIMA
ARIMA is short for Autoregressive integrated moving average. This kind of model tries to explain the series based on the past values (the lags and the lagged errors). An ARIMA model has 3 terms: p, d and q. As usual, p is the order of the AR term, q is the order of the MA term. The extra d is the number of differencing to make the series stationary.
The first thing is to check whether the series is stationary with the Augmented Dickey Fuller test. Essentially ADF is a hypothesis testing method, with null hypothesis to be the series being non stationary. So, if the p-value of the test is less than the significance level of 0.05, then you reject the null hypothesis, which means the series is stationary.
from statsmodels.tsa.stattools import adfuller
from numpy import log
result = adfuller(electric.IPG2211A2N.dropna())
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
ADF Statistic: -2.256990
p-value: 0.186215
In our case, p-value is above 0.05, we can go ahead with the differencing.
import numpy as np, pandas as pd
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.figsize':(9,7), 'figure.dpi':120})
# Import data
# df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/wwwusage.csv', names=['value'], header=0)
df = electric.IPG2211A2N
# Original Series
fig, axes = plt.subplots(3, 2, sharex=True)
axes[0, 0].plot(df); axes[0, 0].set_title('Original Series')
plot_acf(df, ax=axes[0, 1])
# 1st Differencing
axes[1, 0].plot(df.diff()); axes[1, 0].set_title('1st Order Differencing')
plot_acf(df.diff().dropna(), ax=axes[1, 1])
# 2nd Differencing
axes[2, 0].plot(df.diff().diff()); axes[2, 0].set_title('2nd Order Differencing')
plot_acf(df.diff().diff().dropna(), ax=axes[2, 1])
plt.show()
We can choose to difference the series once or twice. d can be 1 or 2. To choose p (the order of AR), we can check the partial autocorrelation. Partial autocorrelation is the pure correlation between a lag and the series.
# PACF plot of 1st differenced series
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})
fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df.diff()); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,5))
plot_pacf(df.diff().dropna(), ax=axes[1])
plt.show()
/Users/nguyenlinhchi/opt/anaconda3/lib/python3.9/site-packages/statsmodels/graphics/tsaplots.py:348: FutureWarning: The default method 'yw' can produce PACF values outside of the [-1,1] interval. After 0.13, the default will change tounadjusted Yule-Walker ('ywm'). You can use this method now by setting method='ywm'.
warnings.warn(
We can choose p = 1. For q (the order of MA), we can look at the ACF plot. Let’s fix q = 1, too.
import pandas as pd
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})
# Import data
# df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/austa.csv')
fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df.diff()); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,1.2))
plot_acf(df.diff().dropna(), ax=axes[1])
plt.show()
from statsmodels.tsa.arima.model import ARIMA
# 1,2,1 ARIMA Model
model = ARIMA(df, order=(1,2,1))
model_fit = model.fit()
print(model_fit.summary())
SARIMAX Results
==============================================================================
Dep. Variable: IPG2211A2N No. Observations: 397
Model: ARIMA(1, 2, 1) Log Likelihood -1342.504
Date: Fri, 19 May 2023 AIC 2691.008
Time: 21:13:20 BIC 2702.945
Sample: 0 HQIC 2695.738
- 397
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.3805 0.085 4.503 0.000 0.215 0.546
ma.L1 -1.0000 10.776 -0.093 0.926 -22.120 20.120
sigma2 51.7549 557.136 0.093 0.926 -1040.211 1143.721
===================================================================================
Ljung-Box (L1) (Q): 24.51 Jarque-Bera (JB): 3.39
Prob(Q): 0.00 Prob(JB): 0.18
Heteroskedasticity (H): 2.76 Skew: -0.02
Prob(H) (two-sided): 0.00 Kurtosis: 2.55
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
The coefficient of ma.L1 seems to not be significant. We can rebuild the model without it.
from statsmodels.tsa.arima.model import ARIMA
# 1,2,0 ARIMA Model
model = ARIMA(df, order=(1,2,0))
model_fit = model.fit()
print(model_fit.summary())
SARIMAX Results
==============================================================================
Dep. Variable: IPG2211A2N No. Observations: 397
Model: ARIMA(1, 2, 0) Log Likelihood -1408.868
Date: Fri, 19 May 2023 AIC 2821.737
Time: 21:14:38 BIC 2829.694
Sample: 0 HQIC 2824.889
- 397
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.1411 0.047 2.994 0.003 0.049 0.233
sigma2 73.3746 5.783 12.688 0.000 62.040 84.709
===================================================================================
Ljung-Box (L1) (Q): 1.14 Jarque-Bera (JB): 9.79
Prob(Q): 0.29 Prob(JB): 0.01
Heteroskedasticity (H): 2.85 Skew: -0.35
Prob(H) (two-sided): 0.00 Kurtosis: 2.66
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
We have AIC increased, which is not a good sign. We plot the residual, it looks good with mean around 0. And then we plot the actual vs. fitted value. They look similar which is good.
# Plot residual errors
residuals = pd.DataFrame(model_fit.resid)
fig, ax = plt.subplots(1,2)
residuals.plot(title="Residuals", ax=ax[0])
residuals.plot(kind='kde', title='Density', ax=ax[1])
plt.show()
actual = df
predicted = model_fit.predict()
pd.DataFrame({'actual':actual, 'predicted':predicted}).plot(title='Actuals vs Predicted');