Mutual Information for Time Series Forecasting in Python

When analysing time series and building forecasting systems, it is essential to understand the predictive strength of endogenous and exogenous features and their effects on the target variable. Would it not be great to have some ranking of feature utility in terms of their predictive power? How could one achieve such an objective?

Of course, there are tools like correlation. However, correlation is not only limited to explaining linear relationships, but also suffers from the inability to correctly reflect the predictive power between features and the target variable. You probably have heard of Anscombe’s quartet, but there is also a more fun way to illustrate the concept. In the Datasaurus Dozen, all these datasets have the exact statistics, including the same correlation between feature X and the target y.

The Datasaurus Dozen

But let’s not get distracted; now that we know that correlation is far from ideal? How do we really measure the predictive power of features (both internal and external) in terms of their effect on target y?

For time series, there are tools like ACF that can help to identify the impact of lags on the target variable — but ACF is … again computed using linear correlations.

Can we do better in the era of machine learning that can deal with non-linear relationships between features and the target?

It turns out that we can, using the concept of ‘mutual information.’ Mutual information is a measure of the shared information between two random variables. Mutual information quantifies the amount of information that one variable provides about the other variable, and it does so without relying on any assumptions like linearity.

Formally, if X and Y are two random variables, the mutual information I(X; Y) between X and Y is defined as the expected value of the logarithmic likelihood ratio between the joint distribution of X and Y and the product of their marginal distributions:

I(X; Y) = E[log (p(X, Y) / (p(X) * p(Y)))]

where p(X, Y) is the joint probability distribution of X and Y and p(X) and p(Y) is the marginal probability distributions of X and Y.

The mutual information is non-negative and zero if and only if X and Y are…

Valeriy Manokhin, PhD, MBA, CQF

Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction 👍Tip: hold down the Clap icon for up x50