How to Detect Anomalies — state-of-the-art methods using Conformal Prediction
Anomaly Detection is one of the primary use cases for most of the businesses with wide range of applications from detecting anomalies in order to clean data, to data preprocessing before design of machine learning and statistical models, to applications in self-driving cars, predictive maintenance (PdM) for diagnosing machine health, detecting health conditions from medical images, indentification of anomalous patterns in credit card transactions, cybersecurity and many more.
Unfortunately a lot of both popular articles and industry practice focuses too much on simplistic methods that rely on very restrictive assumptions about the data distribution (such as 3-sigma type approaches that rely on the data having normal distribution),old data mining methods such as KNN (invented 1951) and using forecasting libraries such as Facebook Prophet for anomaly detection.
Such methods are almost guaranteed not to work because they are not robust, do not have in built mathematical guarantees and rely on either simplistic data mining techniques from the 1950s (KNN) or restrictive assumptions such as normality that is rarely observed on real data such as machine health, demand patterns with long-tail and so on. More importanly as ground truth labels are rarely available for anomaly detection
Meanwhile using forecasting tools to diagnose anomalies is akin to asking a lawer to provide medical diagnosis. Whilst on the surface forecasting and anomaly detection could be (in theory related), in practice nothing is further from the reality. For a forecasting method to successfully perform secondary function of anomaly detection it would have to successfully forecast first, in reality relying on forecasting method for anomaly detection a user would open the solution to double layer of risk. The first layer would be due to the fact that forecasting model will (most likely) not produce accurate point prediction [which is why in forecasting the better paradigm is to have probabilistic forecast instead], the second layer would be that even if point forecast was spot-on there is no clear way [in the absence of good prediction intervals] to establish what magnitude of deviations would result in anomalies.
An illustration — Facebook prophet is a forecasting algorithm that rarely (ever?) produces good forecasts in comparison with other algorithms, the result of someone using it for anomaly detection would be both disastrous due to wildly inaccurate point and probabilities predicitons.
https://valeman.medium.com/benchmarking-facebook-prophet-53273c3ee9c6
If one was to use facebook prophet to try to detect anomalies the result would be both missing true anomalies but also due to significantly mis-calibrated and overconfident prediction intervals produced by Facebook prophet the algorithm will produce a lot of false positives quickly resulting in user fatigue and many false alarms.
Due to the issues above in academia the research around forecasting and anomaly detection is rarely overlapping — different research labs specialise in one or another for very good reasons — because forecasting and anomaly detection are fundamentally and conceptually different tasks [at least until one managed to produce crystal ball that produces both accurate point forecasts but more importantly probabilistic forecasts for assumption free data model.]
The better approach for anomaly detection is therefore to solve anomaly detection as is, totally bypassing forecasting step [in forecasting similar analogy would be to totally bypass inference step and focus direct on predictions, as inference is both unnecessary and risky step-stone where things can go very wrong — this is why the best state-of-the art frameworks in prediction and forecasting are assumption and parameters free.]
So, with all the arguments above for anomaly detection, how can one get the ‘Holly Grail’ of 1) effective and robust 2) assumption free 3) non-parametric anomaly detection.
Conformal Prediction — the most successful assumption free non-parametric prediction framework successfully adopted by the leading machine learning / stat research departments in the USA such as Berkeley, Carnegie-Mellon, Stanford, Chicago and companies such as DeepMind offers the answer.
Conformal prediction has successfully tackled uncertainty quantification for machine learning classification or regression tasks / forecasting for over two decades.
However, very few people would know that conformal prediction is also THE anomaly detection engine driving anomaly detection in Microsoft Azure under the hood.
Why did Microsoft select conformal based anomaly detector as their workhorse algorithm for Azure — quite simply because it works, is robust and more importantly mathematically guaranteed to work as it is based on conformal prediction framework that is grounded in rocket 🚀 science grade math having origins in Kolmogorov’ theory of randomness and also approaches that have been successfully used in statistical physics.
Conformal Anomaly Detection (CAD) has also have won one of the top places in high profile anomaly benchmark competition Numenta outperforming most of the other alternatives including Bayesian change point detection, Twitter anomaly detector and many more.
Conformal Anomaly Detection (CAD) is therefore both a very established approach that for several years has been powering anomaly detection in countless companies across the globe via Microsoft Azure and also vibrant area of academic research where many state-of-the-art anomaly detection algorithms based on conformal prediction are developing fast.
Similar to conformal prediction, the CAD is grounded in strong and robust mathematical approach and is both free from subjective bias introduced by concepts such as priors and parameters and is totally non-parametric and distribution-free.
And with open-source libraries anyone can leapfrog from anomaly detection 1.01 to state of the art methods for anomaly detection.
How can one start with CAD (conformal anomaly detection) one might ask?
There is an easily accessible Medium article explaining the fundamentals (including Colab code) and also scientific articles in comments. There is GitHub code for CAD KNN approach that won one of the top places in Numenta competition.
Resources: