Basic Statistics Overview

This page is intended to give a high-level overview of what Prelert is doing under-the-hood in simple terms.

What is a statistical model? When are they useful?

A statistical model is a mathematical description of the behavior of data (assuming the data is not fully deterministic). These models can be used to understand the typical observations to expect, and to quantify the likelihood of a specific observation.

Anomalies are items, events or observations that do not conform to an expected pattern or other items in a dataset. If a statistical model can describe the likelihood of a specific observation, then, roughly speaking, an anomaly can be defined as an observation that has low likelihood (or probability). For example, an observation that is only seen once in 1,000,000 observations is rare, and its probability could be estimated as 1/1,000,000 0.000001.

An example of a simple statistical model is the use of a Gaussian (or Normal) distribution to describe the behavior of a value over time. If the value has an average of 10 and a standard deviation of 1, then ~95% of the time we expect to see values between 8 and 12. The probability of seeing a value of 14 is ~0.01% (1 in ~7000), so in this context this may be viewed as an anomaly.

Statistical models are essential when the system contains some component which cannot be readily discovered, or when some aspects of the system behavior are intrinsically random. For example, it is usually not possible to estimate an accurate deterministic model for a complex system because there are many hidden states which will affect its behavior. Also, some systems will include behaviors that are intrinsically random, such as peoples’ interactions, which it is only possible to describe with a statistical model. Statistical models also have the advantage that they are often very concise descriptions of complex behavior, and for anomaly detection capture information, as a distribution function, which is essential for identifying unusual observations.

How to create a statistical model?

Prelert is streamed large volumes of diverse time series data. Naively, creating a statistical model that can compute the probability of a value can be viewed as fitting a mathematical function to a histogram of the data. Time series data have other characteristics which it is important for good statistical models to capture. For example they may include trend lines, which in the context of a Gaussian distribution would be like having a time varying mean value. The range may be time dependent, which in the context of a Gaussian would be like having a time varying standard deviation. There may also be more general correlations between values which are close in time.

For example, for the following time series (assuming the time series is approximately stationary ergodic, so the distribution doesn’t change over time and can be estimated from a time sample):

Simple time series

The values can be mapped to a histogram which represents the frequencies of occurrence of values. For example, value=1 has been seen over 900 times in the dataset.

Simple time series histogram

The probability of a value can be computed using a probability distribution function that fits the data. A Gaussian distribution is an example of a probability distribution function, but does not fit this data well. In particular, if the model’s primary objective is to detect anomalies it needs to model the probabilities at extreme values (in the tails of the distribution function) well.

Prelert automatically creates complex probability distribution functions for diverse data. For example, below is an example of how Prelert fits this data compared to a simple Gaussian distribution. Visually, there isn’t a massive difference, but in the tails of the distribution (zoomed in) there is an enormous relative difference in the distribution values. This results in Prelert accurately calculating the probability of an anomaly, whilst the Gaussian distribution significantly underestimates the probabilities which would result in large number of false positives.

Simple time series pdf