Understanding the Anomaly Score

This page gives an overview of how anomalies are identified and scored within an analysis.

Defining the job

When creating an anomaly detection job, the job configuration defines the fields which to analyze (detectors) and the time interval (bucketSpan) to analyze across.

When setting the bucketSpan, take into account the granularity at which you want to analyze by, the frequency of the input data and the frequency at which alerting is required.

The detectors define what type of analysis needs to be done (e.g. max, average, rare) and upon which fields (e.g. IP address, Host name, Num bytes). You can have more than one detector in a job which is more efficient than running multiple jobs against the same data stream.

Identifying the probability of an anomaly

Based on this job configuration we analyze the input stream of data. We model the behavior of the data, perform analysis based upon the defined detectors and for the time interval. When we see an event occuring outside of our model, we identify this as an anomaly. For each anomaly detected, we store the result records of our analysis, which includes the probability of detecting that anomaly.

The lower the probability, the less likely, therefore the bigger the anomaly. A probability with a value of 1 is a certainty and therefore not an anomaly. Probabilities are calculated to a precision of over 300 decimal places, so very unlikely results will be close to zero and represented using scientific notation (e.g. 0.000000043 or 4.3e-8).

We calculate a normalizedProbability for each anomaly record, which is a number between 0 and 100. It is a statistically valid and “friendly” representation of the probability of that record, with 100 being the most anomalous and is normalized across the period of the model. Anomaly records with a normalizedProbability of 100 are considered to be in the top 1% of the most “interesting” and unlikely anomalies in your analysis.

Calculating the Anomaly Score

With high volumes of real-life data, many anomalies may be found. These vary in probability from very likely to highly unlikely i.e. from not particularly anomalous to highly anomalous. There can be none, one or two or tens, sometimes hundreds of anomalies found within each bucket. There can be many thousands found per job.

In order to provide a sensible view of the results, we calculate the anomalyScore for each time interval. An interval with a high anomalyScore is significant and requires investigation.

The anomalyScore is a sophisticated aggregation of the anomaly records. The calculation is optimized for high throughput, gracefully ages historical data and reduces the signal to noise levels. It adjusts for variations in event rate, takes into account the frequency and the level of anomalous activity and is adjusted relative to past anomalous behavior. In addition, it is boosted if anomalous activity occurs for related entities, for example if disk IO and CPU are both behaving unusually for a given host.

Querying the results

The anomalyScore can be queried using the API and is returned in both the bucket and record resource objects for easier programming access. The anomalyScore is always the aggregated and normalized view of the anomalousness of a bucket time interval.

The normalizedProbability is the record level “friendly” and normalized representation of the probability of the anomaly.

At the bucket level there is also a field called maxNormalizedProbability which is the value of the most anomalous record in the bucket time interval. This can be used to identify time intervals which have highly unusual outliers, yet may not be the most anomalous interval overall.

More information on using the API to query results is available here.

Operational best practice

As an operational best practice, our key anomaly indicator is the anomalyScore. Query this to provide a robust and rate-controlled mechanism to identify and alert on anomalous time intervals. Once an anomalous time interval has been identified, it can be expanded to view the detailed anomaly records which are the significant causal factors.

Additionally you may wish to query the detailed anomaly records directly. There may be many, so you can filter by percentile and date.