Best practices for selecting Influencers

What is an Influencer?

An Influencer is someone or something that has influenced or contributed to the anomaly.

Results are aggregated for each Influencer, for each bucket, across all detectors. In this way, a combined anomaly score is calculated for each Influencer which determines its relative anomalousness.

You can specify one or many Influencers. Picking an Influencer is strongly recommended for the following reasons:

  • It allows you to blame someone/something for the anomaly
  • It simplifies and aggregates results
  • The Results Dashboards have been predominantly designed for Influencers

Picking a good influencer

The best Influencer is the person or thing that you want to blame for the anomaly. In many cases, users or clientip make excellent Influencers.

By/over/partition fields are usually good candidates for Influencers.

For example in a firewall log, you may wish to analyze usual data volumes, unusual event rates and unusual destination countries. This would contain the following three detectors:

high_sum(bytes_sent) over clientip
high_count over clientip
rare by country over clientip

Influencers: clientip, country

By specifying clientip as the Influencer, the results will show the combined anomalousness for a clientip, giving the most unusual clientip the highest score.

By specifying a second Influencer of country, the results will show which country is the most anomalous. This will include countries which may have received unusually high bytes_sent or high event rates as well as rarely visited countries.

Influencers can be any field in the source data; they do not need to be fields specified in detectors, although they often are.

If the source data has an additional user field, then this would also be a good Influencer. We model the behavior of the population of clientip’s. If anomalies occur we additionally detect if a particular user significantly contributed to that anomaly.

Impact on results

Consider the following scenario looking to detect anomalies in error rates raised by a system:

high_count by error_code over component

Influencers: component

The field component is a good choice for an Influencer. It will allow you to answer the question, “which components are behaving anomalously?”.

The field error_code is probably not a good choice for an Influencer, presuming that you will not need to answer the question, “which error_codes are behaving anomalously?”.

In the example above, adding error_code as an extra Influencer will not change the results for components. The only difference would be the storage of additional results which may need to be filtered out when exploring results. The downside is small, so if you think you might need to look at the results from the point of view of error_codes, then add it as an Influencer.