Detector Configuration

An Anomaly Search requires the following core items to be configured.

  • Search string - This is the Splunk search that defines what data is fed into the analysis engine.
  • Bucket span - This is the window for time series analysis. Typically between 5 mins to 1 hr, although varies depending on the rate of data, the typical duration of anomalies and the urgency with which you need to be notified.
  • Detectors - These are the analysis functions to be performed, for example sum(bytes) and count. There can be one or more detectors.
  • Influencers - These are the persons or entities to blame for the anomaly. There can be one or more influencer types defined in an Anomaly Search, for example user and clientip. An overall anomaly score for each influencer, across all detectors is calculated per bucket. Influencers are optional, however strongly recommended.

The Anomaly Search detector configuration options are listed below.

Syntax

function(field1) [by <field2>] [over <field3>] [partitionfield=<field4>] [summarycountfield=<field5>] [categorizationfield=<field6>] [excludefrequent=true] [usenull=true]

Functions

Certain functions have high and low sided variants. These are recommended for use if either low or high anomalies are important. For example, if looking to detect peaks in event rate, then using the high_count function would be a more targeted approach. Any anomalies due to a drop in event rates (if they had occurred) would not be reported.

Multiple detectors, each using different functions, can be specified. For example, in one Anomaly Search you can look at the count and the sum of transactions as well as looking for rare products being purchased.

Count Functions Description
count, low_count, high_count the number (or rate) of events within a bucketspan
non_zero_count, low_non_zero_count, high_non_zero_count same as count, but ignores empty buckets, for use when data is sparse
distinct_count(), low_distinct_count(), high_distinct_count() the count of distinct values of a field within a bucketspan.
Sum Functions Description
sum(), low_sum(), high_sum() the sum of the values of a field that is a numerical value within a bucketspan
non_null_sum(), low_non_null_sum(), high_non_null_sum() same as sum(), except ignores empty buckets, for use when data is sparse
Metric Functions Description
min() the minimum value of a field that is a numerical value within a bucketspan
max() the maximum value of a field that is a numerical value within a bucketspan
mean(), low_mean(), high_mean() the average/mean value of a field that is a numerical value within a bucketspan
metric() combines min(), max() and mean(), does not include sum()
Rare Functions Description
rare rarely occurring items
freq_rare rare items, that happen often (pervasively)
Information Content Functions Description
info_content(), low_info_content(), high_info_content() a measure of information contained in strings within a bucket, especially suited to detecting DNS tunneling when the length of the string is not enough
Time Functions Description
time_of_day events that occur at unusual times in the day
time_of_week events that occur at unusual times in the week

“By” fields

A detector containing only a by (and not an over field) will model the behavior of each entity over time. It will detect anomalies compared to the past behavior of the entity.

If many anomalies occur for multiple entities in the same bucket, emphasis is given as this is indicative of a system-wide issue.

“Over” fields

A detector containing only an over field will model the behavior of the entire population. It will detect anomalies where the over field entity differs from the population.

A detector that contains both by and over fields will model the behavior of the population for each by field entity. It will detect anomalies where the over field entity differs from the population.

If you are aware of very active entities that may dominate results, then consider using excludefrequent=true for your population analysis.

“Partition” fields

The partitionfield allows for completely separate baselines to be created for each value of the partitionfield. This segments the modeling so that it is performed for each partition entity that exists. Partition fields can be used in conjunction with both by and/or over fields.

In the following example, separate models are built for each application. e.g. dropbox, facebook, crm, mail:

high_sum(bytes) over clientip partitionfield=application
  • Partition each application separately
  • Model the sum of bytes transferred
  • Detects clientip’s that transfer an unusually large amount of data compared to other clientip’s

Partition fields are ideal when you expect behavior to be similar for each application and to be different between applications. i.e. the behavior of dropbox will not influence the results for mail. If you want separate modeling performed, then choose a partitionfield.

When aggregating these anomalies to calculate the bucket level Anomaly Score, emphasis is given to times when a partitionfield entity behaves anomalously, rather than when many partitionfields behave anomalously at the same time.

Note: Analysis using partitonfield has higher memory requirements and is not recommended for >50,000 partitions.

Summary count field

count by product summarycountfield=orders_per_minute
sum(total_amt) by product summarycountfield=orders_per_minute

If your input data is already summarized, you can now specify the count field as an input to the analysis. This adds greater accuracy to anomaly detection on pre-summarized data.

The above example will detect anomalies in both the rate of orders per minute and the sum of the transaction amount, even though the input data is already summarized by the minute.

Using a summary count field is an optional configuration setting, set to off by default. It is not available in Timechart Mode.

Only one summarycountfield may be specified per Anomaly Search.

We recommend using the summary count field if your input data is already summarized, taking into account the following considerations:

  • No examples in results: We are unable to bring back anomaly examples as part of our result set. These are normally available when you drill into the anomaly results.
  • More blocky sparklines: The sparklines will appear more stepped as they will display across fewer points.
  • Only one summarycountfield may be specified: The Anomaly Search configuration may only contain one summarycountfield, which will be used in all detectors.
  • Not recommended for use with categorization: Using in conjunction with prelertcategory is inappropriate for summarization as it is likely that the field being categorized varies in every raw event. In this case the summary count will always be 1 i.e. not summarized.
  • Cannot be used with metric() function: Use of the metric() function is not supported; however use of the min(), max() and mean() functions are supported.
  • The bucketspan may be larger than the summarization span: For example, if your input data is summarized at 5 minute intervals, you may use an hourly bucketspan and still benefit from accurate analysis. Please ensure that the summarization span is a multiple of the bucketspan and that the summarization span is less than or equal to the bucketspan.
  • Use the appropriate count or sum analysis functions: The summarycountfield represents the number of data records that were summarized. If wishing to perform an event rate analysis, then please use a count function, rather than attempting to sum the summarycountfield.
  • Not recommended for use with the stats command: In Evaluation Mode, results are likely to be affected by the Splunk default for output rows, which is limited to 50,000. Anomaly Searches run continuously or on historical data are submitted using the REST API and do not suffer this limitation; however use of the stats command with summarycountfield is still not recommended.

If your input data is not summarized, then use of Using StatsReduce in distributed environments in an Anomaly Search is always recommended over manual summarization of input data using the stats command.

Categorization field

Categorization uses a proprietary algorithm to group together unstructured events. This is a typical requirement when analyzing event rates and rare events in log files which do not have defined fields. Categorization enables such this analyss by automatically deriving categories and assigning each data record to the most suitable category.

One categorizationfield may be specified per sourcetype.

count by prelertcategory [categorizationfield=<message>]
rare by prelertcategory [categorizationfield=<message>]
prelertcategory:
 Required. A Prelert calculated field which instructs the input data to be categorized.
categorizationfield:
 Optional. The name of the field to be categorized. If not specified, _raw will be used as a default. Only one categorization field can be specified per sourcetype.

The following unstructured data example would be categorized into 3 different message types:

2015-09-30 10:00:00 EST INFO Server started on localhost
2015-09-30 10:00:01 EST INFO Connecting to server at http://localhost:8888/login/
2015-09-30 10:04:15 EST DEBUG Created file banana.txt
2015-09-30 10:05:24 EST DEBUG Created file apple.txt
2015-09-30 10:05:44 EST DEBUG Created file orange.txt
...

Exclude frequent

sum(bytes_out) over hrd excludefrequent=true

Frequent entities may dominate results.

For example in a typical network log, a lot of data is often sent to gmail as this is a popular application. You may or may not wish for this activity to influence the modeling as it could give less significance to other anomalous destinations.

An entity may be frequent in time or frequent in a population. For example looking at firewall logs, gmail is frequent in time as data is being sent to gmail in most buckets. It is also frequent by population as most client machines are sending data to gmail.

When exclude frequent is enabled, we use machine learning techniques to identify frequent entities (in this case “gmail”). Once they are considered frequent, they are automatically excluded from being modeled.

We continue to monitor the entity so that if it is no longer regarded as frequent it can be included back into the analysis. This prevents very frequent entities from having excessive influence which could result in anomalies being overshadowed.

true:Frequent entities will be excluded from the analysis.
false:Default. Frequent entities will be included in the analysis.

Use null

sum(transactions) by customer usenull=true

Controls whether or not a series is modeled for events where the by field or over field is missing. For example, suppose you are analyzing “count by airline” and an event is received with no field called airline or a null value or empty string for the value of airline.

true:Events where the by/over field is null or an empty string will be included in the analysis against an empty string entity value.
false:Default. Events where the by/over field is null will be ignored.

Note: When using partitonfield, results are always created even if the partitonfield value is null/empty/missing. Effectively, usenull=true behavior is always enabled in the case of partitonfield.

See also