Using StatsReduce in distributed environments

StatsReduce enables massive scalability gains with distributed analytics.

By leveraging the distributed and scalable Splunk architecture data is automatically pre-summarized on the Splunk indexers. This enhances scalability for Anomaly Searches, realizing huge performance gains with less load on the Splunk environment whilst maintaining data accuracy.

Up to now, anomaly detection was based on simple statistics calculated on time bucketed input data from raw events inside Prelert code. StatsReduce works by offloading the calculation of these per-time-bucket statistics to Splunk and feeding these pre-summarized statistics into a deeper part of Prelert’s anomaly detection code.

Along with the summary results, we additionally capture statistical information on the data that was summarized. We use this for summarization-aware modeling that allows us to continue with the same high level of accuracy.

For example, capturing a single average for the time-bucket could cause us to miss an anomaly, however capturing the more granular averages along with knowing how many events and the range of values ensures we maintain precision.

What sort of performance gains are achievable?

Performance gains depends on the characteristics of the input data and the number of indexers. Data that summarizes well will realize huge gains, whereas data that contains many by-fields or over-fields many not gain as much.

In our tests, we have seen Anomaly Searches running 15-20 times faster. With more indexers the gains would have been even bigger. In some cases, searches that would previously have been run on segmented data, can now be run in their entirety.

How does it work?

StatsReduce uses the Splunk “stats” command to distribute the calculation of results between the indexers and the search heads.

In the background, when you use the “stats” command Splunk actually runs two commands: “prestats” and “stats”. In a distributed environment the “prestats” command will run on the indexers whilst the “stats” command runs on the search head. For the majority of functions this reduces the amount of data that needs to be transferred from indexers to the search head.

By using Splunk “stats”, StatsReduce also benefits from other system optimizations. For example Report Acceleration and Datamodel Acceleration can both improve performance. The amount of this improvement depends upon the characteristics of the input data and will be greater with larger bucket sizes and for LookBack.

When should we use it?

StatsReduce is an optional configuration setting for Anomaly Searches and is set to off by default. All Timechart Mode searches run with StatsReduce enabled. StatsReduce is not available when using Evaluation Mode.

We recommend enabling StatsReduce for Anomaly Searches when running in distributed environments, taking into account the following considerations:

  1. No examples in results - We are unable to bring back anomaly examples as part of our result set. These are normally available when you drill into the anomaly results.
  2. More blocky sparklines - The sparklines will appear more stepped as they will display across 10 points rather than 30.
  3. The detector configuration may only contain a [default] stanza - It is not possible to enable StatsReduce if using sourcetype-specific stanzas.
  4. It is not possible to do categorization - It is not possible to enable StatsReduce if using the prelertcategory command in an Anomaly Search configuration.
  5. Performance benefits will be only be realized if data is suitable for summarization - If your data is using a very small bucket span or is already summarized or contains many by-fields or over-fields then StatsReduce may not achieve performance improvements.