Summarization of Input Data

One of the most powerful ways to scale the volume of data on which you detect anomalies is to summarize the input yourself before supplying the data to Prelert. Doing this enables you to take advantage of the natural parallelism built into many big data stores. For example, if your data is stored in Hadoop you could have a map-reduce job that calculates average values of a number for each time bucket and feeds these averages to Prelert instead of the raw data. If your data is stored in Elasticsearch you can run a search to return averages instead of raw results and Elasticsearch will distribute the calculation of averages for you automatically.

Specifying that you will send summarized data

If you specify a non-null value for the summaryCountFieldName setting in the job analysis configuration when creating a job then that job will expect to receive summarized input.

For each combination of values of the byFieldName, overFieldName and partitionFieldName, the field specified by the summaryCountFieldName must store the count of raw events that were summarized for that combination of fields. For event rate functions (such as count, distinct_count and rare), this count of raw events is sufficient summary of the data. For functions that work on numeric values (such as mean, min and max), you must additionally supply the function value. The function value must be supplied in the field specified by the fieldName setting.

Example

Returning to the airline tutorial, suppose you want to look for anomalies in the response time for booking requests for the different airlines, but instead of sending one input record per booking to Prelert you want to summarize the input yourself.

Your job configuration could look like this:

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
    "id" : "farequote_summarized",
    "description" : "airline bookings summarized input",
    "analysisConfig" : {
        "summaryCountFieldName":"count",
        "bucketSpan":3600,
        "detectors" :[{"function":"mean","fieldName":"avg_responsetime","byFieldName":"airline"}]
    },
    "dataDescription" : {
        "fieldDelimiter":",",
        "timeField":"time",
        "timeFormat":"yyyy-MM-dd HH:mm:ssX"
    }
}'

The first hour of input contains the following raw data relating to airline UAL:

time,airline,responsetime,sourcetype
2013-01-28 00:00:00Z,UAL,9.225,farequote
2013-01-28 00:01:10Z,UAL,8.4275,farequote
2013-01-28 00:01:36Z,UAL,9.946,farequote
2013-01-28 00:01:55Z,UAL,10.7749,farequote
2013-01-28 00:03:36Z,UAL,10.1147,farequote
2013-01-28 00:04:45Z,UAL,10.3656,farequote
2013-01-28 00:05:59Z,UAL,11.3093,farequote
2013-01-28 00:07:57Z,UAL,10.9432,farequote
2013-01-28 00:10:24Z,UAL,10.6465,farequote
2013-01-28 00:12:34Z,UAL,7.6049,farequote
2013-01-28 00:12:58Z,UAL,9.6703,farequote
2013-01-28 00:14:19Z,UAL,9.6648,farequote
2013-01-28 00:15:44Z,UAL,9.0402,farequote
2013-01-28 00:18:24Z,UAL,9.9996,farequote
2013-01-28 00:19:31Z,UAL,11.1811,farequote
2013-01-28 00:20:21Z,UAL,11.0237,farequote
2013-01-28 00:21:43Z,UAL,9.5246,farequote
2013-01-28 00:23:46Z,UAL,11.4271,farequote
2013-01-28 00:23:57Z,UAL,8.5392,farequote
2013-01-28 00:24:45Z,UAL,9.9468,farequote
2013-01-28 00:25:36Z,UAL,9.5225,farequote
2013-01-28 00:27:39Z,UAL,10.1003,farequote
2013-01-28 00:29:31Z,UAL,10.984,farequote
2013-01-28 00:32:18Z,UAL,10.0543,farequote
2013-01-28 00:35:08Z,UAL,10.2921,farequote
2013-01-28 00:37:04Z,UAL,11.1567,farequote
2013-01-28 00:38:37Z,UAL,10.0058,farequote
2013-01-28 00:39:23Z,UAL,9.4596,farequote
2013-01-28 00:42:14Z,UAL,9.0432,farequote
2013-01-28 00:43:30Z,UAL,8.8289,farequote
2013-01-28 00:46:02Z,UAL,10.6258,farequote
2013-01-28 00:48:38Z,UAL,10.9136,farequote
2013-01-28 00:49:45Z,UAL,9.5679,farequote
2013-01-28 00:52:37Z,UAL,10.8576,farequote
2013-01-28 00:54:11Z,UAL,9.7534,farequote
2013-01-28 00:55:16Z,UAL,10.1497,farequote
2013-01-28 00:56:09Z,UAL,9.5821,farequote
2013-01-28 00:56:38Z,UAL,12.7192,farequote
2013-01-28 00:58:45Z,UAL,7.9734,farequote

Instead of sending this raw data, you would send the following summary row:

time,airline,avg_responsetime,sourcetype,count
2013-01-28 00:58:45Z,UAL,10.0247,farequote,39

There are 39 raw events for airline UAL during the first hour of the tutorial data, hence the count field is set to 39. The average value of the 39 response times is 10.0247, so the avg_responsetime field is set to this.

Advanced usage

The Engine API advanced anomaly detection algorithms can make use of summarized data at a more granular interval than the bucket span, if this is available.

For example, suppose your bucket span is 3600 seconds (1 hour). Instead of summarizing your data by 1 hour periods, you could send 10 summaries per bucket corresponding to 360 second sub-intervals of the hour. If you do this, Prelert’s models will be able to react better to changes in event rate that accompany changes in metric values.

If you do not summarize your data into sub-intervals then Prelert will still find anomalies; it will just not be able to create quite such a good statistical model as it could with more granular input.