Jobs Endpoint

Jobs are the key resource in the API, representing the configuration and metadata for an analytics task. Each Job has a unique identifier either provided at job creation or automatically generated by the API, the resulting identifier is used in all operations including streaming data and querying results.

Creating a new job

A new analysis job is created by POSTing a JSON job configuration to the jobs endpoint.

A job configuration contains:

Core Properties

The job configuration has 3 core properties that will make it easier to manage your jobs.

id:

This unique identifier is used in all API operations. It may be specified with a “friendly” name. If not provided, a unique ID will be created using the current date-time and a one-off sequence number.

Type: String

description:

A brief description of the job.

Type: String

timeout:

If data is not received within this timeout period in seconds, the job will automatically close. The default value is 600. This is an advanced option; usually left as default. This option will be ignored for scheduled jobs.

Note: The job will be automatically re-opened as soon as new data is sent to it.

Type: Long

Analysis Configuration (analysisConfig)

The first method for creating a new job is by supplying an analysisConfig parameter which specifies how the data should be analyzed. This has the following properties:

detectors:

Configuration for the anomaly detectors to be used in the job. Multiple detectors can be specified. The list should contain at least one configured detector. If none are present no analysis will take place and an error will be returned. See Detector Configuration for full details.

Type: Array of Detector Configuration Objects

influencers:

A comma separated list of Influencer field names. Typically these can be the by/over/partition fields used in the Detector Configuration. You may also wish to use a field name that is not specifically named in a detector, but is available as part of the input data.

Use of influencers is strongly recommended as it enables aggregation of results from multiple detectors to a single entity. See Best practices for selecting Influencers and also Example 7 - Specify influencers.

Type: Array of Strings

bucketSpan:

The size of the interval the analysis is aggregated into, measured in seconds, with a default of 300 seconds (5 minutes).

Type: Unsigned Int

latency:

The size of the interval of the maximum latency, measured in seconds, with a default of 0 seconds (no latency). See Working with out-of-sequence data for full details.

Type: Unsigned Int

period:

The repeat interval for periodic data in multiples of batchSpan. If not specified, daily and weekly periodicity will be automatically determined. This is an advanced option; usually left as default.

Type: Unsigned Int

batchSpan:

The interval into which to batch seasonal data measured in seconds. Only relevant if period has been specified. This is an advanced option; usually left as default.

Type: Unsigned Int

summaryCountFieldName:
 

If not null, the input to the job is expected to be pre-summarized, and this is the name of the field in which the count of raw data points that have been summarized must be provided. Cannot be used with the metric function. The same summaryCountFieldName applies to all detectors. See Summarization of Input Data for full details.

Type: String

categorizationFieldName:
 

If not null, the values of the specified field will be categorized. The resulting categories can be used in a Detector by setting either of byFieldName, overFieldName or partitionFieldName to the keyword prelertcategory. See Categorization for full details.

Type: String

categorizationFilters:
 

When categorizationFieldName is specified, optional filters can be defined. This parameter expects an array of regular expressions. The expressions are used to filter out matching sequences off the categorization field values. This is useful to fine tune categorization by excluding sequences that should not be taken into consideration for defining categories, e.g. SQL statements in log files. See Categorization for full details.

Type: Array of Strings

multivariateByFields:
 

If set to true then the analysis will automatically find correlations between metrics for a given by field value, and then report anomalies when those correlations cease to hold. For example, suppose CPU and memory usage on host A is usually highly correlated with the same metrics on host B (perhaps because they’re running a load-balanced application). If you enable this option then anomalies will be reported when, for example, CPU usage on host A is high and CPU usage on host B is low (perhaps because the load-balancer has malfunctioned). Defaults to false.

Type: Boolean

overlappingBuckets:
 

If not null, either true or false.

Type: Boolean

Detector Configuration (detectors)

The detectors property of the Analysis Configuration object specifies which fields in the data are to be analyzed, and using which analytical functions. It is an object with the following properties:

function:

The analysis function to be used. Examples are count, rare, mean, min, max and sum. For a full list of the analytical functions see the Analytical Functions. The default function is metric, which looks for anomalies in all of min, max and mean. The metric function cannot be used with pre-summarized input, in other words, if summaryCountFieldName is not null then you must specify a function other than metric.

Type: String

fieldName:

The field to be analyzed for certain functions (e.g. sum, min, max, mean, info_content). If using an event rate function such as count or rare then this should not be specified. fieldName cannot contain double quotes or backslashes. The field should be renamed to avoid using these characters.

Type: String

byFieldName:

The field used to split the data for analyzing those splits with respect to their own history. Used for finding unusual values in the context of the split.

Type: String

overFieldName:

The field used to split the data for analyzing those splits with respect to the history of all splits. This is used for finding unusual values in the population of all splits.

Type: String

partitionFieldName:
 

Segment the analysis along this field to have completely independent baselines for each value of this field.

Type: String

useNull:

When there isn’t a value for the by or by fields, defines whether a new series be used as the null series. Default value is false.

Type: Boolean

excludeFrequent:
 

May contain true, over, by or false. If set, frequent entities will be excluded from influencing the anomaly results. Entities may be considered frequent over time or frequent in a population. If working with both over and by fields, then excludeFrequent may be set to true for all fields, or specifically for the over or the by fields.

Type: String

Important

The fieldname, byFieldName, overFieldName and partitionFieldName options in the Detector Configuration and the timeField in the Data Description must map to either a field in the data or the output of a transform. summaryCountFieldName in the Analysis Configuration cannot be the output of a transform and must be a field present in the data if used.

For delimited formats the configured detector fields must match elements in the data header. When uploading delimited data to the API, if the header is missing any of the configured fields or the time field, the API will return an error. If the data is in JSON format the API will tolerate individual objects with missing fields as it expects other objects to contain the field.

Field names are case sensitive, for example the fieldName ‘CPU’ is different to the fieldName ‘cpu’.

Describing your data format (dataDescription)

On creating a new job, the default behavior of the API is to accept data in tab-separated-values format, expecting an Epoch time value in a field named time. The time field must be measured in seconds from the Epoch. If however your data is not in this format you may pass a dataDescription parameter to the jobs endpoint to specify the format of your data. The dataDescription parameter is an object with the following properties:

format:

Either DELIMITED, JSON or SINGLE_LINE. The default is DELIMITED.

Type: String

fieldDelimiter:

If the data is in a delimited format with a header e.g. csv, this is the character separating the data fields. This property is only applicable if format is set to DELIMITED. The default delimiter is the tab character \t.

Type: Character

quoteCharacter:

Delimited formats can be quoted to escape fields containing the field delimiter character. This property is only applicable if format is set to DELIMITED. The default quote character is double quote .

Type: Character

timeField:

The name of the field containing the timestamp. If not set the default is time.

Type: String

timeFormat:

Can be epoch, epoch_ms or a custom pattern for date-time formatting.

Custom patterns must conform to Java DateTimeFormatter class. When using date-time formatting patterns, it is recommended to provide full date, time and time zone as in ‘yyyy-MM-ddTHH:mm:ssX’. If the given pattern is not sufficient to produce a complete timestamp, the job creation will fail.

The default value is epoch which refers to Unix or Epoch time i.e. the number of seconds since 1st January 1970 (corresponding to the time_t type in C and C++).

The value epoch_ms is used where the time is measured in milliseconds since the epoch (as used by Java’s Date class).

epoch and epoch_ms accept either integer or real values.

Type: String

Data format

Data is accepted as either JSON, SINGLE_LINE or DELIMITED fields.

In the case where the input data is in a delimited format (csv, tsv, etc) and the field delimiter is not a tab character \t use the fieldDelimiter property to change the separator.

For unstructured data you can use the SINGLE_LINE format in combination with Data pre-processing transforms (transforms) to extract structured data. SINGLE_LINE data does not require a header record and will be read line by line, assuming each line contains a record. See Categorization.

JSON objects can contain nested objects. If you want to a specify a field in the nested object use dot ‘.’ notation. For example, if the JSON document looks like this:

{
    "metric":"metric1",
    "tags":{
        "tag1":"foo",
        "tag2":"bar"
    },
    "time":1350824400,
    "value":12345.678
}

And you wish to include the tags.tag1 field in your analysis, your detector configuration may look like:

"detectors":[
  {
    "fieldName":"value",
    "byFieldName":"metric6",
    "partitionFieldName":"tags.tag1"
  }
]

If the JSON object contains a top level field with a name that matches the concatenated nested field name i.e. the object contains a field with the literal name tags.tag1 as well as the nested object tags with a field tag1 then the last field read is used.

Date Time Format

Important

An incorrect time format is one of the easiest and most common configuration mistakes. Please take care to get this right.

Using epoch time is the most efficient for data analysis.

Every record must have a timestamp. You can tell the API where to find the timestamp and how to parse it using the timeField and timeFormat fields. Custom date formats are parsed using the Java DateTimeFormatter pattern.

For example the string yyyy-MM-dd’T’HH:mm:ssXXX describes date-times in ISO 8601 format.

Specifying a time zone is recommended. If timeFormat does not include any time zone information, the following applies:

Data ingest method Assumed time zone
Engine API data endpoint The time zone that the REST API Java virtual machine is running in.
Elasticsearch scheduler UTC

Note, when using cURL on the command line, date formats with embedded quotes such as yyyy-MM-dd’T’HH:mm:ssXXX may need to be escaped. For example:

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
     "analysisConfig" : {
         "bucketSpan":900,
         "detectors" :[{"function":"metric","fieldName":"responseTime"}]
     },
     "dataDescription" : {
         "fieldDelimiter":",",
         "timeField":"iso_time",
         "timeFormat":"yyyy-MM-dd'"'T'"'HH:mm:ssXXX"
     }
}'

Here are some examples of common time formats.

Data example timeFormat
1457967488 epoch
1457967488123 epoch_ms
2016-12-30T22:04:36.231Z yyyy-MM-dd’T’HH:mm:ss.SSSX
2016-12-30T22:04:36.231-08 yyyy-MM-dd’T’HH:mm:ss.SSSX
2016-12-30T22:04:36-0830 yyyy-MM-dd’T’HH:mm:ssXX
2016-12-30T22:04:36-08:30 yyyy-MM-dd’T’HH:mm:ssXXX
2016-12-30T22:47:36 PST yyyy-MM-dd’T’HH:mm:ss zz
Tue, 3 Jun 2016 11:05:30 GMT E, d MMM yyyy HH:mm:ss zz
Tue, 3 Jun 2016 11:05:30 UTC+5 E, d MMM yyyy HH:mm:ss O

Data pre-processing transforms (transforms)

Transforms are pre-processing steps that are applied to the input data before it is sent to the analytics. For more information about the available transforms see here and also Example 4 - Using transforms.

transform:

The name of the transform function e.g. concat, exclude. See Available Transforms. If the transform name is not recognized an error is returned at job creation.

Type: String

arguments:

Specify for transforms that require one or more initializer arguments when first declared.

Type: Array of String

inputs:

The input to the transform is a list of field names.

Type: Array of String

outputs:

Each transform has an associated default list of output fieldnames but if you wish to override the defaults, define the output field names here. The output fields should be used as an input to one of the configured detectors.

Type: Array of String

arguments, inputs and outputs are all of the type Array of String but you can set the fields with a single value that the API will automatically convert into an array.

In this example the domain_split transform is used to divide a fully qualified domain name into its highest registered domain and sub-domain. The sub-domain is used for the detector’s over field.

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
    "transforms" : [
        {
            "transform":"domain_split",
            "inputs":["domain"],
            "outputs":["sub_domain"]
        }
    ],
    "analysisConfig" : {
        "bucketSpan":900,
        "detectors" :[{"fieldName":"bytes_out", "overFieldName":"sub_domain"}]
    },
    "dataDescription" : {
        "timeFormat":"yyyy-MM-dd'"'T'"'HH:mm:ss.SSSXXX"
    }
}'

Scheduled data extraction (schedulerConfig)

The API can be configured to periodically retrieve input data from a data source.

dataSource:

The data source from which to extract data. Currently only “ELASTICSEARCH” is supported. (In future there will be more options.)

Type: String

dataSourceCompatibility:
 

Where multiple versions of the chosen dataSource are supported, and different versions need to be accessed in different ways, this field defines which query format will be used. For the “ELASTICSEARCH” data source, the two possible options are “1.7.x” and “2.x.x”. This field is required for those data sources that have multiple compatibility modes. (The “ELASTICSEARCH” data source requires this, but in future there may be data sources that don’t.)

Type: String

baseUrl:

The base URL of the REST API endpoint to be queried. Either HTTP or HTTPS may be used (depending on what the remote server supports). Example: http://myserver:9200

Type: String

username:

Optionally, a username to use with HTTP Basic Authentication when querying for data. If not specified then no HTTP Basic Authentication will be sent with the request. By default, Elasticsearch is not secured, so do not specify any username.

Type: String

password:

The password corresponding to the username. May only be specified if a username is also specified. In the stored job configuration this field will be replaced by one called encryptedPassword. By default, Elasticsearch is not secured, so do not specify any password.

Type: String

queryDelay:

How many seconds behind real-time should data be queried. For example, if data from 10:04am may not be searchable in Elasticsearch until 10:06am then set this to 120 seconds. The default is 60 seconds.

Type: Long

frequency:

Interval at which scheduled queries should be made, in seconds. The default is either the bucket span for short bucket spans, or, for longer bucket spans, a sensible fraction of the bucket span.

Type: Long

indexes:

List of Elasticsearch indexes to search for input data. May be wildcarded using * to represent an arbitrary suffix.

Type: Array of String

types:

List of Elasticsearch types to search for within the specified indexes.

Type: Array of String

query:

Elasticsearch query DSL. Corresponds to the query object in an Elasticsearch search POST body. All options supported by Elasticsearch may be used, as this object is passed verbatim to Elasticsearch. If not specified the default is “match_all”: {}

Type: Object

script_fields:

Elasticsearch script fields specification. Corresponds to the script_fields object in an Elasticsearch search POST body. Use this to tell Elasticsearch to calculate fields at search time. The syntax is defined here. By default, there are no script fields.

Type: Object

aggregations:

Elasticsearch aggregation specification. Corresponds to the aggregations object in an Elasticsearch search POST body. Use this to tell Elasticsearch to provide summary statistics as input to the Prelert analytics. This distributes the work of calculating the summary statistics over many machines in a cluster. Be aware that the output of term aggregations are sensitive to whether fields are declared “analyzed” in the mappings of the input index, and, if so, which Elasticsearch analyzer is configured. For example, Elasticsearch’s default analyzer converts all letters to lower case and creates a separate term for each token in an analyzed field. This means the “by”, “over” and “partition” fields used in the Prelert analysis may not contain full field values from the input index. It is recommended that you consult with Prelert support prior to using aggregations to obtain input data from Elasticsearch. By default, aggregations are not used.

Type: Object

aggs:

Synonym for aggregations (as supported by Elasticsearch). Do not specify both aggregations and aggs.

Type: Object

retrieveWholeSource:
 

Should the input data be obtained by requesting the minimum subset of fields, or the whole of the _source document? The default is to request the minimum subset of fields. Must be false if script_fields is specified. This setting should only be modified on the advice of a Prelert engineer.

Type: Boolean

scrollSize:

Number of documents to retrieve from Elasticsearch per scroll. Defaults to 1000. This setting should only be modified on the advice of a Prelert engineer.

Type: Integer

Limits on the size of the analysis (analysisLimits)

The API provides limits for the size of the internal mathematical models held in memory. These can be set per job, and do not control the memory used by other processes. If necessary, they can also be updated after the job is created.

modelMemoryLimit:
 

The maximum amount of memory, in MiB, that the internal mathematical models can use. Once this limit is appoached, pruning of data becomes more aggressive. Upon exceeding this limit, new entities will not be modeled. The default is 4096. See Example 5 - Specify a modelMemoryLimit.

Type: Long

categorizationExamplesLimit:
 

The maximum number of examples stored per category, in memory and in the results data store. The default is 4. See Controlling the number of examples stored for each category for more information about adjusting this setting.

Type: Long

Model debug (modelDebugConfig)

This advanced configuration option will store model information along with results allowing a more detailed view into anomaly detection. Enabling this can add considerable overhead to the performance of the system and is not feasible for jobs with many entities.

Model debug provides a simplified and indicative view of the model and its bounds. It does not display complex features such as overlapping buckets, multivariate correlations or multimodal data. As such, anomalies may occassionally be reported which cannot be seen in model debug.

Model debug can be configured when the job is created or updated later. It must be disabled if performance issues are experienced.

See Example 8 - Using model debug.

writeTo:

The location to write model debug to, from the following options:

data_store:Writes to elasticsearch, using the same index as the job results with _type: modelDebugOutput.
file:Writes to file $PRELERT_LOGS_DIR/<jobId>/modelDebugData.json.

Type: String

terms:

Limits data collection to this comma separated list of partition or by field names. If terms are not specified or is an empty string, no filtering is applied.

Type: String

boundsPercentile:
 

Advanced configuration option. Specifies the percentile for which values for lower and upper bounds will be collected. Recommended value 95.0.

Type: Number

An example of the output is as follows. This can be viewed using standard Kibana visualizations and dashboards. When plotting on a time chart, try to ensure that the time aggregation interval is the same as (or as close as possible to) the bucketSpan.:

{
    "timestamp":1397545200000,
    "partitionFieldName":"continent",
    "partitionFieldValue":"Africa",
    "feature":"'mean value by person and attribute'",
    "byFieldName":"resource",
    "byFieldValue":"CPU",
    "debugLower":29.6513,
    "debugUpper":30.2704,
    "debugMedian":29.96,
    "overFieldName":"bar",
    "overFieldValue":"foo",
    "actual":29.9272
}

Warning

Field debugMedian has replaced debugMean since version 2.1.

Examples of creating a new job

These examples all use the cURL command line client.

Example 1 - Create a simple job

Create a job using an analysis configuration with a bucket span of 1 hour and a single detector configured to analyze the field hitcount by the field url and give it a succint description:

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "description": "hitcount by url, 1 hour bucket",
    "analysisConfig": {
        "bucketSpan":86400,
        "detectors": [{"function":"metric", "fieldName":"hitcount", "byFieldName":"url"}]
    }
}'

Example 2 - Specify a custom time format

Create a job by supplying an analysis configuration, and tell the API to expect CSV data with a field timestamp containing the record’s timestamp formatted as EEE, d MMM yyyy HH:mm:ss Z (for example Mon, 20 Jan 2014 16:01:00 -0500). The name ‘Website-access-logs’ will be used as the job ID in all future references. Remember that for some requests the job ID forms part of the URL and any white space characters will have to be URL encoded:

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "website-access-logs",
    "analysisConfig": {
        "bucketSpan":86400,
        "detectors" : [{"function":"metric", "fieldName":"hitcount", "byFieldName":"url"}]
    },
    "dataDescription": {
        "fieldDelimiter":",",
        "timeField":"timestamp",
        "timeFormat":"EEE, d MMM yyyy HH:mm:ss Z"
    }
}'

Example 3 - Specify an ISO 8601 time format and JSON

Create a job by supplying an analysis configuration, and tell the API to expect JSON data with a field iso_timestamp containing the record’s timestamp in ISO 8601 format. Note the single quotes around the ‘T’ in the timeFormat field.

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "website-access-job-json",
    "analysisConfig": {
        "bucketSpan":86400,
        "detectors" : [{"function":"metric", "fieldName":"hitcount", "byFieldName":"url"}]
    },
    "dataDescription": {
        "format":"JSON",
        "timeField":"iso_timestamp",
        "timeFormat":"yyyy-MM-dd'"'T'"'HH:mm:ssXXX"
    }
}'

Example 4 - Using transforms

Create a job that uses concat transform to make a unique value from the ‘host’ and ‘metric’ fields. The detector uses output of the concat transform as its byFieldValue. Note when defining transforms if only a single input or output is used it does not have to be in JSON array notation.

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "job-with-transform",
    "transforms" : [
        {
            "transform" : "concat",
            "inputs" : ["host", "metric"],
            "outputs" : "host_metric"
        }
    ],
    "analysisConfig": {
        "bucketSpan":86400,
        "detectors" : [{"function":"mean", "fieldName":"value", "byFieldName":"host_metric"}]
    },
    "dataDescription": {
        "format":"DELIMITED",
        "fieldDelimiter":",",
        "quoteCharacter":",",
        "timeField":"starttime",
        "timeFormat":"yyyy-MM-dd'"'T'"'HH:mm:ss.SSSX"
    }

}'

Example 5 - Specify a modelMemoryLimit

Create a job that has a modelMemoryLimit defined to allow the model size to be greater than the default 4GB.

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "big-job",
    "description": "job with modelMemoryLimit set",
    "analysisConfig": {
        "bucketSpan": 86400,
        "detectors" : [{"function":"metric", "fieldName":"hitcount", "byFieldName":"url"}]
    },
    "analysisLimits" : {
        "modelMemoryLimit": 6144
    },
    "dataDescription": {
        "fieldDelimiter":",",
        "timeField":"timestamp",
        "timeFormat":"EEE, d MMM yyyy HH:mm:ss Z"
    }
}'

Example 6 - Specify a custom time format with escape characters

Create a job with timeFormat that can handle the timestamp enclosed in square brackets (e.g. [2015-12-01 10:00:00 -05:00])

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "website-access-logs",
    "analysisConfig": {
        "bucketSpan":86400,
        "detectors" : [{"function":"metric", "fieldName":"hitcount", "byFieldName":"url"}]
    },
    "dataDescription": {
        "fieldDelimiter":",",
        "timeField":"timestamp",
        "timeFormat":"'"'['"'yyyy-MM-dd HH:mm:ss XXX'"']'"'"
    }
}'

Example 7 - Specify influencers

Create a job that uses influencers. Influencers are strongly recommended; however are not mandatory.

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "influencer-job",
    "description": "job with influencers set",
    "analysisConfig": {
        "bucketSpan": 600,
        "influencers": [ "clientip", "country"],
        "detectors" : [
            {"function":"high_sum", "fieldName":"bytes", "overFieldName":"clientip"},
            {"function":"high_count", "overFieldName":"clientip"} ]
    },
    "dataDescription": {
        "format":"DELIMITED",
        "fieldDelimiter":",",
        "timeField":"timestamp",
        "timeFormat":"epoch"
    }
}'

Example 8 - Using model debug

Create a job that writes model debug information to elasticsearch.

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "model-debug-job",
    "description": "job that stores model data",
    "analysisConfig": {
        "bucketSpan": 600,
        "influencers": [ "errorcode"],
        "detectors" : [
            {"function":"count", "byFieldName":"errorcode"}
    },
    "dataDescription": {
        "format":"DELIMITED",
        "fieldDelimiter":",",
        "timeField":"timestamp",
        "timeFormat":"epoch"
    },
    "modelDebugConfig": {
        "boundsPercentile": 95.0,
        "writeTo" : "data_store"
    }
}'

Create a job that writes model debug information to a file for specified performance metrics

curl -X POST -H 'Content-Type: application/json' http://localhost:8080/engine/v2/jobs -d '
{
    "id": "model-debug-job",
    "description": "job that stores model data",
    "analysisConfig": {
        "bucketSpan": 600,
        "influencers": [ "hostname"],
        "detectors" : [
            {"function":"mean", "fieldName":"metricvalue", "byFieldName":"metricname"}
    },
    "dataDescription": {
        "format":"DELIMITED",
        "fieldDelimiter":",",
        "timeField":"timestamp",
        "timeFormat":"epoch"
    },
    "modelDebugConfig": {
        "terms": "CPU,NetworkIn,DiskWrites",
        "boundsPercentile": 95.0,
        "writeTo" : "file"
    }
}'