Analysis Configuration

The Jobs View displays all known Anomaly Search Jobs and their respective status along with possible actions to take on them.

Jobs (Anomaly Search Jobs)

Screen 1: Anomaly Search Jobs

Job Summary

The Job summary in the list is displayed in column format with columns displaying the name of the search (job), the Description of that job, the number of records processed, the memory status of that job, the job status, the scheduler status, the latest timestamp of the search and finally a list of actions to take on that particular item.

Jobs summary columns

Screen 2: Jobs summary columns

Job Actions

The per job actions that are possible are the following: start/stop job, display results of the job, open anomaly explorer for the job, edit the job, clone the job and delete the job.

Per job actions

In addition to the per job actions, there is a create job action that will enable the user to create a new job via a create job wizard.

Create new job action

Job Details

By expanding the row, all job details can be displayed.

Jobs details

Screen 3: Jobs Details

Creating a New Job

To create a new job, the user would click the Create new job button within the Jobs page of the Prelert application.

Important

All job configuration properties must be provided when you create the job. It is only possible to change a small number of configuration options, such as Job Description. Please make sure you have all details available.

Jobs >> Create new job

Screen 4: Jobs >> Create new job

The window in Screen 4 appears when the “Create new job” button is clicked. Within this window the user can create jobs that process data from 3 different source locations: from Elasticsearch, flat file or no data specified at this time (user may upload this later).

Create a new job using Elasticsearch data

By selecting Elasticsearch Server from the list in screen 4 above, the user will be taken to a page that will enable them to create a new job to run against Elasticsearch data. This page is represented in Screen 5 that follows.

Create new job wizard

Screen 5: Jobs >> Create new job using Elasticsearch

The Create new job with Elasticsearch page will prompt for information to be provided in order to create a new job. The first step is to provide a URL to the Elasticsearch server along with port number (see screen 5)

The Create new job with Elasticsearch wizard will prompt for information to be provided in order to create a new job. The first step is to provide a URL to the Elasticsearch server along with port number.

Optionally, login information for authenticating to the Elasticsearch instance can be provided by selecting the Authentication checkbox and entering the username and password.

Once the connection (and optional authentication) is provided, you have the option to discover indexes automatically or enter an index or index pattern manually.

To specify an index pattern, select the Input Index checkbox and then manually type in the index pattern. E.g. logstash-*. Screen 5 demonstrates automatic discovery option in use.

Once the user selects an index, the window will expand to allow selection of the Type and time field within the data. Both fields need to be completed before selecting ‘Next’. Screen 6 demonstrates selection of the two fields.

Jobs >> Create new job >> Elasticsearch server: Server specified and indexes chosen

Screen 6: Jobs >> Create new job >> Elasticsearch server: Server specified and indexes chosen

Once the user enters the values for the Type and Time-field name fields, they can select Next and be taken to the Create a new Job window

Create a new job using file data

By selecting File upload from the list in screen 4 above, the user will be taken through the steps for creating a new job to run using data from a file to upload. This is represented in Screen 7 that follows.

Jobs >> Create new job >> File Upload

Screen 7: Jobs >> Create new job >> File Upload

Once a file is selected, a preview of the data is presented to the user.

Under the preview window, the user can define the format of the file with options for delimited, JSON and single line.

The system makes its best effort to suggest values that work for the data file; however please make sure that delimiters and time fields and formats are correct, otherwise the job will fail.

Select next to be taken to the Create a new Job window

Warning

A file size limit exists when uploading a file using the browser. For Firefox and Explorer this is 200MB, for Chrome 100MB; this is the uncompressed size. An unspecified error or a timeout will occur if the file size exceeds this, or if the time taken to analyze the data exceeds 10 mins. For large files or long running analysis, we recommend using the API Data Endpoint to upload data.

Create a Job for an other data source

If you wish to analyze data from other data sources, then select “Other data source”. This will create an empty job, to which you can upload data later using the Engine API Data Endpoint. This could be done using “cURL” or by writing a Connector.

Please note that you will need to know the format of the data to be analyzed when creating this job. As the system is not able to preview any sample data, then its format is unknown and the job configuration must be added manually.

Create a New Job window

The Create a New Job Window is used to build out the details of the job based on the kind of job that was started by the Create Job window. Whether creating a job from Elasticsearch data, file upload or not supplying data to the job, each approach will need to define the details of the job analysis and data processing. The Create a new Job window is the common intersection point for any job creation. Furthermore the Job listing view has a Clone Job action that will enable changing any value for the cloned job.

Regardless of how the user arrives, the same fields must be established for a job. To do this, the Create a new Job window has several tabs that need to be filled in. These tabs are Job Details, Analysis Configuration, Data Description and Scheduler. For convenience to the user, there is a JSON view editor that allows all settings to be set/changed via a JSON notation.

Job Details

The Job Details configuration tab (as depicted in screen 8 below) has 2 properties that will make it easier to manage your jobs.

id This field is a unique identifier. It may be specified with a “friendly” name. It should be lowercase and only contain alphanumeric characters and underscores or hyphens.
description A brief description of the job.
Custom URLs A means to attach external links to the analysis that is performed. Two fields for each custom URL for the label and the URL itself. By defining a custom URL, the user will have a way to drill down to the source data from the results set. This is very important for understanding results. See below for full on details configuring custom URLs.
Job definition >> Job Details

Screen 8: Job definition >> Job Details

Configuring custom URLs

Custom URLs provide links from the “Anomalies” table in the Explorer window to custom dashboards or external websites, allowing the end user for example to drill into the source data at the time of an anomaly.

Multiple custom URLs can be defined in the Job Details tab at job creation and are stored in the job configuration. Custom URLs can also be created or edited for an existing job. For each custom URL, two properties are configured: a label which is used as the text in the links menu for an anomaly in the Explorer window, and the URL of the link itself.

String substitution in custom URLs

Dollar sign ($) delimited tokens can be used in a custom URL which will then be substituted for the values of the corresponding fields from the anomaly records stored in Elasticsearch. For example, for a configured URL of http://cassandra.datastore.com/dashboards?user=$user_name$, the value of the user_name field from the anomaly record will be substituted into the $user_name$ token when clicking on the link in the Explorer “Anomalies” table.

Four keywords can be used as tokens for String substitution in a custom URL which play a special role when the link is opened: $earliest$, $latest$, $prelertcategoryterms$ and $prelertcategoryregex$.

$earliest$ and $latest$ in custom URLs

$earliest$ and $latest$ tokens are used to pass the time span of the selected anomaly to the target page. The tokens will be substituted with date-time Strings in ISO-8601 format, e.g. 2016-02-08T00:00:00.000Z, as used by Kibana for example when displaying times in dashboards.

When clicking on the custom URL from the Explorer window, if the data in the “Anomalies” table is aggregated by hour, then one hour either side of the anomaly time will be used for the earliest and latest times. If aggregated by day, then the start and times of that day will be used i.e. from 00:00:00.000 to 23:59:59.999. If the Anomalies table is set to “Show All” with no aggregation of anomalies, then the start and end times of the anomaly bucket will be used.

$prelertcategoryterms$ and $prelertcategoryregex$ in custom URLs

For jobs which are analyzing data based on the categorization of values, $prelertcategoryterms$ and $prelertcategoryregex$ can be used to pass on details of the category definition for the selected anomaly to the target page. If present in the custom URL, the tokens will be replaced with the category definition terms or regex for the category ID of the selected anomaly, using the value of the prelertcategory field from the anomaly record stored in Elasticsearch. For example, the following custom URL uses a $prelertcategoryterms$ token to open up a dashboard in a user’s Kibana installation to display source data stored in an it_ops_app index for a job using a categorization detector:

http://localhost:5601/app/kibana#/discover?_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'$earliest$',mode:quick,to:'$latest$'))&_a=(columns:!(_source),index:it_ops_app,interval:auto,query:(query_string:(analyze_wildcard:!t,query:'$prelertcategoryterms$')),sort:!('@timestamp',desc))

When substituting $prelertcategoryterms$, each of the terms are prefixed with a + character to ensure that the Elasticsearch Query String query run in a drilldown Kibana dashboard searches for all of the terms. Therefore if drilling into a non-Kibana URL, the target page should reformat the terms value to remove the + characters as necessary.

Notes on configuring custom URLs

Please be aware of the following points when configuring custom URLs for a job:

  • When creating a link to a Kibana dashboard, note that the URLs of dashboards can be very long so be careful of typos, end of line characters and URL encoding.
  • If an influencer name is used for string substitution e.g. $clientip$, it may not always be available in the results records. The link will still take you to the dashboard, however the query will remain as $clientip$ and will need to be manually corrected.
  • Dates substituted for $earliest$ and $latest$ tokens will be in ISO-8601 format and the target system needs to understand this.
  • If the job performed an analysis against nested JSON fields, the tokens for String substitution may refer to these fields using dot notation e.g. $cpu.total$.
  • Elasticsearch source data mappings may make it difficult for the query string to work. Test the custom URL before saving the job configuration to check it will work as expected, particularly when using String substitution.

Transforms

The Transforms tab (as depicted below in Screen 9) is where parameters are captured that specify how to pre-process your data before analysis is conducted. The transformations are executed prior to being analyzed by configured detectors. This tab has the single property: addTransform.

Transforms can either shape data into the format required by the Engine or extract new fields for use in the analysis. Transforms process one record in the data at a time, the inputs are either one or more of the existing fields in the record or the output of another transform. Chaining transforms together, where the output of one serves as the input to another, is a powerful and effective way to build complex transformations capable of deriving new fields or reformatting the input records.

Some transforms require an initial argument when they are defined. For example split, which splits a string into sub-strings, must be defined with a regular expression around which the split is done. These arguments are set in the arguments field. Each transform has a list of default output names which may be overridden by setting the outputs field.

Job definition >> Transforms

Screen 9: Job definition >> Transforms

Clicking Add Transform will open the dialog as shown in Screen 10 for configuring a new transform. Each transform type has a different range of inputs and outputs. If no labels are specified for the outputs, default labels will be used.

Job definition >> Transforms >> Add New Transform >> domain\_split function

Screen 10: Job definition >> Transforms >> Add New Transform >> domain_split function

Any of the transforms listed below can be used in your job configurations.

name description arguments
concat Concatenate the input fields A delimiter (optional)
domain_split Split a domain name into highest registered domain and sub-domain N/A
exclude Conditionaly exclude records from the analysis if thecondition evaluates to true A condition
extract Extract new fields from the given regular expression’s capture groups A regular expression with capture groups
geo_unhash Convert a Geohash value to latitude,longitude N/A
lowercase Convert input field to lower case N/A
split Split the field around matches of the given regular expression A regular expression
trim Remove leading and trailing whitespace from the input field N/A
uppercase Convert input field to upper case N/A

Each Transform function is described in detail in our API documentation that is located here.

Analysis Configuration

The Analysis Configuration tab (as depicted in Screen 11) is where parameters are captured that specify how the data should be analyzed. This tab has the following properties: bucketSpan, summaryCountFieldname, categorizationFieldName, categorizationFilters, Detectors and Influencers. Each is described below.

Job definition >> Analysis Configuration

Screen 11: Job definition >> Analysis Configuration

bucketSpan The size of the interval the analysis is aggregated into, measured in seconds, with a default of 300 seconds (5 minutes).
summaryCountFieldName If not null, the input to the job is expected to be pre-summarized, and this is the name of the field in which the count of raw data points that have been summarized must be provided. Cannot be used with the metric function. The same summaryCountFieldName applies to all detectors. See Summarization of Input Data for full details.
categorizationFieldName If not null, the values of the specified field will be categorized. The resulting categories can be used in a detector by setting either of byFieldName, overFieldName or partitionFieldName to the keyword prelertcategory. See Categorization for full details.
categorizationFilters When categorizationFieldName is configured, filters can be added in order to exclude text that should have no impact on the categorization. The filters are expected to be regular expressions. See Categorization for full details.
detectors Configuration for the anomaly detectors to be used in the job. The list should contain at least one configured detector.
influencers A list of Influencer field names. Typically these can be the by/over/partition fields used in the detector configuration You may also wish to use a field name that is not specifically named in a detector, but is available as part of the input data. Use of influencers is not mandatory; however is strongly recommended as it enables aggregation of results from multiple detectors to a single entity. See Best practices for selecting Influencers for more details.
Job definition >> Analysis Configuration

Screen 12: Job definition >> Analysis Configuration >> bucketSpan interval selection menu

Add Detectors

Each job will require at least 1 detector definition to indicate to Prelert what this search will do. It will contain an analytical function and indication of how/what fields to operate on. The detectors are edited one at a time in the Edit Detectors screen popup as depicted in Screen 13 below. The properties in this dialog popup are:

Job definition >> Analysis Configuration >> Edit Detector

Screen 13: Job definition >> Analysis Configuration >> Edit Detector

Description A user-friendly description for the detection which is displayed with the results. E.g. DNS tunneling activity
Function The analysis function to be used. Examples are countrare, mean, min, max and sum. See Analytical Functions.
Fieldname The field to be analyzed as a metric. If using an event rate function such as count or rare then this should not be specified. fieldName cannot contain any of the following characters: []()=\-. The field should be renamed to avoid using these characters.
byFieldName The field used to split the data for analyzing those splits with respect to their own history. Used for finding unusual behaviors for an entity compared to its past behaviors.
overFieldName The field used to split the data for analyzing those splits with respect to the history of all splits. This is used for finding unusual values in the population of all splits.
partitionFieldName Segment the analysis along this field to have completely independent baselines for each value of this field.
excludeFrequent May contain trueover or by. If set, frequent entities will be excluded from influencing the anomaly results. Entities may be considered frequent over time or frequent in a population. If working with both over and by fields, then eludeFrequent may be set to true for all fields, or specifically for the over or the by fields.

Important

Field names are case sensitive, for example the field name CPU is different to the field name cpu.

Data Description

The Data Description tab is how to define the structure of the data that will be processed by this job. On creating a new job, the default behavior of the Prelert for Elastic Stack product is to accept data in Elastic Search format, expecting an Epoch time value in a field named time. The time field must be measured in seconds from the Epoch.

The Data Description will have been populated with the values selected during the Create Job process. You can override these here if needed.

Important: The date format must be correct with respect to your input data. If not, the data analysis will fail.

The Data Description tab (as depicted in screen 14 below) allows setting the following properties:

Job definition >> Data Description

Screen 14: Job definition >> Data Description

Data Format Either ELASTICSEARCH, DELIMITED, JSON or SINGLE_LINE.
Time Field The name of the field containing the timestamp.
Time Format The format of the date field, can be epoch, epoch_ms or a custom pattern for date-time formatting can be supplied as a Java DateTimeFormatter string. e.g. yyyy-MM-dd’T’HH:mm:ssX. (Note this is a special example which requires escape characters around the ‘T’)
Field Delimiter If the data is in a delimited format with a header e.g. csv, this is the character separating the data fields. This property is only applicable if format is set to DELIMITED.
Quote Character Delimited formats can be quoted to escape fields containing the field delimiter character. This property is only applicable if format is set to DELIMITED.

Scheduler

For Jobs that are configured to analyze Elasticsearch data, the Scheduler must be configured. There are several fields as part of scheduling (as depicted in screen 15) below that the user can/should set. Each field is described below.

Job definition >> Scheduler

Screen 15: Job definition >> Scheduler

Data Source The source of the data to search for the scheduled job. The default is Elasticsearch.
Query The query string to use to pull data from the data source. This must be set. See the section below for tips on configuring the query to use when analyzing data in Elasticsearch.
Query Delay (seconds) The delay before search is executed. If not set the default is 60 seconds.
Frequency (Seconds) The frequency for how often the search should run. This should be at least as fast as the bucket span setting. If not set, the default is 10 mins.
Scroll Size Number of documents to retrieve from Elasticsearch per scroll. Defaults to 1000. This setting should only be modified on the advice of a Prelert engineer.
Elasticsearch Server address The server to perform the elasticsearch query.
Authenticated Perform HTTP Basic Authentication at the server address URL. When checked, Username and Password should be supplied.
Username username to use with HTTP Basic Authentication when querying for data. If not specified then no HTTP Basic Authentication will be sent with the request.
Password The password corresponding to the username. May only be specified if a username is also specified. In the stored job configuration this field will be replaced by one called encryptedPassword.
Indexes The name of the index or index pattern to pull search results from.
Types The _type value as defined in index.
Time-field name The time field, also defined in the Data Description tab.
Time format The format of the time field, also defined in the Data description tab.
Retrieve whole _source document May be required, depending upon Elasticsearch data mappings. Check Data Preview tab to see if fields required for analysis are present.

For more information about how to configure a job that queries data from Elasticsearch see here.

Edit JSON

The Edit JSON tab allows any job setting (job details, analysis configuration, data description or scheduler) to be made using a simple text editor window. As one would expect, when this JSON changes the change is reflected within the other tabs and when data in the other tabs changes it changes the JSON data view accordingly. This window may go away or get moved into an advanced tab in the future.

This tab can be used to configure advanced job configuration options, which are not necessarily available in the UI.

Job definition >> Edit JSON

Screen 16: Job definition >> Edit JSON

Data Preview

To see a sample set of the data to be processed by the job, click the Data Preview tab. In this tab is a simple JSON data viewer containing a few records from the configured index based on the Query string defined in the Scheduler tab. Screen 17 below depicts the data preview tab.

Job definition >> Date preview

Screen 17: Job definition >> Data Preview

Edit an Existing Job

The Edit Job feature is available to make minor changes to basic elements of a job definition that would not alter the data sets or the processing. To this end, only the descriptions for the anomaly job and the detector(s) that are used and the Custom URLS defined can be altered. See screen 18 below for further details.

Edit Job

Screen 18: Edit Job

Cloning an Existing Job

The Clone Job feature is the principle means to create a new job from an existing data set and be able to edit the contents. Within the context of the Behavioral Analytics for Elastic Stack product, there is no notion of change for an anomaly search job once it is defined with the exception to the field’s mentioned in edit an existing job.

Within the Clone Job UI, the user can set Job details, define the Analysis Configuration, define the Data Description, set scheduling and also to edit the JSON representation of the job. Each of these groupings of settings are represented via a separate tab in the interface (see Screen 8 above) and will operate similar to create a new Job described in the previous sections.

Starting a Stopped Job

When a job is created it ultimately needs to be started in order to process data within Elasticsearch. To do this, job creation asks on save if the job should be started. If you opt to start the job, the start scheduler window in screen 19 is displayed. For a job that is no longer running (it was stopped or reached the end of the data for processing), it can be restarted by clicking the start button in the Actions list for the job in the job listing page. Once start job is clicked the window in screen 18 is displayed also.

Start job

Screen 19: Start Job

Search Start Time The time to start the search job. This can be to start from now, at the start time of data (first elements), or a custom time defined by the calendar and time entry provided.
Search End Time The time to end the search. There are two choices: real-time mode (continue on when new data arrives) or specify a time to complete with date and time specified. If not set the default is real-time.

Warning

When the Scheduler is re-started, it will continue processing input data from the next millisecond from when it was stopped. If your data contains the same timestamp (e.g. it could be summarized by minute), then data loss is possible for the timestamp value that the Scheduler was stopped at, as the Scheduler may not have completely processed all data for that millisecond.