Categorization

Many times it is desired to analyze records based not on their actual values but on a categorization of the values. An example would be the analysis of log files. These contain different types of messages and each type may contain messages that are very similar but not identical.

In such a case, an analysis that detects anomalous rates of a certain category of messages or rare categories could reveal useful insight about a system’s operation. The categorization feature will automatically derive categories and enable this analysis.

Categorization works best on messages which contain human read-able sentances e.g. ERROR: Component 123 failed to connect to service.com with status code: 3546 error: timeout.

Creating an analysis using categorization

The Engine API allows for a single field to be categorized per job. When creating a job, use categorizationFieldName in the Analysis Config to set the field name to be catergorized. Then assign the named value prelertcategory to either the byFieldName, overFieldName or the partitionFieldName.

Viewing category definitions

As data is categorized, the category definitions are stored and can be accessed via the category definitions endpoint:

$ curl 'http://localhost:8080/engine/v2/results/<jobId>/categorydefinitions/<categoryId>'

An example of a category definition follows:

{
    "categoryId" : "3",
    "terms" : "User has logged in host",
    "regex" : ".*?User.+?has.+?logged.+?in.+?host.*",
    "maxMatchingLength" : 44,
    "examples" : ["User peter has logged in host 127.0.0.1",
                  "User parker has logged in host 127.0.0.2"]
}
categoryId:unique identifier for each category and is returned in the anomaly results.
terms:contains a space separated list of the common tokens in the category.
regex:contains a regular expression that matches values of this category.
maxMatchingLength:
 the maximum length of the fields matched by the category (increased by 10% to allow matching similar fields that have not been analyzed).
examples:lists a few examples of distinct actual values from the input data.

The fields terms and regex contain generic search terms and, in combination with maxMatchingLength, are intended for use when searching source data, depending on data store.

Controlling the number of examples stored for each category

By default, a maximum of four distinct examples are stored per category. These examples should be enough to communicate the types of values that were detected under each category. However, it is possible to configure this parameter to store more or less examples.

Before showing how to achieve this it is important to explain the trade-off in adjusting the number of examples stored per category. Categorization examples are stored in memory. Thus, the higher the number of examples per category, the more memory will be required. This also increases with the number of categories.

Storing many examples when there are only a handful of categories requires less memory than storing many examples for many categories. It is recommended that this parameter is not set to a high value without assuring that enough resources are available. It is also possible to set the parameter to zero in order to switch off storing examples completely.

To specify a different number of stored examples per category, simply set a value to the categorizationExamplesLimit in the job configuration. See Limits on the size of the analysis (analysisLimits).

Categorization filtering

If too many categories are being created, a possible cause could be text segments that are unnessarily being categorized. For example, consider analyzing log statements that contain various messages followed by SQL statements. In this instance, analysis of the message (not the SQL statement) is more useful.

In order to fine tune categorization, filtering is supported. A list of regular expressions can be configured in the categorizationFilters field of analysisConfig in the job configuration. The regular expressions will be applied and any matching string will be ignored by the categorization algorithms. Each filter is applied in order, until no further matches are found.

Using categorization filtering is preferable to using transforms as performance is better and it is possible to update categorization filters after the job has been created. In addition, the field being categorized remains unchanged, therefore the properties of a categoryDefinition that relate to the full categorized text will not be affected by the filtering. In particular maxMatchingLength and examples will be reported based on the full text of the categorized field.

Warning

Categorization filters apply regular expressions to the input data which is extra work for the analytics to do. Some regular expressions perform better than others, so it is always important to check run-time with and without the filters. If significant increases in run-time are observed, consider using fewer filters or less resource intensive regular expressions.

Examples

This section contains three examples of applying categorization:

  • Categorizing semi-structured data, where distinct time and message fields have been defined.
  • Categorizing unstructured data like a log file; useful when uploading a file to the Engine API .
  • Using categorization with filtering

Example 1: Structured data source

Given a data source that is as follows:

time,message
2015-04-30 10:00:00 GMT,INFO Server started on localhost
2015-04-30 10:00:01 GMT,INFO Connecting to database at http://localhost:8888/myDatabase/
2015-04-30 10:04:15 GMT,DEBUG Created file foo.txt
2015-04-30 10:05:24 GMT,DEBUG Created file bar.txt
...

The following configuration is categorizing the field ‘message’ and performs a count by category analysis:

{
    "id" : "categorization-job",
    "description" : "Count by category analysis",
    "analysisConfig" : {
        "bucketSpan" : 3600,
        "categorizationFieldName" : "message",
        "detectors" : [{"function" : "count", "byFieldName" : "prelertcategory"}]
    },
    "dataDescription" : {
        "fieldDelimiter" : ",",
        "timeField" : "time",
        "timeFormat" : "yyyy-MM-dd HH:mm:ss z"
    }
}

Example 2: Unstructured data source

In this example, the data source is a log file as follows:

2015-04-30 10:00:00 GMT INFO Server started on localhost
2015-04-30 10:00:01 GMT INFO Connecting to database at http://localhost:8888/myDatabase/
2015-04-30 10:04:15 GMT DEBUG Created file foo.txt
2015-04-30 10:05:24 GMT DEBUG Created file bar.txt
...

In cases like this, none of the structured data formats (e.g. DELIMITED, JSON) can be used. However, the data format SINGLE_LINE can be used in combination with a suitable transform that will extract the ‘time’ and ‘message’ fields from the raw lines in the file.

Here follows a configuration that achieves the same result as in this Example 1: Structured data source

{
    "id" : "categorization-job",
    "description" : "Transform log file into structured and perform count by category analysis",
    "transforms" : [{
        "transform" : "extract",
        "arguments" : ["(.{23}) (.*)"],
        "inputs" : ["raw"],
        "outputs" : ["time", "message"]
    }],
    "analysisConfig" : {
        "bucketSpan" : 3600,
        "categorizationFieldName" : "message",
        "detectors" : [
            {"function" : "high_count", "byFieldName" : "prelertcategory"},
            {"function" : "rare", "byFieldName" : "prelertcategory"}
        ] },
    "dataDescription" : {
        "format" : "SINGLE_LINE",
        "fieldDelimiter" : ",",
        "timeField" : "time",
        "timeFormat" : "yyyy-MM-dd HH:mm:ss z"
    }
}

In order to extract the fields ‘time’ and ‘message’ we make use of the transform extract. The transform applies a regular expression and assigns matched groups into the specified outputs. We need to match the timestamp and the actual message. Counting the characters of the timestamp, one concludes that the first 23 characters in each line contain the timestamp. Therefore, we employ a regular expression that assigns the first 23 characters to the first output, ignores the space between the timestamp and the message, and assigns the rest of the line to the secound output.

Note that the transform requires an input. Since there is no structure to provide a title for the contents of the file, the keyword raw can be used to refer to the entire content of each line.

Also observe how the rest of the configuration remains the same as after the transform we obtain the fields ‘time’ and ‘message’ that are used elsewhere in the configuration. Remember that the term prelertcategory refers to the categoryId that will result from the categorization of the field ‘message’.

Example 3: Categorization with filtering

Given input data as follows:

time,message
2015-04-30 10:00:00Z,DEBUG Query successfully returned 42 records [SQL query: SELECT * FROM USER]
2015-04-30 10:00:00Z,ERROR Query failed due to missing table [SQL query: SELECT * FROM UNKNOWN]
2015-04-30 11:00:00Z,DEBUG Query successfully returned 63 records [SQL query: SELECT * FROM GROUP]
2015-04-30 12:00:00Z,ERROR Query failed due to missing table [SQL query: SELECT * FROM ANOTHER_UNKNOWN]
...

The SQL statements contained in the square bracket will be included in categorization by default, which could result in too many non-informative categories being created. This can be improved by using categorization filters. The following configuration categorizes the field ‘message’, filters out the SQL statements, and performs a count by category analysis:

{
    "id" : "categorization-job",
    "description" : "Count by category analysis",
    "analysisConfig" : {
        "bucketSpan" : 3600,
        "categorizationFieldName" : "message",
        "categorizationFilters": ["\\[SQL.*\\]"],
        "detectors" : [{"function" : "count", "byFieldName" : "prelertcategory"}]
    },
    "dataDescription" : {
        "fieldDelimiter" : ",",
        "timeField" : "time",
        "timeFormat" : "yyyy-MM-dd HH:mm:ssXXX"
    }
}

This is similar to Example 1: Structured data source only that now categorizationFilters have been applied. By filtering, categorization will ignore the SQL statements within the square brackets, thus creating fewer and more informative category definitions.

Note that the double backslash \[SQL.*\] in the example above is required when using cURL to create a job. This is not required if using the Kibana UI (use [SQL.*] instead).