extract

Extract new fields or transform inputs with regular expression capture groups. The transform accepts a single input and can create multiple outputs. 1 argument is required when the transform is defined the regular expression used to extract new fields

  • single input
  • multiple outputs
  • default output fieldname: “extract”
  • argument is a regular expression

The Engine API uses the Java regular expression implementation. For more information about features and compatibility see http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

Example 1: Extracting a value

Our Internet of Things smart thermostat reports the temperature in the unusual format seen below:

Data=SensorId 0002 Temperature 22.3

We want to extract the 4 digit SensorId and Temperature values from this record and write those values to the new fields sensor_id and temperature. A suitable regular expression is:

Data=SensorId ([0-9]{4}) Temperature ([0-9]+\.[0-9]+)

If the Sensor Id and Temperature data is in a field called compound_data the transform configuration would look like this.

"transforms":[
    {
        "transform":"extract",
        "arguments":["Data=SensorId ([0-9]{4}) Temperature ([0-9]+\\.[0-9]+)"],
        "inputs":["compound_data"],
        "outputs":["sensor_id", "temperature"]
    }
]

The JSON format requires that the character be escaped.

The output fields sensor_id and temperature can be chosen as analysis fields in a detector configuration e.g. to analyze temperature by sensor_id:

"detectors": [
    {
        "function":"metric",
        "fieldName":"temperature",
        "byFieldName":"sensor_id"
    }
]

Example 2: Unstructured data source

In this example, the data source is a log file as follows:

2015-04-30 10:00:00 GMT INFO Server started on localhost
2015-04-30 10:00:01 GMT INFO Connecting to database at http://localhost:8888/myDatabase/
2015-04-30 10:04:15 GMT DEBUG Created file foo.txt
2015-04-30 10:05:24 GMT DEBUG Created file bar.txt
...

Here, the data format SINGLE_LINE must be used in combination with a transform that will extract the ‘time’ and ‘message’ fields from the raw lines in the file.

This log file does not contain a header, so a special raw keyword can be used to describe the input. The transform would look like this:

"transforms" : [
    {
        "transform" : "extract",
        "arguments" : ["(.{23}) (.*)"],
        "inputs" : ["raw"],
        "outputs" : ["time", "message"]
    }
]

The output fields time and message can now be used for analysis:

"detectors" : [
    {
        "function" : "count",
        "byFieldName" : "prelertcategory"
    }
]