Transforms

Pre-process your data with ‘Transforms’ either to shape it into the format required by the Engine or extract new fields for use in the analysis. Transforms process one record in the data at a time; the inputs are either one or more of the existing fields in the record or the output of another transform. Chaining transforms together, where the output of one serves as the input to another, is a powerful and effective way to build complex transformations capable of deriving new fields or reformatting the input records. Debug or review the output of a job’s transforms with the Preview Endpoint endpoint.

Some transforms require an initial argument when they are defined; for example split, which splits a string into sub-strings, must be defined with a regular expression around which the split is done. These arguments are set in the arguments field. Each transform has a list of default output names which may be overridden by setting the outputs field.

Configuration

Transforms are defined in the job configuration. As multiple transforms can be created for a single job, the setting is an array of transforms. In the example below the job is defined with 2 transforms -‘concat’ and ‘split’ (the analysisConfig field is elided for brevity). See Analysis Configuration (analysisConfig) for more details about creating jobs with transforms.

Note: arguments, inputs and outputs are all of the type Array of String but you can set the fields with a single value that the API will automatically convert into an array.

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
  "id":"job01",
  "description":"Job with transforms",
  "transforms" : [
    {
      "transform":"concat",
      "inputs":["date", "time"],
      "outputs":"datetime"
    },
    {
      "transform":"split",
      "arguments":[":"],
      "inputs":["host_port"],
      "outputs":["host","port"]
    }
  ],
  "analysisConfig" : {
    ...
  }
}'

Important

It is a configuration error if the output of a transform is not used as either an input to another transform or as one of the analysis fields (i.e. one of the fieldname, byFieldName, overFieldName or partitionFieldName fields in a Detector Configuration or the timeField in the Data Description) however, an output cannot be used as the summary count field. The transform’s output cannot be used as its own input either directly or transitively through a chain of transforms.

Conditional Transforms

Some transforms are applied conditionally based on the value of the input field, exclude is one such transform. Conditional transforms must be defined with a condition clause.

In this example all records where the ‘DataCenter’ field has the value ‘Berlin’ will be excluded from the analysis:

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
  "transforms" : [
    {
      "transform":"exclude",
      "inputs":["DataCenter"],
      "condition": {
          "operator":"match", "value":"Berlin"
      }
    }
  ],
  "analysisConfig" : {
    ...
  }
}'

Conditions have an operator and an operand, typically the operand is a hard limit such as a threshold to be used with the gt operator or a regular expression used with the match operator. The table below lists the available operators and their operand types.

operator operand type description
eq Numerical Equal to
gt Numerical Greater than
gte Numerical Greater than or equal
lt Numerical Less than
lte Numerical Less than or equal
match String A regular expression match

Debugging Transforms

To review the output of your transforms, upload a sample of your data to the preview endpoint. It will return the converted data after the transforms have been applied; however it will NOT pass the data for further analytical processing. This is a quick way of debugging and verifying complex transforms.

Available Transforms

Any of the transforms listed below can be used in your Engine API jobs.

name description arguments
concat Concatenate the input fields A delimiter (optional)
domain_split Split a domain name into highest registered domain and sub-domain N/A
exclude Conditionaly exclude records from the analysis if the condition evaluates to true A condition
extract Extract new fields from the given regular expression’s capture groups A regular expression with capture groups
geo_unhash Convert a Geohash value to latitude,longitude N/A
lowercase Convert input field to lower case N/A
split Split the field around matches of the given regular expression A regular expression
trim Remove leading and trailing whitespace from the input field N/A
uppercase Convert input field to upper case N/A