Tutorial: Flight Comparison Website using cURL

This introductory tutorial provides sample data to analyze web response times for a flight comparison website using cURL.

Pre-requisites

Engine API is installed - This worked example assumes that the Engine API is installed locally. If you are working with a remote instance, please substitute “localhost:8080” with your remote “ipaddress:port” details.

Getting Started

First check the installation is running:

curl 'http://localhost:8080/engine/v2'

This will return the version number of the Engine API. Don’t worry if the version or build number is not exactly the same as the example below. If your version number is lower you may want to consider upgrading to a newer version.

<!DOCTYPE html>
<html>
<head><title>Prelert Engine</title></head>
<body>
<h1>Prelert Engine REST API</h1>
<h2>Analytics Version:</h2>
<p>prelert_autodetect_api (64 bit): Version 6.1.0 (Build d771b5fc3b9077) Copyright (c) Prelert Ltd 2006-2016</p>
</body>
</html>

Overview

Let’s now try to analyze an example time series dataset. This data has been taken from a fictional flight comparison website where users can request real-time quotes from multiple airlines. The website makes programatic calls to each airline to get their latest fare information. It is important that this data request is quick, as slow responding airlines will negatively impact user experience as a whole.

Here we will investigate response times to by airline.

We will be using the command-line utility cURL which allows the easy transfer of data using a URL syntax over HTTP.

Before we start, please download the example CSV and JSON files from http://s3.amazonaws.com/prelert_demo/farequote.csv and http://s3.amazonaws.com/prelert_demo/farequote.json.

Time series data are preferred to be ordered by date. In case data cannot be sent in time order, see Working with out-of-sequence data. The raw csv data looks like this:

time,airline,responsetime,sourcetype
2014-06-23 00:00:00Z,AAL,132.2046,farequote
2014-06-23 00:00:00Z,JZA,990.4628,farequote
2014-06-23 00:00:00Z,JBU,877.5927,farequote
2014-06-23 00:00:00Z,KLM,1355.4812,farequote
2014-06-23 00:00:00Z,NKS,9991.3981,farequote
...

Tutorial

  1. Create New Job.

This creates an analysis jobs for the example data file. This will baseline responsetime for all airlines and report if any responsetime value deviates significantly from its baseline.

Creating a new job requires both a declaration of how the data is formatted (dataDescription), and how the data is expected to be analyzed (analysisConfig).

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
    "id" : "farequote",
    "description" : "airline tutorial",
    "analysisConfig" : {
        "bucketSpan":3600,
        "detectors" :[{"function":"metric","fieldName":"responsetime","byFieldName":"airline"}]
    },
    "dataDescription" : {
        "fieldDelimiter":",",
        "timeField":"time",
        "timeFormat":"yyyy-MM-dd HH:mm:ssX"
    }
}'

In this example, we are specifying that we want the analysis to be executed on the responsetime field. This field contains a numeric value, so we specify the metric function, which expands to all of min, mean and max. (Had we wanted to look at event rate or rare fields we’d have used one of the other available functions.) By declaring byFieldName as airline, the analysis will be performed across all airlines, instead of a unique analysis done for each of the 19 airlines.

bucketSpan defines that the analysis should be performed across hourly (3600 second) periods.

The dataDescription section informs the API as to how the data is formatted, what character delimits the fields, and what is the format of the timestamp. See Describing your data format (dataDescription) for more information.

We’ve given the job a sensisble name and description, if a job with that ID does not already exist the create is successful and the new job’s ID is returned. Please remember that the job’s ID may have to be URL encoded when used in a request URL if it contains unsafe characters.

{"id":"farequote"}
  1. Check Job Status.

Now the the analysis job is created, you can check out the details of the job:

curl 'http://localhost:8080/engine/v2/jobs'

The response will give detailed information about job statuses and their configuration. If you have created multiple jobs they are ordered by createTime latest first. For example:

{
  "hitCount" : 1,
  "skip" : 0,
  "take" : 100,
  "nextPage" : null,
  "previousPage" : null,
  "documents" : [ {
    "location" : "http://localhost:8080/engine/v2/jobs/farequote",
    "description" : "airline tutorial",
    "dataEndpoint" : "http://localhost:8080/engine/v2/data/farequote",
    "bucketsEndpoint" : "http://localhost:8080/engine/v2/results/farequote/buckets",
    "recordsEndpoint" : "http://localhost:8080/engine/v2/results/farequote/records",
    "logsEndpoint" : "http://localhost:8080/engine/v2/logs/farequote",
    "status" : "CLOSED",
    "timeout" : 600,
    "id" : "farequote",
    "analysisConfig" : {
      "detectors" : [ {
        "fieldName" : "responsetime",
        "function" : "metric",
        "byFieldName" : "airline"
      } ],
      "bucketSpan" : 3600
    },
    "dataDescription" : {
      "format" : "DELIMITED",
      "fieldDelimiter" : ",",
      "timeField" : "time",
      "timeFormat" : "yyyy-MM-dd HH:mm:ssX",
      "quoteCharacter" : "\""
    },
    "counts" : {
      "bucketCount" : 0,
      "processedRecordCount" : 0,
      "processedFieldCount" : 0,
      "inputRecordCount" : 0,
      "inputBytes" : 0,
      "inputFieldCount" : 0,
      "invalidDateCount" : 0,
      "missingFieldCount" : 0,
      "outOfOrderTimeStampCount" : 0
    },
    "createTime" : "2014-09-30T13:37:26.597+0000"
  } ]
}

For detailed explanation of the output, please refer to the job resource object in the reference documentation. For now, note that the key piece of information is the job ID, which uniquely identifies this job and will be used in the remainder of this tutorial.

  1. Upload Data.

Now we can send the CSV data to the data endpoint to be processed by the engine. Using cURL, we will use the “-T” option to upload the file. You will need to edit the URL to contain the job ID and specify the path to the farequote.csv file:

curl -X POST -T farequote.csv 'http://localhost:8080/engine/v2/data/farequote'

This will stream the file farequote.csv to the REST API for analysis. This should take less than a minute on modern commodity hardware. Once the command prompt returns, the data upload has completed. Next, we can start looking at the analysis results.

  1. Close the Job.

Since we have uploaded a batch of data with a definite end point it’s best practice to close the job before requesting results. Closing the job tells the API to flush through any data that’s being buffered and store all results. Once again, you will need to edit the URL to contain the correct job ID:

curl -X POST 'http://localhost:8080/engine/v2/data/farequote/close'

Note: in the case of the farequote.csv example data you’ll have enough results to see the anomaly by the time the upload has completed even if you don’t close the job.

  1. View Results.

We can request the results endpoint for our job ID to see what kind of results are available:

curl 'http://localhost:8080/engine/v2/results/farequote/buckets?skip=0&take=100'

This returns a summary of the anomalousness of the data, for each time interval. skip and take default to 0 and 100 meaning the first 100 results are returned see Pagination for instructions to retrieve the next 100 results.

{
  "hitCount" : 119,
  "skip" : 0,
  "take" : 100,
  "nextPage" : "http://localhost:8080/engine/v2/results/farequote/buckets?skip=100&take=100&expand=false&includeInterim=false&anomalyScore=0.0&maxNormalizedProbability=0.0",
  "previousPage" : null,
  "documents" : [ {
    "timestamp" : "2014-06-23T00:00:00.000+0000",
    "bucketSpan" : 3600,
    "anomalyScore" : 0.0,
    "bucketInfluencers": [ ],
    "maxNormalizedProbability" : 0.0,
    "recordCount" : 0,
    "eventCount" : 649
  }, {
    "timestamp" : "2014-06-23T01:00:00.000+0000",
    "bucketSpan" : 3600,
    "anomalyScore" : 0.0,
    "bucketInfluencers": [ ],
    "maxNormalizedProbability" : 0.0,
    "recordCount" : 0,
    "eventCount" : 627
  }, {
  ...
  }
}

The Engine API Dashboard gives a visual view of the results. This can be found here: http://localhost:5601.

In practice, most implementations will process the results programatically. For the purpose of this tutorial, we will continue using the cURL command line and jump straight to the bucket with the maximum anomalyScore. This has the following timestamp: 1403712000.

We can request the details of just this one bucket interval as follows:

curl 'http://localhost:8080/engine/v2/results/farequote/buckets/1403712000?expand=true'
{
  "exists" : true,
  "type" : "bucket",
  "document" : {
    "timestamp" : "2014-06-25T16:00:00.000+0000",
    "bucketSpan" : 3600,
    "records" : [ {
      "fieldName" : "responsetime",
      "timestamp" : "2014-06-25T16:00:00.000+0000",
      "function" : "mean",
      "probability" : 1.44722E-79,
      "anomalyScore" : 94.35376,
      "normalizedProbability" : 100.0,
      "byFieldName" : "airline",
      "byFieldValue" : "AAL",
      "typical" : 99.8455,
      "actual" : 242.75
    } ],
    "anomalyScore" : 94.35376,
    "bucketInfluencers": [ {
      "probability": 8.88553E-25,
      "influencerFieldName": "bucketTime",
      "anomalyScore": 94.35376
    } ],
    "maxNormalizedProbability" : 100.0,
    "recordCount" : 1,
    "eventCount" : 909
  }
}

This shows that between 2014-06-25T16:00:00-0000 and 2014-06-25T17:00:00-0000 (the bucket start time and bucketSpan) the responsetime for airline AAL increased from a normal mean value of 99.8455 to 242.75. The probability of seeing 242.75 is 2.36652E-89 (which is very unlikely).

This increased value is highly unexpected based upon the past behavior of this metric and is thus an outlier.

  1. Delete Job.

Finally, the job can be deleted which shuts down all resources associated with the job, and deletes the results:

curl -X DELETE 'http://localhost:8080/engine/v2/jobs/farequote'
  1. Using JSON data

The same data can be processed in JSON format. Right click on this link farequote.json) and select “Save target as” or “Save link as” to download and save the file to disk. The format of the file is as follows:

{"airline": "AAL", "responsetime": "132.2046", "sourcetype": "farequote", "time": "1403481600"}
{"airline": "JZA", "responsetime": "990.4628", "sourcetype": "farequote", "time": "1403481600"}
{"airline": "JBU", "responsetime": "877.5927", "sourcetype": "farequote", "time": "1403481600"}
...

The same steps as above can be followed, except that the dataDescription would need to be altered during the job creation:

curl -X POST -H 'Content-Type: application/json' 'http://localhost:8080/engine/v2/jobs' -d '{
    "id" : "farequote-json",
    "analysisConfig" : {
        "bucketSpan":3600,
        "detectors" :[{"fieldName":"responsetime","byFieldName":"airline"}]
    },
    "dataDescription" : {
        "format":"json",
        "timeField":"time"
    }
}'

And the upload data step would need to point to the JSON file:

curl -X POST -T farequote.json 'http://localhost:8080/engine/v2/data/farequote-json'

As with the CSV analysis the results are accessed through the /results endpoint

curl 'http://localhost:8080/engine/v2/results/farequote-json/buckets'