Tutorial: Application Performance Management using Python

This tutorial provides sample data to analyze the performance behavior of an IT system, using streamed APM data which contains multiple fields and multiple sourcetypes.

Pre-requisites

  • Engine API is installed - This worked example assumes that the Engine API is installed locally. If you are working with a remote instance, please substitute “localhost:8080” with your remote “ipaddress:port” details.
  • Python 2.7 or later.

Getting Started

First check the installation is running, browse to:

http://localhost:8080/engine/v2

This will return the version number of the Engine API. Don’t worry if the version or build number is not exactly the same as the example below. If your version number is lower you may want to consider upgrading to a newer version.

Prelert Engine REST API
Analytics Version:
Model State Version 22
prelert_autodetect_api (64 bit): Version 6.1.0 (Build d771b5fc3b9077) Copyright (c) Prelert Ltd 2006-2016

Overview

Let’s now try to analyze a real customer dataset (anonymized for use here). The customer had an application that would periodically disconnect from the database and did not know why.

The Engine API can handle a stream of data uploaded to the _data_ endpoint, the records are processed as they are uploaded and the results made available in real time. For this example we walk through the Python script that creates the job, uploads data to a job and reads the results in real time.

All the code snippets presented here are part of the file streamingApm.py file in the Prelert engine-python GitHub repository and the CSV APM data file can be downloaded from http://s3.amazonaws.com/prelert_demo/network.csv.

APM Data

The data is from a customer’s network APM solution and there are multiple fields we wish to analyze.

time, In Broadcast Pkts,In Octets,Out Octets,In Discards,In Errors,Out Broadcast Pkts,Out Discards,Out Errors,host
Sun 05/18/2014 00:00,0.04,405553.8133,568986.9867,0,0,185.9166667,0,0,netprobe.acme.com
Sun 05/18/2014 00:05,0.026666667,328117.5467,1194843.813,0,0,186.81,0,0,netprobe.acme.com
Sun 05/18/2014 00:10,0.033333333,835395.0933,5479368.293,0,0,186.0633333,0,0,netprobe.acme.com
Sun 05/18/2014 00:15,0.046666667,1932961.387,6558447.147,0,0,187.6066667,0,0,netprobe.acme.com

In the Python example the job configuration is defined as:

job_config '{"analysisConfig" : {\
        "bucketSpan":3600,\
        "detectors" :[\
            {"function":"metric","fieldName":"In Discards","byFieldName":"host"},\
            {"function":"metric","fieldName":"In Octets","byFieldName":"host"}\
            {"function":"metric","fieldName":"Out Discards","byFieldName":"host"},\
            {"function":"metric","fieldName":"Out Octets","byFieldName":"host"},\
        ]\
    },\
    "dataDescription" : {\
        "fieldDelimiter":",",\
        "timeField":"time",\
        "timeFormat":"yyyy-MM-dd\'T\'HH:mm:ssXXXs"\
    }\
}'

Important

If using multiple detectors as above, you will require an updated license key. Please contact support@prelert.com to arrange. Alternatively, you can proceed with this tutorial using a single detector.

Four metric detectors are defined for each of the In Octets, In Discards, Out Octets and Out Discards fields, the bucket span is set to 1 hour and the data uploaded data will be in csv format with the time field in ISO 8601 format.

Create a new job with the configuration

engine_client EngineApiClient.EngineApiClient('locahost', '/engine/v2', 8080)
(http_status_code, response) engine_client.createJob(job_config)
if http_status_code !201:
    print (http_status_code, json.dumps(response))
    return

The APM data is made available through a Python generator function. Generators are iterable and produce results on demand using the yield statement. The full source of the generator functions isn’t included here. Refer to https://github.com/prelert/engine-python/blob/master/streamingApm.py for the full implementation.

# generateRecords is the generator function
record_generator generateRecords(csv_file, start_date, interval, end_date)

# get the csv header (the first record generated) by calling next
header ','.join(next(record_generator))

Upload the generated records to the Engine API. The records are grouped into batches of 100 then POSTed to the API.

count 0
data header

for record in record_generator:
    # format as csv and append new line
    csv ','.join(record) + '\n'
    data +csv

    count +1
    if count 100:
        (http_status_code, response) engine_client.upload(job_id, data)
        if http_status !202:
            print (http_status_code, json.dumps(response)) # Error!
            break

        # commit the uploaded data and flush the results
        engine_client.close(job_id)

        # the csv header must be sent every time
        data header
        count 0

Real Time Results

As the Engine API processes data, new results are emitted once the current bucket boundary is breached. When the Engine reads a record with a timestamp beyond the span of the current bucket the results are outputted and processing moves to the next bucket. Bucket results are available once the POST to the close endpoint returns and can be polled for using one of the client’s getBuckets… functions.

# getBucketsByDate accepts epoch time or ISO 8601 formatted strings
(http_status_code, response) engine_client.getBucketsByDate(job_id=job_id,
    start_date=str(next_bucket_id), end_date=None)
if http_status_code !200:
    print (http_status_code, json.dumps(response))
    break

# print results
for bucket in response:
    print "{0},{1}".format(bucket['timestamp'], bucket['anomalyScore'])

if len(response) > 0:
    # bucket id is the epoch time
    next_bucket_id int(response[-1]['id']) + 1

This prints the result bucket time, anomaly score and max normalized probability

Date,Anomaly Score,Max Normalized Probablility
2014-05-18T00:00:00.000+0000,0.0,0.0
2014-05-18T01:00:00.000+0000,0.0,0.0
2014-05-18T02:00:00.000+0000,0.0,0.0

Scanning the results, we see there is a large anomaly at time 2014-05-21T15:00:00Z (epoch time 1400684400). You can view the raw results in your favourite web browser by navigating to the following URL replacing <job_id> with the appropriate job id.

curl http://localhost:8080/engine/v2/results/<job_id>/buckets/1400684400?expand=true
curl http://localhost:8080/engine/v2/results/20140930123038-00003/buckets/1400684400?expand=true
{
  "exists" : true,
  "type" : "bucket",
  "document" : {
    "timestamp" : "2014-05-21T15:00:00.000+0000",
    "bucketSpan" : 3600,
    "records" : [ {
      "fieldName" : "Out Discards",
      "timestamp" : "2014-05-21T15:00:00.000+0000",
      "function" : "mean",
      "probability" : 2.31215E-13,
      "anomalyScore" : 94.3055,
      "normalizedProbability" : 94.4097,
      "byFieldName" : "host",
      "byFieldValue" : "netprobe.acme.com",
      "typical" : 0.22992,
      "actual" : 34.2611
    }, {
      "fieldName" : "Out Octets",
      "timestamp" : "2014-05-21T15:00:00.000+0000",
      "function" : "mean",
      "probability" : 6.27685E-8,
      "anomalyScore" : 94.3055,
      "normalizedProbability" : 53.3324,
      "byFieldName" : "host",
      "byFieldValue" : "netprobe.acme.com",
      "typical" : 4058670.0,
      "actual" : 4.91168E7
    }, {
      "fieldName" : "In Octets",
      "timestamp" : "2014-05-21T15:00:00.000+0000",
      "function" : "min",
      "probability" : 0.0118201,
      "anomalyScore" : 94.3055,
      "normalizedProbability" : 0.254729,
      "byFieldName" : "host",
      "byFieldValue" : "netprobe.acme.com",
      "typical" : 623247.0,
      "actual" : 7062.4
    } ],
    "anomalyScore" : 94.3055,
    "maxNormalizedProbability" : 94.4097,
    "recordCount" : 3,
    "eventCount" : 12,
    "bucketInfluencers": [ {
      "probability": 6.27685E-8,
      "influencerFieldName": "bucketTime",
      "anomalyScore": 94.3055
    } ]
  }
}

The bucket has 3 anomaly records, the first is a value for the Out Discards field that Prelert considers to have a very low probability of occuring, the mena of the value over the bucket span (“function” : “mean”) is 34 (“actual” : 34.2611) compared to a typical value of 0.2 (“typical” : 0.22992). Additionally the field Out Octets has spiked at the same time - 49,116,800 compared to the bucket mean value of 4,058,670.

Conclusion

The ultimate root cause was the network spike that flooded the NIC card, causing TCP Discards. The network spike originated from a network mis-configuration which allowed VMware’s vMotion traffic to occur on the application VLAN instead of the management VLAN.