Top Tips for Writing a Connector

A Connector is required to feed the source data into the Engine API for analysis. This page provides a summary of the basic concepts that you should know about before writing your first Connector. Please use this in conjunction with our full Engine API documentation.

Check out the Python or Java clients on GitHub

We have written Python and Java clients and provided connector examples on GitHub. Rather than start from scratch, we recommend you try these out.

With C or C++ you can use libcurl, with alternative library options listed here. Please make sure you check the HTTP status codes as well as the libcurl return code.

For prototyping, another option is to use cURL from the command line. As with libcurl, please be sure to check HTTP status codes as well as the cURL return code.

Data is preferred to be in time series order

By default, the analysis engine expects data to be ordered in time series. Data that is out of order will be discarded, so please ensure that you provide ordered data.

You can see if you have any data being discarded by looking at invalidDateCount found in Job Counts.

When it is not possible to submit the data in time order, a latency window can be specified as explained in Working with out-of-sequence data.

Data should be in a single stream

Each job will only accept data from a single thread at one time. It expects a single stream of data, so if you are uploading a batch of records please wait for this to complete before starting with the next batch.

Some data uploads take longer than others

End of bucket processing is triggered if the data load contains an input timestamp which is in the next bucket. This closes the previous bucket, updates the models and performs further analysis. During this end of bucket processing, the data post will take longer and it is important to wait for this to complete before sending more data. Once end of bucket processing is complete, then there is a lot of spare capacity to catch up.

For example, a job configured with a 5 min bucketSpan might take ~6 secs to process each bucket. When sending data in real-time every second say, this means that 1 in every 300 data posts will take ~6 seconds to complete; the rest will take a few milliseconds.

If any timeouts are set in the client code that is responsible for posting data to the Engine API, ensure these are set to a long enough value to allow for end of bucket processing to complete. We would recommend a minimum of 60 seconds, more for higher data rates.

Check the return code

The quality of the analysis performed by the Engine API depends on the quality of the data being sent. Always wait for the response from the Engine, and check the return HTTP status code to make sure the data is good and being accepted by the Engine.

Respond to any error codes returned by the engine, and take action to rectify the problem. For example, if you receive an error from the engine receiving too many out of order records, take steps to reduce the amount of data being sent out of chronological order. Or if the error code indicates you have too many jobs running concurrently, stop running multiple connections. Or if you are receiving a timeout, try and figure out why.

See the Error Codes documentation for the full list of errors that may be returned by the API.

Check your log files

If you are experiencing problems (and it is assumed the Engine API is running and correctly licensed) the first two places to look are the following log folders:

  1. <INSTALLDIR>/logs/engine_api

    engine_api.log - core system log file

    stderr.log - stderr

  2. <INSTALLDIR>/logs/<jobid>

    engine_api.log - logs relating to the parsing and pre-processing of uploaded data

    autodetect_api.log - job specific logging relating to the analytics

    normalize_api.log - job specific logging relating to normalization

More information about the Logs Endpoint can be found here.

Improving your throughput

There are several ways to improve throughput which are described here.

As a summary of the above article, you can do the following:

  • Use a longer bucketspan
  • Ensure your data quality is good
  • Only send the fields that are required for the analysis
  • Use seconds since epoch for the time format
  • Consider pre-summarizing your data
  • Use more powerful hardware

Planning memory requirements

Data analysis of big data in real-time requires machine resources. The Engine API is highly optimized, but even so you may approach the limits of the machine you are running on.

Model memory usage figures are provided as part of the job stats.

This will give you the size of the mathematical model in memory. The system requires more than this to run, however looking at the model memory size will give you an indicator as to how much RAM you will need for each job on top of the system requirements.

A population analysis (e.g. modeling a population using the over-field) requires the least memory as a model is stored for the population as a whole. An individual analysis (e.g. modeling behavior using the by-field) uses more memory as it must model the behavior of each entity over time. Partition-fields use more memory than by-fields.

Memory requirements also increase with the number of detectors, for data with periodic characteristics and when using a latency interval.

Ask us

Please feel free to reach out to support@prelert.com - in order to help please note that we may need to see your log files, a copy of your job configuration and have a representative sample of your data.