Working with out-of-sequence data

This page gives an overview of how an anomaly detection job handles out-of-sequence data.

Understanding how it works when data is in chronological order

In the typical case where data arrives in ascending time order, each new record pushes the time forward. When a record is received that belongs to a new bucket, the current bucket is considered to be completed. At this point, the model is updated and final results are calculated for the completed bucket and the new bucket is created.

Expecting data to be in time sequence means that modeling and results calculations can be performed very efficiently and in real-time. As a direct consequence of this approach, out-of-sequence records are ignored.

Unfortunately in the real world, data is often out of sequence; clock drift, distributed end points and network outages can all cause date ordering challenges. A way to handle this would be to write a queue; however this needs development time and effort and can be memory intensive.

The Engine API now has built-in functionality to handle out-of-sequence data with the least possible impact on memory whilst maintaining the quality of the results.

How out-of-sequence data is handled

When data is expected to arrive out-of-sequence, a latency window can be specified in the job analysis configuration. This defines the window within which data will be accepted and processed, regardless of its time order. For example, a latency window of 3600 seconds will accept any data from the last hour. Data with a timestamp greater than one hour ago will be discarded. Note that this is the timestamp of the record, not wall-clock time.

Each bucket within the latency window is open and able to accept new records. We consider these buckets to contain partial data and store summarized bucket statistics for each. As records with newer timestamps are received, time is pushed forwards; the latency window moves along with the most recent timestamp. As buckets leave the latency window, the statistics are applied to the baseline model and the final results are calculated.

Results for buckets within the latency window can be calculated by flushing the data with parameter ?calcInterim=true. They can be queried using interim results. When querying interim results it must be recognized that buckets are likely to contain partial data.

Trade-offs with regard to configuring latency

Arguably, one could consider setting the latency to the highest possible value in order to handle records regardless of their chronological order. However we recommend using as short a latency window as your data will allow for reasons discussed here.

  • Delay in model creation: The model is not created until a whole latency period has passed.
  • Delay in updating the model: The model is based on committed buckets.
  • Delay in final results: The higher the latency, the longer it will be before results are finalized.
  • Interim results: Interim results are calculated based on partial buckets. They are also calculated against the committed model therefore the reaction to changes in state is delayed by the latency period.
  • Alerting: Alerts are created based on final results and are therefore delayed by latency.
  • Memory overhead: The higher the latency, the more bucket statistics are kept thus increasing the resources needed.
  • Results accuracy: Differences are very slight, compared to analyzing in chronological order.

Operational best practice

The golden configuration for real-time anomaly detection would be to process data in chronological order without latency. This approach poses no extra overhead and provides final results at the fastest possible rate. However, if the order of the data cannot be guaranteed, we recommend specifying as short a latency window as your data will allow which can be set in the job analysis configuration.

For cases where the latency has to be significant (e.g. more than 10 buckets or several hours) please contact to discuss.