Frequently Asked Questions

How is the data “baseline” created?

While it is common to think that a baseline is created, what it is really doing is modeling the data’s behavior by constructing an appropriate probability distribution function. See more info here.

Back to top

What is the minimum amount of data needed before anomalies can be detected?

The quality of the probabilistic model that is constructed is a function of how much data is observed. This is also affected by the type of analysis being performed (for example rare or metric analysis). In general we like to see tens or hundreds of observations per bucket, and tens or hundreds of buckets over time. It is perfectly possible to get tangible results with less data with recommended minimums defined below. Please note that these times are based on the timestamp of the input data and not on wall-clock time.

  • all analysis - aim for at least 2 hours and at least 4 buckets of input data (whichever is greater)
  • rare analysis - requires at least 20 buckets before an entity can be regarded as rare
  • periodicity - after 2-3 days, daily periodic patterns can be detected and after 2-3 weekends, sat/sun patterns can be detected

Back to top

Is there a way to set the sensitivity of the detection?

The way that one manages the “sensitivity” is in the output anomaly score, which is an aggregated measure for the anomalousness of the bucket time interval. Not all anomalies are created equal - those that are more of an outlier (less likely statistically) will contribute to a higher anomaly score. We recommend to focus on the buckets with the higher anomaly scores.

Back to top

How much data can be processed?

The Engine API was developed from the ground up to work with Big Data. Processing is limited by hardware more than anything. If you are using the Engine API against large datasets and need help on tuning and best practice for optimization, then please contact support@prelert.com.

Back to top

What are the minimum hardware requirements?

Hardware requirements vary greatly depending upon the data volume and number of metrics being analyzed. The API itself can easily be evaluated running on simple hardware. On a quad core laptop with 4GB RAM we have analyzed 1.8GB of metric data containing 34 million data points in 678 seconds (and that only utilized one and a bit CPU cores on the laptop).

Back to top

What is a Connector?

The Connector performs the following functions:

  • Extracts data from source data store
  • Manages jobs in the Engine API
  • Forwards data to the Engine API for analysis

Depending on your requirements, the Connector could be coded to either process data in a batch or to stream in real-time.

Our Connectors will be open source and examples are available in https://github.com/prelert

The Engine API prefers time series data to be ordered by date. You may wish to perform this as a pre-processing step in your Connector code or specify a latency window. Please see notes on handling out-of-sequence data.

For more information, see Top Tips for Writing a Connector and Deployment Guide.

Back to top

What is a Results Processor?

The Results Processor takes the anomaly results and makes them available to view. The results are provided in JSON format. These can be parsed and imported into your preferred reporting / management solution.

The Behavioral Analytics Dashboard is a Kibana based UI. This is an example of a Results Processor that we have written.

For more information, see Suggested Deployment Patterns.

Back to top

How do I see my results?

When you start off, the Behavioral Analytics Dashboard is a good way to quickly view the results of your analysis.

http://localhost:5601/app/prelert#

In the longer term you’ll want to retrieve the results in JSON format and display them or react to them in some other way within the system you’re integrating with the Engine API.

The analysis results are available in JSON format and the methods to query them are fully documented in the API Reference Results section.

Back to top

How come I can only see 100 results?

By default, the first 100 results are returned. In order to see more, specify the “take” and “skip” options.

Back to top

Why is there some delay in getting my results back?

If a job is not closed after completion, the results output buffer is automatically flushed after 10 minutes. For datasets with a definite end point, such as batches of historical data, it is advisable to close the job once all the data has been submitted in order to flush the output buffer sooner.

Back to top

Why do I need to order my records chronologically?

The Engine API prefers time series data to be ordered by date as calculations are performed on time buckets. By having the input data ordered by date, it allows for the advanced statistical mathematical methods to be applied accurately, at scale and rapidly.

However in the real world data is often out of sequence with clock drift, distributed end points and network outages all causing date ordering challenges. Engine API 1.3 and above now has built-in functionality to handle out-of-sequence data with the least possible impact on memory whilst maintaining the quality of the results.

Whilst the golden configuration for real-time anomaly detection would be to process data in chronological order without latency this is not always possible. If it is not possible to send the data in time order, we recommend specifying as short a latency window as your data will allow which can be set in the job analysis configuration.

Back to top

How do I get my data from a database into the API?

If you data is being updated continually, please read the following section titled How do I stream data?.

If you need to batch analyze a single dataset and this is currently stored in a database, then write a pre-processing step to extract the data into JSON or a DELIMITED format. It is preferable to ensure that the data is ordered chronologically. If that is not possible, a latency window can be specified (see handling out-of-sequence data). An example for PostgreSQL is here.

Back to top

How do I stream data?

On a cycle, perform the following:

  • At 12:00 say, upload a chunk of data to a job (i.e. data from 11:59-12:00), do not close the job.
  • After 60 seconds, at 12:01, upload the next chunk of data to the same job (i.e. data from 12:00-12:01).
  • Repeat these steps.

The cycle length should be adjusted depending upon the volume of data and the frequency of updates.

In a separate (asynchronous) thread you can query the results endpoint to query all, or new results on each poll.

For example:

http://localhost:8080/engine/v2/results/<jobId>/buckets?start=2014-02-12T00:00:00Z&end=2014-02-14T00:00:00Z&expand=true

In this way, you can poll every X minutes for the last X minutes of data programmatically.

Back to top

How can I view the logs?

The main logs are written to $PRELERT_HOME/logs/engine_api/. These rollover; ‘engine_api.log’ will be the most recent and ‘engine_api.log.1’ will be second most recent and so on. ‘stderr.log’ and ‘stdout.log’ are written from the standard error and standard out streams. These logs are the first place to check if unexpected behavior is encountered.

Job-specific logs are located in their own directory (named after the job) within the logs directory, e.g. $PRELERT_HOME/logs/dns_job01.

You can also access any logs using the logs endpoint. This does not require file system access to the server.

Back to top

What data formats do you accept?

The Engine API accepts DELIMITED, JSON and SINGLE_LINE data formats and the results are available in JSON format. This is fully documented in the API Reference Guide.

Back to top

Can I categorize and analyze unstructured log files?

Yes. See Categorization to learn how.

Back to top

How can I improve my throughput?

There are several ways that performance can be enhanced if you have reached limits for your particular hardware spec. Here are some possible suggestions:

  1. Increase aggregation

    Data is aggregated into buckets, for example time series data will be aggregated to time windows of 5 minutes (say). Performance gains can be realized by increasing the level of aggregation to 30 minute windows (say). This is one of the easiest ways to improve throughput.

    Note: Choose a time window that reflects the frequency of data that is being analyzed. If logging events are occuring every 15 seconds, then having a 15 second bucket span will not result in aggregation.

  2. Improve data quality

    It is estimated that between 60-90% of processing time is spent parsing and aggregating the data. Invalid data values are discarded, so if the data quality is poor to begin with, performance gains could be achieved by running a pre-processing data clean up exercise before analyzing the data. Additionally, parsing time can be reduced by only sending relevant fields to the API. For example, the API will happily accept JSON documents with 100 fields of which only 3 are relevant to the analysis, but obviously a lot of data is being shuffled about pointlessly (the other 97 fields). Just sending the 3 relevant fields will improve performance of the API, but we do understand that this needs to be offset against any extra processing on the client side to pre-process the input. One final point to be aware of is that the date format in the input can affect the CPU usage of the API. Internally times are held in the form of seconds or milliseconds since the epoch - midnight on 1/1/1970 UTC. If you can upload times as seconds or milliseconds since the epoch then this can avoid a lot of CPU usage within the API. Having said that, this date transformation processing occurs in parallel to the main analysis so will only slow down the overall throughput if your server is running flat-out.

  3. Summarize data yourself

    If, for example, you want to detect anomalies in the count of particular field values in each time bucket, there is a large gain to be made by sending each distinct field value just once together with the count of raw events for that field value. By doing this you transfer processing load from Prelert to your big data store, where it can often be distributed between many machines. Similarly, for numeric valued data you can supply the appropriate function of your data for each field value plus the count of raw observations. This feature is called summarization.

  4. Hardware upgrade

    If you have production requirements to process thousands of metrics across many terabytes of data, then hardware and the network will need to be up to the job. Please contact support@prelert.com. By providing us with a representative sample of your data and your job config, we can advise on possible optimizations and hardware requirements.

Back to top

How do I evaluate in a Windows environment?

Both Linux and Windows 64-bit operating systems are now supported. A list of supported platform requirements can be found here.

The examples in this documentation assume that you have installed the Engine API onto the same machine that you are working on (i.e. localhost). Please note that it is easy to install the Engine API onto a different system and to access it remotely using tools such as cURL, your web browser or programmatically.

Back to top

How do you deal with periodicity in data?

The Engine API automatically identifies time-related trends due the daily and weekly periodic nature of the data through the use of spline interpolation. These trends are identified and smoothed, leaving the residual to be analyzed for anomalies.

The periodic nature of the input data is detected quicky, usually after only 2-3 days. In order to detect weekend sat/sun patterns then at least 2-3 weekends need to have been seen.

Back to top

Should I create many jobs for 1 metric or 1 job for many metrics?

Usually, in the early stages of evaluating and understanding the Engine API, the simplest approach is to create a job for every metric that you want analyzed. This is okay for prototyping at a small scale, however, as your usage evolves we would recommend using multiple metrics within a job for performance and scalability reasons. If using multiple detectors, a new license key will be required. Please contact support@prelert.com to arrange.

Back to top

What is the relationship between anomaly score and probability?

The probability of the bucket is dependent upon the individual probabilities of the constituent records of the bucket, and also on how many anomalous records there are in the bucket. i.e. the more things that are anomalous together leads to a higher unlikelihood. Therefore, the anomaly score has an inverse relationship to the probability (more unlikely, lower probability, higher anomaly score). This relationship is non-linear, with more sensitivity at the lower end of the probability scale.

More information is available on Understanding the anomaly score.

Back to top

What happens if a state shift occurs in my data?

If a state shift occurs in the data, for example an organizational change which causes a 25% increase in network traffic to become the new normal, the Engine API learns and adapts. Initially, the increase will be seen as an anomaly. If the behavior is sustained, the anomaly score then flattens relatively quickly as it becomes the new normal.

Back to top

How can I serve the API on TCP port 80 (the standard HTTP port)?

Run the Engine API as the same non-root user who installed it and use port forwarding to forward traffic from TCP port 80 to the port that the Engine API is listening on (TCP port 8080 by default). On Linux iptables can be used for port forwarding.

Do not do this by running prelert_startup.sh as the root user. Doing so means that you will be vulnerable to any security flaws in Elasticsearch, Jetty or Java itself, and will also create files owned by root in your installation that subsequently prevent the correct non-root user running the software.

Back to top

Can I remove historical results?

Anomaly results are stored in elasticsearch. For very long running jobs (several months say) you may want to remove historical results in order to manage the size on disk. This can be run as a scheduled maintenance task for elasticsearch. A example script is provided below:

https://github.com/prelert/engine-connectors/blob/master/simple-scripts/delete-results.sh

Please note that the model analysis will automatically age out historical data, so the above will just serve to reduce the results that can be queried and will not affect the accuracy of the modeling.

Back to top

Why can’t I see a known anomaly?

Here are some common reasons for times when you know an anomaly exists in your input data, yet you cannot see it in the results.

  • The final bucket is not closed

    The analysis is triggered as each bucket closes. If you are using synthetic test data and the anomaly exists in your final bucket, please send though a final data point for the next bucket which will make sure that the anomalous bucket is closed.

  • It’s too early

    Anomaly detection typically requires at least 2 hours and at least 4 buckets of input data in order to initiate the model. Detection of periodic input data requires several days before it is fully optimized. If known anomalies exist early in time, they may not be detected as more data is needed to have been seen. More info on the minimum required amounts of data can be found here.

  • A bigger anomaly occurs later on

    The Anomaly Score and Normalized Probability are both normalized. This provides a human readable and ordered value showing relative anomalousness. The normalized values are updated as the analysis progresses. Therefore if a very big anomaly occurs, then previous anomaly scores are adjusted down to reflect the fact that they are less severe relative to recent occurrences. If your input data contains an anomaly, but it is not being shown as significant, then please check more recent data for very large anomalies.

  • Using the mean analysis function

    If there are many data points in the bucket, then a single spiked value may not cause the mean bucket value to be anomalous. Using max would be a recommended alternative analysis.

Back to top

Why can’t I see any interim results?

In order for the Engine API to have calculated interim results, a data flush must have been called using parameter ?calcInterim=true. Once these have been calculated, interim results are immediately available in the UI Summary and Explorer dashboards.

In order to view interim results using the API results endpoint or the alerts endpoint, specifically request them using paramater ?includeInterim=true.

Back to top