Behavioral Analytics for the Elastic Stack

Overview

Prelert’s Behavioral Analytics for the Elastic Stack automates the analysis of time-series data by creating accurate baselines of normal behaviors in the data, and identifying anomalous patterns in that data.

Driven by Prelert’s proprietary machine learning algorithms, anomalies related to temporal deviations in values/counts/frequencies, statistical rarity, and unusual behaviors for a member of a population are detected, scored and linked with statistically significant influencers in the data.

Automated periodicity detection, and quick adaptation to changing data ensure that customers don’t need to specify algorithms, models, or other data science-related configurations in order to get the benefits from Prelert analytics.

Integration with the Elastic Stack

Behavioral Analytics for the Elastic Stack is tightly integrated with the Elastic Stack. Data is pulled from Elasticsearch for analysis, and anomaly results are displayed in Kibana dashboards.

Where does Prelert fit in the Elastic Stack

Typical use cases

Enterprises, government organizations and cloud based service providers daily process volumes of machine data so massive as to make real-time human analysis impossible. Changing behaviors hidden in this data provide the information needed to quickly resolve massive service outage, detect security breaches before they result in the theft of millions of credit records or identify the next big trend in consumer patterns. Current search and analysis, performance management and cyber security tools are unable to find these anomalies without significant human work in the form of thresholds, rules, signatures and data models.

By using advanced anomaly detection techniques that learn normal behavior patterns represented by the data and identify and cross-correlate anomalies, performance, security and operational anomalies and their cause can be identified as they develop, so they can be acted on before they impact business.

Whilst anomaly detection is applicable to any type of data, we focus on machine data scenarios. Enterprise application developers, cloud service providers and technology vendors need to harness the power of machine learning based anomaly detection analytics to better manage complex on-line services, detect the earliest signs of advanced security threats and gain insight to business opportunities and risks represented by changing behaviors hidden in their massive data sets. Here are some real-world examples.

Eliminating noise generated by threshold-based alerts

Modern IT systems are highly instrumented and can generate TBs of machine data a day. Traditional methods for analyzing data involves alerting when metric values exceed a known value (static thresholds), or looking for simple statistical deviations (dynamic thresholds).

Setting accurate thresholds for each metric at different times of day is practically impossible. It results in static thresholds generating large volumes of false positives (threshold set too low) and false negatives (threshold set too high).

The Engine API automatically learns, and calculates the probability of a value being anomalous based on its historical beahvior. This enables accurate alerting, and will highlight only the subset of relevant metrics that have changed. These alerts provide actionable insight into what is a growing mountain of data.

Reducing troubleshooting times and subject matter expert (SME) involvement

It is said that 75 percent of troubleshooting time is spent mining data to try and identify the root cause of an incident. The Engine API automatically analyzes data and boils down the massive volume of information to the few metrics or log messages that have changed behavior. This allows the subject matter experts (SMEs) to focus on the subset of information relevant to an issue - rather than all data - greatly reducing triage time.

In a major credit services provider, within a month of deployment, the company reported that its overall time to triage was reduced by 70 percent and the use of outside SMEs’ time to troubleshoot was decreased by 80 percent.

Finding and fixing issues before they impact the end user

Large-scale systems, such as an online banking, typically require complex infrastructures involving hundreds of different interdependent applications; just accessing an account summary page may involve dozens of different databases, systems and applications.

Because of their importance to the business, these systems are typically highly resilient and a critical problem will not be allowed to re-occur. If a problem happens, it is likely to be complicated and be the result of a causal sequence of events that span multiple interacting resources. Troubleshooting would require the analysis of large volumes of data with a wide range of characteristics and data types requiring a variety of experts from multiple disciplines to participate in time consuming “war rooms” to mine the data for answers.

By using the Engine API in real-time, large volumes of data could be analyzed to provide alerts to early indicators of problems and highlighted the events that were likely to have contributed.

Finding rare events that may be symptomatic of a security issue

With several hundred servers under management, the presence of new processes running may indicate a security breach.

Using typical operational management techniques, each server would require a period of baselining in order to identify which processes are considered standard. Ideally a baseline would be created for each server (or server group) and would be periodically updated, making this a large management overhead.

By using the Engine API, baselines are automatically built based upon normal behavior patterns for each host and alerts are generated when rare events occur.

Finding anomalies in periodic data

For data that has periodicity it is difficult for standard monitoring tools to accurately tell whether a change in data is due to a service outage, or is a result of usual time schedules. Daily and weekly trends in data along with peak and off-peak hours, makes it difficult to identify anomalies using standard threshold based methods. A min and max threshold for SMS text activity at 2am would be very different to the thresholds that would be effective during the day.

By using the Engine API, time-related trends are automatically identified and smoothed, leaving the residual to be analyzed for anomalies.