Multivariate Analysis

The Engine API can dynamically detect correlations between time series and model them as multivariate. For example, it may detect that DiskWrites and NetworkIO are related i.e. changes in the values of each occur together. Anomalies will be detected in DiskWrites based on its past behavior and/or also based on the value of NetworkIO at that time and vice versa.

There is no denying the fact that multivariate analysis is complex. It is more difficult to model, especially when running in memory and in real-time, and the results can be more complicated to interpret. However in scenarios where data is correlated, multivariate analysis with the Engine API provides more accurate anomaly detection, reduces false positives and is available with no extra management overhead.

Using multivariate analysis, anomalies that occur in disk and network when the CPU is high can be handled differently from those that occur when the system is idle with low CPU. In tests using data known to be strongly correlated, the number of raw anomaly records was reduced by almost half.

Using multivariate analysis

For multivariate anomaly detection, the Engine API requires input data to be in a format that is suitable for by-field analysis. For example:

timestamp,instance,metricname,metricvalue
2016-04-08T03:00:00.000+0000,acmehost,CPU,4.872
2016-04-08T03:00:00.000+0000,acmehost,DiskWrite,6.9
2016-04-08T03:00:00.000+0000,acmehost,DiskRead,7.8
2016-04-08T03:00:00.000+0000,acmehost,NetworkIn,3769325.2
2016-04-08T03:01:00.000+0000,acmehost,CPU,3.872
2016-04-08T03:01:00.000+0000,acmehost,DiskWrite,2.3
2016-04-08T03:01:00.000+0000,acmehost,DiskRead,8.6
2016-04-08T03:01:00.000+0000,acmehost,NetworkIn,2639287.7

Multivariate analysis is off by default. To configure a job analyzing the data above, the configuration option multivariateByFields should be set to true:

{
  "id": "mv-job",
  "description": "Metric monitoring",
  "analysisConfig": {
    "influencers": [ "instance" ],
    "bucketSpan": 3600,
    "multivariateByFields": true,
    "detectors": [
      { "function": "mean", "fieldName": "metricvalue", "byFieldName": "metricname", "partitionFieldName": "instance" }
    ]
  },
  "dataDescription": {
    "format": "DELIMITED",
    "timeField": "time",
    "timeFormat": "yyyy-MM-dd'T'HH:mm:ss.SSSX",
    "fieldDelimiter": ",",
    "quoteCharacter": "\""
  }
}

More job creation examples can be found here, and note that embedded quotes may need to be escaped, as detailed in Date Time Format.

How does it work?

The example above uses instance as a partition field and will model each partition separately. Therefore for each instance, the Engine API models each metricname i.e. it will model DiskWrite, NetworkIn etc. As it learns, it will detect if pairs of time series are related (which may change over time). For example, if increases in NetworkIn are matched by increases (or decreases) in DiskWrite then these time series will be considered related. As NetworkIn may be related to more than one metricname, the Engine API will keep track of the pairs of related time series.

As with current univariate analysis, anomalies are considered more significant if there are a number of other anomalies happening in the same bucket. With multivariate analysis the probabilities are additionally adjusted considering the anomalousness of each of its related pairs, if any. To summarize results, only the most anomalous record will be written for each bucket, for each metricname and for each instance. This may be an anomaly given the value of a paired time series, or may be an anomaly in its own right.

Multivariate result

Querying the results using the API

Multivariate results are similar to univariate results and can be retrieved in the same way using the Results Endpoint. Actual and typical values for the anomaly are given and in addition, the related series is provided. So, for example, the following shows that DiskWrite is considered anomalous given the value of NetworkIn:

...
"function" : "mean",
"byFieldValue" : "DiskWrite",
"correlatedByFieldValue" : "NetworkIn",
"typical" : 98276,
"actual" : 20001,
"probability" : 3.52203E-39,
"normalizedProbability" : 86.56073,
...

Note that the correlatedByFieldValue may not always be reported. This is because it may be an anomaly in its own right or the Engine API may have learned that the time series are no longer correlated.

Operational best practice

Released in Engine API 2.0, the following is recommended as best practice:

  • Use a long enough bucketSpan to ensure that data points exist for all by-fields in the input data, otherwise correlations cannot be seen.
  • Allow a slightly longer learning period for the Engine API to detect correlations; typically a minimum of 2 days or 100 buckets depending upon the characteristics of the data.

The following caveats apply:

  • Multivariate analysis is configured per job and is off by default.
  • Multivariate analysis cannot currently be used for population analysis.
  • Analysis of fields containing hyphens causes unexpected results.
  • Multivariate analysis requires a small amount more processing and memory than traditional univariate analysis.