Querying data from Elasticsearch using the Scheduler

This page contains tips on how to configure a job that uses a Scheduler to query data from Elasticsearch.

Tips on Elasticsearch Scheduler query configuration

If you want to filter the input data from Elasticsearch, you can specify a filter in the query the Scheduler uses to pull data during job creation. The following table contains some example filters which illustrate the range of filters that may be used for the Scheduler query.

Elasticsearch 2.x syntax SQL equivalent Notes
{"match_all" : {} } select *  
{match" : { "status" : "304" }} where status=304 Does not work with elasticsearch 1.7 Use of term query preferred (see below).
{"term" : { "status" : "304" }} where status=304 Depending on your field mappings, you may need to use lowercase for Strings and be aware that Elasticsearch may tokenize words, for example splitting up us-east-1 into the terms us, east and 1 .
{"terms": { "status": ["304", "200"] }} where status=304 or status=200  
{"wildcard":{"airline":"a*"}} where airline=”a*” Make sure you can use wildcards for this field. Check in the Kibana Discover tab.
{"terms": { "clientip": [ "82.41.44.140", "204.16.8.218" ] }} where clientip=”82.41.44.140” or clientip=”204.16.8.218” Configured field mappings may prevent using wildcard queries on IP address fields.
{"bool": { "must": [ { "term": {"rcode": "3"}}, { "term": {"clientip": "192.168.62.9"}} ] }} where rcode=3 and clientip=192.168.62.9 Compound queries need to be nested in a bool query
{"bool":{"should":[{"wildcard": { "status": "2*"}}, {"wildcard": { "status": "3*"}}]}} where status=2* or status=3* Make sure you can use wildcards for this field. Check in the Kibana Discover tab.
{"bool":{"must_not":{"term":{"status":"304"}}}} where status!=304  
{"range":{"bytes" : {"gte" : 100,"lte" : 500, "boost" : 2.0}}} where bytes>= 100 and bytes<=500 Optional boost parameter can be used to increase the relative weight of a clause.
{"query_string":{"query": "level:ERROR AND NOT message:\"Not interesting\""}} where level=ERROR and message!=”Not interesting” Kibana-like queries can be used in a query_string query. Note the necessary escaping of quotes.

Always attempt to validate your proposed query by inserting it in a query template similar to that used by the Engine API. For example, use the following curl command to execute a search against your target Elasticsearch instance replacing <index> with your target index, <type> with you target type, and <your-query> with your proposed query (e.g. “match_all”:{}):

curl -X GET -H 'Content-Type: application/json' 'http://localhost:9200/<index>/<type>/_search' -d '{
    "query": {
      "bool": {
        "filter": [
          {<your-query>}
        ]
      }
    }
}'

An alternative way to validate your proposed query is using the Discover tab in Kibana. Although the query syntax used by Kibana is based on the Lucene query syntax and differs from the syntax required for the Elasticsearch query, you can still use the entire JSON object containing the query as seen above in the Kibana search bar.

The syntax for the query filter varies depending on the version of Elasticsearch from which you are pulling data. Check the documentation for the specific version of Elaticsearch being used as your data source.

The success of the query will depend on the mappings configured for the fields in your data. For example, if using a wildcard query, it is good practice to confirm if wildcards can be used for the intended field using the Kibana Discover tab before using it for the Scheduler query.

Use an Elasticsearch indices query to execute a query against specific indices.

See Elasticsearch Query DSL for full details.

See Query string syntax for full details of the query language that can be used in query_string queries.

Performance guidelines

The scheduler is designed to minimize the load on the source Elasticsearch cluster by performing small and frequent queries. However it is recommended to adopt a phased approach in selecting data to be analyzed in order to ensure minimal impact. Many things will affect the performance of the data extraction, such as:

  • the volume of the data
  • the query used by the Scheduler
  • the configuration of the Elasticsearch cluster that hosts the data
  • the scrollSize parameter of the schedulerConfig
  • the available bandwidth if querying over a network

Long historical searches or big data can result to either slow anomaly detection or a significant load on the Elasticsearch cluster that is being queried. It is recommended to analyze smaller subsets of data first and assess the load on the Elasticsearch source cluster and the Prelert server before proceeding to larger data analysis.

  • If your Scheduler query is complex or you have performance concerns, then test the query using the Elasticsearch Search API or Kibana
  • Create a test job running over a subset of data
    • For example, if data spans over multiple indices then create a test job for a single index or select a limited time period such as 100 buckets.
  • Start the job and monitor the performance
    • Observe the load on the hosting cluster
    • Observe the speed of the analysis (check the latest timestamp being processed)
  • If the performance is good, then re-create the job and run for a longer (or the required) time period
  • If the performance appears slow, consider amending the scrollSize parameter:
    • A higher value (default 1000) can speed up the data extraction; however it may impact the load on the hosting cluster positively or negatively depending on the cluster configuration and the available resources.

For further assistance please contact support@prelert.com.