Data Endpoint

Overview

Every job has a data endpoint for uploading data:

http://localhost:8080/engine/v2/data/<jobId>

Important

The Engine API can only accept data from a single connection. Do not attempt to access the data endpoint from different threads at the same time. Use a single connection synchronously to send data, close, flush or delete a single job.

The Engine API starts processing streamed data immediately but the server will not send a response to the POST request until all the data has been copied into the Engine. If the upload has been successful the server will respond with a HTTP 202 status code. At the time the server responds the Engine may not have finished processing all the data sent but it is ready to accept more. A successful POST operation will not return until the API is ready to accept more data.

If the client and server are geographically separate and share an unreliable connection or if your data is inside a very large file it is prudent to stream the data in chunks then if the upload fails only a single chunk needs to be resent. The Web Service can handle gzipped data but you cannot split a gzipped file into chunks and upload them piecemeal as the entire file has to be sent in one message. See Using compressed data for details.

There are many HTTP clients you can use for POSTing data to the Engine API. For example to stream a data file to the API using the cURL command line client, the command is of the form:

curl http://localhost:8080/engine/v2/data/<jobId> data-binary @test_data.csv

The data-binary option is used to preserve line endings in the source data, whilst the @ symbol tells cURL what follows is a filename.

An alternative is:

curl -X POST -T test_data.csv http://localhost:8080/engine/v2/data/<jobId>

This second form causes cURL to use less memory because data-binary loads the entire file into memory before sending it. Note that with “-T” you must ensure the URL does not end with a slash, as this will cause cURL to append the filename to the URL. It is also necessary to explicitly specify a POST is to be used, as the default HTTP method for cURL is PUT.

It is possible to upload data to multiple jobs simultaneously by appending a comma separated list of job IDs to the data endpoint. The Engine will duplicate the data and forward it to each job. The POST operation will not return until all of the jobs have completed. This is useful if you wish to analyze the same data using different bucketSpans or different transforms. For example:

curl -X POST -T test_data.csv http://localhost:8080/engine/v2/data/<jobId1>,<jobId2>,<jobId3>

Once the last of the data has been uploaded, send a POST to the Close Data endpoint to persist the internal model state. A closed job can be restarted at any time, as described in Closing data.

curl -X POST http://localhost:8080/engine/v2/data/<jobId>/close

Data format

The format of the data should match the definition in the data description object. All data, whether JSON, DELIMITED or SINGLE_LINE, must be UTF-8 encoded. If using cURL to upload data, always use the data-binary option rather than simply -d in order to preserve new line characters.

JSON

Records in JSON format may be uploaded either as an array of objects with each object storing one input record, or as a stream of single object JSON documents.

If the input records are objects in a JSON array then it must start and end with the ‘[‘ and ‘]’ and all elements of the array must be separated by a comma (as required by the JSON grammar). For example:

[
{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "AAL", "responsetime": "132.2046"},
{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "JZA", "responsetime": "990.4628"},
{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "JBU", "responsetime": "877.5927"}
]

Whitespace outside field names/values is not significant.

The alternative format is a plain list of JSON documents:

{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "AAL", "responsetime": "132.2046"}
{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "JZA", "responsetime": "990.4628"}
{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "JBU", "responsetime": "877.5927"}

Note that there is no “,” between documents. The overall text uploaded is not a single valid JSON document in this case, but makes life much easier for streaming data without having to build the entire upload as a single string. The Engine API parses a single JSON document, then attempts to parse another JSON document starting from the character after the first one ended, and repeats this until the data stream ends.

This format does not have to have new line characters between the objects: having them all on a single line is equally acceptable:

{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "AAL", "responsetime": "132.2046"}{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "JZA", "responsetime": "990.4628"}{"sourcetype": "farequote", "timestamp": "1359331200", "airline": "JBU", "responsetime": "877.5927"}

Also, as for the first format, whitespace outside field names/values is not significant.

In all cases, field values that are numbers may be encoded as JSON strings (as in the examples above) or as JSON numbers, for example:

{"sourcetype": "farequote", "timestamp": 1359331200, "airline": "AAL", "responsetime": 132.2046}
{"sourcetype": "farequote", "timestamp": 1359331200, "airline": "JZA", "responsetime": 990.4628}
{"sourcetype": "farequote", "timestamp": 1359331200, "airline": "JBU", "responsetime": 877.5927}

Delimited

If the source is a delimited format the first line must be a header and every line must end in a newline character optionally preceded with a carriage return.

For delimited formats, providing the job has not been closed, it is not strictly necessary to include the header. In this way it is possible to split a large CSV file into multiple chunks and POST these consecutively and only to send the header in the first chunk. However this requires the job to remain open, so we would advise to include the header as best practice.

If a field is not enclosed by double-quotes, then whitespace next to the delimiter is considered part of the data.

SINGLE_LINE

If the source is unstructured but records are contained within single lines, then the SINGLE_LINE data format can be used. Every line must end in a newline character optionally preceded with a carriage return. This format can only be used in combination with transforms which must be used to idenfity the timestamp. See Example 2: Unstructured data source.

Flushing data

Flushing a job causes the Engine service to ensure that no data is sitting in buffers. The flush call blocks until processing is complete for all previously uploaded data, and any results are ready to query.

Flushing a closed job has no effect.

An empty POST message should be sent to this endpoint at a time when data is not being uploaded by a different thread or process. The server will respond with a HTTP 200 status code once all data uploaded prior to the flush has been processed. For example, with the cURL command line client run the command:

curl -X POST http://localhost:8080/engine/v2/data/<jobId>/flush

Optionally the flush command can request that interim results be created for the most recent bucket for which data has been uploaded. (Usually results are not calculated for a bucket until the first piece of data for the subsequent bucket is seen.) To request that interim results be calculated, specify the calcInterim=true argument. For example:

curl -X POST http://localhost:8080/engine/v2/data/<jobId>/flush?calcInterim=true

Specifying a calcInterim=false argument is equivalent to not specifying any calcInterim argument. Note that because flushing a closed job has no effect it is not possible to calculate interim results for a closed job. If you need interim results for the last bucket for which data was uploaded for a job, be sure to request calculation of interim results before the job automatically closes due to inactivity.

Closing data

Closing a job causes the Engine service to run housekeeping tasks, to save the internal models and to flush the results buffer.

An empty POST message should be sent to this endpoint once all data has been uploaded, and the server will respond with a HTTP 202 status code. For example, with the cURL command line client run the command:

curl -X POST http://localhost:8080/engine/v2/data/<jobId>/close

When closing a job, its internal models are persisted. Depending upon the size of the job, this could take several minutes to close and the equivalent time to restart.

If the data endpoint is inactive for 10 minutes, the job will be automatically closed.

Closed jobs can be restarted at any time simply by posting fresh data to the Send Data endpoint. The API will discover that there is persisted state for that job and restore the internal models to that state before the engine processes the new data.

Using compressed data

If the data is GZipped compressed the HTTP header Content Encoding must be set to ‘gzip’

Content-Encoding: gzip

For example, the cURL command to upload a compressed CSV file test_data.csv.gz would be:

curl -H'Content-Encoding:gzip' http://localhost:8080/engine/v2/data/<jobId> data-binary @test_data.csv.gz

The API Engine will automatically decompress any data sent with this header. Gzipped files cannot be broken into smaller chunks and sent fragmentary; the entire file must be uploaded in a single POST.

Timestamps

Each record in the data must contain a timestamp. By default the API will look for the field time, with the date in seconds from the Epoch. If this field is not present in the data then the Data Description must provide the name of the field containing the timestamp and a format string describing how to parse that timestamp.

Chronological ordering

The Engine API prefers time series data to be in ascending chronological order. If the order of the data cannot be guaranteed, a latency window can be specified in the job analysis configuration. Further notes on handling out-of-sequence data are available.

Where latency has not been specified and data is streamed in chunks with multiple POSTs this ordering must be maintained with the earliest data in the first POST. If a job is restarted after being finished and persisted, the new data must be temporally ordered after the last data processed by the job before it was finished.

These conditions apply to all data regardless of the format.