Batch APIs
Batch APIs have many endpoints that collectively allows you to process massive amount of inputs/scenarios or test cases on a Spark service using a dedicated infrastructure specialized in parallel processing. Users can take advantage of Spark's batch processing scaling through:
Coherent's Modeling Center application.
Our upcoming SDKs with batch functionality.
For advanced users it can be manually orchestrated using the API documentation below.
Batch pipeline architecture

In order to use the Batch APIs effectively, we need to understand the how batch works.
Batch has an input buffer: When a user submits data to a batch we store it in input buffer. The format in which user submits a collection of inputs is called chunk. We will discuss more about chunks in further sections.
The
chunksAPI will add your chunks to input buffer.If your input buffer is full, you will not be able to add any chunks to it. In that case, add chunks API will return an error.
Batch has a processing pipeline: As soon as you submit data to a batch, the data gets stored in an input buffer. Batch processing pipeline will immediately start processing the data available in input buffer and store the resulted output in output buffer.
The
statusAPI provides status of batch pipeline, along with how much space is left in input buffer and output buffer.
Output buffer: This is separate storage dedicated to store the resulted outputs.
The
chunkresultsAPI fetches data from output buffer.If the output buffer is full, the batch pipeline will stop processing data from the input buffer. You will have to use get chunk results endpoint to empty the output buffer in order to start the pipeline processing again.
How to: Run a batch job with APIs
Setup the Authorization.
Next you will need to create a batch via the
/batchendpoint.The response will contain a batch ID. Using this ID, you will be able to add chunks via the
/batch/{batch_id}/chunksendpoint.You'll also be able to check the
/batch/{batch_id}/status.Finally, you can use the
/batch/{batch_id}/chunkresultsendpoint to retrieve the results of the calculations.Note that if the
chunkresultsare failing to return, this may mean that the number of records sent in the chunk were too many to process at a time.You can then close the batch if you do not have any more data to send to the pipeline. Cancel the batch if there is a mistake or the batch is not working as expected.
You can also get information about the particular batch
/batch/{batch_id}or all batches/batch/status.
Sales example
In this how to guide, we will use a service based upon the attached SalesExample. This Spark service calculates the total cost of goods based upon the formula total = price * quantity * (1 + tax). The values are rounded to 2 decimal places.
Our batch chunks will include the following information:
priceandquantityto calculate the total cost of goods.A additional field
sales_idthat is used to correlate each individual record. This is important to use to merge the batch dataset with the batch results.In this example, we will use a
taxrate of10%for all sales. We will include this as aparametersuch that this constant value does not need to be included in every batch record.
Authorization
Bearer {token}accessible from Authorization - Bearer token or systematically via Client Credentials.The request headers should include a key for
Authorizationwith the valueBearer {token}.
API key created from Authorization - API keys.
The request headers should include the keys
x-synthetic-keyandx-tenant-namewith the values of the API key and tenant name respectively.
POST batch job
POST batch jobReturns: Response from GET the batch pipeline status.
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
Request body
Content-Type: application/json
service *
URI or service_id of the service being called.
Example 1: stocks/NVDA
Example 2: stocks/NVDA[1.4.3]
Example 3: stocks/NVDA[1.4] take the latest version starting with 1.4.
Example 4: stocks/NVDA[1] take the latest version starting with 1.
Example 5: /folders/stocks/services/NVDA
Example 6: a5e3f03a-57ca-4889-adae-0630be54bd87
output
Array of strings to denote the outputs to keep in the results. The strings can also contain regular expressions.
Example 1: ["value_*", "valuation_by_*"]
If you are running simulations and only looking for aggregate results, this should be omitted. See Define aggregations using data.summary.
Example: ["total"]
unique_record_key
unique_record_key column name in your inputs that can uniquely identify the input records. This does not need to be an Xinput in the Spark service! If this value is provided, then the same column will be echoed back in the outputs which then can be used to correlate inputs with outputs.
This is especially important because of the asynchronous nature of the batch process, chunks may not necessarily run or complete in the order they were submitted.
If you are running simulations and only looking for aggregate results, this should be omitted. See Define aggregations using data.summary.
Example: "sales_id"
...
The additional parameters of this API align with those defined for Request body except inputs are not provided in this step.
Set performance parameters for the batch
Additional advanced batch settings. You can find the settings that were used in batch using the Get batch information endpoint.
max_workers
Maximum number of workers to allocate to this batch job. This can be used to reduce the default allocation to allow more simultaneous jobs.
runner_thread_count
The number of threads to run at the same time in a Lambda. Not recommended to adjust this figure from its default of 1.
chunks_per_thread
Number of chunks processed per each thread in the Lambda at the same time. Not recommended to adjust this figure from its default of 1.
max_input_in_mb
Define the size of the maximum input buffer we can receive for a batch (MB). This can be used to reduce the default allocation to allow more simultaneous jobs.
max_output_in_mb
Define the size of the maximum output buffer we can store for a batch (MB). This can be used to reduce the allocation for a batch such that more simultaneous batches can be run.
acceptable_error_percentage
The acceptable percentage of failed chunks that percentage of chunks failed that can still be considered as a successful batch. The default is 0, where no errors are accepted.
Example: 10 means 10% of batches can fail and still be considered a successful batch.
Sample request
Sample response
HTTP 200 OK Content-Type: application/json
POST chunks to the batch pipeline
POST chunks to the batch pipelineReturns: Response from GET the batch pipeline status.
If chunks are consistently failing to execute, consider reducing the size of the chunks sent to the batch pipeline. It is highly recommended to test batch jobs thoroughly and use an appropriate batch size.
User can submit data in form of a collection of inputs called chunk. This endpoint allows you to add chunks to your batch. You may have to check the input buffer available before adding chunks to batch. If there is no space in input buffer this API will return an error.
Data can be sent in both uncompressed JSON and compressed Binary JSON (BSON).
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
batchId *
id from POST batch job.
Request body
Content-Type: application/json or Content-Type: application/bson
chunks *
Array of JSON chunk objects.
Define data and settings for a chunk
id *
You must generate a universally unique identifier (UUID) for each chunk. This is to associate chunks inputs against the chunkresults outputs.
data *
Object to stare inputs, parameters, and summary.
data.inputs *
inputs should contain the input records that we want to apply against a Spark service. These should correspond to the Xinputs on the Spark service. The dataset would contain multiple records which would be processed by the batch.
Data should conform to the formats described in Request body JSON array format.
data.parameters
Xinput values that stay the same across all dataset records, can instead be provided as parameters.
Parameters is a common data set for all the inputs in the chunk. This eliminates the need to send repeated data and reduces the size of the chunk.
tax rate of 0.1 that will be used with each input record.
size
Total number of records in each chunk. This is needed when using BSON as the number of rows cannot be quickly determined.
Define aggregations using data.summary
data.summaryThe data.summary object defines aggregations that will be applied to each chunk. The summary aggregations will be returned in an object called summary_outputs.
This is useful when the batch is being used for the purpose of running simulations where the aggregation of the batch results are more useful than the individual record outputs.
If in Batch APIs, the parameters for unique_record_key and output are not provided, then GET the chunk results will not return the individual records along with the aggregation.
ignore_error
When set to false, this is analogous to Excel where SUM(0,1,2,#N/A) = #N/A.
When set to true, records which return error in the output will not have an affect on the aggregation.
Default value false.
aggregation
An array of aggregate instruction objects.
aggregation.output_name
The name of the Xoutput from your Spark service on which you want to apply aggregate function.
aggregation.operator
The name of the aggregate operator. Currently, only SUM is supported.
Sample data.summary object
Sample request
This request sends 2 chunks to the batch without data.summary.
Response
HTTP 200 OK Content-Type: application/json
Returns the response from GET the batch pipeline status.
GET the batch pipeline status
GET the batch pipeline statusReturns: Response status object including the number of records_available to download and the remaining input_buffer_remaining_bytes and output_buffer_remaining_bytes.
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
batchId *
id from Batch APIs.
Sample request
Sample response
HTTP 200 OK Content-Type: application/json
GET the chunk results
GET the chunk resultsReturns: Response from GET the batch pipeline status and completed outputs.
If chunks are consistently failing to execute, consider reducing the size of the chunks sent to the batch pipeline in POST chunks to the batch pipeline. It is highly recommended to test batch jobs thoroughly and use an appropriate batch size.
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
batchId *
id from POST batch job.
Query parameters
max
Maximum number of chunks to be returned as part of a chunkresults call. This cannot exceed 100. The default is 50.
Sample request
Sample response
HTTP 200 OK Content-Type: application/json
If there are no output records available to download then the response from GET the batch pipeline statuswill be returned.
If the POST chunks to the batch pipeline includes Define aggregations using data.summary then:
The object
data.summary_outputwill populated.Furthermore, if in POST batch job, the parameters for
unique_record_keyandoutputare not provided, theoutputsobject will contain[]for the individual records.Aggregations will be returned with the output name suffixed with
_and the aggregation operator.
Close and cancel the batch
Returns: Response from GET the batch pipeline status.
Close batch: If you no longer have any more data to add to the batch, you can close the batch. After closing a batch, batch will still process the data and user will be able to download the remaining output from get chunk results API.
Cancel batch: When batch is not working as expected or you have made a mistake and you can cancel the batch to immediately stop the further processing. You won't be able to download anymore data after canceling a batch.
If a batch is not close or cancelled, the batch timeout is 30 minutes. This will release the input and output buffers.
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
batchId *
id from POST batch job.
Request body
Content-Type: application/json
Include this line in the JSON body to close the batch after all inputs have been submitted.
Include this line in the JSON body to cancel the batch at any time. Spark will stop processing inputs immediately.
Sample request
Sample response
HTTP 200 OK Content-Type: application/json
Get batch information
Returns: Detailed information about the batch.
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
batchId *
id from POST batch job.
Sample request
Sample response
HTTP 200 OK Content-Type: application/json
Get batch status across the tenant
Path parameters
tenant *
Tenant is part of your Log in to Spark URL and also available in the User menu.
This endpoint provides information about batches that are in_progress or recently_completed within the past 1 h .
If you are a
supervisor:pfuser, you will be able to see all batches run by users within your tenant.Otherwise, you will only see information about the batches that were initiated by yourself.
The environment object is a work in progress and may change in future iterations.
Sample request
Sample response
HTTP 200 OK Content-Type: application/json
Reference
batch_status values
batch_status valuescreated,pending,in_progress,closed,closed_by_timeout,completed,completed_by_timeout,failed,cancelled.
Manage batch performance
For each tenant, there is an allocation of server "workers" that are used to accommodate batch requests. After a batch is started:
There is an initial number of workers that are allocated for the batch job.
For each batch job, the number of workers scale depending on the available workers on the tenant. This is limited by:
The total allocation of workers on the tenant.
The maximum number of workers used for a particular job can be overridden by in POST batch job using the
max_workersparameter. This will enable more concurrent batch jobs to be run but at a lower rate of completion.
After a batch has been closed, there is a cool off period of
2 minbefore the buffer is released for any additional batch jobs.
It is very important to understand the Batch pipeline architecture when sending large volumes of data. Each tenant has a total maximum buffer allocation. Each job also has a limited input buffer and output buffer.
A batch job requires the total tenant buffer amount to have at least
110%of(input buffer + output buffer)available to start.If the output batch buffer for the tenant is full, Spark will also reject any new chunks that are submitted.
The batch may experience significant slowdowns in the condition where: 1) the input batch buffer is full, 2) there are a lot of chunks that have completed, 3) they cannot write to the output batch buffer because it is also full.
Use the API to GET the batch pipeline status, Get batch information, and Get batch status across the tenant to monitor the available buffer levels.
Whenever a batch is completed, Close and cancel the batch in order to restore the buffer values.
Last updated
