# Batch APIs

Batch APIs have many endpoints that collectively allows you to process massive number of inputs/scenarios or test cases on a Spark service using a dedicated infrastructure specialized in parallel processing. This is most easily done using the Python [SDKs and tools](/spark-developer/sdks-and-tools.md).

For advanced users it can be manually orchestrated using the API documentation below.

## Batch pipeline architecture

<figure><img src="/files/7lXwfkJbFwvnGaIP0ao0" alt=""><figcaption></figcaption></figure>

In order to use the Batch APIs effectively, we need to understand the how batch works.

1. Batch has an **input buffer**: When a user submits data to a batch we store it in input buffer. The format in which user submits a collection of inputs is called **chunk**. We will discuss more about chunks in further sections.&#x20;
   * The `chunks` API will add your chunks to input buffer.
   * If your input buffer is full, you will not be able to add any chunks to it. In that case, add chunks API will return an error.&#x20;
2. Batch has a **processing pipeline:** As soon as you submit data to a batch, the data gets stored in an input buffer. Batch processing pipeline will immediately start processing the data available in input buffer and store the resulted output in output buffer.
   * The `status` API provides status of batch pipeline, along with how much space is left in input buffer and output buffer.
3. **Output buffer**: This is separate storage dedicated to store the resulted outputs.&#x20;
   * The `chunkresults` API fetches data from output buffer.&#x20;
   * If the output buffer is full, the batch pipeline will stop processing data from the input buffer. You will have to use get chunk results endpoint to empty the output buffer in order to start the pipeline processing again.

## How to: Run a batch job with APIs

1. Setup the [#authorization](#authorization "mention").
2. Next you will need to create a batch via the `/batch` endpoint.&#x20;
3. The response will contain a batch ID. Using this ID, you will be able to add chunks via the `/batch/{batch_id}/chunks` endpoint.&#x20;
4. You'll also be able to check the `/batch/{batch_id}/status`.
5. Finally, you can use the `/batch/{batch_id}/chunkresults` endpoint to retrieve the results of the calculations.
6. Note that if the `chunkresults` are failing to return, this may mean that the number of records sent in the chunk were too many to process at a time.
7. You can then close the batch if you do not have any more data to send to the pipeline. Cancel the batch if there is a mistake or the batch is not working as expected.
8. You can also get information about the particular batch `/batch/{batch_id}` or all batches `/batch/status`.

### Sales example

In this how to guide, we will use a service based upon the attached `SalesExample`. This Spark service calculates the total cost of goods based upon the formula `total = price * quantity * (1 + tax)`. The values are rounded to `2` decimal places.

Our batch chunks will include the following information:

* `price` and `quantity` to calculate the total cost of goods.
* An additional field `sales_id` that is used to correlate each individual record. This is important to use to merge the batch dataset with the batch results.
* In this example, we will use a `tax` rate of `10%` for all sales. We will include this as a `parameter` such that this constant value does not need to be included in every batch record.

{% file src="/files/sXd2wtIVXrbPPJoQAGlK" %}

## Authorization

* `Bearer {token}` accessible from [Authorization - Bearer token](/spark-apis/authorization-bearer-token.md) or systematically via [Client Credentials](/identity-and-access-management/client-credentials.md).
  * The request headers should include a key for `Authorization` with the value `Bearer {token}`.
* API key created from [Authorization - API keys](/spark-apis/authorization-api-keys.md).
  * The request headers should include the keys `x-synthetic-key` and `x-tenant-name` with the values of the API key and tenant name respectively.

## `POST` batch job

Returns: Response from [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention").

{% code overflow="wrap" %}

```shellscript
POST /{tenant}/api/v4/batch
```

{% endcode %}

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr></tbody></table>

### Request body

&#x20;`Content-Type: application/json`

| Name                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `service` \*        | <p>URI or <code>service\_id</code> of the service being called.<br>Example 1: <code>stocks/NVDA</code></p><p>Example 2: <code>stocks/NVDA\[1.4.3]</code></p><p>Example 3:  <code>stocks/NVDA\[1.4]</code> take the latest version starting with <code>1.4.</code></p><p>Example 4: <code>stocks/NVDA\[1]</code> take the latest version starting with <code>1.</code></p><p>Example 5: <code>/folders/stocks/services/NVDA</code><br>Example 6: <code>a5e3f03a-57ca-4889-adae-0630be54bd87</code></p>                                                                                                                                                                                                                                                               |
| `output`            | <p>Array of strings to denote the <code>output</code>s to keep in the results. The strings can also contain regular expressions.<br>Example 1: <code>\["value\_*", "valuation\_by\_*"]</code></p><p><br>If you are running simulations and only looking for aggregate results, this should be omitted. See <a data-mention href="#define-aggregations-using-data.summary">#define-aggregations-using-data.summary</a>.<br>Example: <code>\["total"]</code></p>                                                                                                                                                                                                                                                                                                      |
| `unique_record_key` | <p><code>unique\_record\_key</code> column name in your inputs that can uniquely identify the input records. This does not need to be an <code>Xinput</code> in the Spark service! If this value is provided, then the same column will be echoed back in the outputs which then can be used to correlate inputs with outputs.<br>This is especially important because of the asynchronous nature of the batch process, chunks may not necessarily run or complete in the order they were submitted.<br><br>If you are running simulations and only looking for aggregate results, this should be omitted. See <a data-mention href="#define-aggregations-using-data.summary">#define-aggregations-using-data.summary</a>.<br>Example: <code>"sales\_id"</code></p> |
| ...                 | The additional parameters of this API align with those defined for [Execute API (v4)](/spark-apis/execute-api/execute-api-v4.md#request-body) except `inputs` are not provided in this step.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |

#### Set performance parameters for the batch

Additional advanced batch settings. You can find the settings that were used in batch using the [#get-batch-information](#get-batch-information "mention") endpoint.

| Name                          | Description                                                                                                                                                                                                                                                                                                                |
| ----------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `max_workers`                 | Maximum number of workers to allocate to this batch job. This can be used to reduce the default allocation to allow more simultaneous jobs.                                                                                                                                                                                |
| `runner_thread_count`         | The number of threads to run at the same time in a Lambda. Not recommended to adjust this figure from its default of `1`.                                                                                                                                                                                                  |
| `chunks_per_thread`           | Number of chunks processed per each thread in the Lambda at the same time. Not recommended to adjust this figure from its default of `1`.                                                                                                                                                                                  |
| `max_input_in_mb`             | Define the size of the maximum input buffer we can receive for a batch (`MB`). This can be used to reduce the default allocation to allow more simultaneous jobs.                                                                                                                                                          |
| `max_output_in_mb`            | Define the size of the maximum output buffer we can store for a batch (`MB`). This can be used to reduce the allocation for a batch such that more simultaneous batches can be run.                                                                                                                                        |
| `acceptable_error_percentage` | <p>The acceptable percentage of failed chunks that percentage of chunks failed that can still be considered as a successful batch. The default is <code>0</code>, where no errors are accepted.<br><br>Example: <code>10</code> means <code>10%</code> of batches can fail and still be considered a successful batch.</p> |

### Sample request

```sh
curl --location 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch' \
--header 'Accept-Encoding: gzip' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer {token}' \
--data '{
    "service": "myfolder/SalesExample",
    "unique_record_key": "sales_id",
    "source_system": "Analysis server",
    "correlation_id": "89f4befd-91fe-4414-a33f-ea0911586fe2",
    "call_purpose": "Quarterly reporting"
}'
```

### Sample response

`HTTP 200 OK` `Content-Type: application/json`

```json
{
    "object": "batch",
    "id": "4d3a06ea-dda4-458c-9036-423a2b74e5cd",
    "data": {
        "service_id": "ec7932a3-3e60-43d0-bd84-704cd4e94ff7",
        "version_id": "ee6849b3-d7c0-44b4-b554-fe55fc128f8f",
        "compiler_version": "Neuron_v1.19.0",
        "correlation_id": "89f4befd-91fe-4414-a33f-ea0911586fe2",
        "source_system": "Analysis server",
        "unique_record_key": null,
        "response_timestamp": "2024-06-26T03:09:13.829Z",
        "batch_status": "created",
        "created_by": "myuser@mydomain.com",
        "created_timestamp": "2024-06-26T03:09:13.695Z",
        "updated_timestamp": "2024-06-26T03:09:13.707Z",
        "service_uri": "myfolder/SalesExample[0.2.0]"
    }
}
```

## `POST` chunks to the batch pipeline

Returns: Response from [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention").

{% code overflow="wrap" %}

```shellscript
POST /{tenant}/api/v4/batch/{batchId}/chunks
```

{% endcode %}

{% hint style="info" %}
If chunks are consistently failing to execute, consider reducing the size of the chunks sent to the batch pipeline. It is highly recommended to test batch jobs thoroughly and use an appropriate batch size.
{% endhint %}

User can submit data in form of a collection of inputs called chunk. Each chunk contains records that to submit for calculation. This endpoint allows you to add chunks to your batch. You may have to check the available input and buffer buffer space using [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention") before adding chunks to the batch. If there is insufficient space, the batch will be rejected.

The way that chunks are assembled have a large impact on the performance of the batch pipeline. The batch size is the number of records that are included in a chunk.

* Always use a consistent batch size for each chunk. This makes it easier for the batch pipeline to assess the time it takes to complete a chunk and how to allocate work to the pipeline.
* Larger batch sizes can lead to improved performance due to fewer I/O operations to the batch pipeline.
* A batch size that is too large can lead to chunks being failed because it exceeds the lifetime of the workers who can process the chunk.
* Smaller batch sizes can leverage greater scalability in a shorter period of time, however the additional I/O operations have an impact on the overall processing time.

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr><tr><td><code>batchId</code> *</td><td><code>id</code> from <a data-mention href="#post-batch-job">#post-batch-job</a>.</td></tr></tbody></table>

### Request body

`Content-Type: application/json` or `Content-Type: application/bson`&#x20;

Data can be sent in both uncompressed JSON and compressed Binary JSON ([BSON](https://bsonspec.org/)).

| Name        | Description                  |
| ----------- | ---------------------------- |
| `chunks` \* | Array of JSON chunk objects. |

### Define data and settings for a chunk

<table data-full-width="false"><thead><tr><th>Name</th><th>Description</th><th>Example</th></tr></thead><tbody><tr><td><code>id</code> *</td><td>You must generate a universally unique identifier (UUID) for each chunk. This is to associate <code>chunks</code> inputs against the <code>chunkresults</code> outputs.</td><td><pre class="language-json"><code class="lang-json">"4d3a06ea-dda4-458c-9036-423a2b74e5cd"
</code></pre></td></tr><tr><td><code>data</code> *</td><td>Object to stare <code>inputs</code>, <code>parameters</code>, and <code>summary</code>.</td><td></td></tr><tr><td><code>data.inputs</code> *</td><td><code>inputs</code> should contain the input records that we want to apply against a Spark service. These should correspond to the <code>Xinput</code>s on the Spark service. The dataset would contain multiple records which would be processed by the batch.<br><br>Data should conform to the formats described in <a data-mention href="/pages/F63N1Fpz7fYruDnmWkEQ#request-body">/pages/F63N1Fpz7fYruDnmWkEQ#request-body</a> JSON array format.</td><td><pre class="language-json"><code class="lang-json">"inputs": [
    ["sale_id","price","quantity"],
    [1,20,65],
    [2,74,73]
]
</code></pre></td></tr><tr><td><code>data.parameters</code></td><td><p><code>Xinput</code> values that stay the same across all dataset records, can instead be provided as parameters.</p><p><br>Parameters is a common data set for all the inputs in the chunk. This eliminates the need to send repeated data and reduces the size of the chunk.</p></td><td><p><code>tax</code> rate of <code>0.1</code> that will be used with each input record.</p><pre class="language-json"><code class="lang-json">"parameters": {
    "tax": 0.1
}
</code></pre></td></tr><tr><td><code>data.summary</code></td><td>Reference <a data-mention href="#define-aggregations-using-data.summary">#define-aggregations-using-data.summary</a>.</td><td></td></tr><tr><td><code>size</code></td><td>Total number of records in each chunk. This is needed when using BSON as the number of rows cannot be quickly determined.</td><td><pre class="language-json"><code class="lang-json">"size": 2
</code></pre></td></tr></tbody></table>

#### Define aggregations using `data.summary`

The `data.summary` object defines aggregations that will be applied to **each chunk**. The summary aggregations will be returned in an object called `summary_outputs`.

This is useful when the batch is being used for the purpose of running simulations where the aggregation of the batch results are more useful than the individual record outputs.

If in [#start-a-batch-process](#start-a-batch-process "mention"), the parameters for `unique_record_key` and `output` are not provided, then [#get-the-chunk-results](#get-the-chunk-results "mention") will not return the individual records along with the aggregation.

<table data-full-width="false"><thead><tr><th>Name</th><th>Description</th><th>Example</th></tr></thead><tbody><tr><td><code>ignore_error</code></td><td>When set to <code>false</code>, this is analogous to Excel where <code>SUM(0,1,2,#N/A) = #N/A</code>.<br>When set to <code>true</code>, records which return error in the output will not have an affect on the aggregation.<br>Default value <code>false</code>.</td><td><pre class="language-json" data-overflow="wrap"><code class="lang-json">"ignore_error": false
</code></pre></td></tr><tr><td><code>aggregation</code></td><td>An array of aggregate instruction objects.</td><td></td></tr><tr><td><code>aggregation.output_name</code></td><td>The name of the <code>Xoutput</code> from your Spark service on which you want to apply aggregate function.</td><td><pre class="language-json" data-overflow="wrap"><code class="lang-json">"output_name": "total"
</code></pre></td></tr><tr><td><code>aggregation.operator</code></td><td>The name of the aggregate operator. Currently, only <code>SUM</code> is supported.</td><td><pre class="language-json" data-overflow="wrap"><code class="lang-json">"operator": "SUM"
</code></pre></td></tr></tbody></table>

**Sample `data.summary` object**

<pre class="language-json"><code class="lang-json"><strong>"summary": {
</strong>    "ignore_error": false,
    "aggregation": [
        {
            "output_name": "total",
            "operator": "SUM"
        }
    ]
}
</code></pre>

### Sample request

This request sends `2` chunks to the batch without `data.summary`.

```sh
curl --location 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch/4d3a06ea-dda4-458c-9036-423a2b74e5cd/chunks' \
--header 'Accept-Encoding: gzip' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer {token}' \
--data '{
    "chunks": [
        {
            "id": "f7b52961-ad34-40ce-874f-93c67df11d65",
            "data": {
                "inputs": [
                    ["sale_id","price","quantity"],
                    [1,20,65],
                    [2,74,73]
                ],
                "parameters": {
                    "tax": 0.1
                }
            }
        },
        {
            "id": "ec2eec02-c005-4af6-a2af-50cfd7616d64",
            "data": {
                "inputs": [
                    ["sale_id","price","quantity"],
                    [3,20,65],
                    [4,34,73],
                    [5,62,62],
                    [6,43,87],
                    [7,23,35]
                ],
                "parameters": {
                    "tax": 0.1
                }
            }
        }
    ]
}'
```

### Response

`HTTP 200 OK` `Content-Type: application/json`

Returns the response from [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention").

## `GET` the batch pipeline status

Returns: Response `status` object including the number of `records_available` to download and the remaining `input_buffer_remaining_bytes` and `output_buffer_remaining_bytes`.

{% code overflow="wrap" %}

```shellscript
GET /{tenant}/api/v4/batch/{batchId}/status
```

{% endcode %}

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr><tr><td><code>batchId</code> *</td><td><code>id</code> from <a data-mention href="#start-a-batch-process">#start-a-batch-process</a>.</td></tr></tbody></table>

### Sample request

```sh
curl --location 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch/4d3a06ea-dda4-458c-9036-423a2b74e5cd/status' \
--header 'Accept-Encoding: gzip' \
--header 'Authorization: Bearer {token}'
```

### Sample response

`HTTP 200 OK` `Content-Type: application/json`

```json
{
    "response_timestamp": "2024-06-26T03:10:45.809Z",
    "request_timestamp": "2024-06-26T03:10:45.789Z",
    "batch_status": "in_progress",
    "pipeline_status": "idle",
    "chunks_available": 2,
    "chunks_submitted": 2,
    "record_submitted": 7,
    "chunks_completed": 2,
    "records_completed": 7,
    "compute_time_ms": 8,
    "input_buffer_used_bytes": 0,
    "input_buffer_remaining_bytes": 70000000,
    "output_buffer_used_bytes": 402,
    "output_buffer_remaining_bytes": 79999598,
    "workers_in_use": 0,
    "records_available": 7
}
```

## `GET` the chunk results

Returns: Response from [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention") and completed `outputs`.

{% code overflow="wrap" %}

```shellscript
GET /{tenant}/api/v4/batch/{batchId}/chunkresults?max={chunks}
```

{% endcode %}

{% hint style="info" %}
If chunks are consistently failing to execute, consider reducing the size of the chunks sent to the batch pipeline in [#post-chunks-to-the-batch-pipeline](#post-chunks-to-the-batch-pipeline "mention"). It is highly recommended to test batch jobs thoroughly and use an appropriate batch size.
{% endhint %}

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr><tr><td><code>batchId</code> *</td><td><code>id</code> from <a data-mention href="#post-batch-job">#post-batch-job</a>.</td></tr></tbody></table>

### Query parameters

| Name  | Description                                                                                                              |
| ----- | ------------------------------------------------------------------------------------------------------------------------ |
| `max` | Maximum number of chunks to be returned as part of a `chunkresults` call. This cannot exceed `100`. The default is `50`. |

### Sample request

```sh
curl --location 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch/4d3a06ea-dda4-458c-9036-423a2b74e5cd/chunkresults/?max=100' \
--header 'Accept-Encoding: gzip' \
--header 'Authorization: Bearer {token}'
```

### Sample response

`HTTP 200 OK` `Content-Type: application/json`&#x20;

* If there are no output records available to download then the response from [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention")will be returned.
* If there are output records available then response is returned as chunks. Each chunk object includes the `unique_record_key` (if provided) and the calculated service outputs. If the request was submitted using [BSON](https://bsonspec.org/), then the returned response will also be in [BSON](https://bsonspec.org/).

```json
{
    "data": [
        {
            "id": "f7b52961-ad34-40ce-874f-93c67df11d65",
            "summary_output": [
                []
            ],
            "outputs": [
                ["sale_id", "total"],
                [1, 1430],
                [2, 5942.2]
            ],
            "warnings": [
                null,
                null
            ],
            "errors": [
                null,
                null
            ],
            "process_time": [
                1,
                1
            ]
        },
        {
            "id": "ec2eec02-c005-4af6-a2af-50cfd7616d64",
            "summary_output": [
                []
            ],
            "outputs": [
                ["sale_id", "total"],
                [3, 1430],
                [4, 2730.2],
                [5, 4228.4],
                [6, 4115.1],
                [7, 885.5]
            ],
            "warnings": [
                null,
                null,
                null,
                null,
                null
            ],
            "errors": [
                null,
                null,
                null,
                null,
                null
            ],
            "process_time": [
                1,
                1,
                1,
                1,
                1
            ]
        }
    ],
    "status": {
        "response_timestamp": "2024-06-26T03:17:23.488Z",
        "request_timestamp": "2024-06-26T03:17:23.481Z",
        "batch_status": "in_progress",
        "pipeline_status": "idle",
        "chunks_available": 0,
        "chunks_submitted": 2,
        "record_submitted": 7,
        "chunks_completed": 2,
        "records_completed": 7,
        "compute_time_ms": 8,
        "input_buffer_used_bytes": 0,
        "input_buffer_remaining_bytes": 70000000,
        "output_buffer_used_bytes": 0,
        "output_buffer_remaining_bytes": 80000000,
        "workers_in_use": 0,
        "records_available": 0
    }
}
```

If the [#post-chunks-to-the-batch-pipeline](#post-chunks-to-the-batch-pipeline "mention") includes [#define-aggregations-using-data.summary](#define-aggregations-using-data.summary "mention") then:

* The object `data.summary_output` will populated.&#x20;
* Furthermore, if in [#post-batch-job](#post-batch-job "mention"), the parameters for `unique_record_key` and `output` are not provided, the `outputs` object will contain `[]` for the individual records.
* Aggregations will be returned with the output name suffixed with `_` and the aggregation operator.

```json
{
    "data": [
        {
            "id": "b49be0c0-9e7a-439f-98ed-4eab3054b34c",
            "summary_output": [
                [
                    ["total_SUM"],
                    [7372.2]
                ]
            ],
            "outputs": [
                []
                []
                []
            ],
            "warnings": [
                null,
                null
            ],
            "errors": [
                null,
                null
            ],
            "process_time": [
                1,
                1
            ]
        }
    ],
    "status": {
        "response_timestamp": "2024-06-26T03:20:07.095Z",
        "request_timestamp": "2024-06-26T03:20:07.090Z",
        "batch_status": "in_progress",
        "pipeline_status": "idle",
        "chunks_available": 0,
        "chunks_submitted": 3,
        "record_submitted": 9,
        "chunks_completed": 3,
        "records_completed": 9,
        "compute_time_ms": 10,
        "input_buffer_used_bytes": 0,
        "input_buffer_remaining_bytes": 70000000,
        "output_buffer_used_bytes": 0,
        "output_buffer_remaining_bytes": 80000000,
        "workers_in_use": 1,
        "records_available": 0
    }
}
```

## Close and cancel the batch

Returns: Response from [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention").

{% code overflow="wrap" %}

```shellscript
PATCH /{tenant}/api/v4/batch/{batchId}
```

{% endcode %}

* Close batch: If you no longer have any more data to add to the batch, you can close the batch. After closing a batch, batch will still process the data and user will be able to download the remaining output from get chunk results API.
* Cancel batch: When batch is not working as expected or you have made a mistake and you can cancel the batch to immediately stop the further processing. You won't be able to download anymore data after canceling a batch.

If a batch is not close or cancelled, the batch timeout is `30` minutes. This will release the input and output buffers.

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr><tr><td><code>batchId</code> *</td><td><code>id</code> from <a data-mention href="#post-batch-job">#post-batch-job</a>.</td></tr></tbody></table>

### Request body

`Content-Type: application/json`

<table><thead><tr><th>Name</th><th>Description</th></tr></thead><tbody><tr><td><pre class="language-json"><code class="lang-json">{"batch_status":"closed"}
</code></pre></td><td>Include this line in the JSON body to close the batch after all inputs have been submitted.</td></tr><tr><td><pre class="language-json"><code class="lang-json"><strong>{"batch_status":"cancelled"}
</strong></code></pre></td><td>Include this line in the JSON body to cancel the batch at any time. Spark will stop processing inputs immediately.</td></tr></tbody></table>

### Sample request

```sh
curl --location --request PATCH 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch/4d3a06ea-dda4-458c-9036-423a2b74e5cd' \
--header 'Accept-Encoding: gzip' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer {token}' \
--data '{
    "batch_status":"closed"
}'
```

### Sample response

`HTTP 200 OK` `Content-Type: application/json`

```json
{
    "object": "batch",
    "id": "4d3a06ea-dda4-458c-9036-423a2b74e5cd",
    "data": {
        "service_id": "ec7932a3-3e60-43d0-bd84-704cd4e94ff7",
        "version_id": "ee6849b3-d7c0-44b4-b554-fe55fc128f8f",
        "compiler_version": "Neuron_v1.19.0",
        "correlation_id": "89f4befd-91fe-4414-a33f-ea0911586fe2",
        "source_system": "Analysis server",
        "unique_record_key": null,
        "response_timestamp": "2024-06-26T03:25:41.638Z",
        "batch_status": "closed",
        "created_by": "myuser@mydomain.com",
        "created_timestamp": "2024-06-26T03:09:13.717Z",
        "updated_timestamp": "2024-06-26T03:20:06.897Z",
        "service_uri": "myfolder/SalesExample[0.2.0]"
    }
}
```

## Get batch information

Returns: Detailed information about the batch.

{% code overflow="wrap" %}

```shellscript
GET /{tenant}/api/v4/batch/{batchId}
```

{% endcode %}

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr><tr><td><code>batchId</code> *</td><td><code>id</code> from <a data-mention href="#post-batch-job">#post-batch-job</a>.</td></tr></tbody></table>

### Sample request

```sh
curl --location 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch/4d3a06ea-dda4-458c-9036-423a2b74e5cd' \
--header 'Accept-Encoding: gzip' \
--header 'Authorization: Bearer {token}'
```

### Sample response

`HTTP 200 OK` `Content-Type: application/json`

```json
{
    "object": "batch",
    "id": "4d3a06ea-dda4-458c-9036-423a2b74e5cd",
    "data": {
        "service_id": "ec7932a3-3e60-43d0-bd84-704cd4e94ff7",
        "version_id": "ee6849b3-d7c0-44b4-b554-fe55fc128f8f",
        "compiler_version": "Neuron_v1.19.0",
        "correlation_id": "89f4befd-91fe-4414-a33f-ea0911586fe2",
        "source_system": "Analysis server",
        "unique_record_key": null,
        "summary": {
            "chunks_submitted": 3,
            "chunks_retried": 0,
            "chunks_completed": 3,
            "chunks_failed": 0,
            "records_retried": 0,
            "input_size_bytes": 0,
            "output_size_bytes": 0,
            "avg_compute_time_ms": 1,
            "records_submitted": 9,
            "records_failed": 0,
            "records_completed": 9,
            "compute_time_ms": 10,
            "batch_time_ms": 653180.055
        },
        "configuration": {
            "initial_workers": 10,
            "chunks_per_request": 1,
            "runner_thread_count": 1,
            "acceptable_error_percentage": 0,
            "input_buffer_allocated_bytes": 70000000,
            "output_buffer_allocated_bytes": 80000000,
            "max_workers": 3000
        },
        "response_timestamp": "2024-06-26T03:24:27.879Z",
        "batch_status": "in_progress",
        "created_by": "myuser@mydomain.com",
        "created_timestamp": "2024-06-26T03:09:13.717Z",
        "updated_timestamp": "2024-06-26T03:20:06.897Z",
        "service_uri": "myfolder/SalesExample[0.2.0]"
    }
}
```

## Get batch status across the tenant

```shellscript
GET /{tenant}/api/v4/batch/status
```

### Path parameters

<table><thead><tr><th width="374">Key</th><th>Value</th></tr></thead><tbody><tr><td><code>tenant</code> *</td><td>Tenant is part of your <a data-mention href="/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark">/pages/-MboyUpg0GSvjBVkVMcw#log-in-to-spark</a> URL and also available in the  <a data-mention href="/pages/ylWjjoVBLOcB7JZks4Cp#user-menu">/pages/ylWjjoVBLOcB7JZks4Cp#user-menu</a>.</td></tr></tbody></table>

This endpoint provides information about batches that are `in_progress` or `recently_completed` within the past `1 h` .

* If you are a `supervisor:pf` user, you will be able to see all batches run by users within your tenant.
* Otherwise, you will only see information about the batches that were initiated by yourself.

{% hint style="info" %}
The `environment` object is a work in progress and may change in future iterations.
{% endhint %}

### Sample request

```sh
curl --location 'https://excel.myenvironment.coherent.global/mytenant/api/v4/batch/status' \
--header 'Accept-Encoding: gzip' \
--header 'Authorization: Bearer {token}'
```

### Sample response

`HTTP 200 OK` `Content-Type: application/json`

```json
{
    "in_progress_batches": [],
    "recent_batches": [
        {
            "object": "batch",
            "id": "4d3a06ea-dda4-458c-9036-423a2b74e5cd",
            "data": {
                "pipeline_status": "closed",
                "summary": {
                    "records_submitted": 9,
                    "records_failed": 0,
                    "records_completed": 9,
                    "compute_time_ms": 10,
                    "batch_time_ms": 987852.52
                },
                "response_timestamp": "2024-06-26T03:30:13.231Z",
                "batch_status": "completed",
                "created_by": "myuser@mydomain.com",
                "created_timestamp": "2024-06-26T03:09:13.717Z",
                "updated_timestamp": "2024-06-26T03:25:41.570Z",
                "service_uri": "myfolder/SalesExample[0.2.0]"
            }
        }
    ],
    "tenant": {
        "configuration": {
            "input_buffer_allocated_bytes": 0,
            "output_buffer_allocated_bytes": 0,
            "max_workers": 3000
        },
        "status": {
            "input_buffer_used_bytes": 0,
            "input_buffer_remaining_bytes": 0,
            "output_buffer_used_bytes": 0,
            "output_buffer_remaining_bytes": 0,
            "workers_in_use": 0
        }
    },
    "environment": {
        "update": 6
    }
}
```

## Reference

### `batch_status` values

* `created`, `pending`, `in_progress`, `closed`, `closed_by_timeout`, `completed`, `completed_by_timeout`, `failed`, `cancelled`.

### Manage batch performance

For each tenant, there is an allocation of server "workers" that are used to accommodate batch requests. After a batch is started:

1. There is an initial number of workers that are allocated for the batch job.
2. For each batch job, the number of workers scale depending on the available workers on the tenant. This is limited by:
   1. The total allocation of workers on the tenant.
   2. The maximum number of workers used for a particular job can be overridden by in [#post-batch-job](#post-batch-job "mention") using the `max_workers` parameter. This will enable more concurrent batch jobs to be run but at a lower rate of completion.
3. After a batch has been closed, there is a cool off period of `2 min` before the buffer is released for any additional batch jobs.

It is very important to understand the [#batch-pipeline-architecture](#batch-pipeline-architecture "mention") when sending large volumes of data. Each tenant has a total maximum buffer allocation. Each job also has a limited input buffer and output buffer.

* A batch job requires the total tenant buffer amount to have at least `110%` of `(input buffer + output buffer)` available to start.
* If the output batch buffer for the tenant is full, Spark will also reject any new chunks that are submitted.
* The batch may experience significant slowdowns in the condition where: 1) the input batch buffer is full, 2) there are a lot of chunks that have completed, 3) they cannot write to the output batch buffer because it is also full.
* Use the API to [#get-the-batch-pipeline-status](#get-the-batch-pipeline-status "mention"), [#get-batch-information](#get-batch-information "mention"), and [#get-batch-status-across-the-tenant](#get-batch-status-across-the-tenant "mention") to monitor the available buffer levels.
* Whenever a batch is completed, [#close-and-cancel-the-batch](#close-and-cancel-the-batch "mention") in order to restore the buffer values.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.coherent.global/spark-apis/batch-apis.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
