elasticsearch/docs/reference/ml/trained-models/apis/start-trained-model-deployment.asciidoc

[role="xpack"]
[[start-trained-model-deployment]]
= Start trained model deployment API
[subs="attributes"]
++++
<titleabbrev>Start trained model deployment</titleabbrev>
++++

Starts a new trained model deployment.

preview::[]

[[start-trained-model-deployment-request]]
== {api-request-title}

`POST _ml/trained_models/<model_id>/deployment/_start`

[[start-trained-model-deployment-prereq]]
== {api-prereq-title}
Requires the `manage_ml` cluster privilege. This privilege is included in the
`machine_learning_admin` built-in role.

[[start-trained-model-deployment-desc]]
== {api-description-title}

Currently only `pytorch` models are supported for deployment. Once deployed
the model can be used by the <<inference-processor,{infer-cap} processor>>
in an ingest pipeline or directly in the <<infer-trained-model>> API.

Scaling inference performance can be achieved by setting the parameters
`number_of_allocations` and `threads_per_allocation`.

Increasing `threads_per_allocation` means more threads are used when
an inference request is processed on a node. This can improve inference speed
for certain models. It may also result in improvement to throughput.

Increasing `number_of_allocations` means more threads are used to
process multiple inference requests in parallel resulting in throughput
improvement. Each model allocation uses a number of threads defined by
`threads_per_allocation`.

Model allocations are distributed across {ml} nodes. All allocations assigned
to a node share the same copy of the model in memory. To avoid
thread oversubscription which is detrimental to performance, model allocations
are distributed in such a way that the total number of used threads does not
surpass the node's allocated processors.

[[start-trained-model-deployment-path-params]]
== {api-path-parms-title}

`<model_id>`::
(Required, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]

[[start-trained-model-deployment-query-params]]
== {api-query-parms-title}

`cache_size`::
(Optional, <<byte-units,byte value>>)
The inference cache size (in memory outside the JVM heap) per node for the model.
The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.

`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
Increasing this value generally increases the throughput.
Defaults to 1.

`queue_capacity`::
(Optional, integer)
Controls how many inference requests are allowed in the queue at a time.
Every machine learning node in the cluster where the model can be allocated
has a queue of this size; when the number of requests exceeds the total value,
new requests are rejected with a 429 error. Defaults to 1024.

`threads_per_allocation`::
(Optional, integer)
Sets the number of threads used by each model allocation during inference. This generally increases
the speed per inference request. The inference process is a compute-bound process;
`threads_per_allocations` must not exceed the number of available allocated processors per node.
Defaults to 1. Must be a power of 2. Max allowed value is 32.

`timeout`::
(Optional, time)
Controls the amount of time to wait for the model to deploy. Defaults
to 20 seconds.

`wait_for`::
(Optional, string)
Specifies the allocation status to wait for before returning. Defaults to
`started`. The value `starting` indicates deployment is starting but not yet on
any node. The value `started` indicates the model has started on at least one
node. The value `fully_allocated` indicates the deployment has started on all
valid nodes.

[[start-trained-model-deployment-example]]
== {api-examples-title}

The following example starts a new deployment for a
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:

[source,console]
--------------------------------------------------
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
--------------------------------------------------
// TEST[skip:TBD]

The API returns the following results:

[source,console-result]
----
{
    "assignment": {
        "task_parameters": {
            "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
            "model_bytes": 265632637,
            "threads_per_allocation" : 1,
            "number_of_allocations" : 1,
            "queue_capacity" : 1024
        },
        "routing_table": {
            "uckeG3R8TLe2MMNBQ6AGrw": {
                "routing_state": "started",
                "reason": ""
            }
        },
        "assignment_state": "started",
        "start_time": "2022-11-02T11:50:34.766591Z"
    }
}
----