[role="xpack"]
[[start-trained-model-deployment]]
= Start trained model deployment API
[subs="attributes"]
++++
Start trained model deployment
++++
.New API reference
[sidebar]
--
For the most up-to-date API details, refer to {api-es}/group/endpoint-ml-trained-model[{ml-cap} trained model APIs].
--
Starts a new trained model deployment.
[[start-trained-model-deployment-request]]
== {api-request-title}
`POST _ml/trained_models//deployment/_start`
[[start-trained-model-deployment-prereq]]
== {api-prereq-title}
Requires the `manage_ml` cluster privilege. This privilege is included in the
`machine_learning_admin` built-in role.
[[start-trained-model-deployment-desc]]
== {api-description-title}
Currently only `pytorch` models are supported for deployment. Once deployed
the model can be used by the <>
in an ingest pipeline or directly in the <> API.
A model can be deployed multiple times by using deployment IDs. A deployment ID
must be unique and should not match any other deployment ID or model ID, unless
it is the same as the ID of the model being deployed. If `deployment_id` is not
set, it defaults to the `model_id`.
You can enable adaptive allocations to automatically scale model allocations up
and down based on the actual resource requirement of the processes.
Manually scaling inference performance can be achieved by setting the parameters
`number_of_allocations` and `threads_per_allocation`.
Increasing `threads_per_allocation` means more threads are used when an
inference request is processed on a node. This can improve inference speed for
certain models. It may also result in improvement to throughput.
Increasing `number_of_allocations` means more threads are used to process
multiple inference requests in parallel resulting in throughput improvement.
Each model allocation uses a number of threads defined by
`threads_per_allocation`.
Model allocations are distributed across {ml} nodes. All allocations assigned to
a node share the same copy of the model in memory. To avoid thread
oversubscription which is detrimental to performance, model allocations are
distributed in such a way that the total number of used threads does not surpass
the node's allocated processors.
[[start-trained-model-deployment-path-params]]
== {api-path-parms-title}
``::
(Required, string)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=model-id]
[[start-trained-model-deployment-query-params]]
== {api-query-parms-title}
`deployment_id`::
(Optional, string)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=deployment-id]
+
--
Defaults to `model_id`.
--
`timeout`::
(Optional, time)
Controls the amount of time to wait for the model to deploy. Defaults to 30
seconds.
`wait_for`::
(Optional, string)
Specifies the allocation status to wait for before returning. Defaults to
`started`. The value `starting` indicates deployment is starting but not yet on
any node. The value `started` indicates the model has started on at least one
node. The value `fully_allocated` indicates the deployment has started on all
valid nodes.
[[start-trained-model-deployment-request-body]]
== {api-request-body-title}
`adaptive_allocations`::
(Optional, object)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]
`enabled`:::
(Optional, Boolean)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]
`max_number_of_allocations`:::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]
`min_number_of_allocations`:::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]
`cache_size`::
(Optional, <>)
The inference cache size (in memory outside the JVM heap) per node for the
model. In serverless, the cache is disabled by default. Otherwise, the default value is the size of the model as reported by the
`model_size_bytes` field in the <>. To disable the
cache, `0b` can be provided.
`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
Increasing this value generally increases the throughput. Defaults to `1`.
If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.
`priority`::
(Optional, string)
The priority of the deployment. The default value is `normal`.
There are two priority settings:
+
--
* `normal`: Use this for deployments in production. The deployment allocations
are distributed so that node processors are not oversubscribed.
* `low`: Use this for testing model functionality. The intention is that these
deployments are not sent a high volume of input. The deployment is required to
have a single allocation with just one thread. Low priority deployments may be
assigned on nodes that already utilize all their processors but will be given a
lower CPU priority than normal deployments. Low priority deployments may be
unassigned in order to satisfy more allocations of normal priority deployments.
--
WARNING: Heavy usage of low priority deployments may impact performance of
normal priority deployments.
`queue_capacity`::
(Optional, integer)
Controls how many inference requests are allowed in the queue at a time.
Every machine learning node in the cluster where the model can be allocated
has a queue of this size; when the number of requests exceeds the total value,
new requests are rejected with a 429 error. Defaults to 10000. Max allowed value
is 100000.
`threads_per_allocation`::
(Optional, integer)
Sets the number of threads used by each model allocation during inference. This
generally increases the speed per inference request. The inference process is a
compute-bound process; `threads_per_allocations` must not exceed the number of
available allocated processors per node. Defaults to 1. Must be a power of 2.
Max allowed value is 32.
[[start-trained-model-deployment-example]]
== {api-examples-title}
The following example starts a new deployment for a
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:
[source,console]
--------------------------------------------------
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
--------------------------------------------------
// TEST[skip:TBD]
The API returns the following results:
[source,console-result]
----
{
"assignment": {
"task_parameters": {
"model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
"model_bytes": 265632637,
"threads_per_allocation" : 1,
"number_of_allocations" : 1,
"queue_capacity" : 10000,
"priority": "normal"
},
"routing_table": {
"uckeG3R8TLe2MMNBQ6AGrw": {
"routing_state": "started",
"reason": ""
}
},
"assignment_state": "started",
"start_time": "2022-11-02T11:50:34.766591Z"
}
}
----
[[start-trained-model-deployment-deployment-id-example]]
=== Using deployment IDs
The following example starts a new deployment for the `my_model` trained model
with the ID `my_model_for_ingest`. The deployment ID an be used in {infer} API
calls or in {infer} processors.
[source,console]
--------------------------------------------------
POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_ingest
--------------------------------------------------
// TEST[skip:TBD]
The `my_model` trained model can be deployed again with a different ID:
[source,console]
--------------------------------------------------
POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
--------------------------------------------------
// TEST[skip:TBD]
[[start-trained-model-deployment-adaptive-allocation-example]]
=== Setting adaptive allocations
The following example starts a new deployment of the `my_model` trained model
with the ID `my_model_for_search` and enables adaptive allocations with the
minimum number of 3 allocations and the maximum number of 10.
[source,console]
--------------------------------------------------
POST _ml/trained_models/my_model/deployment/_start?deployment_id=my_model_for_search
{
"adaptive_allocations": {
"enabled": true,
"min_number_of_allocations": 3,
"max_number_of_allocations": 10
}
}
--------------------------------------------------
// TEST[skip:TBD]