[role="xpack"] [[start-trained-model-deployment]] = Start trained model deployment API [subs="attributes"] ++++ Start trained model deployment ++++ Starts a new trained model deployment. preview::[] [[start-trained-model-deployment-request]] == {api-request-title} `POST _ml/trained_models//deployment/_start` [[start-trained-model-deployment-prereq]] == {api-prereq-title} Requires the `manage_ml` cluster privilege. This privilege is included in the `machine_learning_admin` built-in role. [[start-trained-model-deployment-desc]] == {api-description-title} Currently only `pytorch` models are supported for deployment. Once deployed the model can be used by the <> in an ingest pipeline or directly in the <> API. Scaling inference performance can be achieved by setting the parameters `number_of_allocations` and `threads_per_allocation`. Increasing `threads_per_allocation` means more threads are used when an inference request is processed on a node. This can improve inference speed for certain models. It may also result in improvement to throughput. Increasing `number_of_allocations` means more threads are used to process multiple inference requests in parallel resulting in throughput improvement. Each model allocation uses a number of threads defined by `threads_per_allocation`. Model allocations are distributed across {ml} nodes. All allocations assigned to a node share the same copy of the model in memory. To avoid thread oversubscription which is detrimental to performance, model allocations are distributed in such a way that the total number of used threads does not surpass the node's allocated processors. [[start-trained-model-deployment-path-params]] == {api-path-parms-title} ``:: (Required, string) include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id] [[start-trained-model-deployment-query-params]] == {api-query-parms-title} `cache_size`:: (Optional, <>) The inference cache size (in memory outside the JVM heap) per node for the model. The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided. `number_of_allocations`:: (Optional, integer) The total number of allocations this model is assigned across {ml} nodes. Increasing this value generally increases the throughput. Defaults to 1. `queue_capacity`:: (Optional, integer) Controls how many inference requests are allowed in the queue at a time. Every machine learning node in the cluster where the model can be allocated has a queue of this size; when the number of requests exceeds the total value, new requests are rejected with a 429 error. Defaults to 1024. `threads_per_allocation`:: (Optional, integer) Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process; `threads_per_allocations` must not exceed the number of available allocated processors per node. Defaults to 1. Must be a power of 2. Max allowed value is 32. `timeout`:: (Optional, time) Controls the amount of time to wait for the model to deploy. Defaults to 20 seconds. `wait_for`:: (Optional, string) Specifies the allocation status to wait for before returning. Defaults to `started`. The value `starting` indicates deployment is starting but not yet on any node. The value `started` indicates the model has started on at least one node. The value `fully_allocated` indicates the deployment has started on all valid nodes. [[start-trained-model-deployment-example]] == {api-examples-title} The following example starts a new deployment for a `elastic__distilbert-base-uncased-finetuned-conll03-english` trained model: [source,console] -------------------------------------------------- POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m -------------------------------------------------- // TEST[skip:TBD] The API returns the following results: [source,console-result] ---- { "assignment": { "task_parameters": { "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english", "model_bytes": 265632637, "threads_per_allocation" : 1, "number_of_allocations" : 1, "queue_capacity" : 1024 }, "routing_table": { "uckeG3R8TLe2MMNBQ6AGrw": { "routing_state": "started", "reason": "" } }, "assignment_state": "started", "start_time": "2022-11-02T11:50:34.766591Z" } } ----