mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-30 02:13:33 -04:00
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model. By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. This is due to the model having to be loaded in memory twice when serializing to the native process. But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node. Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
131 lines
4.6 KiB
Text
131 lines
4.6 KiB
Text
[role="xpack"]
|
|
[[start-trained-model-deployment]]
|
|
= Start trained model deployment API
|
|
[subs="attributes"]
|
|
++++
|
|
<titleabbrev>Start trained model deployment</titleabbrev>
|
|
++++
|
|
|
|
Starts a new trained model deployment.
|
|
|
|
preview::[]
|
|
|
|
[[start-trained-model-deployment-request]]
|
|
== {api-request-title}
|
|
|
|
`POST _ml/trained_models/<model_id>/deployment/_start`
|
|
|
|
[[start-trained-model-deployment-prereq]]
|
|
== {api-prereq-title}
|
|
Requires the `manage_ml` cluster privilege. This privilege is included in the
|
|
`machine_learning_admin` built-in role.
|
|
|
|
[[start-trained-model-deployment-desc]]
|
|
== {api-description-title}
|
|
|
|
Currently only `pytorch` models are supported for deployment. Once deployed
|
|
the model can be used by the <<inference-processor,{infer-cap} processor>>
|
|
in an ingest pipeline or directly in the <<infer-trained-model>> API.
|
|
|
|
Scaling inference performance can be achieved by setting the parameters
|
|
`number_of_allocations` and `threads_per_allocation`.
|
|
|
|
Increasing `threads_per_allocation` means more threads are used when
|
|
an inference request is processed on a node. This can improve inference speed
|
|
for certain models. It may also result in improvement to throughput.
|
|
|
|
Increasing `number_of_allocations` means more threads are used to
|
|
process multiple inference requests in parallel resulting in throughput
|
|
improvement. Each model allocation uses a number of threads defined by
|
|
`threads_per_allocation`.
|
|
|
|
Model allocations are distributed across {ml} nodes. All allocations assigned
|
|
to a node share the same copy of the model in memory. To avoid
|
|
thread oversubscription which is detrimental to performance, model allocations
|
|
are distributed in such a way that the total number of used threads does not
|
|
surpass the node's allocated processors.
|
|
|
|
[[start-trained-model-deployment-path-params]]
|
|
== {api-path-parms-title}
|
|
|
|
`<model_id>`::
|
|
(Required, string)
|
|
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]
|
|
|
|
[[start-trained-model-deployment-query-params]]
|
|
== {api-query-parms-title}
|
|
|
|
`cache_size`::
|
|
(Optional, <<byte-units,byte value>>)
|
|
The inference cache size (in memory outside the JVM heap) per node for the model.
|
|
The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.
|
|
|
|
`number_of_allocations`::
|
|
(Optional, integer)
|
|
The total number of allocations this model is assigned across {ml} nodes.
|
|
Increasing this value generally increases the throughput.
|
|
Defaults to 1.
|
|
|
|
`queue_capacity`::
|
|
(Optional, integer)
|
|
Controls how many inference requests are allowed in the queue at a time.
|
|
Every machine learning node in the cluster where the model can be allocated
|
|
has a queue of this size; when the number of requests exceeds the total value,
|
|
new requests are rejected with a 429 error. Defaults to 1024.
|
|
|
|
`threads_per_allocation`::
|
|
(Optional, integer)
|
|
Sets the number of threads used by each model allocation during inference. This generally increases
|
|
the speed per inference request. The inference process is a compute-bound process;
|
|
`threads_per_allocations` must not exceed the number of available allocated processors per node.
|
|
Defaults to 1. Must be a power of 2. Max allowed value is 32.
|
|
|
|
`timeout`::
|
|
(Optional, time)
|
|
Controls the amount of time to wait for the model to deploy. Defaults
|
|
to 20 seconds.
|
|
|
|
`wait_for`::
|
|
(Optional, string)
|
|
Specifies the allocation status to wait for before returning. Defaults to
|
|
`started`. The value `starting` indicates deployment is starting but not yet on
|
|
any node. The value `started` indicates the model has started on at least one
|
|
node. The value `fully_allocated` indicates the deployment has started on all
|
|
valid nodes.
|
|
|
|
[[start-trained-model-deployment-example]]
|
|
== {api-examples-title}
|
|
|
|
The following example starts a new deployment for a
|
|
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
|
|
--------------------------------------------------
|
|
// TEST[skip:TBD]
|
|
|
|
The API returns the following results:
|
|
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"assignment": {
|
|
"task_parameters": {
|
|
"model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
|
|
"model_bytes": 265632637,
|
|
"threads_per_allocation" : 1,
|
|
"number_of_allocations" : 1,
|
|
"queue_capacity" : 1024
|
|
},
|
|
"routing_table": {
|
|
"uckeG3R8TLe2MMNBQ6AGrw": {
|
|
"routing_state": "started",
|
|
"reason": ""
|
|
}
|
|
},
|
|
"assignment_state": "started",
|
|
"start_time": "2022-11-02T11:50:34.766591Z"
|
|
}
|
|
}
|
|
----
|