[[infer-service-elasticsearch]] === Elasticsearch {infer} service Creates an {infer} endpoint to perform an {infer} task with the `elasticsearch` service. NOTE: If you use the ELSER or the E5 model through the `elasticsearch` service, the API request will automatically download and deploy the model if it isn't downloaded yet. [discrete] [[infer-service-elasticsearch-api-request]] ==== {api-request-title} `PUT /_inference//` [discrete] [[infer-service-elasticsearch-api-path-params]] ==== {api-path-parms-title} ``:: (Required, string) include::inference-shared.asciidoc[tag=inference-id] ``:: (Required, string) include::inference-shared.asciidoc[tag=task-type] + -- Available task types: * `rerank`, * `sparse_embedding`, * `text_embedding`. -- [discrete] [[infer-service-elasticsearch-api-request-body]] ==== {api-request-body-title} `service`:: (Required, string) The type of service supported for the specified task type. In this case, `elasticsearch`. `service_settings`:: (Required, object) include::inference-shared.asciidoc[tag=service-settings] + -- These settings are specific to the `elasticsearch` service. -- `adaptive_allocations`::: (Optional, object) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation] `deployment_id`::: (Optional, string) The `deployment_id` of an existing trained model deployment. When `deployment_id` is used the `model_id` is optional. `enabled`:::: (Optional, Boolean) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled] `max_number_of_allocations`:::: (Optional, integer) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number] `min_number_of_allocations`:::: (Optional, integer) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] `model_id`::: (Required, string) The name of the model to use for the {infer} task. It can be the ID of either a built-in model (for example, `.multilingual-e5-small` for E5), a text embedding model already {ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland]. `num_allocations`::: (Required, integer) The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. If `adaptive_allocations` is enabled, do not set this value, because it's automatically set. `num_threads`::: (Required, integer) Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process; `threads_per_allocations` must not exceed the number of available allocated processors per node. Must be a power of 2. Max allowed value is 32. `task_settings`:: (Optional, object) include::inference-shared.asciidoc[tag=task-settings] + .`task_settings` for the `rerank` task type [%collapsible%closed] ===== `return_documents`::: (Optional, Boolean) Returns the document instead of only the index. Defaults to `true`. ===== [discrete] [[inference-example-elasticsearch-elser]] ==== ELSER via the `elasticsearch` service The following example shows how to create an {infer} endpoint called `my-elser-model` to perform a `sparse_embedding` task type. The API request below will automatically download the ELSER model if it isn't already downloaded and then deploy the model. [source,console] ------------------------------------------------------------ PUT _inference/sparse_embedding/my-elser-model { "service": "elasticsearch", "service_settings": { "adaptive_allocations": { <1> "enabled": true, "min_number_of_allocations": 1, "max_number_of_allocations": 10 }, "num_threads": 1, "model_id": ".elser_model_2" <2> } } ------------------------------------------------------------ // TEST[skip:TBD] <1> Adaptive allocations will be enabled with the minimum of 1 and the maximum of 10 allocations. <2> The `model_id` must be the ID of one of the built-in ELSER models. Valid values are `.elser_model_2` and `.elser_model_2_linux-x86_64`. For further details, refer to the {ml-docs}/ml-nlp-elser.html[ELSER model documentation]. [discrete] [[inference-example-elasticsearch]] ==== E5 via the `elasticsearch` service The following example shows how to create an {infer} endpoint called `my-e5-model` to perform a `text_embedding` task type. The API request below will automatically download the E5 model if it isn't already downloaded and then deploy the model. [source,console] ------------------------------------------------------------ PUT _inference/text_embedding/my-e5-model { "service": "elasticsearch", "service_settings": { "num_allocations": 1, "num_threads": 1, "model_id": ".multilingual-e5-small" <1> } } ------------------------------------------------------------ // TEST[skip:TBD] <1> The `model_id` must be the ID of one of the built-in E5 models. Valid values are `.multilingual-e5-small` and `.multilingual-e5-small_linux-x86_64`. For further details, refer to the {ml-docs}/ml-nlp-e5.html[E5 model documentation]. [NOTE] ==== You might see a 502 bad gateway error in the response when using the {kib} Console. This error usually just reflects a timeout, while the model downloads in the background. You can check the download progress in the {ml-app} UI. If using the Python client, you can set the `timeout` parameter to a higher value. ==== [discrete] [[inference-example-eland]] ==== Models uploaded by Eland via the elasticsearch service The following example shows how to create an {infer} endpoint called `my-msmarco-minilm-model` to perform a `text_embedding` task type. [source,console] ------------------------------------------------------------ PUT _inference/text_embedding/my-msmarco-minilm-model <1> { "service": "elasticsearch", "service_settings": { "num_allocations": 1, "num_threads": 1, "model_id": "msmarco-MiniLM-L12-cos-v5" <2> } } ------------------------------------------------------------ // TEST[skip:TBD] <1> Provide an unique identifier for the inference endpoint. The `inference_id` must be unique and must not match the `model_id`. <2> The `model_id` must be the ID of a text embedding model which has already been {ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland]. [discrete] [[inference-example-adaptive-allocation]] ==== Setting adaptive allocation for E5 via the `elasticsearch` service The following example shows how to create an {infer} endpoint called `my-e5-model` to perform a `text_embedding` task type and configure adaptive allocations. The API request below will automatically download the E5 model if it isn't already downloaded and then deploy the model. [source,console] ------------------------------------------------------------ PUT _inference/text_embedding/my-e5-model { "service": "elasticsearch", "service_settings": { "adaptive_allocations": { "enabled": true, "min_number_of_allocations": 3, "max_number_of_allocations": 10 }, "num_threads": 1, "model_id": ".multilingual-e5-small" } } ------------------------------------------------------------ // TEST[skip:TBD] [discrete] [[inference-example-existing-deployment]] ==== Using an existing model deployment with the `elasticsearch` service The following example shows how to use an already existing model deployment when creating an {infer} endpoint. [source,console] ------------------------------------------------------------ PUT _inference/sparse_embedding/use_existing_deployment { "service": "elasticsearch", "service_settings": { "deployment_id": ".elser_model_2" <1> } } ------------------------------------------------------------ // TEST[skip:TBD] <1> The `deployment_id` of the already existing model deployment. The API response contains the `model_id`, and the threads and allocations settings from the model deployment: [source,console-result] ------------------------------------------------------------ { "inference_id": "use_existing_deployment", "task_type": "sparse_embedding", "service": "elasticsearch", "service_settings": { "num_allocations": 2, "num_threads": 1, "model_id": ".elser_model_2", "deployment_id": ".elser_model_2" }, "chunking_settings": { "strategy": "sentence", "max_chunk_size": 250, "sentence_overlap": 1 } } ------------------------------------------------------------ // NOTCONSOLE