elasticsearch/docs/reference/inference/service-elasticsearch.asciidoc

[[infer-service-elasticsearch]]
=== Elasticsearch {infer} integration

.New API reference
[sidebar]
--
For the most up-to-date API details, refer to {api-es}/group/endpoint-inference[{infer-cap} APIs].
--

Creates an {infer} endpoint to perform an {infer} task with the `elasticsearch` service.

[NOTE]
====
* Your {es} deployment contains <<default-enpoints,preconfigured ELSER and E5 {infer} endpoints>>, you only need to create the enpoints using the API if you want to customize the settings.
* If you use the ELSER or the E5 model through the `elasticsearch` service, the API request will automatically download and deploy the model if it isn't downloaded yet.
====

[discrete]
[[infer-service-elasticsearch-api-request]]
==== {api-request-title}

`PUT /_inference/<task_type>/<inference_id>`

[discrete]
[[infer-service-elasticsearch-api-path-params]]
==== {api-path-parms-title}

`<inference_id>`::
(Required, string)
include::inference-shared.asciidoc[tag=inference-id]

`<task_type>`::
(Required, string)
include::inference-shared.asciidoc[tag=task-type]
+
--
Available task types:

* `rerank`,
* `sparse_embedding`,
* `text_embedding`.
--

[discrete]
[[infer-service-elasticsearch-api-request-body]]
==== {api-request-body-title}

`chunking_settings`::
(Optional, object)
include::inference-shared.asciidoc[tag=chunking-settings]

`max_chunking_size`:::
(Optional, integer)
include::inference-shared.asciidoc[tag=chunking-settings-max-chunking-size]

`overlap`:::
(Optional, integer)
include::inference-shared.asciidoc[tag=chunking-settings-overlap]

`sentence_overlap`:::
(Optional, integer)
include::inference-shared.asciidoc[tag=chunking-settings-sentence-overlap]

`strategy`:::
(Optional, string)
include::inference-shared.asciidoc[tag=chunking-settings-strategy]

`service`::
(Required, string)
The type of service supported for the specified task type. In this case,
`elasticsearch`.

`service_settings`::
(Required, object)
include::inference-shared.asciidoc[tag=service-settings]
+
--
These settings are specific to the `elasticsearch` service.
--

`deployment_id`:::
(Optional, string)
The `deployment_id` of an existing trained model deployment.
When `deployment_id` is used the `model_id` is optional.

`adaptive_allocations`:::
(Optional, object)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation]

`enabled`::::
(Optional, Boolean)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled]

`max_number_of_allocations`::::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number]

`min_number_of_allocations`::::
(Optional, integer)
include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number]

`model_id`:::
(Required, string)
The name of the model to use for the {infer} task.
It can be the ID of either a built-in model (for example, `.multilingual-e5-small` for E5), a text embedding model already
{ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].

`num_allocations`:::
(Required, integer)
The total number of allocations this model is assigned across machine learning nodes.
Increasing this value generally increases the throughput.
If `adaptive_allocations` is enabled, do not set this value, because it's automatically set.

`num_threads`:::
(Required, integer)
Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process; `threads_per_allocations` must not exceed the number of available allocated processors per node.
Must be a power of 2. Max allowed value is 32.

`task_settings`::
(Optional, object)
include::inference-shared.asciidoc[tag=task-settings]
+
.`task_settings` for the `rerank` task type
[%collapsible%closed]
=====
`return_documents`:::
(Optional, Boolean)
Returns the document instead of only the index. Defaults to `true`.
=====

[discrete]
[[inference-example-elasticsearch-elser]]
==== ELSER via the `elasticsearch` service

The following example shows how to create an {infer} endpoint called `my-elser-model` to perform a `sparse_embedding` task type.

The API request below will automatically download the ELSER model if it isn't already downloaded and then deploy the model.

[source,console]
------------------------------------------------------------
PUT _inference/sparse_embedding/my-elser-model
{
  "service": "elasticsearch",
  "service_settings": {
    "adaptive_allocations": { <1>
      "enabled": true,
      "min_number_of_allocations": 1,
      "max_number_of_allocations": 4
    },
    "num_threads": 1,
    "model_id": ".elser_model_2" <2>
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> Adaptive allocations will be enabled with the minimum of 1 and the maximum of 10 allocations.
<2> The `model_id` must be the ID of one of the built-in ELSER models.
Valid values are `.elser_model_2` and `.elser_model_2_linux-x86_64`.
For further details, refer to the {ml-docs}/ml-nlp-elser.html[ELSER model documentation].

[discrete]
[[inference-example-elastic-reranker]]
==== Elastic Rerank via the `elasticsearch` service

The following example shows how to create an {infer} endpoint called `my-elastic-rerank` to perform a `rerank` task type using the built-in Elastic Rerank cross-encoder model.

The API request below will automatically download the Elastic Rerank model if it isn't already downloaded and then deploy the model.
Once deployed, the model can be used for semantic re-ranking with a <<text-similarity-reranker-retriever-example-elastic-rerank,`text_similarity_reranker` retriever>>.

[source,console]
------------------------------------------------------------
PUT _inference/rerank/my-elastic-rerank
{
  "service": "elasticsearch",
  "service_settings": {
    "model_id": ".rerank-v1", <1>
    "num_threads": 1,
    "adaptive_allocations": { <2>
      "enabled": true,
      "min_number_of_allocations": 1,
      "max_number_of_allocations": 4
    }
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The `model_id` must be the ID of the built-in Elastic Rerank model: `.rerank-v1`.
<2> {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[Adaptive allocations] will be enabled with the minimum of 1 and the maximum of 10 allocations.

[discrete]
[[inference-example-elasticsearch]]
==== E5 via the `elasticsearch` service

The following example shows how to create an {infer} endpoint called `my-e5-model` to perform a `text_embedding` task type.

The API request below will automatically download the E5 model if it isn't already downloaded and then deploy the model.

[source,console]
------------------------------------------------------------
PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": ".multilingual-e5-small" <1>
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The `model_id` must be the ID of one of the built-in E5 models.
Valid values are `.multilingual-e5-small` and `.multilingual-e5-small_linux-x86_64`.
For further details, refer to the {ml-docs}/ml-nlp-e5.html[E5 model documentation].

[NOTE]
====
You might see a 502 bad gateway error in the response when using the {kib} Console.
This error usually just reflects a timeout, while the model downloads in the background.
You can check the download progress in the {ml-app} UI.
If using the Python client, you can set the `timeout` parameter to a higher value.
====

[discrete]
[[inference-example-eland]]
==== Models uploaded by Eland via the `elasticsearch` service

The following example shows how to create an {infer} endpoint called
`my-msmarco-minilm-model` to perform a `text_embedding` task type.

[source,console]
------------------------------------------------------------
PUT _inference/text_embedding/my-msmarco-minilm-model <1>
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": "msmarco-MiniLM-L12-cos-v5" <2>
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> Provide an unique identifier for the inference endpoint. The `inference_id` must be unique and must not match the `model_id`.
<2> The `model_id` must be the ID of a text embedding model which has already been
{ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].

[discrete]
[[inference-example-adaptive-allocation]]
==== Setting adaptive allocation for E5 via the `elasticsearch` service

The following example shows how to create an {infer} endpoint called
`my-e5-model` to perform a `text_embedding` task type and configure adaptive
allocations.

The API request below will automatically download the E5 model if it isn't
already downloaded and then deploy the model.

[source,console]
------------------------------------------------------------
PUT _inference/text_embedding/my-e5-model
{
  "service": "elasticsearch",
  "service_settings": {
    "adaptive_allocations": {
      "enabled": true,
      "min_number_of_allocations": 3,
      "max_number_of_allocations": 10
    },
    "num_threads": 1,
    "model_id": ".multilingual-e5-small"
  }
}
------------------------------------------------------------
// TEST[skip:TBD]


[discrete]
[[inference-example-existing-deployment]]
==== Using an existing model deployment with the `elasticsearch` service

The following example shows how to use an already existing model deployment when creating an {infer} endpoint.

[source,console]
------------------------------------------------------------
PUT _inference/sparse_embedding/use_existing_deployment
{
  "service": "elasticsearch",
  "service_settings": {
    "deployment_id": ".elser_model_2" <1>
  }
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The `deployment_id` of the already existing model deployment.

The API response contains the `model_id`, and the threads and allocations settings from the model deployment:

[source,console-result]
------------------------------------------------------------
{
  "inference_id": "use_existing_deployment",
  "task_type": "sparse_embedding",
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 2,
    "num_threads": 1,
    "model_id": ".elser_model_2",
    "deployment_id": ".elser_model_2"
  },
  "chunking_settings": {
    "strategy": "sentence",
    "max_chunk_size": 250,
    "sentence_overlap": 1
  }
}
------------------------------------------------------------
// NOTCONSOLE