[[infer-service-elser]] === ELSER {infer} integration .New API reference [sidebar] -- For the most up-to-date API details, refer to {api-es}/group/endpoint-inference[{infer-cap} APIs]. -- Creates an {infer} endpoint to perform an {infer} task with the `elser` service. You can also deploy ELSER by using the <>. [NOTE] ==== * Your {es} deployment contains <>, you only need to create the enpoint using the API if you want to customize the settings. * The API request will automatically download and deploy the ELSER model if it isn't already downloaded. ==== [WARNING] .Deprecated in 8.16 ==== The `elser` service is deprecated and will be removed in a future release. Use the <> instead, with `model_id` included in the `service_settings`. ==== [discrete] [[infer-service-elser-api-request]] ==== {api-request-title} `PUT /_inference//` [discrete] [[infer-service-elser-api-path-params]] ==== {api-path-parms-title} ``:: (Required, string) include::inference-shared.asciidoc[tag=inference-id] ``:: (Required, string) include::inference-shared.asciidoc[tag=task-type] + -- Available task types: * `sparse_embedding`. -- [discrete] [[infer-service-elser-api-request-body]] ==== {api-request-body-title} `chunking_settings`:: (Optional, object) include::inference-shared.asciidoc[tag=chunking-settings] `max_chunk_size`::: (Optional, integer) include::inference-shared.asciidoc[tag=chunking-settings-max-chunking-size] `overlap`::: (Optional, integer) include::inference-shared.asciidoc[tag=chunking-settings-overlap] `sentence_overlap`::: (Optional, integer) include::inference-shared.asciidoc[tag=chunking-settings-sentence-overlap] `strategy`::: (Optional, string) include::inference-shared.asciidoc[tag=chunking-settings-strategy] `service`:: (Required, string) The type of service supported for the specified task type. In this case, `elser`. `service_settings`:: (Required, object) include::inference-shared.asciidoc[tag=service-settings] + -- These settings are specific to the `elser` service. -- `adaptive_allocations`::: (Optional, object) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation] `enabled`:::: (Optional, Boolean) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-enabled] `max_number_of_allocations`:::: (Optional, integer) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-max-number] `min_number_of_allocations`:::: (Optional, integer) include::{es-ref-dir}/ml/ml-shared.asciidoc[tag=adaptive-allocation-min-number] `num_allocations`::: (Required, integer) The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. If `adaptive_allocations` is enabled, do not set this value, because it's automatically set. `num_threads`::: (Required, integer) Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process; `threads_per_allocations` must not exceed the number of available allocated processors per node. Must be a power of 2. Max allowed value is 32. [discrete] [[inference-example-elser-adaptive-allocation]] ==== ELSER service example with adaptive allocations When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load. NOTE: For more information on how to optimize your ELSER endpoints, refer to {ml-docs}/ml-nlp-elser.html#elser-recommendations[the ELSER recommendations] section in the model documentation. To learn more about model autoscaling, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] page. The following example shows how to create an {infer} endpoint called `my-elser-model` to perform a `sparse_embedding` task type and configure adaptive allocations. The request below will automatically download the ELSER model if it isn't already downloaded and then deploy the model. [source,console] ------------------------------------------------------------ PUT _inference/sparse_embedding/my-elser-model { "service": "elser", "service_settings": { "adaptive_allocations": { "enabled": true, "min_number_of_allocations": 3, "max_number_of_allocations": 10 }, "num_threads": 1 } } ------------------------------------------------------------ // TEST[skip:TBD] [discrete] [[inference-example-elser]] ==== ELSER service example without adaptive allocations The following example shows how to create an {infer} endpoint called `my-elser-model` to perform a `sparse_embedding` task type. Refer to the {ml-docs}/ml-nlp-elser.html[ELSER model documentation] for more info. NOTE: If you want to optimize your ELSER endpoint for ingest, set the number of threads to `1` (`"num_threads": 1`). If you want to optimize your ELSER endpoint for search, set the number of threads to greater than `1`. The request below will automatically download the ELSER model if it isn't already downloaded and then deploy the model. [source,console] ------------------------------------------------------------ PUT _inference/sparse_embedding/my-elser-model { "service": "elser", "service_settings": { "num_allocations": 1, "num_threads": 1 } } ------------------------------------------------------------ // TEST[skip:TBD] Example response: [source,console-result] ------------------------------------------------------------ { "inference_id": "my-elser-model", "task_type": "sparse_embedding", "service": "elser", "service_settings": { "num_allocations": 1, "num_threads": 1 }, "task_settings": {} } ------------------------------------------------------------ // NOTCONSOLE [NOTE] ==== You might see a 502 bad gateway error in the response when using the {kib} Console. This error usually just reflects a timeout, while the model downloads in the background. You can check the download progress in the {ml-app} UI. If using the Python client, you can set the `timeout` parameter to a higher value. ====