elasticsearch/docs/reference/inference/inference-apis.asciidoc

[role="xpack"]
[[inference-apis]]
== {infer-cap} APIs

IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in
{ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure,
Google AI Studio or Hugging Face. For built-in models and models uploaded
through Eland, the {infer} APIs offer an alternative way to use and manage
trained models. However, if you do not plan to use the {infer} APIs to use these
models or if you want to use non-NLP models, use the
<<ml-df-trained-models-apis>>.

.New API reference
[sidebar]
--
For the most up-to-date API details, refer to {api-es}/group/endpoint-inference[{infer-cap} APIs].
--

The {infer} APIs enable you to create {infer} endpoints and integrate with {ml} models of different services - such as Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, or HuggingFace.
Use the following APIs to manage {infer} models and perform {infer}:

* <<delete-inference-api>>
* <<get-inference-api>>
* <<post-inference-api>>
* <<put-inference-api>>
* <<stream-inference-api>>
* <<chat-completion-inference-api>>
* <<update-inference-api>>

[[inference-landscape]]
.A representation of the Elastic inference landscape
image::images/inference-landscape.jpg[A representation of the Elastic inference landscape,align="center"]

An {infer} endpoint enables you to use the corresponding {ml} model without
manual deployment and apply it to your data at ingestion time through
<<semantic-search-semantic-text, semantic text>>.

Choose a model from your service or use ELSER – a retrieval model trained by Elastic –, then create an {infer} endpoint by the <<put-inference-api>>.
Now use <<semantic-search-semantic-text, semantic text>> to perform <<semantic-search, semantic search>> on your data.

[discrete]
[[adaptive-allocations]]
=== Adaptive allocations

Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.

When adaptive allocations are enabled:

* The number of allocations scales up automatically when the load increases.
- Allocations scale down to a minimum of 0 when the load decreases, saving resources.

For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.

[discrete]
[[default-enpoints]]
=== Default {infer} endpoints

Your {es} deployment contains preconfigured {infer} endpoints which makes them easier to use when defining `semantic_text` fields or using {infer} processors.
The following list contains the default {infer} endpoints listed by `inference_id`:

* `.elser-2-elasticsearch`: uses the {ml-docs}/ml-nlp-elser.html[ELSER] built-in trained model for `sparse_embedding` tasks (recommended for English language tex).
The `model_id` is `.elser_model_2_linux-x86_64`.
* `.multilingual-e5-small-elasticsearch`: uses the {ml-docs}/ml-nlp-e5.html[E5] built-in trained model for `text_embedding` tasks (recommended for non-English language texts).
The `model_id` is `.e5_model_2_linux-x86_64`.

Use the `inference_id` of the endpoint in a <<semantic-text,`semantic_text`>> field definition or when creating an <<inference-processor,{infer} processor>>.
The API call will automatically download and deploy the model which might take a couple of minutes.
Default {infer} enpoints have {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[adaptive allocations] enabled.
For these models, the minimum number of allocations is `0`.
If there is no {infer} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.


[discrete]
[[infer-chunking-config]]
=== Configuring chunking

{infer-cap} endpoints have a limit on the amount of text they can process at once, determined by the model's input capacity.
Chunking is the process of splitting the input text into pieces that remain within these limits.
It occurs when ingesting documents into <<semantic-text,`semantic_text` fields>>.
Chunking also helps produce sections that are digestible for humans.
Returning a long document in search results is less useful than providing the most relevant chunk of text.

Each chunk will include the text subpassage and the corresponding embedding generated from it.

By default, documents are split into sentences and grouped in sections up to 250 words with 1 sentence overlap so that each chunk shares a sentence with the previous chunk.
Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break.

{es} uses the https://unicode-org.github.io/icu-docs/[ICU4J] library to detect word and sentence boundaries for chunking.
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#word-boundary[Word boundaries] are identified by following a series of rules, not just the presence of a whitespace character.
For written languages that do use whitespace such as Chinese or Japanese dictionary lookups are used to detect word boundaries.


[discrete]
==== Chunking strategies

Two strategies are available for chunking: `sentence` and `word`.

The `sentence` strategy splits the input text at sentence boundaries.
Each chunk contains one or more complete sentences ensuring that the integrity of sentence-level context is preserved, except when a sentence causes a chunk to exceed a word count of `max_chunk_size`, in which case it will be split across chunks.
The `sentence_overlap` option defines the number of sentences from the previous chunk to include in the current chunk which is either `0` or `1`.

The `word` strategy splits the input text on individual words up to the `max_chunk_size` limit.
The `overlap` option is the number of words from the previous chunk to include in the current chunk.

The default chunking strategy is `sentence`.

NOTE: The default chunking strategy for {infer} endpoints created before 8.16 is `word`.


[discrete]
==== Example of configuring the chunking behavior

The following example creates an {infer} endpoint with the `elasticsearch` service that deploys the ELSER model by default and configures the chunking behavior.

[source,console]
------------------------------------------------------------
PUT _inference/sparse_embedding/small_chunk_size
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1
  },
  "chunking_settings": {
    "strategy": "sentence",
    "max_chunk_size": 100,
    "sentence_overlap": 0
  }
}
------------------------------------------------------------
// TEST[skip:TBD]


include::delete-inference.asciidoc[]
include::get-inference.asciidoc[]
include::post-inference.asciidoc[]
include::chat-completion-inference.asciidoc[]
include::put-inference.asciidoc[]
include::stream-inference.asciidoc[]
include::update-inference.asciidoc[]
include::elastic-infer-service.asciidoc[]
include::service-alibabacloud-ai-search.asciidoc[]
include::service-amazon-bedrock.asciidoc[]
include::service-anthropic.asciidoc[]
include::service-azure-ai-studio.asciidoc[]
include::service-azure-openai.asciidoc[]
include::service-cohere.asciidoc[]
include::service-elasticsearch.asciidoc[]
include::service-elser.asciidoc[]
include::service-google-ai-studio.asciidoc[]
include::service-google-vertex-ai.asciidoc[]
include::service-hugging-face.asciidoc[]
include::service-jinaai.asciidoc[]
include::service-mistral.asciidoc[]
include::service-openai.asciidoc[]
include::service-voyageai.asciidoc[]
include::service-watsonx-ai.asciidoc[]