mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
157 lines
7.4 KiB
Text
157 lines
7.4 KiB
Text
[role="xpack"]
|
||
[[inference-apis]]
|
||
== {infer-cap} APIs
|
||
|
||
IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in
|
||
{ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure,
|
||
Google AI Studio or Hugging Face. For built-in models and models uploaded
|
||
through Eland, the {infer} APIs offer an alternative way to use and manage
|
||
trained models. However, if you do not plan to use the {infer} APIs to use these
|
||
models or if you want to use non-NLP models, use the
|
||
<<ml-df-trained-models-apis>>.
|
||
|
||
.New API reference
|
||
[sidebar]
|
||
--
|
||
For the most up-to-date API details, refer to {api-es}/group/endpoint-inference[{infer-cap} APIs].
|
||
--
|
||
|
||
The {infer} APIs enable you to create {infer} endpoints and integrate with {ml} models of different services - such as Amazon Bedrock, Anthropic, Azure AI Studio, Cohere, Google AI, Mistral, OpenAI, or HuggingFace.
|
||
Use the following APIs to manage {infer} models and perform {infer}:
|
||
|
||
* <<delete-inference-api>>
|
||
* <<get-inference-api>>
|
||
* <<post-inference-api>>
|
||
* <<put-inference-api>>
|
||
* <<stream-inference-api>>
|
||
* <<chat-completion-inference-api>>
|
||
* <<update-inference-api>>
|
||
|
||
[[inference-landscape]]
|
||
.A representation of the Elastic inference landscape
|
||
image::images/inference-landscape.jpg[A representation of the Elastic inference landscape,align="center"]
|
||
|
||
An {infer} endpoint enables you to use the corresponding {ml} model without
|
||
manual deployment and apply it to your data at ingestion time through
|
||
<<semantic-search-semantic-text, semantic text>>.
|
||
|
||
Choose a model from your service or use ELSER – a retrieval model trained by Elastic –, then create an {infer} endpoint by the <<put-inference-api>>.
|
||
Now use <<semantic-search-semantic-text, semantic text>> to perform <<semantic-search, semantic search>> on your data.
|
||
|
||
[discrete]
|
||
[[adaptive-allocations]]
|
||
=== Adaptive allocations
|
||
|
||
Adaptive allocations allow inference services to dynamically adjust the number of model allocations based on the current load.
|
||
|
||
When adaptive allocations are enabled:
|
||
|
||
* The number of allocations scales up automatically when the load increases.
|
||
- Allocations scale down to a minimum of 0 when the load decreases, saving resources.
|
||
|
||
For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.
|
||
|
||
[discrete]
|
||
[[default-enpoints]]
|
||
=== Default {infer} endpoints
|
||
|
||
Your {es} deployment contains preconfigured {infer} endpoints which makes them easier to use when defining `semantic_text` fields or using {infer} processors.
|
||
The following list contains the default {infer} endpoints listed by `inference_id`:
|
||
|
||
* `.elser-2-elasticsearch`: uses the {ml-docs}/ml-nlp-elser.html[ELSER] built-in trained model for `sparse_embedding` tasks (recommended for English language tex).
|
||
The `model_id` is `.elser_model_2_linux-x86_64`.
|
||
* `.multilingual-e5-small-elasticsearch`: uses the {ml-docs}/ml-nlp-e5.html[E5] built-in trained model for `text_embedding` tasks (recommended for non-English language texts).
|
||
The `model_id` is `.e5_model_2_linux-x86_64`.
|
||
|
||
Use the `inference_id` of the endpoint in a <<semantic-text,`semantic_text`>> field definition or when creating an <<inference-processor,{infer} processor>>.
|
||
The API call will automatically download and deploy the model which might take a couple of minutes.
|
||
Default {infer} enpoints have {ml-docs}/ml-nlp-auto-scale.html#nlp-model-adaptive-allocations[adaptive allocations] enabled.
|
||
For these models, the minimum number of allocations is `0`.
|
||
If there is no {infer} activity that uses the endpoint, the number of allocations will scale down to `0` automatically after 15 minutes.
|
||
|
||
|
||
[discrete]
|
||
[[infer-chunking-config]]
|
||
=== Configuring chunking
|
||
|
||
{infer-cap} endpoints have a limit on the amount of text they can process at once, determined by the model's input capacity.
|
||
Chunking is the process of splitting the input text into pieces that remain within these limits.
|
||
It occurs when ingesting documents into <<semantic-text,`semantic_text` fields>>.
|
||
Chunking also helps produce sections that are digestible for humans.
|
||
Returning a long document in search results is less useful than providing the most relevant chunk of text.
|
||
|
||
Each chunk will include the text subpassage and the corresponding embedding generated from it.
|
||
|
||
By default, documents are split into sentences and grouped in sections up to 250 words with 1 sentence overlap so that each chunk shares a sentence with the previous chunk.
|
||
Overlapping ensures continuity and prevents vital contextual information in the input text from being lost by a hard break.
|
||
|
||
{es} uses the https://unicode-org.github.io/icu-docs/[ICU4J] library to detect word and sentence boundaries for chunking.
|
||
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#word-boundary[Word boundaries] are identified by following a series of rules, not just the presence of a whitespace character.
|
||
For written languages that do use whitespace such as Chinese or Japanese dictionary lookups are used to detect word boundaries.
|
||
|
||
|
||
[discrete]
|
||
==== Chunking strategies
|
||
|
||
Two strategies are available for chunking: `sentence` and `word`.
|
||
|
||
The `sentence` strategy splits the input text at sentence boundaries.
|
||
Each chunk contains one or more complete sentences ensuring that the integrity of sentence-level context is preserved, except when a sentence causes a chunk to exceed a word count of `max_chunk_size`, in which case it will be split across chunks.
|
||
The `sentence_overlap` option defines the number of sentences from the previous chunk to include in the current chunk which is either `0` or `1`.
|
||
|
||
The `word` strategy splits the input text on individual words up to the `max_chunk_size` limit.
|
||
The `overlap` option is the number of words from the previous chunk to include in the current chunk.
|
||
|
||
The default chunking strategy is `sentence`.
|
||
|
||
NOTE: The default chunking strategy for {infer} endpoints created before 8.16 is `word`.
|
||
|
||
|
||
[discrete]
|
||
==== Example of configuring the chunking behavior
|
||
|
||
The following example creates an {infer} endpoint with the `elasticsearch` service that deploys the ELSER model by default and configures the chunking behavior.
|
||
|
||
[source,console]
|
||
------------------------------------------------------------
|
||
PUT _inference/sparse_embedding/small_chunk_size
|
||
{
|
||
"service": "elasticsearch",
|
||
"service_settings": {
|
||
"num_allocations": 1,
|
||
"num_threads": 1
|
||
},
|
||
"chunking_settings": {
|
||
"strategy": "sentence",
|
||
"max_chunk_size": 100,
|
||
"sentence_overlap": 0
|
||
}
|
||
}
|
||
------------------------------------------------------------
|
||
// TEST[skip:TBD]
|
||
|
||
|
||
include::delete-inference.asciidoc[]
|
||
include::get-inference.asciidoc[]
|
||
include::post-inference.asciidoc[]
|
||
include::chat-completion-inference.asciidoc[]
|
||
include::put-inference.asciidoc[]
|
||
include::stream-inference.asciidoc[]
|
||
include::update-inference.asciidoc[]
|
||
include::elastic-infer-service.asciidoc[]
|
||
include::service-alibabacloud-ai-search.asciidoc[]
|
||
include::service-amazon-bedrock.asciidoc[]
|
||
include::service-anthropic.asciidoc[]
|
||
include::service-azure-ai-studio.asciidoc[]
|
||
include::service-azure-openai.asciidoc[]
|
||
include::service-cohere.asciidoc[]
|
||
include::service-elasticsearch.asciidoc[]
|
||
include::service-elser.asciidoc[]
|
||
include::service-google-ai-studio.asciidoc[]
|
||
include::service-google-vertex-ai.asciidoc[]
|
||
include::service-hugging-face.asciidoc[]
|
||
include::service-jinaai.asciidoc[]
|
||
include::service-mistral.asciidoc[]
|
||
include::service-openai.asciidoc[]
|
||
include::service-voyageai.asciidoc[]
|
||
include::service-watsonx-ai.asciidoc[]
|