elasticsearch/docs/reference/inference/put-inference.asciidoc

[role="xpack"]
[[put-inference-api]]
=== Create {infer} API

.New API reference
[sidebar]
--
For the most up-to-date API details, refer to {api-es}/group/endpoint-inference[{infer-cap} APIs].
--

Creates an {infer} endpoint to perform an {infer} task.

[IMPORTANT]
====
* The {infer} APIs enable you to use certain services, such as built-in {ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Mistral, Azure OpenAI, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face.
* For built-in models and models uploaded through Eland, the {infer} APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the {infer} APIs to use these models or if you want to use non-NLP models, use the <<ml-df-trained-models-apis>>.
====

[discrete]
[[put-inference-api-request]]
==== {api-request-title}

`PUT /_inference/<task_type>/<inference_id>`

[discrete]
[[put-inference-api-prereqs]]
==== {api-prereq-title}

* Requires the `manage_inference` <<privileges-list-cluster,cluster privilege>>
(the built-in `inference_admin` role grants this privilege)

[discrete]
[[put-inference-api-path-params]]
==== {api-path-parms-title}

`<inference_id>`::
(Required, string)
include::inference-shared.asciidoc[tag=inference-id]

`<task_type>`::
(Required, string)
include::inference-shared.asciidoc[tag=task-type]
+
--
Refer to the integration list in the <<put-inference-api-desc,API description section>> for the available task types.
--


[discrete]
[[put-inference-api-desc]]
==== {api-description-title}

The create {infer} API enables you to create an {infer} endpoint and configure a {ml} model to perform a specific {infer} task.

[IMPORTANT]
====
* When creating an {infer} endpoint, the associated {ml} model is automatically deployed if it is not already running.
* After creating the endpoint, wait for the model deployment to complete before using it. You can verify the deployment status by using the <<get-trained-models-stats, Get trained model statistics>> API. In the response, look for `"state": "fully_allocated"` and ensure the `"allocation_count"` matches the `"target_allocation_count"`.
* Avoid creating multiple endpoints for the same model unless required, as each endpoint consumes significant resources.
====


The following integrations are available through the {infer} API.
You can find the available task types next to the integration name.
Click the links to review the configuration details of the integrations:

* <<infer-service-alibabacloud-ai-search,AlibabaCloud AI Search>> (`completion`, `rerank`, `sparse_embedding`, `text_embedding`)
* <<infer-service-amazon-bedrock,Amazon Bedrock>> (`completion`, `text_embedding`)
* <<infer-service-anthropic,Anthropic>> (`completion`)
* <<infer-service-azure-ai-studio,Azure AI Studio>> (`completion`, `text_embedding`)
* <<infer-service-azure-openai,Azure OpenAI>> (`completion`, `text_embedding`)
* <<infer-service-cohere,Cohere>> (`completion`, `rerank`, `text_embedding`)
* <<infer-service-elasticsearch,Elasticsearch>> (`rerank`, `sparse_embedding`, `text_embedding` - this service is for built-in models and models uploaded through Eland)
* <<infer-service-elser,ELSER>> (`sparse_embedding`)
* <<infer-service-google-ai-studio,Google AI Studio>> (`completion`, `text_embedding`)
* <<infer-service-google-vertex-ai,Google Vertex AI>> (`rerank`, `text_embedding`)
* <<infer-service-hugging-face,Hugging Face>> (`text_embedding`)
* <<infer-service-mistral,Mistral>> (`text_embedding`)
* <<infer-service-openai,OpenAI>> (`chat_completion`, `completion`, `text_embedding`)
* <<infer-service-watsonx-ai>> (`text_embedding`)
* <<infer-service-jinaai,JinaAI>> (`text_embedding`, `rerank`)

The {es} and ELSER services run on a {ml} node in your {es} cluster.
The rest of the integrations connect to external services.

[discrete]
[[adaptive-allocations-put-inference]]
==== Adaptive allocations

Adaptive allocations allow inference endpoints to dynamically adjust the number of model allocations based on the current load.

When adaptive allocations are enabled:

- The number of allocations scales up automatically when the load increases.
- Allocations scale down to a minimum of 0 when the load decreases, saving resources.

For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.