[role="xpack"] [[put-inference-api]] === Create {infer} API .New API reference [sidebar] -- For the most up-to-date API details, refer to {api-es}/group/endpoint-inference[{infer-cap} APIs]. -- Creates an {infer} endpoint to perform an {infer} task. [IMPORTANT] ==== * The {infer} APIs enable you to use certain services, such as built-in {ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Mistral, Azure OpenAI, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face. * For built-in models and models uploaded through Eland, the {infer} APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the {infer} APIs to use these models or if you want to use non-NLP models, use the <>. ==== [discrete] [[put-inference-api-request]] ==== {api-request-title} `PUT /_inference//` [discrete] [[put-inference-api-prereqs]] ==== {api-prereq-title} * Requires the `manage_inference` <> (the built-in `inference_admin` role grants this privilege) [discrete] [[put-inference-api-path-params]] ==== {api-path-parms-title} ``:: (Required, string) include::inference-shared.asciidoc[tag=inference-id] ``:: (Required, string) include::inference-shared.asciidoc[tag=task-type] + -- Refer to the integration list in the <> for the available task types. -- [discrete] [[put-inference-api-desc]] ==== {api-description-title} The create {infer} API enables you to create an {infer} endpoint and configure a {ml} model to perform a specific {infer} task. [IMPORTANT] ==== * When creating an {infer} endpoint, the associated {ml} model is automatically deployed if it is not already running. * After creating the endpoint, wait for the model deployment to complete before using it. You can verify the deployment status by using the <> API. In the response, look for `"state": "fully_allocated"` and ensure the `"allocation_count"` matches the `"target_allocation_count"`. * Avoid creating multiple endpoints for the same model unless required, as each endpoint consumes significant resources. ==== The following integrations are available through the {infer} API. You can find the available task types next to the integration name. Click the links to review the configuration details of the integrations: * <> (`completion`, `rerank`, `sparse_embedding`, `text_embedding`) * <> (`completion`, `text_embedding`) * <> (`completion`) * <> (`completion`, `text_embedding`) * <> (`completion`, `text_embedding`) * <> (`completion`, `rerank`, `text_embedding`) * <> (`rerank`, `sparse_embedding`, `text_embedding` - this service is for built-in models and models uploaded through Eland) * <> (`sparse_embedding`) * <> (`completion`, `text_embedding`) * <> (`rerank`, `text_embedding`) * <> (`text_embedding`) * <> (`text_embedding`) * <> (`chat_completion`, `completion`, `text_embedding`) * <> (`text_embedding`, `rerank`) * <> (`text_embedding`) * <> (`text_embedding`, `rerank`) The {es} and ELSER services run on a {ml} node in your {es} cluster. The rest of the integrations connect to external services. [discrete] [[adaptive-allocations-put-inference]] ==== Adaptive allocations Adaptive allocations allow inference endpoints to dynamically adjust the number of model allocations based on the current load. When adaptive allocations are enabled: - The number of allocations scales up automatically when the load increases. - Allocations scale down to a minimum of 0 when the load decreases, saving resources. For more information about adaptive allocations and resources, refer to the {ml-docs}/ml-nlp-auto-scale.html[trained model autoscaling] documentation.