[role="xpack"] [[chat-completion-inference-api]] === Chat completion inference API Streams a chat completion response. IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in {ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure, Google AI Studio, Google Vertex AI, Anthropic, Watsonx.ai, or Hugging Face. For built-in models and models uploaded through Eland, the {infer} APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the {infer} APIs to use these models or if you want to use non-NLP models, use the <>. [discrete] [[chat-completion-inference-api-request]] ==== {api-request-title} `POST /_inference//_stream` `POST /_inference/chat_completion//_stream` [discrete] [[chat-completion-inference-api-prereqs]] ==== {api-prereq-title} * Requires the `monitor_inference` <> (the built-in `inference_admin` and `inference_user` roles grant this privilege) * You must use a client that supports streaming. [discrete] [[chat-completion-inference-api-desc]] ==== {api-description-title} The chat completion {infer} API enables real-time responses for chat completion tasks by delivering answers incrementally, reducing response times during computation. It only works with the `chat_completion` task type for `openai` and `elastic` {infer} services. [NOTE] ==== * The `chat_completion` task type is only available within the _stream API and only supports streaming. * The Chat completion {infer} API and the Stream {infer} API differ in their response structure and capabilities. The Chat completion {infer} API provides more comprehensive customization options through more fields and function calling support. If you use the `openai` service or the `elastic` service, use the Chat completion {infer} API. ==== [discrete] [[chat-completion-inference-api-path-params]] ==== {api-path-parms-title} ``:: (Required, string) The unique identifier of the {infer} endpoint. ``:: (Optional, string) The type of {infer} task that the model performs. If included, this must be set to the value `chat_completion`. [discrete] [[chat-completion-inference-api-request-body]] ==== {api-request-body-title} `messages`:: (Required, array of objects) A list of objects representing the conversation. Requests should generally only add new messages from the user (role `user`). The other message roles (`assistant`, `system`, or `tool`) should generally only be copied from the response to a previous completion request, such that the messages array is built up throughout a conversation. + .Assistant message [%collapsible%closed] ===== `content`:: (Required unless `tool_calls` is specified, string or array of objects) The contents of the message. + include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] + `role`:: (Required, string) The role of the message author. This should be set to `assistant` for this type of message. + `tool_calls`:: (Optional, array of objects) The tool calls generated by the model. + .Examples [%collapsible%closed] ====== [source,js] ------------------------------------------------------------ { "tool_calls": [ { "id": "call_KcAjWtAww20AihPHphUh46Gd", "type": "function", "function": { "name": "get_current_weather", "arguments": "{\"location\":\"Boston, MA\"}" } } ] } ------------------------------------------------------------ // NOTCONSOLE ====== + `id`::: (Required, string) The identifier of the tool call. + `type`::: (Required, string) The type of tool call. This must be set to the value `function`. + `function`::: (Required, object) The function that the model called. + `name`:::: (Required, string) The name of the function to call. + `arguments`:::: (Required, string) The arguments to call the function with in JSON format. ===== + .System message [%collapsible%closed] ===== `content`::: (Required, string or array of objects) The contents of the message. + include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] + `role`::: (Required, string) The role of the message author. This should be set to `system` for this type of message. ===== + .Tool message [%collapsible%closed] ===== `content`:: (Required, string or array of objects) The contents of the message. + include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] + `role`:: (Required, string) The role of the message author. This should be set to `tool` for this type of message. + `tool_call_id`:: (Required, string) The tool call that this message is responding to. ===== + .User message [%collapsible%closed] ===== `content`:: (Required, string or array of objects) The contents of the message. + include::inference-shared.asciidoc[tag=chat-completion-schema-content-with-examples] + `role`:: (Required, string) The role of the message author. This should be set to `user` for this type of message. ===== `model`:: (Optional, string) The ID of the model to use. By default, the model ID is set to the value included when creating the inference endpoint. `max_completion_tokens`:: (Optional, integer) The upper bound limit for the number of tokens that can be generated for a completion request. `stop`:: (Optional, array of strings) A sequence of strings to control when the model should stop generating additional tokens. `temperature`:: (Optional, float) The sampling temperature to use. `tools`:: (Optional, array of objects) A list of tools that the model can call. + .Structure [%collapsible%closed] ===== `type`:: (Required, string) The type of tool, must be set to the value `function`. + `function`:: (Required, object) The function definition. + `description`::: (Optional, string) A description of what the function does. This is used by the model to choose when and how to call the function. + `name`::: (Required, string) The name of the function. + `parameters`::: (Optional, object) The parameters the functional accepts. This should be formatted as a JSON object. + `strict`::: (Optional, boolean) Whether to enable schema adherence when generating the function call. ===== + .Examples [%collapsible%closed] ====== [source,js] ------------------------------------------------------------ { "tools": [ { "type": "function", "function": { "name": "get_price_of_item", "description": "Get the current price of an item", "parameters": { "type": "object", "properties": { "item": { "id": "12345" }, "unit": { "type": "currency" } } } } } ] } ------------------------------------------------------------ // NOTCONSOLE ====== `tool_choice`:: (Optional, string or object) Controls which tool is called by the model. + String representation::: One of `auto`, `none`, or `requrired`. `auto` allows the model to choose between calling tools and generating a message. `none` causes the model to not call any tools. `required` forces the model to call one or more tools. + Object representation::: + .Structure [%collapsible%closed] ===== `type`:: (Required, string) The type of the tool. This must be set to the value `function`. + `function`:: (Required, object) + `name`::: (Required, string) The name of the function to call. ===== + .Examples [%collapsible%closed] ===== [source,js] ------------------------------------------------------------ { "tool_choice": { "type": "function", "function": { "name": "get_current_weather" } } } ------------------------------------------------------------ // NOTCONSOLE ===== `top_p`:: (Optional, float) Nucleus sampling, an alternative to sampling with temperature. [discrete] [[chat-completion-inference-api-example]] ==== {api-examples-title} The following example performs a chat completion on the example question with streaming. [source,console] ------------------------------------------------------------ POST _inference/chat_completion/openai-completion/_stream { "model": "gpt-4o", "messages": [ { "role": "user", "content": "What is Elastic?" } ] } ------------------------------------------------------------ // TEST[skip:TBD] The following example performs a chat completion using an Assistant message with `tool_calls`. [source,console] ------------------------------------------------------------ POST _inference/chat_completion/openai-completion/_stream { "messages": [ { "role": "assistant", "content": "Let's find out what the weather is", "tool_calls": [ <1> { "id": "call_KcAjWtAww20AihPHphUh46Gd", "type": "function", "function": { "name": "get_current_weather", "arguments": "{\"location\":\"Boston, MA\"}" } } ] }, { <2> "role": "tool", "content": "The weather is cold", "tool_call_id": "call_KcAjWtAww20AihPHphUh46Gd" } ] } ------------------------------------------------------------ // TEST[skip:TBD] <1> Each tool call needs a corresponding Tool message. <2> The corresponding Tool message. The following example performs a chat completion using a User message with `tools` and `tool_choice`. [source,console] ------------------------------------------------------------ POST _inference/chat_completion/openai-completion/_stream { "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What's the price of a scarf?" } ] } ], "tools": [ { "type": "function", "function": { "name": "get_current_price", "description": "Get the current price of a item", "parameters": { "type": "object", "properties": { "item": { "id": "123" } } } } } ], "tool_choice": { "type": "function", "function": { "name": "get_current_price" } } } ------------------------------------------------------------ // TEST[skip:TBD] The API returns the following response when a request is made to the OpenAI service: [source,txt] ------------------------------------------------------------ event: message data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":"","role":"assistant"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}} event: message data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":Elastic"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}} event: message data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[{"delta":{"content":" is"},"index":0}],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk"}} (...) event: message data: {"chat_completion":{"id":"chatcmpl-Ae0TWsy2VPnSfBbv5UztnSdYUMFP3","choices":[],"model":"gpt-4o-2024-08-06","object":"chat.completion.chunk","usage":{"completion_tokens":28,"prompt_tokens":16,"total_tokens":44}}} <1> event: message data: [DONE] ------------------------------------------------------------ // NOTCONSOLE <1> The last object message of the stream contains the token usage information.