# Backport This will backport the following commits from `main` to `8.x`: - [[Simulated function calling] specify that only one tool call can be performed at a time (#193556)](https://github.com/elastic/kibana/pull/193556) <!--- Backport version: 9.4.3 --> ### Questions ? Please refer to the [Backport tool documentation](https://github.com/sqren/backport) <!--BACKPORT [{"author":{"name":"Pierre Gayvallet","email":"pierre.gayvallet@elastic.co"},"sourceCommit":{"committedDate":"2024-09-20T18:07:54Z","message":"[Simulated function calling] specify that only one tool call can be performed at a time (#193556)\n\n## Summary\r\n\r\nTitle.\r\n\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"374351af564030dc047ea9ef4d780b67b976ac79","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","backport:prev-major","Team:Obs AI Assistant","ci:project-deploy-observability","v8.16.0","Team:AI Infra"],"title":"[Simulated function calling] specify that only one tool call can be performed at a time","number":193556,"url":"https://github.com/elastic/kibana/pull/193556","mergeCommit":{"message":"[Simulated function calling] specify that only one tool call can be performed at a time (#193556)\n\n## Summary\r\n\r\nTitle.\r\n\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"374351af564030dc047ea9ef4d780b67b976ac79"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/193556","number":193556,"mergeCommit":{"message":"[Simulated function calling] specify that only one tool call can be performed at a time (#193556)\n\n## Summary\r\n\r\nTitle.\r\n\r\nCo-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>","sha":"374351af564030dc047ea9ef4d780b67b976ac79"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}] BACKPORT--> Co-authored-by: Pierre Gayvallet <pierre.gayvallet@elastic.co> |
||
---|---|---|
.. | ||
common | ||
public | ||
scripts | ||
server | ||
jest.config.js | ||
kibana.jsonc | ||
README.md | ||
tsconfig.json |
Inference plugin
The inference plugin is a central place to handle all interactions with the Elasticsearch Inference API and external LLM APIs. Its goals are:
- Provide a single place for all interactions with large language models and other generative AI adjacent tasks.
- Abstract away differences between different LLM providers like OpenAI, Bedrock and Gemini
- Host commonly used LLM-based tasks like generating ES|QL from natural language and knowledge base recall.
- Allow us to move gradually to the _inference endpoint without disrupting engineers.
Architecture and examples
Terminology
The following concepts are commonly used throughout the plugin:
- chat completion: the process in which the LLM generates the next message in the conversation. This is sometimes referred to as inference, text completion, text generation or content generation.
- tasks: higher level tasks that, based on its input, use the LLM in conjunction with other services like Elasticsearch to achieve a result. The example in this POC is natural language to ES|QL.
- tools: a set of tools that the LLM can choose to use when generating the next message. In essence, it allows the consumer of the API to define a schema for structured output instead of plain text, and having the LLM select the most appropriate one.
- tool call: when the LLM has chosen a tool (schema) to use for its output, and returns a document that matches the schema, this is referred to as a tool call.
Usage examples
class MyPlugin {
setup(coreSetup, pluginsSetup) {
const router = coreSetup.http.createRouter();
router.post(
{
path: '/internal/my_plugin/do_something',
validate: {
body: schema.object({
connectorId: schema.string(),
}),
},
},
async (context, request, response) => {
const [coreStart, pluginsStart] = await coreSetup.getStartServices();
const inferenceClient = pluginsSetup.inference.getClient({ request });
const chatComplete$ = inferenceClient.chatComplete({
connectorId: request.body.connectorId,
system: `Here is my system message`,
messages: [
{
role: MessageRole.User,
content: 'Do something',
},
],
});
const message = await lastValueFrom(
chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents())
);
return response.ok({
body: {
message,
},
});
}
);
}
}
Services
chatComplete
:
chatComplete
generates a response to a prompt or a conversation using the LLM. Here's what is supported:
- Normalizing request and response formats from different connector types (e.g. OpenAI, Bedrock, Claude, Elastic Inference Service)
- Tool calling and validation of tool calls
- Emits token count events
- Emits message events, which is the concatenated message based on the response chunks
output
output
is a wrapper around chatComplete
that is catered towards a single use case: having the LLM output a structured response, based on a schema. It also drops the token count events to simplify usage.
Observable event streams
These APIs, both on the client and the server, return Observables that emit events. When converting the Observable into a stream, the following things happen:
- Errors are caught and serialized as events sent over the stream (after an error, the stream ends).
- The response stream outputs data as server-sent events
- The client that reads the stream, parses the event source as an Observable, and if it encounters a serialized error, it deserializes it and throws an error in the Observable.
Errors
All known errors are instances, and not extensions, from the InferenceTaskError
base class, which has a code
, a message
, and meta
information about the error. This allows us to serialize and deserialize errors over the wire without a complicated factory pattern.
Tools
Tools are defined as a record, with a description
and optionally a schema
. The reason why it's a record is because of type-safety. This allows us to have fully typed tool calls (e.g. when the name of the tool being called is x
, its arguments are typed as the schema of x
).