This PR introduces an Inference plugin. ## Goals - Provide a single place for all interactions with large language models and other generative AI adjacent tasks. - Abstract away differences between different LLM providers like OpenAI, Bedrock and Gemini - Host commonly used LLM-based tasks like generating ES|QL from natural language and knowledge base recall. - Allow us to move gradually to the _inference endpoint without disrupting engineers. ## Architecture and examples  ## Terminology The following concepts are referenced throughout this POC: - **chat completion**: the process in which the LLM generates the next message in the conversation. This is sometimes referred to as inference, text completion, text generation or content generation. - **tasks**: higher level tasks that, based on its input, use the LLM in conjunction with other services like Elasticsearch to achieve a result. The example in this POC is natural language to ES|QL. - **tools**: a set of tools that the LLM can choose to use when generating the next message. In essence, it allows the consumer of the API to define a schema for structured output instead of plain text, and having the LLM select the most appropriate one. - **tool call**: when the LLM has chosen a tool (schema) to use for its output, and returns a document that matches the schema, this is referred to as a tool call. ## Usage examples ```ts class MyPlugin { setup(coreSetup, pluginsSetup) { const router = coreSetup.http.createRouter(); router.post( { path: '/internal/my_plugin/do_something', validate: { body: schema.object({ connectorId: schema.string(), }), }, }, async (context, request, response) => { const [coreStart, pluginsStart] = await coreSetup.getStartServices(); const inferenceClient = pluginsSetup.inference.getClient({ request }); const chatComplete$ = inferenceClient.chatComplete({ connectorId: request.body.connectorId, system: `Here is my system message`, messages: [ { role: MessageRole.User, content: 'Do something', }, ], }); const message = await lastValueFrom( chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents()) ); return response.ok({ body: { message, }, }); } ); } } ``` ## Implementation The bulk of the work here is implementing a `chatComplete` API. Here's what it does: - Formats the request for the specific LLM that is being called (all have different API specifications). - Executes the specified connector with the formatted request. - Creates and returns an Observable, and starts reading from the stream. - Every event in the stream is normalized to a format that is close to (but not exactly the same) as OpenAI's format, and emitted as a value from the Observable. - When the stream ends, the individual events (chunks) are concatenated into a single message. - If the LLM has called any tools, the tool call is validated according to its schema. - After emitting the message, the Observable completes There's also a thin wrapper around this API, which is called the `output` API. It simplifies a few things: - It doesn't require a conversation (list of messages), a simple `input` string suffices. - You can define a schema for the output of the LLM. - It drops the token count events that are emitted - It simplifies the event format (update & complete) ### Observable event streams These APIs, both on the client and the server, return Observables that emit events. When converting the Observable into a stream, the following things happen: - Errors are caught and serialized as events sent over the stream (after an error, the stream ends). - The response stream outputs data as [server-sent events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) - The client that reads the stream, parses the event source as an Observable, and if it encounters a serialized error, it deserializes it and throws an error in the Observable. ### Errors All known errors are instances, and not extensions, from the `InferenceTaskError` base class, which has a `code`, a `message`, and `meta` information about the error. This allows us to serialize and deserialize errors over the wire without a complicated factory pattern. ### Tools Tools are defined as a record, with a `description` and optionally a `schema`. The reason why it's a record is because of type-safety. This allows us to have fully typed tool calls (e.g. when the name of the tool being called is `x`, its arguments are typed as the schema of `x`). ## Notes for reviewers - I've only added one reference implementation for a connector adapter, which is OpenAI. Adding more would create noise in the PR, but I can add them as well. Bedrock would need simulated function calling, which I would also expect to be handled by this plugin. - Similarly, the natural language to ES|QL task just creates dummy steps, as moving the entire implementation would mean 1000s of additional LOC due to it needing the documentation, for instance. - Observables over promises/iterators: Observables are a well-defined and widely-adopted solution for async programming. Promises are not suitable for streamed/chunked responses because there are no intermediate values. Async iterators are not widely adopted for Kibana engineers. - JSON Schema over Zod: I've tried using Zod, because I like its ergonomics over plain JSON Schema, but we need to convert it to JSON Schema at some point, which is a lossy conversion, creating a risk of using features that we cannot convert to JSON Schema. Additionally, tools for converting Zod to and [from JSON Schema are not always suitable ](https://github.com/StefanTerdell/json-schema-to-zod#use-at-runtime). I've implemented my own JSON Schema to type definition, as [json-schema-to-ts](https://github.com/ThomasAribart/json-schema-to-ts) is very slow. - There's no option for raw input or output. There could be, but it would defeat the purpose of the normalization that the `chatComplete` API handles. At that point it might be better to use the connector directly. - That also means that for LangChain, something would be needed to convert the Observable into an async iterator that returns OpenAI-compatible output. This is doable, although it would be nice if we could just use the output from the OpenAI API in that case. - I have not made room for any vendor-specific parameters in the `chatComplete` API. We might need it, but hopefully not. - I think type safety is critical here, so there is some TypeScript voodoo in some places to make that happen. - `system` is not a message in the conversation, but a separate property. Given the semantics of a system message (there can only be one, and only at the beginning of the conversation), I think it's easier to make it a top-level property than a message type. --------- Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com> Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
4.5 KiB
Inference plugin
The inference plugin is a central place to handle all interactions with the Elasticsearch Inference API and external LLM APIs. Its goals are:
- Provide a single place for all interactions with large language models and other generative AI adjacent tasks.
- Abstract away differences between different LLM providers like OpenAI, Bedrock and Gemini
- Host commonly used LLM-based tasks like generating ES|QL from natural language and knowledge base recall.
- Allow us to move gradually to the _inference endpoint without disrupting engineers.
Architecture and examples
Terminology
The following concepts are commonly used throughout the plugin:
- chat completion: the process in which the LLM generates the next message in the conversation. This is sometimes referred to as inference, text completion, text generation or content generation.
- tasks: higher level tasks that, based on its input, use the LLM in conjunction with other services like Elasticsearch to achieve a result. The example in this POC is natural language to ES|QL.
- tools: a set of tools that the LLM can choose to use when generating the next message. In essence, it allows the consumer of the API to define a schema for structured output instead of plain text, and having the LLM select the most appropriate one.
- tool call: when the LLM has chosen a tool (schema) to use for its output, and returns a document that matches the schema, this is referred to as a tool call.
Usage examples
class MyPlugin {
setup(coreSetup, pluginsSetup) {
const router = coreSetup.http.createRouter();
router.post(
{
path: '/internal/my_plugin/do_something',
validate: {
body: schema.object({
connectorId: schema.string(),
}),
},
},
async (context, request, response) => {
const [coreStart, pluginsStart] = await coreSetup.getStartServices();
const inferenceClient = pluginsSetup.inference.getClient({ request });
const chatComplete$ = inferenceClient.chatComplete({
connectorId: request.body.connectorId,
system: `Here is my system message`,
messages: [
{
role: MessageRole.User,
content: 'Do something',
},
],
});
const message = await lastValueFrom(
chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents())
);
return response.ok({
body: {
message,
},
});
}
);
}
}
Services
chatComplete
:
chatComplete
generates a response to a prompt or a conversation using the LLM. Here's what is supported:
- Normalizing request and response formats from different connector types (e.g. OpenAI, Bedrock, Claude, Elastic Inference Service)
- Tool calling and validation of tool calls
- Emits token count events
- Emits message events, which is the concatenated message based on the response chunks
output
output
is a wrapper around chatComplete
that is catered towards a single use case: having the LLM output a structured response, based on a schema. It also drops the token count events to simplify usage.
Observable event streams
These APIs, both on the client and the server, return Observables that emit events. When converting the Observable into a stream, the following things happen:
- Errors are caught and serialized as events sent over the stream (after an error, the stream ends).
- The response stream outputs data as server-sent events
- The client that reads the stream, parses the event source as an Observable, and if it encounters a serialized error, it deserializes it and throws an error in the Observable.
Errors
All known errors are instances, and not extensions, from the InferenceTaskError
base class, which has a code
, a message
, and meta
information about the error. This allows us to serialize and deserialize errors over the wire without a complicated factory pattern.
Tools
Tools are defined as a record, with a description
and optionally a schema
. The reason why it's a record is because of type-safety. This allows us to have fully typed tool calls (e.g. when the name of the tool being called is x
, its arguments are typed as the schema of x
).