kibana/docs/developer
Dario Gieselaar 769fb994df
[Inference] Inference plugin + chatComplete API (#188280)
This PR introduces an Inference plugin.

## Goals

- Provide a single place for all interactions with large language models
and other generative AI adjacent tasks.
- Abstract away differences between different LLM providers like OpenAI,
Bedrock and Gemini
- Host commonly used LLM-based tasks like generating ES|QL from natural
language and knowledge base recall.
- Allow us to move gradually to the _inference endpoint without
disrupting engineers.

## Architecture and examples

![CleanShot 2024-07-14 at 14 45
27@2x](https://github.com/user-attachments/assets/e65a3e47-bce1-4dcf-bbed-4f8ac12a104f)

## Terminology

The following concepts are referenced throughout this POC:

- **chat completion**: the process in which the LLM generates the next
message in the conversation. This is sometimes referred to as inference,
text completion, text generation or content generation.
- **tasks**: higher level tasks that, based on its input, use the LLM in
conjunction with other services like Elasticsearch to achieve a result.
The example in this POC is natural language to ES|QL.
- **tools**: a set of tools that the LLM can choose to use when
generating the next message. In essence, it allows the consumer of the
API to define a schema for structured output instead of plain text, and
having the LLM select the most appropriate one.
- **tool call**: when the LLM has chosen a tool (schema) to use for its
output, and returns a document that matches the schema, this is referred
to as a tool call.

## Usage examples

```ts

class MyPlugin {
  setup(coreSetup, pluginsSetup) {
    const router = coreSetup.http.createRouter();

    router.post(
      {
        path: '/internal/my_plugin/do_something',
        validate: {
          body: schema.object({
            connectorId: schema.string(),
          }),
        },
      },
      async (context, request, response) => {
        const [coreStart, pluginsStart] = await coreSetup.getStartServices();

        const inferenceClient = pluginsSetup.inference.getClient({ request });

        const chatComplete$ = inferenceClient.chatComplete({
          connectorId: request.body.connectorId,
          system: `Here is my system message`,
          messages: [
            {
              role: MessageRole.User,
              content: 'Do something',
            },
          ],
        });

        const message = await lastValueFrom(
          chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents())
        );

        return response.ok({
          body: {
            message,
          },
        });
      }
    );
  }
}
```

## Implementation

The bulk of the work here is implementing a `chatComplete` API. Here's
what it does:

- Formats the request for the specific LLM that is being called (all
have different API specifications).
- Executes the specified connector with the formatted request.
- Creates and returns an Observable, and starts reading from the stream.
- Every event in the stream is normalized to a format that is close to
(but not exactly the same) as OpenAI's format, and emitted as a value
from the Observable.
- When the stream ends, the individual events (chunks) are concatenated
into a single message.
- If the LLM has called any tools, the tool call is validated according
to its schema.
- After emitting the message, the Observable completes

There's also a thin wrapper around this API, which is called the
`output` API. It simplifies a few things:

- It doesn't require a conversation (list of messages), a simple `input`
string suffices.
- You can define a schema for the output of the LLM. 
- It drops the token count events that are emitted
- It simplifies the event format (update & complete)

### Observable event streams

These APIs, both on the client and the server, return Observables that
emit events. When converting the Observable into a stream, the following
things happen:

- Errors are caught and serialized as events sent over the stream (after
an error, the stream ends).
- The response stream outputs data as [server-sent
events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events)
- The client that reads the stream, parses the event source as an
Observable, and if it encounters a serialized error, it deserializes it
and throws an error in the Observable.

### Errors

All known errors are instances, and not extensions, from the
`InferenceTaskError` base class, which has a `code`, a `message`, and
`meta` information about the error. This allows us to serialize and
deserialize errors over the wire without a complicated factory pattern.

### Tools

Tools are defined as a record, with a `description` and optionally a
`schema`. The reason why it's a record is because of type-safety. This
allows us to have fully typed tool calls (e.g. when the name of the tool
being called is `x`, its arguments are typed as the schema of `x`).

## Notes for reviewers

- I've only added one reference implementation for a connector adapter,
which is OpenAI. Adding more would create noise in the PR, but I can add
them as well. Bedrock would need simulated function calling, which I
would also expect to be handled by this plugin.
- Similarly, the natural language to ES|QL task just creates dummy
steps, as moving the entire implementation would mean 1000s of
additional LOC due to it needing the documentation, for instance.
- Observables over promises/iterators: Observables are a well-defined
and widely-adopted solution for async programming. Promises are not
suitable for streamed/chunked responses because there are no
intermediate values. Async iterators are not widely adopted for Kibana
engineers.
- JSON Schema over Zod: I've tried using Zod, because I like its
ergonomics over plain JSON Schema, but we need to convert it to JSON
Schema at some point, which is a lossy conversion, creating a risk of
using features that we cannot convert to JSON Schema. Additionally,
tools for converting Zod to and [from JSON Schema are not always
suitable
](https://github.com/StefanTerdell/json-schema-to-zod#use-at-runtime).
I've implemented my own JSON Schema to type definition, as
[json-schema-to-ts](https://github.com/ThomasAribart/json-schema-to-ts)
is very slow.
- There's no option for raw input or output. There could be, but it
would defeat the purpose of the normalization that the `chatComplete`
API handles. At that point it might be better to use the connector
directly.
- That also means that for LangChain, something would be needed to
convert the Observable into an async iterator that returns
OpenAI-compatible output. This is doable, although it would be nice if
we could just use the output from the OpenAI API in that case.
- I have not made room for any vendor-specific parameters in the
`chatComplete` API. We might need it, but hopefully not.
- I think type safety is critical here, so there is some TypeScript
voodoo in some places to make that happen.
- `system` is not a message in the conversation, but a separate
property. Given the semantics of a system message (there can only be
one, and only at the beginning of the conversation), I think it's easier
to make it a top-level property than a message type.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-08-06 04:07:33 -05:00
..
advanced Upgrade Node.js to 20.15.1 (#187791) 2024-07-15 12:34:07 -05:00
architecture [Reporting/Docs] Update Developer API docs (#179723) 2024-04-01 15:02:36 +00:00
best-practices Update stability.asciidoc (#187294) 2024-07-01 12:41:32 -06:00
contributing [FTR] split configs by target into multiple manifest files (#187440) 2024-07-19 15:00:53 +02:00
getting-started Remove KUI Framework (#167833) 2023-10-04 07:55:03 -07:00
images docs: interpreting ci failures (#153549) 2023-03-24 17:27:11 +01:00
plugin [DOCS] Port forward link fix (#178875) 2024-03-20 09:16:04 -04:00
index.asciidoc Add more details on developer documentation (#126674) 2022-03-03 11:44:05 -07:00
plugin-list.asciidoc [Inference] Inference plugin + chatComplete API (#188280) 2024-08-06 04:07:33 -05:00
telemetry.asciidoc [dev/cli/timings] report on time to dev server listening (#95120) 2021-03-24 18:45:24 -04:00