kibana/x-pack/plugins/inference/README.md
Dario Gieselaar 769fb994df
[Inference] Inference plugin + chatComplete API (#188280)
This PR introduces an Inference plugin.

## Goals

- Provide a single place for all interactions with large language models
and other generative AI adjacent tasks.
- Abstract away differences between different LLM providers like OpenAI,
Bedrock and Gemini
- Host commonly used LLM-based tasks like generating ES|QL from natural
language and knowledge base recall.
- Allow us to move gradually to the _inference endpoint without
disrupting engineers.

## Architecture and examples

![CleanShot 2024-07-14 at 14 45
27@2x](https://github.com/user-attachments/assets/e65a3e47-bce1-4dcf-bbed-4f8ac12a104f)

## Terminology

The following concepts are referenced throughout this POC:

- **chat completion**: the process in which the LLM generates the next
message in the conversation. This is sometimes referred to as inference,
text completion, text generation or content generation.
- **tasks**: higher level tasks that, based on its input, use the LLM in
conjunction with other services like Elasticsearch to achieve a result.
The example in this POC is natural language to ES|QL.
- **tools**: a set of tools that the LLM can choose to use when
generating the next message. In essence, it allows the consumer of the
API to define a schema for structured output instead of plain text, and
having the LLM select the most appropriate one.
- **tool call**: when the LLM has chosen a tool (schema) to use for its
output, and returns a document that matches the schema, this is referred
to as a tool call.

## Usage examples

```ts

class MyPlugin {
  setup(coreSetup, pluginsSetup) {
    const router = coreSetup.http.createRouter();

    router.post(
      {
        path: '/internal/my_plugin/do_something',
        validate: {
          body: schema.object({
            connectorId: schema.string(),
          }),
        },
      },
      async (context, request, response) => {
        const [coreStart, pluginsStart] = await coreSetup.getStartServices();

        const inferenceClient = pluginsSetup.inference.getClient({ request });

        const chatComplete$ = inferenceClient.chatComplete({
          connectorId: request.body.connectorId,
          system: `Here is my system message`,
          messages: [
            {
              role: MessageRole.User,
              content: 'Do something',
            },
          ],
        });

        const message = await lastValueFrom(
          chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents())
        );

        return response.ok({
          body: {
            message,
          },
        });
      }
    );
  }
}
```

## Implementation

The bulk of the work here is implementing a `chatComplete` API. Here's
what it does:

- Formats the request for the specific LLM that is being called (all
have different API specifications).
- Executes the specified connector with the formatted request.
- Creates and returns an Observable, and starts reading from the stream.
- Every event in the stream is normalized to a format that is close to
(but not exactly the same) as OpenAI's format, and emitted as a value
from the Observable.
- When the stream ends, the individual events (chunks) are concatenated
into a single message.
- If the LLM has called any tools, the tool call is validated according
to its schema.
- After emitting the message, the Observable completes

There's also a thin wrapper around this API, which is called the
`output` API. It simplifies a few things:

- It doesn't require a conversation (list of messages), a simple `input`
string suffices.
- You can define a schema for the output of the LLM. 
- It drops the token count events that are emitted
- It simplifies the event format (update & complete)

### Observable event streams

These APIs, both on the client and the server, return Observables that
emit events. When converting the Observable into a stream, the following
things happen:

- Errors are caught and serialized as events sent over the stream (after
an error, the stream ends).
- The response stream outputs data as [server-sent
events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events)
- The client that reads the stream, parses the event source as an
Observable, and if it encounters a serialized error, it deserializes it
and throws an error in the Observable.

### Errors

All known errors are instances, and not extensions, from the
`InferenceTaskError` base class, which has a `code`, a `message`, and
`meta` information about the error. This allows us to serialize and
deserialize errors over the wire without a complicated factory pattern.

### Tools

Tools are defined as a record, with a `description` and optionally a
`schema`. The reason why it's a record is because of type-safety. This
allows us to have fully typed tool calls (e.g. when the name of the tool
being called is `x`, its arguments are typed as the schema of `x`).

## Notes for reviewers

- I've only added one reference implementation for a connector adapter,
which is OpenAI. Adding more would create noise in the PR, but I can add
them as well. Bedrock would need simulated function calling, which I
would also expect to be handled by this plugin.
- Similarly, the natural language to ES|QL task just creates dummy
steps, as moving the entire implementation would mean 1000s of
additional LOC due to it needing the documentation, for instance.
- Observables over promises/iterators: Observables are a well-defined
and widely-adopted solution for async programming. Promises are not
suitable for streamed/chunked responses because there are no
intermediate values. Async iterators are not widely adopted for Kibana
engineers.
- JSON Schema over Zod: I've tried using Zod, because I like its
ergonomics over plain JSON Schema, but we need to convert it to JSON
Schema at some point, which is a lossy conversion, creating a risk of
using features that we cannot convert to JSON Schema. Additionally,
tools for converting Zod to and [from JSON Schema are not always
suitable
](https://github.com/StefanTerdell/json-schema-to-zod#use-at-runtime).
I've implemented my own JSON Schema to type definition, as
[json-schema-to-ts](https://github.com/ThomasAribart/json-schema-to-ts)
is very slow.
- There's no option for raw input or output. There could be, but it
would defeat the purpose of the normalization that the `chatComplete`
API handles. At that point it might be better to use the connector
directly.
- That also means that for LangChain, something would be needed to
convert the Observable into an async iterator that returns
OpenAI-compatible output. This is doable, although it would be nice if
we could just use the output from the OpenAI API in that case.
- I have not made room for any vendor-specific parameters in the
`chatComplete` API. We might need it, but hopefully not.
- I think type safety is critical here, so there is some TypeScript
voodoo in some places to make that happen.
- `system` is not a message in the conversation, but a separate
property. Given the semantics of a system message (there can only be
one, and only at the beginning of the conversation), I think it's easier
to make it a top-level property than a message type.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-08-06 04:07:33 -05:00

4.5 KiB

Inference plugin

The inference plugin is a central place to handle all interactions with the Elasticsearch Inference API and external LLM APIs. Its goals are:

  • Provide a single place for all interactions with large language models and other generative AI adjacent tasks.
  • Abstract away differences between different LLM providers like OpenAI, Bedrock and Gemini
  • Host commonly used LLM-based tasks like generating ES|QL from natural language and knowledge base recall.
  • Allow us to move gradually to the _inference endpoint without disrupting engineers.

Architecture and examples

CleanShot 2024-07-14 at 14 45 27@2x

Terminology

The following concepts are commonly used throughout the plugin:

  • chat completion: the process in which the LLM generates the next message in the conversation. This is sometimes referred to as inference, text completion, text generation or content generation.
  • tasks: higher level tasks that, based on its input, use the LLM in conjunction with other services like Elasticsearch to achieve a result. The example in this POC is natural language to ES|QL.
  • tools: a set of tools that the LLM can choose to use when generating the next message. In essence, it allows the consumer of the API to define a schema for structured output instead of plain text, and having the LLM select the most appropriate one.
  • tool call: when the LLM has chosen a tool (schema) to use for its output, and returns a document that matches the schema, this is referred to as a tool call.

Usage examples

class MyPlugin {
  setup(coreSetup, pluginsSetup) {
    const router = coreSetup.http.createRouter();

    router.post(
      {
        path: '/internal/my_plugin/do_something',
        validate: {
          body: schema.object({
            connectorId: schema.string(),
          }),
        },
      },
      async (context, request, response) => {
        const [coreStart, pluginsStart] = await coreSetup.getStartServices();

        const inferenceClient = pluginsSetup.inference.getClient({ request });

        const chatComplete$ = inferenceClient.chatComplete({
          connectorId: request.body.connectorId,
          system: `Here is my system message`,
          messages: [
            {
              role: MessageRole.User,
              content: 'Do something',
            },
          ],
        });

        const message = await lastValueFrom(
          chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents())
        );

        return response.ok({
          body: {
            message,
          },
        });
      }
    );
  }
}

Services

chatComplete:

chatComplete generates a response to a prompt or a conversation using the LLM. Here's what is supported:

  • Normalizing request and response formats from different connector types (e.g. OpenAI, Bedrock, Claude, Elastic Inference Service)
  • Tool calling and validation of tool calls
  • Emits token count events
  • Emits message events, which is the concatenated message based on the response chunks

output

output is a wrapper around chatComplete that is catered towards a single use case: having the LLM output a structured response, based on a schema. It also drops the token count events to simplify usage.

Observable event streams

These APIs, both on the client and the server, return Observables that emit events. When converting the Observable into a stream, the following things happen:

  • Errors are caught and serialized as events sent over the stream (after an error, the stream ends).
  • The response stream outputs data as server-sent events
  • The client that reads the stream, parses the event source as an Observable, and if it encounters a serialized error, it deserializes it and throws an error in the Observable.

Errors

All known errors are instances, and not extensions, from the InferenceTaskError base class, which has a code, a message, and meta information about the error. This allows us to serialize and deserialize errors over the wire without a complicated factory pattern.

Tools

Tools are defined as a record, with a description and optionally a schema. The reason why it's a record is because of type-safety. This allows us to have fully typed tool calls (e.g. when the name of the tool being called is x, its arguments are typed as the schema of x).