kibana/x-pack/plugins/inference
Kibana Machine b16f4e0317
[8.x] [Simulated function calling] specify that only one tool call can be performed at a time (#193556) (#193627)
# Backport

This will backport the following commits from `main` to `8.x`:
- [[Simulated function calling] specify that only one tool call can be
performed at a time
(#193556)](https://github.com/elastic/kibana/pull/193556)

<!--- Backport version: 9.4.3 -->

### Questions ?
Please refer to the [Backport tool
documentation](https://github.com/sqren/backport)

<!--BACKPORT [{"author":{"name":"Pierre
Gayvallet","email":"pierre.gayvallet@elastic.co"},"sourceCommit":{"committedDate":"2024-09-20T18:07:54Z","message":"[Simulated
function calling] specify that only one tool call can be performed at a
time (#193556)\n\n## Summary\r\n\r\nTitle.\r\n\r\nCo-authored-by:
Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"374351af564030dc047ea9ef4d780b67b976ac79","branchLabelMapping":{"^v9.0.0$":"main","^v8.16.0$":"8.x","^v(\\d+).(\\d+).\\d+$":"$1.$2"}},"sourcePullRequest":{"labels":["release_note:skip","v9.0.0","backport:prev-major","Team:Obs
AI Assistant","ci:project-deploy-observability","v8.16.0","Team:AI
Infra"],"title":"[Simulated function calling] specify that only one tool
call can be performed at a
time","number":193556,"url":"https://github.com/elastic/kibana/pull/193556","mergeCommit":{"message":"[Simulated
function calling] specify that only one tool call can be performed at a
time (#193556)\n\n## Summary\r\n\r\nTitle.\r\n\r\nCo-authored-by:
Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"374351af564030dc047ea9ef4d780b67b976ac79"}},"sourceBranch":"main","suggestedTargetBranches":["8.x"],"targetPullRequestStates":[{"branch":"main","label":"v9.0.0","branchLabelMappingKey":"^v9.0.0$","isSourceBranch":true,"state":"MERGED","url":"https://github.com/elastic/kibana/pull/193556","number":193556,"mergeCommit":{"message":"[Simulated
function calling] specify that only one tool call can be performed at a
time (#193556)\n\n## Summary\r\n\r\nTitle.\r\n\r\nCo-authored-by:
Elastic Machine
<elasticmachine@users.noreply.github.com>","sha":"374351af564030dc047ea9ef4d780b67b976ac79"}},{"branch":"8.x","label":"v8.16.0","branchLabelMappingKey":"^v8.16.0$","isSourceBranch":false,"state":"NOT_CREATED"}]}]
BACKPORT-->

Co-authored-by: Pierre Gayvallet <pierre.gayvallet@elastic.co>
2024-09-20 14:58:52 -05:00
..
common [8.x] [inference] Add simulated function calling (#192544) (#193275) 2024-09-18 07:17:27 -05:00
public [8.x] [inference] Add simulated function calling (#192544) (#193275) 2024-09-18 07:17:27 -05:00
scripts [8.x] [inference] improve simulated function calling instructions (#193414) (#193484) 2024-09-19 16:48:51 -05:00
server [8.x] [Simulated function calling] specify that only one tool call can be performed at a time (#193556) (#193627) 2024-09-20 14:58:52 -05:00
jest.config.js [Inference] Implement NL-to-ESQL task (#190433) 2024-09-04 18:30:40 +02:00
kibana.jsonc
README.md
tsconfig.json [Inventory] Inventory plugin (#191798) 2024-09-12 15:07:09 +02:00

Inference plugin

The inference plugin is a central place to handle all interactions with the Elasticsearch Inference API and external LLM APIs. Its goals are:

  • Provide a single place for all interactions with large language models and other generative AI adjacent tasks.
  • Abstract away differences between different LLM providers like OpenAI, Bedrock and Gemini
  • Host commonly used LLM-based tasks like generating ES|QL from natural language and knowledge base recall.
  • Allow us to move gradually to the _inference endpoint without disrupting engineers.

Architecture and examples

CleanShot 2024-07-14 at 14 45 27@2x

Terminology

The following concepts are commonly used throughout the plugin:

  • chat completion: the process in which the LLM generates the next message in the conversation. This is sometimes referred to as inference, text completion, text generation or content generation.
  • tasks: higher level tasks that, based on its input, use the LLM in conjunction with other services like Elasticsearch to achieve a result. The example in this POC is natural language to ES|QL.
  • tools: a set of tools that the LLM can choose to use when generating the next message. In essence, it allows the consumer of the API to define a schema for structured output instead of plain text, and having the LLM select the most appropriate one.
  • tool call: when the LLM has chosen a tool (schema) to use for its output, and returns a document that matches the schema, this is referred to as a tool call.

Usage examples

class MyPlugin {
  setup(coreSetup, pluginsSetup) {
    const router = coreSetup.http.createRouter();

    router.post(
      {
        path: '/internal/my_plugin/do_something',
        validate: {
          body: schema.object({
            connectorId: schema.string(),
          }),
        },
      },
      async (context, request, response) => {
        const [coreStart, pluginsStart] = await coreSetup.getStartServices();

        const inferenceClient = pluginsSetup.inference.getClient({ request });

        const chatComplete$ = inferenceClient.chatComplete({
          connectorId: request.body.connectorId,
          system: `Here is my system message`,
          messages: [
            {
              role: MessageRole.User,
              content: 'Do something',
            },
          ],
        });

        const message = await lastValueFrom(
          chatComplete$.pipe(withoutTokenCountEvents(), withoutChunkEvents())
        );

        return response.ok({
          body: {
            message,
          },
        });
      }
    );
  }
}

Services

chatComplete:

chatComplete generates a response to a prompt or a conversation using the LLM. Here's what is supported:

  • Normalizing request and response formats from different connector types (e.g. OpenAI, Bedrock, Claude, Elastic Inference Service)
  • Tool calling and validation of tool calls
  • Emits token count events
  • Emits message events, which is the concatenated message based on the response chunks

output

output is a wrapper around chatComplete that is catered towards a single use case: having the LLM output a structured response, based on a schema. It also drops the token count events to simplify usage.

Observable event streams

These APIs, both on the client and the server, return Observables that emit events. When converting the Observable into a stream, the following things happen:

  • Errors are caught and serialized as events sent over the stream (after an error, the stream ends).
  • The response stream outputs data as server-sent events
  • The client that reads the stream, parses the event source as an Observable, and if it encounters a serialized error, it deserializes it and throws an error in the Observable.

Errors

All known errors are instances, and not extensions, from the InferenceTaskError base class, which has a code, a message, and meta information about the error. This allows us to serialize and deserialize errors over the wire without a complicated factory pattern.

Tools

Tools are defined as a record, with a description and optionally a schema. The reason why it's a record is because of type-safety. This allows us to have fully typed tool calls (e.g. when the name of the tool being called is x, its arguments are typed as the schema of x).