[ML] Inference API rate limit queuing logic refactor (#107706)

* Adding new executor

* Adding in queuing logic

* working tests

* Added cleanup task

* Update docs/changelog/107706.yaml

* Updating yml

* deregistering callbacks for settings changes

* Cleaning up code

* Update docs/changelog/107706.yaml

* Fixing rate limit settings bug and only sleeping least amount

* Removing debug logging

* Removing commented code

* Renaming feedback

* fixing tests

* Updating docs and validation

* Fixing source blocks

* Adjusting cancel logic

* Reformatting ascii

* Addressing feedback

* adding rate limiting for google embeddings and mistral

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This commit is contained in:
Jonathan Buttner 2024-06-05 08:25:25 -04:00 committed by GitHub
parent cd84749d87
commit fdb5058b13
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
102 changed files with 1499 additions and 937 deletions

View file

@ -7,21 +7,17 @@ experimental[]
Creates an {infer} endpoint to perform an {infer} task.
IMPORTANT: The {infer} APIs enable you to use certain services, such as built-in
{ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure
OpenAI, Google AI Studio or Hugging Face. For built-in models and models
uploaded though Eland, the {infer} APIs offer an alternative way to use and
manage trained models. However, if you do not plan to use the {infer} APIs to
use these models or if you want to use non-NLP models, use the
{ml} models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, Azure OpenAI, Google AI Studio or Hugging Face.
For built-in models and models uploaded though Eland, the {infer} APIs offer an alternative way to use and manage trained models.
However, if you do not plan to use the {infer} APIs to use these models or if you want to use non-NLP models, use the
<<ml-df-trained-models-apis>>.
[discrete]
[[put-inference-api-request]]
==== {api-request-title}
`PUT /_inference/<task_type>/<inference_id>`
[discrete]
[[put-inference-api-prereqs]]
==== {api-prereq-title}
@ -29,7 +25,6 @@ use these models or if you want to use non-NLP models, use the
* Requires the `manage_inference` <<privileges-list-cluster,cluster privilege>>
(the built-in `inference_admin` role grants this privilege)
[discrete]
[[put-inference-api-desc]]
==== {api-description-title}
@ -48,25 +43,23 @@ The following services are available through the {infer} API:
* Hugging Face
* OpenAI
[discrete]
[[put-inference-api-path-params]]
==== {api-path-parms-title}
`<inference_id>`::
(Required, string)
The unique identifier of the {infer} endpoint.
`<task_type>`::
(Required, string)
The type of the {infer} task that the model will perform. Available task types:
The type of the {infer} task that the model will perform.
Available task types:
* `completion`,
* `rerank`,
* `sparse_embedding`,
* `text_embedding`.
[discrete]
[[put-inference-api-request-body]]
==== {api-request-body-title}
@ -78,21 +71,18 @@ Available services:
* `azureopenai`: specify the `completion` or `text_embedding` task type to use the Azure OpenAI service.
* `azureaistudio`: specify the `completion` or `text_embedding` task type to use the Azure AI Studio service.
* `cohere`: specify the `completion`, `text_embedding` or the `rerank` task type to use the
Cohere service.
* `elasticsearch`: specify the `text_embedding` task type to use the E5
built-in model or text embedding models uploaded by Eland.
* `cohere`: specify the `completion`, `text_embedding` or the `rerank` task type to use the Cohere service.
* `elasticsearch`: specify the `text_embedding` task type to use the E5 built-in model or text embedding models uploaded by Eland.
* `elser`: specify the `sparse_embedding` task type to use the ELSER service.
* `googleaistudio`: specify the `completion` task to use the Google AI Studio service.
* `hugging_face`: specify the `text_embedding` task type to use the Hugging Face
service.
* `openai`: specify the `completion` or `text_embedding` task type to use the
OpenAI service.
* `hugging_face`: specify the `text_embedding` task type to use the Hugging Face service.
* `openai`: specify the `completion` or `text_embedding` task type to use the OpenAI service.
`service_settings`::
(Required, object)
Settings used to install the {infer} model. These settings are specific to the
Settings used to install the {infer} model.
These settings are specific to the
`service` you specified.
+
.`service_settings` for the `azureaistudio` service
@ -104,11 +94,10 @@ Settings used to install the {infer} model. These settings are specific to the
A valid API key of your Azure AI Studio model deployment.
This key can be found on the overview page for your deployment in the management section of your https://ai.azure.com/[Azure AI Studio] account.
IMPORTANT: You need to provide the API key only once, during the {infer} model
creation. The <<get-inference-api>> does not retrieve your API key. After
creating the {infer} model, you cannot change the associated API key. If you
want to use a different API key, delete the {infer} model and recreate it with
the same name and the updated API key.
IMPORTANT: You need to provide the API key only once, during the {infer} model creation.
The <<get-inference-api>> does not retrieve your API key.
After creating the {infer} model, you cannot change the associated API key.
If you want to use a different API key, delete the {infer} model and recreate it with the same name and the updated API key.
`target`:::
(Required, string)
@ -142,11 +131,13 @@ For "real-time" endpoints which are billed per hour of usage, specify `realtime`
By default, the `azureaistudio` service sets the number of requests allowed per minute to `240`.
This helps to minimize the number of rate limit errors returned from Azure AI Studio.
To modify this, set the `requests_per_minute` setting of this object in your service settings:
```
+
[source,text]
----
"rate_limit": {
"requests_per_minute": <<number_of_requests>>
}
```
----
=====
+
.`service_settings` for the `azureopenai` service
@ -181,6 +172,22 @@ Your Azure OpenAI deployments can be found though the https://oai.azure.com/[Azu
The Azure API version ID to use.
We recommend using the https://learn.microsoft.com/en-us/azure/ai-services/openai/reference#embeddings[latest supported non-preview version].
`rate_limit`:::
(Optional, object)
The `azureopenai` service sets a default number of requests allowed per minute depending on the task type.
For `text_embedding` it is set to `1440`.
For `completion` it is set to `120`.
This helps to minimize the number of rate limit errors returned from Azure.
To modify this, set the `requests_per_minute` setting of this object in your service settings:
+
[source,text]
----
"rate_limit": {
"requests_per_minute": <<number_of_requests>>
}
----
+
More information about the rate limits for Azure can be found in the https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits[Quota limits docs] and https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/quota?tabs=rest[How to change the quotas].
=====
+
.`service_settings` for the `cohere` service
@ -188,24 +195,24 @@ We recommend using the https://learn.microsoft.com/en-us/azure/ai-services/opena
=====
`api_key`:::
(Required, string)
A valid API key of your Cohere account. You can find your Cohere API keys or you
can create a new one
A valid API key of your Cohere account.
You can find your Cohere API keys or you can create a new one
https://dashboard.cohere.com/api-keys[on the API keys settings page].
IMPORTANT: You need to provide the API key only once, during the {infer} model
creation. The <<get-inference-api>> does not retrieve your API key. After
creating the {infer} model, you cannot change the associated API key. If you
want to use a different API key, delete the {infer} model and recreate it with
the same name and the updated API key.
IMPORTANT: You need to provide the API key only once, during the {infer} model creation.
The <<get-inference-api>> does not retrieve your API key.
After creating the {infer} model, you cannot change the associated API key.
If you want to use a different API key, delete the {infer} model and recreate it with the same name and the updated API key.
`embedding_type`::
(Optional, string)
Only for `text_embedding`. Specifies the types of embeddings you want to get
back. Defaults to `float`.
Only for `text_embedding`.
Specifies the types of embeddings you want to get back.
Defaults to `float`.
Valid values are:
* `byte`: use it for signed int8 embeddings (this is a synonym of `int8`).
* `float`: use it for the default float embeddings.
* `int8`: use it for signed int8 embeddings.
* `byte`: use it for signed int8 embeddings (this is a synonym of `int8`).
* `float`: use it for the default float embeddings.
* `int8`: use it for signed int8 embeddings.
`model_id`::
(Optional, string)
@ -214,50 +221,68 @@ To review the available `rerank` models, refer to the
https://docs.cohere.com/reference/rerank-1[Cohere docs].
To review the available `text_embedding` models, refer to the
https://docs.cohere.com/reference/embed[Cohere docs]. The default value for
https://docs.cohere.com/reference/embed[Cohere docs].
The default value for
`text_embedding` is `embed-english-v2.0`.
`rate_limit`:::
(Optional, object)
By default, the `cohere` service sets the number of requests allowed per minute to `10000`.
This value is the same for all task types.
This helps to minimize the number of rate limit errors returned from Cohere.
To modify this, set the `requests_per_minute` setting of this object in your service settings:
+
[source,text]
----
"rate_limit": {
"requests_per_minute": <<number_of_requests>>
}
----
+
More information about Cohere's rate limits can be found in https://docs.cohere.com/docs/going-live#production-key-specifications[Cohere's production key docs].
=====
+
.`service_settings` for the `elasticsearch` service
[%collapsible%closed]
=====
`model_id`:::
(Required, string)
The name of the model to use for the {infer} task. It can be the
ID of either a built-in model (for example, `.multilingual-e5-small` for E5) or
a text embedding model already
The name of the model to use for the {infer} task.
It can be the ID of either a built-in model (for example, `.multilingual-e5-small` for E5) or a text embedding model already
{ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].
`num_allocations`:::
(Required, integer)
The number of model allocations to create. `num_allocations` must not exceed the
number of available processors per node divided by the `num_threads`.
The number of model allocations to create. `num_allocations` must not exceed the number of available processors per node divided by the `num_threads`.
`num_threads`:::
(Required, integer)
The number of threads to use by each model allocation. `num_threads` must not
exceed the number of available processors per node divided by the number of
allocations. Must be a power of 2. Max allowed value is 32.
The number of threads to use by each model allocation. `num_threads` must not exceed the number of available processors per node divided by the number of allocations.
Must be a power of 2. Max allowed value is 32.
=====
+
.`service_settings` for the `elser` service
[%collapsible%closed]
=====
`num_allocations`:::
(Required, integer)
The number of model allocations to create. `num_allocations` must not exceed the
number of available processors per node divided by the `num_threads`.
The number of model allocations to create. `num_allocations` must not exceed the number of available processors per node divided by the `num_threads`.
`num_threads`:::
(Required, integer)
The number of threads to use by each model allocation. `num_threads` must not
exceed the number of available processors per node divided by the number of
allocations. Must be a power of 2. Max allowed value is 32.
The number of threads to use by each model allocation. `num_threads` must not exceed the number of available processors per node divided by the number of allocations.
Must be a power of 2. Max allowed value is 32.
=====
+
.`service_settings` for the `googleiastudio` service
[%collapsible%closed]
=====
`api_key`:::
(Required, string)
A valid API key for the Google Gemini API.
@ -274,76 +299,113 @@ This helps to minimize the number of rate limit errors returned from Google AI S
To modify this, set the `requests_per_minute` setting of this object in your service settings:
+
--
```
[source,text]
----
"rate_limit": {
"requests_per_minute": <<number_of_requests>>
}
```
----
--
=====
+
.`service_settings` for the `hugging_face` service
[%collapsible%closed]
=====
`api_key`:::
(Required, string)
A valid access token of your Hugging Face account. You can find your Hugging
Face access tokens or you can create a new one
A valid access token of your Hugging Face account.
You can find your Hugging Face access tokens or you can create a new one
https://huggingface.co/settings/tokens[on the settings page].
IMPORTANT: You need to provide the API key only once, during the {infer} model
creation. The <<get-inference-api>> does not retrieve your API key. After
creating the {infer} model, you cannot change the associated API key. If you
want to use a different API key, delete the {infer} model and recreate it with
the same name and the updated API key.
IMPORTANT: You need to provide the API key only once, during the {infer} model creation.
The <<get-inference-api>> does not retrieve your API key.
After creating the {infer} model, you cannot change the associated API key.
If you want to use a different API key, delete the {infer} model and recreate it with the same name and the updated API key.
`url`:::
(Required, string)
The URL endpoint to use for the requests.
`rate_limit`:::
(Optional, object)
By default, the `huggingface` service sets the number of requests allowed per minute to `3000`.
This helps to minimize the number of rate limit errors returned from Hugging Face.
To modify this, set the `requests_per_minute` setting of this object in your service settings:
+
[source,text]
----
"rate_limit": {
"requests_per_minute": <<number_of_requests>>
}
----
=====
+
.`service_settings` for the `openai` service
[%collapsible%closed]
=====
`api_key`:::
(Required, string)
A valid API key of your OpenAI account. You can find your OpenAI API keys in
your OpenAI account under the
A valid API key of your OpenAI account.
You can find your OpenAI API keys in your OpenAI account under the
https://platform.openai.com/api-keys[API keys section].
IMPORTANT: You need to provide the API key only once, during the {infer} model
creation. The <<get-inference-api>> does not retrieve your API key. After
creating the {infer} model, you cannot change the associated API key. If you
want to use a different API key, delete the {infer} model and recreate it with
the same name and the updated API key.
IMPORTANT: You need to provide the API key only once, during the {infer} model creation.
The <<get-inference-api>> does not retrieve your API key.
After creating the {infer} model, you cannot change the associated API key.
If you want to use a different API key, delete the {infer} model and recreate it with the same name and the updated API key.
`model_id`:::
(Required, string)
The name of the model to use for the {infer} task. Refer to the
The name of the model to use for the {infer} task.
Refer to the
https://platform.openai.com/docs/guides/embeddings/what-are-embeddings[OpenAI documentation]
for the list of available text embedding models.
`organization_id`:::
(Optional, string)
The unique identifier of your organization. You can find the Organization ID in
your OpenAI account under
The unique identifier of your organization.
You can find the Organization ID in your OpenAI account under
https://platform.openai.com/account/organization[**Settings** > **Organizations**].
`url`:::
(Optional, string)
The URL endpoint to use for the requests. Can be changed for testing purposes.
The URL endpoint to use for the requests.
Can be changed for testing purposes.
Defaults to `https://api.openai.com/v1/embeddings`.
`rate_limit`:::
(Optional, object)
The `openai` service sets a default number of requests allowed per minute depending on the task type.
For `text_embedding` it is set to `3000`.
For `completion` it is set to `500`.
This helps to minimize the number of rate limit errors returned from Azure.
To modify this, set the `requests_per_minute` setting of this object in your service settings:
+
[source,text]
----
"rate_limit": {
"requests_per_minute": <<number_of_requests>>
}
----
+
More information about the rate limits for OpenAI can be found in your https://platform.openai.com/account/limits[Account limits].
=====
`task_settings`::
(Optional, object)
Settings to configure the {infer} task. These settings are specific to the
Settings to configure the {infer} task.
These settings are specific to the
`<task_type>` you specified.
+
.`task_settings` for the `completion` task type
[%collapsible%closed]
=====
`do_sample`:::
(Optional, float)
For the `azureaistudio` service only.
@ -358,8 +420,8 @@ Defaults to 64.
`user`:::
(Optional, string)
For `openai` service only. Specifies the user issuing the request, which can be
used for abuse detection.
For `openai` service only.
Specifies the user issuing the request, which can be used for abuse detection.
`temperature`:::
(Optional, float)
@ -378,45 +440,46 @@ Should not be used if `temperature` is specified.
.`task_settings` for the `rerank` task type
[%collapsible%closed]
=====
`return_documents`::
(Optional, boolean)
For `cohere` service only. Specify whether to return doc text within the
results.
For `cohere` service only.
Specify whether to return doc text within the results.
`top_n`::
(Optional, integer)
The number of most relevant documents to return, defaults to the number of the
documents.
The number of most relevant documents to return, defaults to the number of the documents.
=====
+
.`task_settings` for the `text_embedding` task type
[%collapsible%closed]
=====
`input_type`:::
(Optional, string)
For `cohere` service only. Specifies the type of input passed to the model.
For `cohere` service only.
Specifies the type of input passed to the model.
Valid values are:
* `classification`: use it for embeddings passed through a text classifier.
* `clusterning`: use it for the embeddings run through a clustering algorithm.
* `ingest`: use it for storing document embeddings in a vector database.
* `search`: use it for storing embeddings of search queries run against a
vector database to find relevant documents.
* `classification`: use it for embeddings passed through a text classifier.
* `clusterning`: use it for the embeddings run through a clustering algorithm.
* `ingest`: use it for storing document embeddings in a vector database.
* `search`: use it for storing embeddings of search queries run against a vector database to find relevant documents.
`truncate`:::
(Optional, string)
For `cohere` service only. Specifies how the API handles inputs longer than the
maximum token length. Defaults to `END`. Valid values are:
* `NONE`: when the input exceeds the maximum input token length an error is
returned.
* `START`: when the input exceeds the maximum input token length the start of
the input is discarded.
* `END`: when the input exceeds the maximum input token length the end of
the input is discarded.
For `cohere` service only.
Specifies how the API handles inputs longer than the maximum token length.
Defaults to `END`.
Valid values are:
* `NONE`: when the input exceeds the maximum input token length an error is returned.
* `START`: when the input exceeds the maximum input token length the start of the input is discarded.
* `END`: when the input exceeds the maximum input token length the end of the input is discarded.
`user`:::
(optional, string)
For `openai`, `azureopenai` and `azureaistudio` services only. Specifies the user issuing the
request, which can be used for abuse detection.
For `openai`, `azureopenai` and `azureaistudio` services only.
Specifies the user issuing the request, which can be used for abuse detection.
=====
[discrete]
@ -470,7 +533,6 @@ PUT _inference/completion/azure_ai_studio_completion
The list of chat completion models that you can choose from in your deployment can be found in the https://ai.azure.com/explore/models?selectedTask=chat-completion[Azure AI Studio model explorer].
[discrete]
[[inference-example-azureopenai]]
===== Azure OpenAI service
@ -519,7 +581,6 @@ The list of chat completion models that you can choose from in your Azure OpenAI
* https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-4-and-gpt-4-turbo-models[GPT-4 and GPT-4 Turbo models]
* https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#gpt-35[GPT-3.5]
[discrete]
[[inference-example-cohere]]
===== Cohere service
@ -565,7 +626,6 @@ PUT _inference/rerank/cohere-rerank
For more examples, also review the
https://docs.cohere.com/docs/elasticsearch-and-cohere#rerank-search-results-with-cohere-and-elasticsearch[Cohere documentation].
[discrete]
[[inference-example-e5]]
===== E5 via the `elasticsearch` service
@ -586,10 +646,9 @@ PUT _inference/text_embedding/my-e5-model
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The `model_id` must be the ID of one of the built-in E5 models. Valid values
are `.multilingual-e5-small` and `.multilingual-e5-small_linux-x86_64`. For
further details, refer to the {ml-docs}/ml-nlp-e5.html[E5 model documentation].
<1> The `model_id` must be the ID of one of the built-in E5 models.
Valid values are `.multilingual-e5-small` and `.multilingual-e5-small_linux-x86_64`.
For further details, refer to the {ml-docs}/ml-nlp-e5.html[E5 model documentation].
[discrete]
[[inference-example-elser]]
@ -597,8 +656,7 @@ further details, refer to the {ml-docs}/ml-nlp-e5.html[E5 model documentation].
The following example shows how to create an {infer} endpoint called
`my-elser-model` to perform a `sparse_embedding` task type.
Refer to the {ml-docs}/ml-nlp-elser.html[ELSER model documentation] for more
info.
Refer to the {ml-docs}/ml-nlp-elser.html[ELSER model documentation] for more info.
[source,console]
------------------------------------------------------------
@ -672,16 +730,17 @@ PUT _inference/text_embedding/hugging-face-embeddings
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> A valid Hugging Face access token. You can find on the
<1> A valid Hugging Face access token.
You can find on the
https://huggingface.co/settings/tokens[settings page of your account].
<2> The {infer} endpoint URL you created on Hugging Face.
Create a new {infer} endpoint on
https://ui.endpoints.huggingface.co/[the Hugging Face endpoint page] to get an
endpoint URL. Select the model you want to use on the new endpoint creation page
- for example `intfloat/e5-small-v2` - then select the `Sentence Embeddings`
task under the Advanced configuration section. Create the endpoint. Copy the URL
after the endpoint initialization has been finished.
https://ui.endpoints.huggingface.co/[the Hugging Face endpoint page] to get an endpoint URL.
Select the model you want to use on the new endpoint creation page - for example `intfloat/e5-small-v2` - then select the `Sentence Embeddings`
task under the Advanced configuration section.
Create the endpoint.
Copy the URL after the endpoint initialization has been finished.
[discrete]
[[inference-example-hugging-face-supported-models]]
@ -695,7 +754,6 @@ The list of recommended models for the Hugging Face service:
* https://huggingface.co/intfloat/multilingual-e5-base[multilingual-e5-base]
* https://huggingface.co/intfloat/multilingual-e5-small[multilingual-e5-small]
[discrete]
[[inference-example-eland]]
===== Models uploaded by Eland via the elasticsearch service
@ -716,11 +774,9 @@ PUT _inference/text_embedding/my-msmarco-minilm-model
}
------------------------------------------------------------
// TEST[skip:TBD]
<1> The `model_id` must be the ID of a text embedding model which has already
been
<1> The `model_id` must be the ID of a text embedding model which has already been
{ml-docs}/ml-nlp-import-model.html#ml-nlp-import-script[uploaded through Eland].
[discrete]
[[inference-example-openai]]
===== OpenAI service
@ -756,4 +812,3 @@ PUT _inference/completion/openai-completion
}
------------------------------------------------------------
// TEST[skip:TBD]