Commit graph

63 commits

Author SHA1 Message Date
István Zoltán Szabó
e54f46e4eb
[DOCS] Fixes indentation issue on PUT trained models docs page. (#112538) 2024-09-05 10:46:41 +02:00
István Zoltán Szabó
d6c532135e
[DOCS] Adds adaptive_allocations to inference and trained model API docs (#111476) 2024-08-01 12:37:07 +02:00
David Kyle
d38d1af242
[ML] GA the update trained model action (#108868)
Accidentally missed when the other trained model APIs went GA
2024-05-22 13:30:25 +01:00
Max Hniebergall
a2008bd190
[ML] Add option to disable inference process cache by default (#108784)
* Add option to disable inference process cache by default

* Add test

* improve tests

* Update docs and improve code

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-05-19 11:11:02 -04:00
Liam Thompson
33a71e3289
[DOCS] Refactor book-scoped variables in docs/reference/index.asciidoc (#107413)
* Remove `es-test-dir` book-scoped variable

* Remove `plugins-examples-dir` book-scoped variable

* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables

- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem

* Replace `es-repo-dir` with `es-ref-dir`

* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
2024-04-17 14:37:07 +02:00
David Kyle
50dcfdc726
[ML] Document wait_for_completion parameter to PUT trained models (#106769) 2024-03-27 16:55:06 +00:00
István Zoltán Szabó
e48b549588
[DOCS] Fixes asciidoc syntax in PUT trained models API docs. (#104741) 2024-01-25 14:22:17 +01:00
David Kyle
330e8b99bf
[ML] Add prefix strings option to trained models (#102089)
Certain NLP models such as multilingual-e5-large require a prefix 
string to be applied to the input text. For asymmetric tasks such as 
information retrieval the prefix can be different when ingesting the
data and when searching it. For example text embedding model can
have a one prefix applied when the model is evaluated as part of an
knn search and a different prefix when ingesting documents.
2023-11-14 13:02:02 +00:00
István Zoltán Szabó
481ebd2e21
[DOCS] Improves readability of PUT trained models API docs page (#101880)
* [DOCS] Improves readability of PUT trained models API docs page.

* [DOCS] Fixes URLs.
2023-11-08 17:57:57 +01:00
István Zoltán Szabó
c34e0c0746
[DOCS] Clarifies that inference input must be single string (#101301) 2023-10-25 17:18:05 +02:00
Max Hniebergall
7c21ce3f1b
Platform specific models (#99584)
* Added platform architecture field to TrainedModelMetadata and users of TrainedModelMetadata

* Added TransportVersions guarding for TrainedModelMetadata

* Prevent platform-specific models from being deployed on the wrong architecture

* Added logic to only verify node architectures for models which are platform specific

* Handle null platform architecture

* Added logging for the detection of heterogeneous platform architectures among ML nodes and refactoring to support this

* Added platform architecture field to TrainedModelConfig

* Stop platform-speficic model when rebalance occurs and the cluster has a heterogeneous architecture among ML nodes

* Added logic to TransportPutTrainedModelAction to return a warning response header when the model is paltform-specific and cannot be depoloyed on the cluster at that time due to heterogenous architectures among ML nodes

* Added MlPlatformArchitecturesUtilTests

* Updated Create Trained Models API docs to describe the new platform_architecture optional field.

* Updated/incremented InferenceIndexConstants

* Added special override to make  models with linux-x86_64 in the model ID to be platform specific
2023-09-28 13:56:45 -04:00
Jonathan Buttner
1ca66bde91
[ML] Safely drain deployment request queues before allowing node to shutdown (#98406)
* isSafeToShutdown checks routing table

* Rebalancer changes and tests

* Update docs/changelog/98406.yaml

* Forcing lifecycle tests to avoid over time case

* Changes and remaining tests

* Adding node service changes

* Finishing unit tests

* Adding wait for completion paramater

* Adding stop deployment integration tests

* Cleaning up code

* Fixing stop deployment test

* Fixing string formatter issue and timeout

* Investigating deadlock

* More testing

* More logging

* Prevent model reloading while stopping

* Fixing compile error

* More code clean up

* Adding test for loading model after stopping

* Addressing review feedback

* Fixing a couple shutdown -> shutdownNow tests

* Adding doc changes and refactoring
2023-08-31 15:37:11 -04:00
Ed Savage
3682a88199
[ML] Update documentation regarding versioning. (#98320)
Update the ml and transform reference documentation to provide information regarding the new versioning schemes independent from the product versions.

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2023-08-10 11:20:58 +01:00
David Roberts
e810d7b77b
[ML] inference_config is optional for the infer trained model API (#97464)
It was made optional in #92359 which was released in version 8.6.1,
but the docs weren't updated to reflect this.
2023-07-12 08:35:06 +01:00
Max Hniebergall
3a4113801c
[NLP] Support the different mask tokens used by NLP models for Fill Mask (#97453)
Add mask_token field to fill_mask of _ml/trained_models.

This change will enable users and Kibana to get the particular mask tokens needed for deployed models by adding a mask_token field to the GET _ml/trained_models API, as an enhancement to support kibana#159577.
2023-07-11 14:42:44 -04:00
István Zoltán Szabó
8d5b803bff
[DOCS] Adds API docs for bert_ja text embedding tokenizer option (#96873) 2023-06-26 11:36:08 +02:00
Benjamin Trent
14ca8fee20
[ML] add support for xlm_roberta tokenized models (#94089)
Many multi-lingual and newer models use a tokenization scheme similar to
sentence-piece. This PR adds support for one of those tokenization
schemes, XLMRoBERTa. 

The main changes are:  - Support for xlm_roberta tokenization
configuration  - Adding `scores` to the vocabulary document stored,
requiring that scores be the same size as the vocabulary  - Adding a new
flat text file to resources that is the spm char normalizer.
2023-06-13 08:40:55 -04:00
István Zoltán Szabó
b164555072
[DOCS] Adds deployment ID param documentation to trained model APIs (#96174) 2023-05-17 15:56:58 +02:00
David Kyle
6de8469a51
[ML] Include model definition install status for Pytorch models (#95271)
Adds a new include flag definition_status to the GET trained models API.
When present the trained model configuration returned in the response 
will have the new boolean field fully_defined if the full model definition 
is exists.
2023-04-17 18:12:26 +01:00
David Kyle
7d90c519ef
[ML] Add embedding_size to text embedding config (#95176) 2023-04-17 11:49:35 +01:00
István Zoltán Szabó
c08c16e311
[DOCS] Removes semantic search reference docs (#93500) 2023-02-06 11:00:25 +01:00
David Kyle
6acfbbcd8b
[ML] Utilise parallel allocations where the inference request contains multiple documents (#92359)
Divide work from the _infer API among all allocations
2023-01-11 12:38:35 +00:00
David Kyle
fbb6abd2f4
[ML] Increase the default timeout for start trained model deployment (#92328)
A 30 second timeout is inline with the default value used in most ML APIs.
2022-12-14 13:32:23 +00:00
David Roberts
6fa3d73fd5
[ML] Make native inference generally available (#92213)
Previously this functionality was beta. This PR changes it to GA.
2022-12-12 15:43:30 +00:00
Nik Everett
6481342466
Fix sneaky docs test failure (#91829)
This prevents docs files from *starting* with a "response" because when
that happens the response is converted to an assertion and appended
to the last snippet that was processed. If that last snipper was in a
different file then it's very hard to reason about the tests. That goes
double because the order we iterate files isn't defined....

Anyway! This adds a guard in the build, removes the offending
"response", and reenables the tests that we'd thought we failing here.

Closes #91081
2022-12-07 11:02:44 -05:00
István Zoltán Szabó
99415818e2
[DOCS] Adds semantic search API to the trained model API list (#91815) 2022-11-22 18:08:06 +01:00
David Kyle
7b9a6fe3db
{ML] Correct index for text_similarity config (#91644) 2022-11-17 10:58:36 +00:00
István Zoltán Szabó
612a7b673a
[DOCS] Highlights inference caching behavior (#91608) 2022-11-16 13:17:49 +01:00
Benjamin Trent
2e8bf33b0a
[ML] allow model_aliases to be used with Pytorch trained models (#91296)
This adds model_alias support for native pytorch models.

Model aliases can be used in `_infer` or within the inference processor. This way the alias can be atomically changed without down time to another deployed model. 

Restrictions:
 - Model alias changes need to be done between two models of the same kind (e.g. pytorch -> pytorch)
 - Model alias change is not allowed between a model that is deployed to a model that is not
 - Model alias change is not allowed between a model that deployed AND allocated to a model that is deployed but NOT allocated (not assigned to any nodes).
 - A deployment cannot be stopped (without supplying the `force` parameter) when the model has a model alias that is used by a pipeline.


closes: https://github.com/elastic/elasticsearch/issues/90960
2022-11-08 08:35:33 -05:00
Dimitris Athanasiou
4e67df8b05
[ML] Low priority trained model deployments (#91234)
This adds a new parameter to the start trained model deployment API,
namely `priority`. The available settings are `normal` and `low`.

For normal priority deployments the allocations get distributed so that
node processors are never oversubscribed.

Low priority deployments allow users to test model functionality even if there
are no node processors available. They are limited to 1 allocation with a single thread.
In addition, the process is executed in low priority which limits the amount of
CPU that can be used when the CPU is under pressure. The intention of this is to
limit the impact of low priority deployments on normal priority deployments.

When we rebalance model assignments we now:

  1. compute a plan just for normal priority deployments
  2. fix the resources used by normal deployments
  3. compute a plan just for low priority deployments
  4. merge the two plans

Closes #91024
2022-11-04 14:22:30 +02:00
Dimitris Athanasiou
16bfc550ea
[ML] Add api to update trained model deployment number_of_allocations (#90728)
This commit adds a new API that users can use calling:

```
POST _ml/trained_models/{model_id}/deployment/_update
{
  "number_of_allocations": 4
}
```

This allows a user to update the number of allocations for a deployment
that is `started`.

If the allocations are increased we rebalance and let the assignment
planner find how to allocate the additional allocations.

If the allocations are decreased we cannot use the assignment planner.
Instead, we implement the reduction in a new class `AllocationReducer`
that tries to reduce the allocations so that:

  1. availability zone balance is maintained
  2. assignments that can be completely stopped are preferred to release memory
2022-10-12 10:04:23 +03:00
Lisa Cawley
db2882cbb5
[DOCS] Add links to clear trained model deployment cache API (#90727) 2022-10-06 10:10:55 -07:00
David Kyle
17579ae1af
[ML] Add stat for non cache hit inference time (#90464) 2022-09-29 12:18:27 +01:00
David Roberts
d9ea080d10
[ML] Release native inference functionality as beta (#90418)
Previously this functionality was tech preview (aka experimental).
This PR changes it to beta.
2022-09-28 11:09:02 +01:00
István Zoltán Szabó
cbda0a51c6
[DOCS] Adds text similarity task example to API docs (#89756) 2022-09-01 11:53:26 +02:00
Dimitris Athanasiou
32d512286d
[ML] Validate trained model deployment queue_capacity limit (#89573)
When starting a trained model deployment, a queue is created.
If the queue_capacity is too large, it can lead to OOM and a node
crash.

This commit adds validation that the queue_capacity cannot be more
than 1M.

Closes #89555
2022-08-24 16:52:19 +03:00
Benjamin Trent
d588d456f0
[ML] add new trained model deployment cache clear API (#89074)
This adds a new `_ml/trained_models/<model_id>/deployment/cache/_clear` API. This will clear the inference cache on every node where the model is allocated.
2022-08-04 19:45:15 +01:00
Benjamin Trent
9ce59bb7a9
[ML] add text_similarity nlp task documentation (#88994)
Introduced in: #88439

* [ML] add text_similarity nlp task documentation

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Update docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Update docs/reference/ml/ml-shared.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2022-08-02 12:17:14 -04:00
David Roberts
15e7b06b79
[ML] Add inference cache hit count to inference node stats (#88807)
The inference node stats for deployed PyTorch inference
models now contain two new fields: `inference_cache_hit_count`
and `inference_cache_hit_count_last_minute`.

These indicate how many inferences on that node were served
from the C++-side response cache that was added in
https://github.com/elastic/ml-cpp/pull/2305. Cache hits
occur when exactly the same inference request is sent to the
same node more than once.

The `average_inference_time_ms` and
`average_inference_time_ms_last_minute` fields now refer to
the time taken to do the cache lookup, plus, if necessary,
the time to do the inference. We would expect average inference
time to be vastly reduced in situations where the cache hit
rate is high.
2022-07-26 17:53:43 +01:00
Benjamin Trent
afa28d49b4
[ML] add new cache_size parameter to trained_model deployments API (#88450)
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model.

By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. 

This is due to the model having to be loaded in memory twice when serializing to the native process. 

But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node.

Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
2022-07-18 09:19:01 -04:00
István Zoltán Szabó
cf68d0f13c
[DOCS] Updates infer trained model API docs with inference_config (#88500)
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2022-07-13 17:47:05 +02:00
Dimitris Athanasiou
f3199e968b
[ML] Adjust docs for distributed model allocation (#87955)
[ML] Adjust docs for distributed model allocation

Follow up to #87366
2022-06-23 15:35:58 +03:00
Dimitris Athanasiou
679351e224
[ML] Require that threads_per_allocation is a power of 2 (#87697)
As the number of cores in CPUs is typically a power of 2,
this commit adds a validation that trained model deployments
start with `threads_per_allocation` set to be a power of 2.
When we look for how we distribute the allocations across the
cluster, this prevents situations where we have a lot of wasted
CPU cores.

In addition, we add a max value limit of `32`.
2022-06-17 15:12:37 +03:00
István Zoltán Szabó
f3e8904b2c
[DOCS] Adds settings of question_answering to inference_config of PUT and infer trained model APIs (#86895)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2022-05-19 11:04:14 +02:00
Lisa Cawley
6b7320790f
[DOCS] Updates example output for start trained model deployment API (#86824) 2022-05-17 07:27:44 -07:00
Lisa Cawley
a9c8c12814
[DOCS] Removes infer trained model deployment API (#86497) 2022-05-10 09:56:36 -07:00
Dimitris Athanasiou
68c51f3ada
[ML] Rename threading params in _start trained model deployment API (#86597)
When starting a trained model deployment the user can tweak performance
by setting the `model_threads` and `inference_threads` parameters.
These parameters are hard to understand and cause confusion.

This commit renames these as well as the fields where their values are
reported in the stats API.

- `model_threads` => `number_of_allocations`
- `inference_threads` => `threads_per_allocation`

Now the terminology is as follows.

A model deployment starts with a requested `number_of_allocations`.
Each allocation means the model gets another thread for executing
parallel inference requests. Thus, more allocations should increase
throughput. In its turn, each allocation is may be using a number
of threads to parallelize each individual inference request.
This is the `threads_per_allocation` setting and increases inference
speed (which might also result in improved throughput).
2022-05-10 17:41:00 +03:00
Lisa Cawley
89a3e18e10
[DOCS] Add preview admonition to infer API (#86486) 2022-05-05 13:49:02 -07:00
Benjamin Trent
a907f0bb6f
[ML] add new trained_models/{model_id}/_infer endpoint for all supervised models and deprecate deployment infer api (#86361)
This commit adds a new `_ml/trained_models/{model_id}/_infer` API. This api works for both native NLP models and supervised models trained via Data Frame analytics. 

The format of the API is the same as the old `_ml/trained_models/{model_id}/deployment/_infer`. Taking a `docs` and an `inference_config` parameter.

This PR also deprecates the old experimental `_ml/trained_models/{model_id}/deployment/_infer` API.

The biggest difference is that the response now nests all results under an "inference_results" object.

closes: https://github.com/elastic/elasticsearch/issues/86032
2022-05-05 14:58:59 -04:00
Benjamin Trent
25d1afbe6f
[ML] rename trained model allocations to assignments (#85503)
This renames the internal concept of a trained model allocation into an assignment.

Now models are assigned to a node and routes created for inference. Not "allocated".

This is an internal rename only. The user facing concepts of trained models and deployments are untouched.
2022-04-18 11:35:10 -04:00