Add mask_token field to fill_mask of _ml/trained_models.
This change will enable users and Kibana to get the particular mask tokens needed for deployed models by adding a mask_token field to the GET _ml/trained_models API, as an enhancement to support kibana#159577.
Many multi-lingual and newer models use a tokenization scheme similar to
sentence-piece. This PR adds support for one of those tokenization
schemes, XLMRoBERTa.
The main changes are: - Support for xlm_roberta tokenization
configuration - Adding `scores` to the vocabulary document stored,
requiring that scores be the same size as the vocabulary - Adding a new
flat text file to resources that is the spm char normalizer.
Adds a new include flag definition_status to the GET trained models API.
When present the trained model configuration returned in the response
will have the new boolean field fully_defined if the full model definition
is exists.
This PR adds a new field, `_meta`, to the data frame
analytics configuration.
The `_meta` field stores an arbitrary key-value map.
Keys are strings. Values are arbitrary objects
(possibly also maps).
The `_meta` field can be updated using the data frame
analytics `_update` endpoint.
The companion PR to elastic/ml-cpp#2440 adds processing of multimodal_distribution field in the anomaly score explanation. I added a changelog entry in the ml-cpp PR hence I mark this PR as a non-issue.
These docs previously implied that you could update datafeed
properties while the datafeed was running, but then would have
to stop and restart it for the changes to take effect.
In fact datafeed updates can only be made while the datafeed is
stopped (and this has been the case for many years, if not forever).
This prevents docs files from *starting* with a "response" because when
that happens the response is converted to an assertion and appended
to the last snippet that was processed. If that last snipper was in a
different file then it's very hard to reason about the tests. That goes
double because the order we iterate files isn't defined....
Anyway! This adds a guard in the build, removes the offending
"response", and reenables the tests that we'd thought we failing here.
Closes#91081
Currently there is no way to remove user-added annotations when a job is deleted or reset.
This change adds an option - delete_user_annotations - to both the delete and reset job APIs.
The default value is false, to keep the behaviour of these calls as it is currently.
This adds model_alias support for native pytorch models.
Model aliases can be used in `_infer` or within the inference processor. This way the alias can be atomically changed without down time to another deployed model.
Restrictions:
- Model alias changes need to be done between two models of the same kind (e.g. pytorch -> pytorch)
- Model alias change is not allowed between a model that is deployed to a model that is not
- Model alias change is not allowed between a model that deployed AND allocated to a model that is deployed but NOT allocated (not assigned to any nodes).
- A deployment cannot be stopped (without supplying the `force` parameter) when the model has a model alias that is used by a pipeline.
closes: https://github.com/elastic/elasticsearch/issues/90960
This adds a new parameter to the start trained model deployment API,
namely `priority`. The available settings are `normal` and `low`.
For normal priority deployments the allocations get distributed so that
node processors are never oversubscribed.
Low priority deployments allow users to test model functionality even if there
are no node processors available. They are limited to 1 allocation with a single thread.
In addition, the process is executed in low priority which limits the amount of
CPU that can be used when the CPU is under pressure. The intention of this is to
limit the impact of low priority deployments on normal priority deployments.
When we rebalance model assignments we now:
1. compute a plan just for normal priority deployments
2. fix the resources used by normal deployments
3. compute a plan just for low priority deployments
4. merge the two plans
Closes#91024
This PR surfaces new information about the impact of the factors on the initial anomaly score in the anomaly record:
- single bucket impact is determined by the deviation between actual and typical in the current bucket
- multi-bucket impact is determined by the deviation between actual and typical in the past 12 buckets
- anomaly characteristics are statistical properties of the current anomaly compared to the historical observations
- high variance penalty is the reduction of anomaly score in the buckets with large confidence intervals.
- incomplete bucket penalty is the reduction of anomaly score in the buckets with fewer samples than historically expected.
Additionally, we compute lower- and upper-confidence bounds and the typical value for the anomaly records. This improves the explainability of the cases where the model plot is not activated with only a slight overhead in performance (1-2%).
This commit adds a new API that users can use calling:
```
POST _ml/trained_models/{model_id}/deployment/_update
{
"number_of_allocations": 4
}
```
This allows a user to update the number of allocations for a deployment
that is `started`.
If the allocations are increased we rebalance and let the assignment
planner find how to allocate the additional allocations.
If the allocations are decreased we cannot use the assignment planner.
Instead, we implement the reduction in a new class `AllocationReducer`
that tries to reduce the allocations so that:
1. availability zone balance is maintained
2. assignments that can be completely stopped are preferred to release memory
Categorization of strings which break down to a huge number of tokens can cause the C++ backend process to choke - see elastic/ml-cpp#2403.
This PR adds a limit filter to the default categorization analyzer which caps the number of tokens passed to the backend at 100.
Unfortunately this isn't a complete panacea to all the issues surrounding categorization of many tokened / large messages as verification checks on the frontend can also fail due to calls to the datafeed _preview API returning an excessive amount of data.
When starting a trained model deployment, a queue is created.
If the queue_capacity is too large, it can lead to OOM and a node
crash.
This commit adds validation that the queue_capacity cannot be more
than 1M.
Closes#89555
This adds a new `_ml/trained_models/<model_id>/deployment/cache/_clear` API. This will clear the inference cache on every node where the model is allocated.
Introduced in: #88439
* [ML] add text_similarity nlp task documentation
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/ml/ml-shared.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
The inference node stats for deployed PyTorch inference
models now contain two new fields: `inference_cache_hit_count`
and `inference_cache_hit_count_last_minute`.
These indicate how many inferences on that node were served
from the C++-side response cache that was added in
https://github.com/elastic/ml-cpp/pull/2305. Cache hits
occur when exactly the same inference request is sent to the
same node more than once.
The `average_inference_time_ms` and
`average_inference_time_ms_last_minute` fields now refer to
the time taken to do the cache lookup, plus, if necessary,
the time to do the inference. We would expect average inference
time to be vastly reduced in situations where the cache hit
rate is high.
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model.
By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead.
This is due to the model having to be loaded in memory twice when serializing to the native process.
But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node.
Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.