Add mask_token field to fill_mask of _ml/trained_models.
This change will enable users and Kibana to get the particular mask tokens needed for deployed models by adding a mask_token field to the GET _ml/trained_models API, as an enhancement to support kibana#159577.
Many multi-lingual and newer models use a tokenization scheme similar to
sentence-piece. This PR adds support for one of those tokenization
schemes, XLMRoBERTa.
The main changes are: - Support for xlm_roberta tokenization
configuration - Adding `scores` to the vocabulary document stored,
requiring that scores be the same size as the vocabulary - Adding a new
flat text file to resources that is the spm char normalizer.
This PR adds a new field, `_meta`, to the data frame
analytics configuration.
The `_meta` field stores an arbitrary key-value map.
Keys are strings. Values are arbitrary objects
(possibly also maps).
The `_meta` field can be updated using the data frame
analytics `_update` endpoint.
Introduced in: #88439
* [ML] add text_similarity nlp task documentation
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/ml/ml-shared.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
This commit adds initial windowing support for text_classification tasks.
Specifically, a user can now indicate a span (non-negative) indicating the tokenization windowing span when creating
sub-sequences.
Default value is span: -1 indicates that no windowing should take place.
Emit deprecation warning when creating new jobs with bucket spans that
aren't an integral divisor or multiple of a day.
Relates #81645
Co-authored-by: lcawl <lcawley@elastic.co>
This commit adds support for MPNet based models.
MPNet models differ from BERT style models in that:
- Special tokens are different
- Input to the model doesn't require token positions.
To configure an MPNet tokenizer for your pytorch MPNet based model:
```
"tokenization": {
"mpnet": {...}
}
```
The options provided to `mpnet` are the same as the previously supported `bert` configuration.
For new jobs, when the analysis config field model_prune_window is not set, use a default value of 30 days or 20 times the bucket span, whichever is greater.
Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
If the xpack.ml.use_auto_machine_memory_percent setting is true,
and xpack.ml.max_model_memory_limit is not set then
xpack.ml.max_model_memory_limit is now considered to be set to
the largest size that could be assigned in the cluster.
This functionality will be crucial for Cloud once the Elasticsearch
startup code is setting the Elasticsearch JVM heap size. Then the
Cloud code will no longer be able to accurately set
xpack.ml.max_model_memory_limit, so will not set it at all.
Instead the Cloud code will just set
xpack.ml.use_auto_machine_memory_percent and the ML code will
calculate the appropriate maximum model_memory_limit that should
be permitted.
This commit makes the two following changes (along with some
refactoring) - Nlp results will now indicate if the input was truncated
or not - The default truncation is now `none` instead of `first`
Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels.
This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs text sequences with "entailment" clauses. An example could be:
"Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...].
This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification.
See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works.
The zeroshot classification task is configured as follows:
```js
{
// <snip> model configuration </snip>
"inference_config" : {
"zero_shot_classification": {
"classification_labels": ["entailment", "neutral", "contradiction"], // <1>
"labels": ["sad", "glad", "mad", "rad"], // <2>
"multi_label": false, // <3>
"hypothesis_template": "This example is {}.", // <4>
"tokenization": { /*<snip> tokenization configuration </snip>*/}
}
}
}
```
* <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case
* <2> This is an optional parameter for the default zero_shot labels to attempt to classify
* <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels
* <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.`
For inference in a pipeline one may provide label updates:
```js
{
//<snip> pipeline definition </snip>
"processors": [
//<snip> other processors </snip>
{
"inference": {
// <snip> general configuration </snip>
"inference_config": {
"zero_shot_classification": {
"labels": ["humanities", "science", "mathematics", "technology"], // <1>
"multi_label": true // <2>
}
}
}
}
//<snip> other processors </snip>
]
}
```
* <1> The `labels` we care about, these replace the default ones if they exist.
* <2> Should the results allow multiple true labels
Similarly one may provide label changes against the `_infer` endpoint
```js
{
"docs":[{ "text_field": "This is a very happy person"}],
"inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}}
}
```
In #75617 a new setting, system_annotations_retention_days, was
added to control how long system annotations are retained for.
We now feel that this setting is redundant and that system
annotations should be retained for the same period as results.
This is intuitive and defensible, as system annotations can be
considered a type of result.
Followup to #75617
Add configuration for pruning dead split fields in anomaly detection
jobs via the `model_prune_window` field for both the job creation and
update APIs.
Relates to ml-cpp/#1962
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.
The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.
It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.
To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.
If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.
Closeselastic/ml-cpp#1724
This commit allows documents seen within the same time bucket to be out of order.
This is already supported within the native process.
Additionally, when recording the "latest" record timestamp, we were assuming that the latest seen document was truly the "latest". This is not really the case if latency is utilized or if documents come out of order within the same bucket.
A `model_alias` allows trained models to be referred by a user defined moniker.
This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models.
Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated.
When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically.
An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.