Add mask_token field to fill_mask of _ml/trained_models.
This change will enable users and Kibana to get the particular mask tokens needed for deployed models by adding a mask_token field to the GET _ml/trained_models API, as an enhancement to support kibana#159577.
Many multi-lingual and newer models use a tokenization scheme similar to
sentence-piece. This PR adds support for one of those tokenization
schemes, XLMRoBERTa.
The main changes are: - Support for xlm_roberta tokenization
configuration - Adding `scores` to the vocabulary document stored,
requiring that scores be the same size as the vocabulary - Adding a new
flat text file to resources that is the spm char normalizer.
Adds a new include flag definition_status to the GET trained models API.
When present the trained model configuration returned in the response
will have the new boolean field fully_defined if the full model definition
is exists.
Introduced in: #88439
* [ML] add text_similarity nlp task documentation
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Update docs/reference/ml/ml-shared.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
This commit adds initial windowing support for text_classification tasks.
Specifically, a user can now indicate a span (non-negative) indicating the tokenization windowing span when creating
sub-sequences.
Default value is span: -1 indicates that no windowing should take place.
This commit adds support for MPNet based models.
MPNet models differ from BERT style models in that:
- Special tokens are different
- Input to the model doesn't require token positions.
To configure an MPNet tokenizer for your pytorch MPNet based model:
```
"tokenization": {
"mpnet": {...}
}
```
The options provided to `mpnet` are the same as the previously supported `bert` configuration.