Commit graph

140 commits

Author SHA1 Message Date
Simon Cooper
a36d90cf34
Use CLDR locale provider on JDK 23+ (#110222)
JDK 23 removes the COMPAT locale provider, leaving CLDR as the only option. This commit configures Elasticsearch
to use the CLDR provider when on JDK 23, but still use the existing COMPAT provider when on JDK 22 and below.

This causes some differences in locale behaviour; this also adapts various tests to still work whether run on COMPAT or CLDR.
2024-09-04 13:42:40 +01:00
Valeriy Khakhutskyy
5a7a032cea
[ML] Force time shift documentation (#111668)
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2024-08-09 11:12:46 +02:00
István Zoltán Szabó
d6c532135e
[DOCS] Adds adaptive_allocations to inference and trained model API docs (#111476) 2024-08-01 12:37:07 +02:00
Ed Savage
c214457b39
[ML] Handle the "output memory allocator bytes" field (#109653)
Handle the "output memory allocator bytes" field if and only if it is present in the model size stats, as reported by the C++ backend.

This PR _must_ be merged prior to the corresponding `ml-cpp` one, to keep CI tests happy.
2024-06-18 15:25:05 +12:00
Liam Thompson
33a71e3289
[DOCS] Refactor book-scoped variables in docs/reference/index.asciidoc (#107413)
* Remove `es-test-dir` book-scoped variable

* Remove `plugins-examples-dir` book-scoped variable

* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables

- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem

* Replace `es-repo-dir` with `es-ref-dir`

* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
2024-04-17 14:37:07 +02:00
István Zoltán Szabó
cfa2b2a2e2
[DOCS] Rephrases sentence in data_description param of PUT job API docs (#104792)
* [DOCS] Rephrase sentence in data_description param of PUT job API docs.

* [DOCS] Further edits.
2024-01-26 14:27:02 +01:00
István Zoltán Szabó
947128e76d
[DOCS] Fixes NOTE display error. (#98783) 2023-08-23 12:18:54 +02:00
Ed Savage
3682a88199
[ML] Update documentation regarding versioning. (#98320)
Update the ml and transform reference documentation to provide information regarding the new versioning schemes independent from the product versions.

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2023-08-10 11:20:58 +01:00
Max Hniebergall
3a4113801c
[NLP] Support the different mask tokens used by NLP models for Fill Mask (#97453)
Add mask_token field to fill_mask of _ml/trained_models.

This change will enable users and Kibana to get the particular mask tokens needed for deployed models by adding a mask_token field to the GET _ml/trained_models API, as an enhancement to support kibana#159577.
2023-07-11 14:42:44 -04:00
István Zoltán Szabó
8d5b803bff
[DOCS] Adds API docs for bert_ja text embedding tokenizer option (#96873) 2023-06-26 11:36:08 +02:00
Benjamin Trent
14ca8fee20
[ML] add support for xlm_roberta tokenized models (#94089)
Many multi-lingual and newer models use a tokenization scheme similar to
sentence-piece. This PR adds support for one of those tokenization
schemes, XLMRoBERTa. 

The main changes are:  - Support for xlm_roberta tokenization
configuration  - Adding `scores` to the vocabulary document stored,
requiring that scores be the same size as the vocabulary  - Adding a new
flat text file to resources that is the spm char normalizer.
2023-06-13 08:40:55 -04:00
debadair
777598d602
[DOCS] Remove redirect pages (#88738)
* [DOCS] Remove manual redirects

* [DOCS] Removed refs to modules-discovery-hosts-providers

* [DOCS] Fixed broken internal refs

* Fixing bad cross links in ES book, and adding redirects.asciidoc[] back into docs/reference/index.asciidoc.

* Update docs/reference/search/point-in-time-api.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/setup/restart-cluster.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/sql/endpoints/translate.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/snapshot-restore/restore-snapshot.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update repository-azure.asciidoc

* Update node-tool.asciidoc

* Update repository-azure.asciidoc

---------

Co-authored-by: amyjtechwriter <61687663+amyjtechwriter@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Amy Jonsson <amy.jonsson@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2023-05-24 12:32:46 +01:00
István Zoltán Szabó
b164555072
[DOCS] Adds deployment ID param documentation to trained model APIs (#96174) 2023-05-17 15:56:58 +02:00
David Kyle
7d90c519ef
[ML] Add embedding_size to text embedding config (#95176) 2023-04-17 11:49:35 +01:00
David Roberts
708730e27c
[ML] Add _meta field to data frame analytics config (#94529)
This PR adds a new field, `_meta`, to the data frame
analytics configuration.

The `_meta` field stores an arbitrary key-value map.
Keys are strings. Values are arbitrary objects
(possibly also maps).

The `_meta` field can be updated using the data frame
analytics `_update` endpoint.
2023-03-20 11:53:53 +00:00
Benjamin Trent
9ce59bb7a9
[ML] add text_similarity nlp task documentation (#88994)
Introduced in: #88439

* [ML] add text_similarity nlp task documentation

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Update docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Update docs/reference/ml/ml-shared.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2022-08-02 12:17:14 -04:00
István Zoltán Szabó
f3e8904b2c
[DOCS] Adds settings of question_answering to inference_config of PUT and infer trained model APIs (#86895)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2022-05-19 11:04:14 +02:00
Benjamin Trent
258d2b71e2
[ML] add roberta/bart docs (#85001)
adds roberta section to NLP tokenization documentation.
2022-03-17 12:14:57 -04:00
Benjamin Trent
45deac4c96
[ML] add windowing support for text_classification (#83989)
This commit adds initial windowing support for text_classification tasks.

Specifically, a user can now indicate a span (non-negative) indicating the tokenization windowing span when creating
sub-sequences.

Default value is span: -1 indicates that no windowing should take place.
2022-03-01 08:29:12 -05:00
Tobias Stadler
e3deacf547
[DOCS] Fix typos (#83895) 2022-02-15 12:42:17 -05:00
Lisa Cawley
91cd38df57
[DOCS] Fix links to anomaly detection docs (#82836) 2022-01-19 17:54:18 -08:00
Lisa Cawley
c98833f9c6
[DOCS] Fix links to anomaly detection docs (#82774) 2022-01-18 17:42:16 -08:00
Ed Savage
e8a46649c5
[ML] Warn when creating job with an unusual bucket span (#82145)
Emit deprecation warning when creating new jobs with bucket spans that
aren't an integral divisor or multiple of a day.

Relates #81645

Co-authored-by: lcawl <lcawley@elastic.co>
2022-01-10 17:04:18 +00:00
Benjamin Trent
9dc8aea1cb
[ML] adds new mpnet tokenization for nlp models (#82234)
This commit adds support for MPNet based models.

MPNet models differ from BERT style models in that:

 - Special tokens are different
 - Input to the model doesn't require token positions.

To configure an MPNet tokenizer for your pytorch MPNet based model:

```
"tokenization": {
  "mpnet": {...}
}
```
The options provided to `mpnet` are the same as the previously supported `bert` configuration.
2022-01-05 12:56:47 -05:00
Ed Savage
a646f55c57
[ML] Set default value of 30 days for model prune window (#81377)
For new jobs, when the analysis config field model_prune_window is not set, use a default value of 30 days or 20 times the bucket span, whichever is greater.

Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-12-20 11:27:30 +00:00
Lisa Cawley
1751ced80a
[DOCS] Fix formatting in get anomaly job API (#81682) 2021-12-13 12:56:27 -08:00
Lisa Cawley
d1af86cfdd
[DOCS] Fixes start and stop trained model deployment APIs (#80978) 2021-11-24 10:09:45 -08:00
Lisa Cawley
f3a69ae4b1
[DOCS] Adds missing query parameters to ML APIs (#80863) 2021-11-22 09:25:01 -08:00
Lisa Cawley
fffac5bd08
[DOCS] Adds missing query parameters in get influencer and get snapshot APIs (#80801) 2021-11-18 08:24:24 -08:00
David Roberts
a61088063e
[ML] use_auto_machine_memory_percent now defaults max_model_memory_limit (#80532)
If the xpack.ml.use_auto_machine_memory_percent setting is true,
and xpack.ml.max_model_memory_limit is not set then
xpack.ml.max_model_memory_limit is now considered to be set to
the largest size that could be assigned in the cluster.

This functionality will be crucial for Cloud once the Elasticsearch
startup code is setting the Elasticsearch JVM heap size. Then the
Cloud code will no longer be able to accurately set
xpack.ml.max_model_memory_limit, so will not set it at all.
Instead the Cloud code will just set
xpack.ml.use_auto_machine_memory_percent and the ML code will
calculate the appropriate maximum model_memory_limit that should
be permitted.
2021-11-10 08:38:02 +00:00
David Kyle
0635f2758f
[ML] Consistently apply the default truncation option for the BERT tokenizer (#80339)
The default is Truncate.First
2021-11-05 09:10:59 +00:00
Benjamin Trent
375fc779b4
[ML] update truncation default & adding field output when input is truncated (#79942)
This commit makes the two following changes (along with some
refactoring)  - Nlp results will now indicate if the input was truncated
or not  - The default truncation is now `none` instead of `first`
2021-10-28 10:40:49 -04:00
Benjamin Trent
d2b638356b
[ML] Update trained model docs for truncate parameter for bert tokenization (#79652) 2021-10-28 07:19:10 -04:00
István Zoltán Szabó
c879db98b1
[DOCS] Updates get trained models API docs (#79372)
* [DOCS] Updates get trained models API docs.

* [DOCS] Reviews get trained models related definitions in ml-shared.
2021-10-25 11:47:45 +02:00
István Zoltán Szabó
94ab204a1e
[DOCS] Fixes indentation issue in GET trained models API docs. (#79347) 2021-10-18 12:27:24 +02:00
Benjamin Trent
408489310c
[ML] add zero_shot_classification task for BERT nlp models (#77799)
Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels.

This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs  text sequences with "entailment" clauses. An example could be:

"Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...]. 

This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification.

See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works. 

The zeroshot classification task is configured as follows:
```js
{
   // <snip> model configuration </snip>
  "inference_config" : {
    "zero_shot_classification": {
      "classification_labels": ["entailment", "neutral", "contradiction"], // <1>
      "labels": ["sad", "glad", "mad", "rad"], // <2>
      "multi_label": false, // <3>
      "hypothesis_template": "This example is {}.", // <4>
      "tokenization": { /*<snip> tokenization configuration </snip>*/}
    }
  }
}
```
* <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case
* <2> This is an optional parameter for the default zero_shot labels to attempt to classify
* <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels
* <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.`

For inference in a pipeline one may provide label updates:
```js
{
  //<snip> pipeline definition </snip>
  "processors": [
    //<snip> other processors </snip>
    {
      "inference": {
        // <snip> general configuration </snip>
        "inference_config": {
          "zero_shot_classification": {
             "labels": ["humanities", "science", "mathematics", "technology"], // <1>
             "multi_label": true // <2>
          }
        }
      }
    }
    //<snip> other processors </snip>
  ]
}
```
* <1> The `labels` we care about, these replace the default ones if they exist. 
* <2> Should the results allow multiple true labels

Similarly one may provide label changes against the `_infer` endpoint
```js
{
   "docs":[{ "text_field": "This is a very happy person"}],
   "inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}}
}
```
2021-09-28 09:38:23 -04:00
Benjamin Trent
00defa38a9
[ML] adding some initial document for our pytorch NLP model support (#78270)
Adding docs for:

put vocab
put model definition part
start deployment
all the new NLP configuration objects for trained model configurations
2021-09-27 12:46:13 -04:00
István Zoltán Szabó
7faec52a1e
[DOCS] Fixes model_prune_window property description. (#76711) 2021-08-19 16:16:37 +02:00
István Zoltán Szabó
b9d875bf68
[DOCS] Updates description of model_prune_window property in ML shared (#76487) 2021-08-13 12:18:38 +02:00
David Roberts
7ac5ea39df
[ML] Use results retention time for deleting system annotations (#76096)
In #75617 a new setting, system_annotations_retention_days, was
added to control how long system annotations are retained for.
We now feel that this setting is redundant and that system
annotations should be retained for the same period as results.
This is intuitive and defensible, as system annotations can be
considered a type of result.

Followup to #75617
2021-08-04 17:42:31 +01:00
Ed Savage
5651215be1
[ML] Add 'model_prune_window' field to AD job config (#75741)
Add configuration for pruning dead split fields in anomaly detection
jobs via the `model_prune_window` field for both the job creation and
update APIs.

Relates to ml-cpp/#1962
2021-08-03 09:16:43 +01:00
Przemysław Witek
30d9f13436
[ML] Delete expired annotations (#75617) 2021-07-29 15:27:03 +02:00
Lisa Cawley
70b870ee7f
[DOCS] Fixes nesting of datafeed config in APIs (#75502) 2021-07-20 11:27:15 -07:00
Lisa Cawley
3c76bcb3a5
[DOCS] Fixes links to machine learning concepts (#75194) 2021-07-09 13:09:03 -07:00
Lisa Cawley
a6339918ac
[DOCS] Adds defaults to get ML results APIs (#73540)
Co-authored-by: David Roberts <dave.roberts@elastic.co>
2021-06-03 10:05:47 -07:00
David Roberts
0059c59e25
[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724
2021-06-01 15:11:32 +01:00
István Zoltán Szabó
1ce2308e2a
[DOCS] Adds max_trees hyperparameter to GET TM API docs (#72298) 2021-05-06 08:18:19 +02:00
Pierre Grimaud
3c44dfec60
[DOCS] Fix typos (#72227) 2021-04-26 12:40:38 -04:00
István Zoltán Szabó
ce389dff5d
[DOCS] Clarifies that custom rules are job rules in Kibana (#71678)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-04-15 09:33:03 +02:00
James Rodewig
693807a6d3
[DOCS] Fix double spaces (#71082) 2021-03-31 09:57:47 -04:00