Commit graph

125 commits

Author SHA1 Message Date
Benjamin Trent
9ce59bb7a9
[ML] add text_similarity nlp task documentation (#88994)
Introduced in: #88439

* [ML] add text_similarity nlp task documentation

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Update docs/reference/ml/trained-models/apis/infer-trained-model.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Update docs/reference/ml/ml-shared.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2022-08-02 12:17:14 -04:00
István Zoltán Szabó
f3e8904b2c
[DOCS] Adds settings of question_answering to inference_config of PUT and infer trained model APIs (#86895)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2022-05-19 11:04:14 +02:00
Benjamin Trent
258d2b71e2
[ML] add roberta/bart docs (#85001)
adds roberta section to NLP tokenization documentation.
2022-03-17 12:14:57 -04:00
Benjamin Trent
45deac4c96
[ML] add windowing support for text_classification (#83989)
This commit adds initial windowing support for text_classification tasks.

Specifically, a user can now indicate a span (non-negative) indicating the tokenization windowing span when creating
sub-sequences.

Default value is span: -1 indicates that no windowing should take place.
2022-03-01 08:29:12 -05:00
Tobias Stadler
e3deacf547
[DOCS] Fix typos (#83895) 2022-02-15 12:42:17 -05:00
Lisa Cawley
91cd38df57
[DOCS] Fix links to anomaly detection docs (#82836) 2022-01-19 17:54:18 -08:00
Lisa Cawley
c98833f9c6
[DOCS] Fix links to anomaly detection docs (#82774) 2022-01-18 17:42:16 -08:00
Ed Savage
e8a46649c5
[ML] Warn when creating job with an unusual bucket span (#82145)
Emit deprecation warning when creating new jobs with bucket spans that
aren't an integral divisor or multiple of a day.

Relates #81645

Co-authored-by: lcawl <lcawley@elastic.co>
2022-01-10 17:04:18 +00:00
Benjamin Trent
9dc8aea1cb
[ML] adds new mpnet tokenization for nlp models (#82234)
This commit adds support for MPNet based models.

MPNet models differ from BERT style models in that:

 - Special tokens are different
 - Input to the model doesn't require token positions.

To configure an MPNet tokenizer for your pytorch MPNet based model:

```
"tokenization": {
  "mpnet": {...}
}
```
The options provided to `mpnet` are the same as the previously supported `bert` configuration.
2022-01-05 12:56:47 -05:00
Ed Savage
a646f55c57
[ML] Set default value of 30 days for model prune window (#81377)
For new jobs, when the analysis config field model_prune_window is not set, use a default value of 30 days or 20 times the bucket span, whichever is greater.

Co-authored-by: David Roberts <dave.roberts@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-12-20 11:27:30 +00:00
Lisa Cawley
1751ced80a
[DOCS] Fix formatting in get anomaly job API (#81682) 2021-12-13 12:56:27 -08:00
Lisa Cawley
d1af86cfdd
[DOCS] Fixes start and stop trained model deployment APIs (#80978) 2021-11-24 10:09:45 -08:00
Lisa Cawley
f3a69ae4b1
[DOCS] Adds missing query parameters to ML APIs (#80863) 2021-11-22 09:25:01 -08:00
Lisa Cawley
fffac5bd08
[DOCS] Adds missing query parameters in get influencer and get snapshot APIs (#80801) 2021-11-18 08:24:24 -08:00
David Roberts
a61088063e
[ML] use_auto_machine_memory_percent now defaults max_model_memory_limit (#80532)
If the xpack.ml.use_auto_machine_memory_percent setting is true,
and xpack.ml.max_model_memory_limit is not set then
xpack.ml.max_model_memory_limit is now considered to be set to
the largest size that could be assigned in the cluster.

This functionality will be crucial for Cloud once the Elasticsearch
startup code is setting the Elasticsearch JVM heap size. Then the
Cloud code will no longer be able to accurately set
xpack.ml.max_model_memory_limit, so will not set it at all.
Instead the Cloud code will just set
xpack.ml.use_auto_machine_memory_percent and the ML code will
calculate the appropriate maximum model_memory_limit that should
be permitted.
2021-11-10 08:38:02 +00:00
David Kyle
0635f2758f
[ML] Consistently apply the default truncation option for the BERT tokenizer (#80339)
The default is Truncate.First
2021-11-05 09:10:59 +00:00
Benjamin Trent
375fc779b4
[ML] update truncation default & adding field output when input is truncated (#79942)
This commit makes the two following changes (along with some
refactoring)  - Nlp results will now indicate if the input was truncated
or not  - The default truncation is now `none` instead of `first`
2021-10-28 10:40:49 -04:00
Benjamin Trent
d2b638356b
[ML] Update trained model docs for truncate parameter for bert tokenization (#79652) 2021-10-28 07:19:10 -04:00
István Zoltán Szabó
c879db98b1
[DOCS] Updates get trained models API docs (#79372)
* [DOCS] Updates get trained models API docs.

* [DOCS] Reviews get trained models related definitions in ml-shared.
2021-10-25 11:47:45 +02:00
István Zoltán Szabó
94ab204a1e
[DOCS] Fixes indentation issue in GET trained models API docs. (#79347) 2021-10-18 12:27:24 +02:00
Benjamin Trent
408489310c
[ML] add zero_shot_classification task for BERT nlp models (#77799)
Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels.

This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs  text sequences with "entailment" clauses. An example could be:

"Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...]. 

This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification.

See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works. 

The zeroshot classification task is configured as follows:
```js
{
   // <snip> model configuration </snip>
  "inference_config" : {
    "zero_shot_classification": {
      "classification_labels": ["entailment", "neutral", "contradiction"], // <1>
      "labels": ["sad", "glad", "mad", "rad"], // <2>
      "multi_label": false, // <3>
      "hypothesis_template": "This example is {}.", // <4>
      "tokenization": { /*<snip> tokenization configuration </snip>*/}
    }
  }
}
```
* <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case
* <2> This is an optional parameter for the default zero_shot labels to attempt to classify
* <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels
* <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.`

For inference in a pipeline one may provide label updates:
```js
{
  //<snip> pipeline definition </snip>
  "processors": [
    //<snip> other processors </snip>
    {
      "inference": {
        // <snip> general configuration </snip>
        "inference_config": {
          "zero_shot_classification": {
             "labels": ["humanities", "science", "mathematics", "technology"], // <1>
             "multi_label": true // <2>
          }
        }
      }
    }
    //<snip> other processors </snip>
  ]
}
```
* <1> The `labels` we care about, these replace the default ones if they exist. 
* <2> Should the results allow multiple true labels

Similarly one may provide label changes against the `_infer` endpoint
```js
{
   "docs":[{ "text_field": "This is a very happy person"}],
   "inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}}
}
```
2021-09-28 09:38:23 -04:00
Benjamin Trent
00defa38a9
[ML] adding some initial document for our pytorch NLP model support (#78270)
Adding docs for:

put vocab
put model definition part
start deployment
all the new NLP configuration objects for trained model configurations
2021-09-27 12:46:13 -04:00
István Zoltán Szabó
7faec52a1e
[DOCS] Fixes model_prune_window property description. (#76711) 2021-08-19 16:16:37 +02:00
István Zoltán Szabó
b9d875bf68
[DOCS] Updates description of model_prune_window property in ML shared (#76487) 2021-08-13 12:18:38 +02:00
David Roberts
7ac5ea39df
[ML] Use results retention time for deleting system annotations (#76096)
In #75617 a new setting, system_annotations_retention_days, was
added to control how long system annotations are retained for.
We now feel that this setting is redundant and that system
annotations should be retained for the same period as results.
This is intuitive and defensible, as system annotations can be
considered a type of result.

Followup to #75617
2021-08-04 17:42:31 +01:00
Ed Savage
5651215be1
[ML] Add 'model_prune_window' field to AD job config (#75741)
Add configuration for pruning dead split fields in anomaly detection
jobs via the `model_prune_window` field for both the job creation and
update APIs.

Relates to ml-cpp/#1962
2021-08-03 09:16:43 +01:00
Przemysław Witek
30d9f13436
[ML] Delete expired annotations (#75617) 2021-07-29 15:27:03 +02:00
Lisa Cawley
70b870ee7f
[DOCS] Fixes nesting of datafeed config in APIs (#75502) 2021-07-20 11:27:15 -07:00
Lisa Cawley
3c76bcb3a5
[DOCS] Fixes links to machine learning concepts (#75194) 2021-07-09 13:09:03 -07:00
Lisa Cawley
a6339918ac
[DOCS] Adds defaults to get ML results APIs (#73540)
Co-authored-by: David Roberts <dave.roberts@elastic.co>
2021-06-03 10:05:47 -07:00
David Roberts
0059c59e25
[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724
2021-06-01 15:11:32 +01:00
István Zoltán Szabó
1ce2308e2a
[DOCS] Adds max_trees hyperparameter to GET TM API docs (#72298) 2021-05-06 08:18:19 +02:00
Pierre Grimaud
3c44dfec60
[DOCS] Fix typos (#72227) 2021-04-26 12:40:38 -04:00
István Zoltán Szabó
ce389dff5d
[DOCS] Clarifies that custom rules are job rules in Kibana (#71678)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-04-15 09:33:03 +02:00
James Rodewig
693807a6d3
[DOCS] Fix double spaces (#71082) 2021-03-31 09:57:47 -04:00
Benjamin Trent
10e637d97c
[ML] allow documents to be out of order within the same time bucket (#70468)
This commit allows documents seen within the same time bucket to be out of order.

This is already supported within the native process.

Additionally, when recording the "latest" record timestamp, we were assuming that the latest seen document was truly the "latest". This is not really the case if latency is utilized or if documents come out of order within the same bucket.
2021-03-17 09:34:49 -04:00
István Zoltán Szabó
59f6280a7b
[DOCS] Changes deprecated syntax to node.role style in datafeed docs. (#70201) 2021-03-10 15:46:01 +01:00
Lisa Cawley
2caba7b11f
[DOCS] Edits machine learning settings (#69947)
Co-authored-by: David Roberts <dave.roberts@elastic.co>
2021-03-09 10:59:12 -08:00
Lisa Cawley
55f0e32fe4
[DOCS] Clarify put data frame analytics API feature processors option (#69158) 2021-02-18 08:53:46 -08:00
Benjamin Trent
26eef892df
[ML] adds new trained model alias API to simplify trained model updates and deployments (#68922)
A `model_alias` allows trained models to be referred by a user defined moniker. 

This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models. 

Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated. 

When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically. 

An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.
2021-02-18 09:41:50 -05:00
Lisa Cawley
a1fb2c3606
[DOCS] Fixes n_gram_encoding in data frame analytics APIs (#69084) 2021-02-16 14:02:00 -08:00
Lisa Cawley
8b6ec07613
[DOCS] Edits ML hyperparameter descriptions (#68880) 2021-02-11 11:55:28 -08:00
Lisa Cawley
683368cc4d
[DOCS] Clarify soft_tree_depth_limit (#68787)
Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>
2021-02-10 12:51:01 -08:00
István Zoltán Szabó
e45d7a942d
[DOCS] Expands feature processors property description and adds a link of conceptual docs (#68213) 2021-02-02 14:48:43 +01:00
Valeriy Khakhutskyy
78368428b3
[ML] Add early stopping DFA configuration parameter (#68099)
The PR adds early_stopping_enabled optional data frame analysis configuration parameter. The enhancement was already described in elastic/ml-cpp#1676 and so I mark it here as non-issue.
2021-02-01 11:41:28 +01:00
Dimitris Athanasiou
5c961c1c81
[ML] Expand regression/classification hyperparameters (#67950)
Expands data frame analytics regression and classification
analyses with the followin hyperparameters:

- alpha
- downsample_factor
- eta_growth_rate_per_tree
- max_optimization_rounds_per_hyperparameter
- soft_tree_depth_limit
- soft_tree_depth_tolerance
2021-01-26 12:56:41 +02:00
István Zoltán Szabó
addb5cbd3a
[DOCS] Adds custom feature processors description to PUT DFA API (#67424)
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2021-01-19 09:47:32 +01:00
David Kyle
22dadfd407
[ML] Docs and HRLC for datafeed runtime mappings (#65810)
For the changes in #65606
2020-12-08 10:06:58 +00:00
David Roberts
49e492f313
[ML] Adding assignment_memory_basis to model_size_stats (#65561)
At present the Java code makes a decision on whether to
use current model memory or model memory limit to calculate
how much memory a job requires to be assigned.

The plan is to move this decision to the C++ code, which will
report it via a new field in the model size stats.  An
additional change will be that once we have made the switch
from using model memory limit to using current model memory
we will never switch back, as this causes large fluctuations
up and down in memory requirement which will be much more
noticeable when autoscaling is in use.

Although the only two options at present are model memory
limit and current model memory, the new enum includes a
third possibility, peak model memory.  To switch to this
now would be tricky, as there have been two bugs in the
implementation of peak model memory which render its value
unreliable in 7.x.  However, in 8.x it might make sense to
switch to using peak model memory instead of current model
memory and it's much easier from a BWC perspective if the
enum contains all the values from the start.

Relates #63163
2020-12-03 17:18:08 +00:00
David Roberts
fc72b39a17
[ML] Adjusting soft_limit description (#65383)
This PR adds detail to the explanation of the soft_limit
memory_status in ML job stats. A consequence that was not
mentioned before is that examples are not added to category
definitions.

Relates elastic/ml-cpp#1590
2020-11-24 09:35:07 +00:00