elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-30 02:13:33 -04:00

Author	SHA1	Message	Date
Benjamin Trent	408489310c	[ML] add zero_shot_classification task for BERT nlp models (#77799 ) Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels. This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs text sequences with "entailment" clauses. An example could be: "Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...]. This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification. See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works. The zeroshot classification task is configured as follows: ```js { // <snip> model configuration </snip> "inference_config" : { "zero_shot_classification": { "classification_labels": ["entailment", "neutral", "contradiction"], // <1> "labels": ["sad", "glad", "mad", "rad"], // <2> "multi_label": false, // <3> "hypothesis_template": "This example is {}.", // <4> "tokenization": { /<snip> tokenization configuration </snip>/} } } } ``` * <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case * <2> This is an optional parameter for the default zero_shot labels to attempt to classify * <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels * <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.` For inference in a pipeline one may provide label updates: ```js { //<snip> pipeline definition </snip> "processors": [ //<snip> other processors </snip> { "inference": { // <snip> general configuration </snip> "inference_config": { "zero_shot_classification": { "labels": ["humanities", "science", "mathematics", "technology"], // <1> "multi_label": true // <2> } } } } //<snip> other processors </snip> ] } ``` * <1> The `labels` we care about, these replace the default ones if they exist. * <2> Should the results allow multiple true labels Similarly one may provide label changes against the `_infer` endpoint ```js { "docs":[{ "text_field": "This is a very happy person"}], "inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}} } ```	2021-09-28 09:38:23 -04:00
Benjamin Trent	00defa38a9	[ML] adding some initial document for our pytorch NLP model support (#78270 ) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations	2021-09-27 12:46:13 -04:00
István Zoltán Szabó	7faec52a1e	[DOCS] Fixes model_prune_window property description. (#76711 )	2021-08-19 16:16:37 +02:00
István Zoltán Szabó	b9d875bf68	[DOCS] Updates description of model_prune_window property in ML shared (#76487 )	2021-08-13 12:18:38 +02:00
David Roberts	7ac5ea39df	[ML] Use results retention time for deleting system annotations (#76096 ) In #75617 a new setting, system_annotations_retention_days, was added to control how long system annotations are retained for. We now feel that this setting is redundant and that system annotations should be retained for the same period as results. This is intuitive and defensible, as system annotations can be considered a type of result. Followup to #75617	2021-08-04 17:42:31 +01:00
Ed Savage	5651215be1	[ML] Add 'model_prune_window' field to AD job config (#75741 ) Add configuration for pruning dead split fields in anomaly detection jobs via the `model_prune_window` field for both the job creation and update APIs. Relates to ml-cpp/#1962	2021-08-03 09:16:43 +01:00
Przemysław Witek	30d9f13436	[ML] Delete expired annotations (#75617 )	2021-07-29 15:27:03 +02:00
Lisa Cawley	70b870ee7f	[DOCS] Fixes nesting of datafeed config in APIs (#75502 )	2021-07-20 11:27:15 -07:00
Lisa Cawley	3c76bcb3a5	[DOCS] Fixes links to machine learning concepts (#75194 )	2021-07-09 13:09:03 -07:00
Lisa Cawley	a6339918ac	[DOCS] Adds defaults to get ML results APIs (#73540 ) Co-authored-by: David Roberts <dave.roberts@elastic.co>	2021-06-03 10:05:47 -07:00
David Roberts	0059c59e25	[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805 ) Categorization jobs created once the entire cluster is upgraded to version 7.14 or higher will default to using the new ml_standard tokenizer rather than the previous default of the ml_classic tokenizer, and will incorporate the new first_non_blank_line char filter so that categorization is based purely on the first non-blank line of each message. The difference between the ml_classic and ml_standard tokenizers is that ml_classic splits on slashes and colons, so creates multiple tokens from URLs and filesystem paths, whereas ml_standard attempts to keep URLs, email addresses and filesystem paths as single tokens. It is still possible to config the ml_classic tokenizer if you prefer: just provide a categorization_analyzer within your analysis_config and whichever tokenizer you choose (which could be ml_classic or any other Elasticsearch tokenizer) will be used. To opt out of using first_non_blank_line as a default char filter, you must explicitly specify a categorization_analyzer that does not include it. If no categorization_analyzer is specified but categorization_filters are specified then the categorization filters are converted to char filters applied that are applied after first_non_blank_line. Closes elastic/ml-cpp#1724	2021-06-01 15:11:32 +01:00
István Zoltán Szabó	1ce2308e2a	[DOCS] Adds max_trees hyperparameter to GET TM API docs (#72298 )	2021-05-06 08:18:19 +02:00
Pierre Grimaud	3c44dfec60	[DOCS] Fix typos (#72227 )	2021-04-26 12:40:38 -04:00
István Zoltán Szabó	ce389dff5d	[DOCS] Clarifies that custom rules are job rules in Kibana (#71678 ) Co-authored-by: Lisa Cawley <lcawley@elastic.co>	2021-04-15 09:33:03 +02:00
James Rodewig	693807a6d3	[DOCS] Fix double spaces (#71082 )	2021-03-31 09:57:47 -04:00
Benjamin Trent	10e637d97c	[ML] allow documents to be out of order within the same time bucket (#70468 ) This commit allows documents seen within the same time bucket to be out of order. This is already supported within the native process. Additionally, when recording the "latest" record timestamp, we were assuming that the latest seen document was truly the "latest". This is not really the case if latency is utilized or if documents come out of order within the same bucket.	2021-03-17 09:34:49 -04:00
István Zoltán Szabó	59f6280a7b	[DOCS] Changes deprecated syntax to node.role style in datafeed docs. (#70201 )	2021-03-10 15:46:01 +01:00
Lisa Cawley	2caba7b11f	[DOCS] Edits machine learning settings (#69947 ) Co-authored-by: David Roberts <dave.roberts@elastic.co>	2021-03-09 10:59:12 -08:00
Lisa Cawley	55f0e32fe4	[DOCS] Clarify put data frame analytics API feature processors option (#69158 )	2021-02-18 08:53:46 -08:00
Benjamin Trent	26eef892df	[ML] adds new trained model alias API to simplify trained model updates and deployments (#68922 ) A `model_alias` allows trained models to be referred by a user defined moniker. This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models. Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated. When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically. An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.	2021-02-18 09:41:50 -05:00
Lisa Cawley	a1fb2c3606	[DOCS] Fixes n_gram_encoding in data frame analytics APIs (#69084 )	2021-02-16 14:02:00 -08:00
Lisa Cawley	8b6ec07613	[DOCS] Edits ML hyperparameter descriptions (#68880 )	2021-02-11 11:55:28 -08:00
Lisa Cawley	683368cc4d	[DOCS] Clarify soft_tree_depth_limit (#68787 ) Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>	2021-02-10 12:51:01 -08:00
István Zoltán Szabó	e45d7a942d	[DOCS] Expands feature processors property description and adds a link of conceptual docs (#68213 )	2021-02-02 14:48:43 +01:00
Valeriy Khakhutskyy	78368428b3	[ML] Add early stopping DFA configuration parameter (#68099 ) The PR adds early_stopping_enabled optional data frame analysis configuration parameter. The enhancement was already described in elastic/ml-cpp#1676 and so I mark it here as non-issue.	2021-02-01 11:41:28 +01:00
Dimitris Athanasiou	5c961c1c81	[ML] Expand regression/classification hyperparameters (#67950 ) Expands data frame analytics regression and classification analyses with the followin hyperparameters: - alpha - downsample_factor - eta_growth_rate_per_tree - max_optimization_rounds_per_hyperparameter - soft_tree_depth_limit - soft_tree_depth_tolerance	2021-01-26 12:56:41 +02:00
István Zoltán Szabó	addb5cbd3a	[DOCS] Adds custom feature processors description to PUT DFA API (#67424 ) Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>	2021-01-19 09:47:32 +01:00
David Kyle	22dadfd407	[ML] Docs and HRLC for datafeed runtime mappings (#65810 ) For the changes in #65606	2020-12-08 10:06:58 +00:00
David Roberts	49e492f313	[ML] Adding assignment_memory_basis to model_size_stats (#65561 ) At present the Java code makes a decision on whether to use current model memory or model memory limit to calculate how much memory a job requires to be assigned. The plan is to move this decision to the C++ code, which will report it via a new field in the model size stats. An additional change will be that once we have made the switch from using model memory limit to using current model memory we will never switch back, as this causes large fluctuations up and down in memory requirement which will be much more noticeable when autoscaling is in use. Although the only two options at present are model memory limit and current model memory, the new enum includes a third possibility, peak model memory. To switch to this now would be tricky, as there have been two bugs in the implementation of peak model memory which render its value unreliable in 7.x. However, in 8.x it might make sense to switch to using peak model memory instead of current model memory and it's much easier from a BWC perspective if the enum contains all the values from the start. Relates #63163	2020-12-03 17:18:08 +00:00
David Roberts	fc72b39a17	[ML] Adjusting soft_limit description (#65383 ) This PR adds detail to the explanation of the soft_limit memory_status in ML job stats. A consequence that was not mentioned before is that examples are not added to category definitions. Relates elastic/ml-cpp#1590	2020-11-24 09:35:07 +00:00
István Zoltán Szabó	95a0ed4304	[DOCS] Adds recommendation about when to use chunking_config in manual mode. (#65060 )	2020-11-16 16:12:07 +01:00
István Zoltán Szabó	db15c4d6b9	[DOCS] Adds scroll_size maximum value to datafeeds API docs (#64986 )	2020-11-12 15:53:53 +01:00
James Rodewig	1ea83359bb	[DOCS] Fix case for 'Boolean' (#64299 )	2020-10-29 09:04:43 -04:00
Benjamin Trent	c1de07fa83	[ML] adding new flag exclude_generated that removes generated fields in GET config APIs (#63899 ) When exporting and cloning ml configurations in a cluster it can be frustrating to remove all the fields that were generated by the plugin. Especially as the number of these fields change from version to version. This flag, exclude_generated, allows the GET config APIs to return configurations with these generated fields removed. APIs supporting this flag: - GET _ml/anomaly_detection/<job_id> - GET _ml/datafeeds/<datafeed_id> - GET _ml/data_frame/analytics/<analytics_id> The following fields are not returned in the objects: - any field that is not user settable (e.g. version, create_time) - any field that is a calculated default value (e.g. datafeed chunking_config) - any field that is automatically set via another Elastic stack process (e.g. anomaly job custom_settings.created_by) relates to #63055	2020-10-20 11:28:29 -04:00
Benjamin Trent	7bd6e78dae	[ML] adding for_export flag for ml plugin GET resource APIs (#63092 ) This adds the new `for_export` flag to the following APIs: - GET _ml/anomaly_detection/<job_id> - GET _ml/datafeeds/<datafeed_id> - GET _ml/data_frame/analytics/<analytics_id> The flag is designed for cloning or exporting configuration objects to later be put into the same cluster or a separate cluster. The following fields are not returned in the objects: - any field that is not user settable (e.g. version, create_time) - any field that is a calculated default value (e.g. datafeed chunking_config) - any field that would effectively require changing to be of use (e.g. datafeed job_id) - any field that is automatically set via another Elastic stack process (e.g. anomaly job custom_settings.created_by) closes https://github.com/elastic/elasticsearch/issues/63055	2020-10-02 08:29:19 -04:00
Lisa Cawley	e48eab95e9	[DOCS] Formatting fix in get trained model API (#62643 )	2020-09-21 08:19:37 -07:00
Benjamin Trent	fdb7b6d3b5	[ML] Add new include flag to GET inference/<model_id> API for model training metadata (#61922 ) Adds new flag include to the get trained models API The flag initially has two valid values: definition, total_feature_importance. Consequently, the old include_model_definition flag is now deprecated. When total_feature_importance is included, the total_feature_importance field is included in the model metadata object. Including definition is the same as previously setting include_model_definition=true.	2020-09-18 07:11:38 -04:00
Lisa Cawley	9c2b214873	[DOCS] Removes inference from trained model API text (#62125 )	2020-09-09 10:11:50 -07:00
Lisa Cawley	1e6cdcac20	[DOCS] Fix from and size descriptions for model APIs (#62128 )	2020-09-08 12:54:51 -07:00
Lisa Cawley	4a7492f3fd	[DOCS] Fix allow_no_match description for model APIs (#62008 )	2020-09-08 08:11:33 -07:00
Benjamin Trent	1b34c88d56	[ML] adding docs + hlrc for data frame analysis feature_processors (#61149 ) Adds HLRC and some docs for the new feature_processors field in Data frame analytics. Co-authored-by: Przemysław Witek <przemyslaw.witek@elastic.co> Co-authored-by: Lisa Cawley <lcawley@elastic.co>	2020-08-24 12:00:44 -04:00
James Rodewig	a94e5cb7c4	[DOCS] Replace Wikipedia links with attribute (#61171 )	2020-08-17 09:44:24 -04:00
James Rodewig	6b9b8c5e31	[DOCS] Move script and stored fields content to search fields page (#60826 ) Changes: * Moves `Retrieve selected fields` to its own page and adds a title abbreviation. * Adds existing script and stored fields content to `Retrieve selected fields` * Adds a xref for `Retrieve selected fields` to `Search your data` * Adds related redirects and updates existing xrefs	2020-08-06 12:45:03 -04:00
Lisa Cawley	fb0157460f	[DOCS] Changes level offset of anomaly detection pages (#59911 )	2020-07-20 16:33:54 -07:00
Benjamin Trent	b551f75ec3	[ML] add new `custom` field to trained model processors (#59542 ) This commit adds the new configurable field `custom`. `custom` indicates if the preprocessor was submitted by a user or automatically created by the analytics job. Eventually, this field will be used in calculating feature importance. When `custom` is true, the feature importance for the processed fields is calculated. When `false` the current behavior is the same (we calculate the importance for the originating field/feature). This also adds new required methods to the preprocessor interface. If users are to supply their own preprocessors in the analytics job configuration, we need to know the input and output field names.	2020-07-16 09:35:56 -04:00
Przemysław Witek	dfbb47dcaa	Add a "verbose" option to the data frame analytics stats endpoint (#59589 )	2020-07-15 15:59:56 +02:00
Przemysław Witek	4a43b03855	Report peak model memory in ModelSizeStats (#59017 )	2020-07-06 10:33:54 +02:00
István Zoltán Szabó	d0042fb791	[DOCS] Updates results_field description in the inference processor docs (#58554 )	2020-06-29 11:28:17 +02:00
Przemysław Witek	76c7e3259f	Make ModelPlotConfig.annotations_enabled default to ModelPlotConfig.enabled if unset (#57808 )	2020-06-08 15:31:37 +02:00
David Roberts	605b4d0ea9	[ML] Add per-partition categorization option (#57683 ) This PR adds the initial Java side changes to enable use of the per-partition categorization functionality added in elastic/ml-cpp#1293. There will be a followup change to complete the work, as there cannot be any end-to-end integration tests until elastic/ml-cpp#1293 is merged, and also elastic/ml-cpp#1293 does not implement some of the more peripheral functionality, like stop_on_warn and per-partition stats documents. The changes so far cover REST APIs, results object formats, HLRC and docs.	2020-06-05 11:56:15 +01:00

1 2 3

105 commits