These docs previously implied that you could update datafeed
properties while the datafeed was running, but then would have
to stop and restart it for the changes to take effect.
In fact datafeed updates can only be made while the datafeed is
stopped (and this has been the case for many years, if not forever).
* [7.16] [ML] Model snapshot upgrade needs a stats endpoint
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.
This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.
Backport of #81641
* Fixing compilation
The char filter replaces the previous default of `first_non_blank_line`.
`first_non_blank_line` worked well to figure out what line had characters at all, but log lines
like the following were handled poorly:
```
--------------------------------------------------------------------------------
Alias 'foo' already exists and this prevents setting up ILM for logs
--------------------------------------------------------------------------------
```
When combined with the `ml_standard` tokenizer, the first line was used:
```
--------------------------------------------------------------------------------
```
This has no valid tokens for our standard tokenizer. Consequently, no tokens were found by `ml_standard` tokenizer.
The new filter, `first_line_with_letters`, returns the first line with any letter character (e.g. `Character#isLetter` returns true).
Given the previously poorly handled log, when combining with our `ml_standard` tokenizer, we get the following, more appropriate, tokens:
```
"tokens" : ["Alias", "foo", "already", "exists", "and", "this", "prevents", "setting", "up", "ILM", "for", "logs"]
```
* [ML] Use results retention time for deleting system annotations
In #75617 a new setting, system_annotations_retention_days, was
added to control how long system annotations are retained for.
We now feel that this setting is redundant and that system
annotations should be retained for the same period as results.
This is intuitive and defensible, as system annotations can be
considered a type of result.
Backport of #76096
* Fix one more merge clash
Previously attempting to delete a job that had a datafeed
would return an exception. However, this was unnecessarily
pedantic - the user would always want to delete both job
and datafeed together, and would react by deleting the
datafeed and then subsequently deleting the job again.
This change makes the delete job API automatically delete
a datafeed associated with the job. The same level of
force is used for this delete datafeed request as was used
on the delete job request. This means that it's possible
to force-delete an open job with a started datafeed (since
force-delete datafeed will automatically stop a started
datafeed). It's still not possible to delete an opened job
without using force.
Backport of #76010
Add configuration for pruning dead split fields in anomaly detection
jobs via the `model_prune_window` field for both the job creation and
update APIs.
Relates to ml-cpp/#1962
Backports #75741
This is a quality of life improvement for typical users. Almost all anomaly jobs will receive their data through a datafeed.
The datafeed config can now be supplied and is available in the datafeed field in the job config for creation and getting jobs.
Previously it was a requirement of the close job API that if the
job had an associated datafeed that that datafeed was stopped
before the job could be closed. Experience has shown that this
is just a pedantic nuisance. If a user closes the job without
first stopping the datafeed then it's just a mistake, and they
then have to make two further calls, to stop the datafeed and
then attempt to close the job again.
This PR changes the behaviour so that if you ask to close a job
whose datafeed is running then the datafeed gets stopped first
as part of the same call. Datafeeds are stopped with the same
level of force as the job close request specified.
Backport of #74257
Adds a new API that allows a user to reset
an anomaly detection job.
To use the API do:
```
POST _ml/anomaly_detectors/<job_id>_reset
```
The API removes all data associated to the job.
In particular, it deletes model state, results and stats.
However, job notifications and user annotations are not removed.
Also, the API can be called asynchronously by setting the parameter
`wait_for_completion` to `false` (defaults to `true`). When run
that way the API returns the task id for further monitoring.
In order to prevent the job from opening while it is resetting,
a new job field has been added called `blocked`. It is an object
that contains a `reason` and the `task_id`. `reason` can take
a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The `task_id` is also
included in order to allow tracking the task if necessary.
Finally, this commit also sets the `blocked` field when
the revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.
Backport of #73908
It is useful to know the following information when reading datafeed stats:
- Is the datafeed a "real-time" datafeed, i.e. a datafeed without a configured `end` time
- Has the datafeed processed all past data available at the time of starting.
This object is only available if the datafeed task has been created.
It has the form:
```
"running_state": {
"is_real_time": <boolean>,
"look_back_finished": <boolean>
}
```
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.
The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.
It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.
To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.
If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.
Backport of #72805
This commit increases the xpack.ml.max_open_jobs from 20 to 512. Additionally, it ignores nodes that cannot provide an accurate view into their native memory.
If a node does not have a view into its native memory, we ignore it for assignment.
This effectively fixes a bug with autoscaling. Autoscaling relies on jobs with adequate memory to assign jobs to nodes. If that is hampered by the xpack.ml.max_open_jobs scaling decisions are hampered.
Previously, a datafeed and job must already exist for the `_preview` API to work.
With this change, users can get an accurate preview of the data that will be sent to the anomaly detection job
without creating either of them.
closes https://github.com/elastic/elasticsearch/issues/70264
The text structure finder API documentation had many references to the "files". While this is one use of the API, the API now has a more generic name. This commit replaces many references to the word "file" to the more generic word "text".
* [ML] move find file structure to a new API endpoint (#67123)
This introduces a new `text-structure` plugin. This is the new home of the find file structure API.
The old REST URL is still available but is deprecated.
The new URL is: `_text_structure/find_structure`. All parameters and behavior are unchanged.
Changes to the high-level REST client and docs will be in separate commit.
related to: https://github.com/elastic/elasticsearch/issues/67001
This commit is fixing a potential bug if we support anomaly detection
results index rollover in the future.
In particular, we determine the current `data_counts` by sorting on the
latest record time. However, this is not correct if the job reverts
to an older model snapshot. To fix this we add `log_time` to `data_counts`
(similarly to `model_size_stats`) and sort on `log_time` to figure
out the current counts for the job.
Backport of #66343
There is little evidence of this endpoint being used
and there is quite a lot of code complexity associated
with the various formats that can be used to upload
data and the different errors that can occur when direct
data upload is open to end users.
In a future release we can make this endpoint internal
so that only datafeeds can use it, and remove all the
options and formats that are not used by datafeeds.
End users will have to store their input data for
anomaly detection in Elasticsearch indices (which we
believe all do today) and use a datafeed to feed it
to anomaly detection jobs.
Backport of #66347
At present the Java code makes a decision on whether to
use current model memory or model memory limit to calculate
how much memory a job requires to be assigned.
The plan is to move this decision to the C++ code, which will
report it via a new field in the model size stats. An
additional change will be that once we have made the switch
from using model memory limit to using current model memory
we will never switch back, as this causes large fluctuations
up and down in memory requirement which will be much more
noticeable when autoscaling is in use.
Although the only two options at present are model memory
limit and current model memory, the new enum includes a
third possibility, peak model memory. To switch to this
now would be tricky, as there have been two bugs in the
implementation of peak model memory which render its value
unreliable in 7.x. However, in 8.x it might make sense to
switch to using peak model memory instead of current model
memory and it's much easier from a BWC perspective if the
enum contains all the values from the start.
Backport of #65561