Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.
The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.
It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.
To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.
If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.
Closeselastic/ml-cpp#1724
This change adds an extra piece of information,
limits.total_ml_memory, to the ML info response.
This returns the total amount of memory that ML
is permitted to use for native processes across
all ML nodes in the cluster. Some of this may
already be in use; the value returned is total,
not available ML memory.
This PR implements the following changes to make ML model snapshot
retention more flexible in advance of adding a UI for the feature in
an upcoming release.
- The default for `model_snapshot_retention_days` for new jobs is now
10 instead of 1
- There is a new job setting, `daily_model_snapshot_retention_after_days`,
that defaults to 1 for new jobs and `model_snapshot_retention_days`
for pre-7.8 jobs
- For days that are older than `model_snapshot_retention_days`, all
model snapshots are deleted as before
- For days that are in between `daily_model_snapshot_retention_after_days`
and `model_snapshot_retention_days` all but the first model snapshot
for that day are deleted
- The `retain` setting of model snapshots is still respected to allow
selected model snapshots to be retained indefinitely
Closes#52150
The ML info endpoint returns the max_model_memory_limit setting
if one is configured. However, it is still possible to create
a job that cannot run anywhere in the current cluster because
no node in the cluster has enough memory to accommodate it.
This change adds an extra piece of information,
limits.effective_max_model_memory_limit, to the ML info
response that returns the biggest model memory limit that could
be run in the current cluster assuming no other jobs were
running.
The idea is that the ML UI will be able to warn users who try to
create jobs with higher model memory limits that their jobs will
not be able to start unless they add a bigger ML node to their
cluster.
Relates elastic/kibana#63942
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.