Since ignore_case is set to true in our custom stop words filter, the matching will be case-insensitive.
(cherry picked from commit a03fba9d77)
Co-authored-by: Siniša Subašić <68671543+sinisuba@users.noreply.github.com>
This commit removes the experimental tag from kNN search docs and makes some
docs improvements:
* Add a prominent warning about memory usage in the kNN search guide
* Link to the performance tuning guide from the main guide
* Clarify the memory requirements section in the tuning guide
* Update threadpool.asciidoc
Starting from 8.0 the value of the `node.processors` setting is bounded by the number of available
processors https://github.com/elastic/elasticsearch/pull/44894
* Update docs/reference/modules/threadpool.asciidoc
Co-authored-by: Adam Locke <adam.locke@elastic.co>
* Refine geo-point and geo-shape docs
While reviewing the docs for another issue, some deprecated
references to prefix-trees were discovered, leading to interest
in bringing the docs a little more up-to-date.
* Update docs/reference/mapping/types/geo-point.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
* Update docs/reference/mapping/types/geo-shape.asciidoc
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
Co-authored-by: Abdon Pijpelink <abdon.pijpelink@elastic.co>
adds a health section to the transform stats endpoint and implements reporting assignment, indexing/search and persistence problems, together with a overall health state.
This change adds an element_type as an optional mapping parameter for dense vector fields as
described in #89784. This also adds a byte element_type for dense vector fields that supports storing
dense vectors using only 8-bits per dimension. This is only supported when the mapping parameter
index is set to true.
The code follows a similar pattern to our NumberFieldMapper where we have an enum for
ElementType, and it has methods that DenseVectorFieldType and DenseVectorMapper can delegate to
to support each available type (just float and byte for now).
* Add CCR limitation
closes https://github.com/elastic/elasticsearch/issues/86121
* Add restored index auto follow pattern restriction
https://github.com/elastic/elasticsearch/issues/87055
* Moving content to existing CCR page + several changes
* Remove sections to consolidate limitation information
* Delete separate file
* Remove restored indices from list of things that aren't replicated
Co-authored-by: Adam Locke <adam.locke@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This commit adds a new field, write_load, into the shard stats. This new stat exposes the average number of write threads used while indexing documents.
Closes#90102
Adds a {index}_semantic_search endpoint which first converts the query text into a dense vector
using a NLP text embedding model then performs a knn search against an index containing
dense vectors created with the same embedding model.
Adds more detail about the meaning of the results
fields of the `categorize_text` aggregation, and
advice about how to use these fields when searching
for messages that match the categories.
Followup to #90723
This PR surfaces new information about the impact of the factors on the initial anomaly score in the anomaly record:
- single bucket impact is determined by the deviation between actual and typical in the current bucket
- multi-bucket impact is determined by the deviation between actual and typical in the past 12 buckets
- anomaly characteristics are statistical properties of the current anomaly compared to the historical observations
- high variance penalty is the reduction of anomaly score in the buckets with large confidence intervals.
- incomplete bucket penalty is the reduction of anomaly score in the buckets with fewer samples than historically expected.
Additionally, we compute lower- and upper-confidence bounds and the typical value for the anomaly records. This improves the explainability of the cases where the model plot is not activated with only a slight overhead in performance (1-2%).
When parsing queries on the coordinating node, there is currently no way to share state between the different parsing methods (`fromXContent`). The only query that supports a parse context is bool query, which uses the context to track nested depth of queries, added with #66204. Such nested depth tracking mechanism is not 100% accurate as it tracks bool queries only, while there's many more query types that can hold other queries hence potentially cause stack overflow when deeply nested.
This change removes the parsing context that's specific to bool query, introduced with #66204, in favour of generalizing the nested depth tracking to all query types.
The generic tracking is introduced by wrapping the parser and overriding the method that parses named objects through the xcontent registry. Another way would have been to require a context argument when parsing queries, which would mean adding a context argument to all the QueryBuilder#fromXContent static methods. That would be a breaking change for plugins that provide custom queries, hence I went for trying out a different approach.
One aspect that this change requires and introduces is the distinction between parsing a top level query (which will wrap the parser, or it would create the context if we had one), as opposed to parsing an inner query, which goes ahead with the given parser and context. We already have this distinction as we have two different static methods in `AbstractQueryBuilder` but in practice only bool query makes the distinction being the only context-aware query.
In addition to generalizing tracking nested depth when parsing queries, we should be able to adopt this same strategy to track queries usage as part #90176 .
Given that the depth check is now more restrictive, as it counts all compound queries and not only bool, we have decided to raise the default limit to `30` to ensure that users are not going to hit the limit due to this change.
We do not support and don't plan to support disaster recovery arrangements
where Security configuration is replicated between the production and the
disaster recovery cluster because the cluster-local Security APIs assume
exclusive write on the .security system index.
Use a magic value of "null" for the timestamp format override to indicate to the analysis that a timestamp is not expected in the input text. This should improve performance when analysing delimited, ndjson or xml formatted text files that don't contain timestamps. For semi-structured text files without timestamps the magic value indicates to treat the text as single line log messages.
see #55219
This commit adds a new API that users can use calling:
```
POST _ml/trained_models/{model_id}/deployment/_update
{
"number_of_allocations": 4
}
```
This allows a user to update the number of allocations for a deployment
that is `started`.
If the allocations are increased we rebalance and let the assignment
planner find how to allocate the additional allocations.
If the allocations are decreased we cannot use the assignment planner.
Instead, we implement the reduction in a new class `AllocationReducer`
that tries to reduce the allocations so that:
1. availability zone balance is maintained
2. assignments that can be completely stopped are preferred to release memory
The new `regex` field in `categorize_text` output is created in
the same way as the `regex` field that appears in the category
definitions created by anomaly detection jobs that do categorization.
It consists of the terms that occur in the same order for every
message that matches the category, separated with a `.+?` wildcard.
It therefore matches the category messages and enforces the order
of the terms that occurred in the same order for all messages used
to create the category.
It is not recommended to use the regex as the primary mechanism for
searching for the original documents that were categorized. Search
using a regular expression is very slow. Instead the terms of the
category should be used to search for matching documents, as a
terms search can use the inverted index and hence be much faster.
However, there may be situations where it is useful to use the
`regex` field to test whether a small set of messages that have not
been indexed match the category.
Currently, we report the count of affected nodes and indices as part of
the disk indicator using a leaky abstraction. Namely we use the status
we assign to nodes internally to nodes based on their disk usage (red,
yellow, green, unknown).
However, these statuses don't have an explicit meaning outside the
implementation details e.g. a red node would probably convey it's a node
experiencing disk issues but not what kind
This proposes being explicit in what we return to our health API users
e.g.
```
"details": {
"indices_with_readonly_block": 2,
"nodes_with_enough_disk_space": 0,
"nodes_with_unknown_disk_status": 0,
"nodes_over_high_watermark": 0,
"nodes_over_flood_watermark": 2
}
```
This commit adds deprecation warning for when the `remove_binary`
setting is unset. In the future we want to change the default to `true`
(it is currently `false`), so this will let a user know they should be
explicit about setting this to ensure the behavior does not change in a
future (breaking) release.
Relates to #86014
Now that we have the estimated field mappings heap overhead
in nodes stats, we can refer to them in the guide for sizing
data nodes appropriately.
Relates to #86639
Adds to the docs a note that the `100mb` default for
`http.max_content_length` is the recommended maximum, along with
suggestions for what to do when hitting this limit.
Added Cartesian support for centroid aggregation
* First draft of cartesian-centroid docs
However, this is largely a duplicate of geo-centroid docs since they are essentially identical behaviour. We should consider merging them.
* Work on isAggregatable caused a minor logic conflict. When that work was done, Point and Shape were not aggregatable, but now they are.