In LogsDB we would like to use a default value of `8191` for the index-level setting
`index.mapping.ignore_above`. The value for `ignore_above` is the _character count_,
but Lucene counts bytes. Here we set the limit to `32766 / 4 = 8191` since UTF-8
characters may occupy at most 4 bytes.
(cherry picked from commit 521e4341d7)
# Conflicts:
# server/src/main/java/org/elasticsearch/common/settings/Setting.java
Co-authored-by: Salvatore Campagna <93581129+salvatore-campagna@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
If source is required by a block loader then the StoredFieldsSpec that gets populated should be enhanced by SourceLoader#requiredStoredFields(...) in ValuesSourceReaderOperator. Otherwise in case of synthetic source many stored fields aren't loaded, which causes only a subset of _source to be synthesized. For example when unmapped fields exist or field values that exceed configured ignore above will not appear is _source.
This happens when field types fallback to a block loader implementation that uses _source. The required field values are then extracted from the source once loaded.
This change also reverts the production code changes introduced via #114903. That change only ensured that _ignored_source field was added to the required list of stored fields. In reality more fields could be required. This change is better fix, since it handles also other cases and the SourceLoader implementation indicates which stored fields are needed.
Closes#115076
This reverts commit 4c15cc0778.
This commit introduced an orders of magnitude regression when searching many shards.
(cherry picked from commit d9baf6f9db)
Co-authored-by: Armin Braun <me@obrown.io>
* Add prefilters only once in the compound and text similarity retrievers (#114983)
This change ensures that the prefilters are propagated in the downstream retrievers only once.
It also removes the ability to extends `explainQuery` in the compound retriever. This is not needed
as the rank docs are now responsible for the explanation.
* Trigger Build
In this PR we add a test and we fix the issues we encountered when we
enabled the failure store for TSDS and logsdb.
**Logsdb** Logsdb worked out of the box, so we just added the test that
indexes with a bulk request a couple of documents and tests how they are
ingested.
**TSDS** Here it was a bit trickier. We encountered the following
issues:
- TSDS requires a timestamp to determine the write index of the data stream meaning the failure happens earlier than we have anticipated so far. We added a special exception to detect this case and we treat it accordingly.
- The template of a TSDS data stream sets certain settings that we do not want to have in the failure store index. We added an allowlist that gets applied before we add the necessary index settings.
Furthermore, we added a test case to capture this.
* Reprocess operator file settings on service start (#114295)
Changes `FileSettingsService` to reprocess file settings on every
restart or master node change, even if versions match between file and
cluster-state metadata. If the file version is lower than the metadata
version, processing is still skipped to avoid applying stale settings.
This makes it easier for consumers of file settings to change their
behavior w.r.t. file settings contents. For instance, an update of how
role mappings are stored will automatically apply on the next restart,
without the need to manually increment the file settings version to
force reprocessing.
Relates: ES-9628
* Backport 114295
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This adds cancellation checks to rescore phase. This cancellation checks
for the parent task being cancelled and for timeout checks.
The assumption is that rescore is always significantly more expensive
than a regular query, so we check for timeout as frequently as the most
frequent check in ExitableDirectoryReader.
For LTR, we check on hit inference. Maybe we should also check for per
feature extraction?
For QueryRescorer, we check in the combine method.
closes: https://github.com/elastic/elasticsearch/issues/114955
Currently, in compute engine when loading source if source mode is synthetic, the synthetic source loader is already used. But the ignored_source field isn't always marked as a required source field, causing the source to potentially miss a lot of fields.
This change includes _ignored_source field as a required stored field and allowing keyword fields without doc values or stored fields to be used in case of synthetic source.
Relying on synthetic source to get the values (because a field doesn't have stored fields / doc values) is slow. In case of synthetic source we already keep ignored field/values in a special place, named ignored source. Long term in case of synthetic source we should only load ignored source in case a field has no doc values or stored field. Like is being explored in #114886 Thereby avoiding synthesizing the complete _source in order to get only one field.
This remove all recovery source specific SFM singletons. Whether recovery source is enabled can be checked via `DocumentParserContext`. This reduces the number of SFM instances by half.
The main users of this class use as input latitudes and longitudes read from doc values. These coordinates are always
on bounds so there is no point to try to normalise them, more over when this piece of code is in the hot path for
aggregations.
When I added the query/fetch metrics, I overlooked that non-primary
shards were being skipped during metrics collection, and the stateful
tests didn't catch it. This change ensures that search metrics are now
collected from every shard copy.
Currently the incremental and non-incremental bulk variations will
return different error codes when the json body provided is invalid.
This commit ensures both version return status code 400. Additionally,
this renames the incremental rest tests to bulk tests and ensures that
all tests work with both bulk api versions. We set these tests to
randomize which version of the api we test each run.
With recent changes in Lucene 9.12 around not forking execution when not necessary
(see https://github.com/apache/lucene/pull/13472), we have removed the search
worker thread pool in #111099. The worker thread pool had unlimited queue, and we
feared that we couuld have much more queueing on the search thread pool if we execute
segment level searches on the same thread pool as the shard level searches, because
every shard search would take up to a thread per slice when executing the query phase.
We have then introduced an additional conditional to stop parallelizing when there
is a queue. That is perhaps a bit extreme, as it's a decision made when creating the
searcher, while a queue may no longer be there once the search is executing.
This has caused some benchmarks regressions, given that having a queue may be a transient
scenario, especially with short-lived segment searches being queued up. We may end
up disabling inter-segment concurrency more aggressively than we would want, penalizing
requests that do benefit from concurrency. At the same time, we do want to have some kind
of protection against rejections of shard searches that would be caused by excessive slicing.
When the queue is above a certain size, we can turn off the slicing and effectively disable
inter-segment concurrency. With this commit we set that threshold to be the number of
threads in the search pool.
Here we check for the existence of a `host.name` field in index sort settings
when the index mode is `logsdb` and decide to inject the field in the mapping
depending on whether it exists or not. By default `host.name` is required for
sorting in LogsDB. This reduces the chances for errors at mapping or template
composition time as a result of injecting the `host.name` field only if strictly
required. A user who wants to override index sort settings without including
a `host.name` field would be able to do so without finding an additional
`host.name` field in the mappings (injected automatically). If users override the
sort settings and a `host.name` field is not included we don't need
to inject such field since sorting does not require it anymore.
As a result of this change we have the following:
* the user does not provide any index sorting configuration: we are responsible for injecting the default sort fields and their mapping (for `logsdb`)
* the user explicitly provides non-empty index sorting configuration: the user is also responsible for providing correct mappings and we do not modify index sorting or mappings
Note also that all sort settings `index.sort.*` are `final` which means doing this
check once, when mappings are merged at template composition time, is enough.
(cherry picked from commit 9bf6e3b0ba)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
**Introduction**
> In order to make adoption of failure stores simpler for all users, we
are introducing a new syntactical feature to index expression
resolution: The selector. > > Selectors, denoted with a :: followed by a
recognized suffix will allow users to specify which component of an
index abstraction they would like to operate on within an API call. In
this case, an index abstraction is a concrete index, data stream, or
alias; Any abstraction that can be resolved to a set of indices/shards.
We define a component of an index abstraction to be some searchable unit
of the index abstraction. > > To start, we will support two components:
data and failures. Concrete indices are their own data components, while
the data component for index aliases are all of the indices contained
therein. For data streams, the data component corresponds to their
backing indices. Data stream aliases mirror this, treating all backing
indices of the data streams they correspond to as their data component.
> > The failure component is only supported by data streams and data
stream aliases. The failure component of these abstractions refer to the
data streams' failure stores. Indices and index aliases do not have a
failure component.
For more details and examples see
https://github.com/elastic/elasticsearch/pull/113144. All this work has
been cherry picked from there.
**Purpose of this PR**
This PR is replacing the `FailureStoreOptions` with the
`SelectorOptions`, there shouldn't be any perceivable change to the user
since we kept the query parameter "failure_store" for now. It will be
removed in the next PR which will introduce the parsing of the
expressions.
_The current PR is just a refactoring and does not and should not change
any existing behaviour._
* Allow synthetic source and disabled source for standard indices (#114817)
When using the index.mapping.source.mode setting we need to make sure
that it takes precedence and that is used also when standard index mode
is used. Without this patch we always return stored source if
_source.mode is not used and the setting is.
Relates #114433
(cherry picked from commit 3af4d67fac)
# Conflicts:
# server/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java
* fix: conflict resolution mistake
* fix: error message
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
We actually don't need a cluster feature, a capability added if the
feature flag is enabled is enough for testing.
closes https://github.com/elastic/elasticsearch/issues/114787
(cherry picked from commit e87b894f68)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
`ThreadContext#stashContext` does not yield a completely fresh context:
it preserves headers related to tracing the original request. That may
be appropriate in many situations, but sometimes we really do want to
detach processing entirely from the original task. This commit
introduces new utilities to do that.
* ESQL: Introduce per agg filter (#113735)
Add support for aggregation scoped filters that work dynamically on the
data in each group.
| STATS
success = COUNT(*) WHERE 200 <= code AND code < 300,
redirect = COUNT(*) WHERE 300 <= code AND code < 400,
client_err = COUNT(*) WHERE 400 <= code AND code < 500,
server_err = COUNT(*) WHERE 500 <= code AND code < 600,
total_count = COUNT(*)
Implementation wise, the base AggregateFunction has been extended to
allow a filter to be passed on. This is required to incorporate the
filter as part of the aggregate equality/identify which would fail with
the filter as an external component.
As part of the process, the serialization for the existing aggregations
had to be fixed so AggregateFunction implementations so that it
delegates to their parent first.
(cherry picked from commit d102659dce)
* Update docs/changelog/114842.yaml
* Delete docs/changelog/114842.yaml