Today a `HttpResponse` is always released via a `ChannelPromise` which
means the release happens on a network thread. However, it's possible we
try and send a `HttpResponse` after the node has got far enough through
shutdown that it doesn't have any running network threads left, which
means the response just leaks.
This is no big deal in production, it becomes irrelevant when the
process exits, but in tests we start and stop many nodes within the same
process so mustn't leak anything.
At this point in shutdown, all HTTP channels are now closed, so it's
sufficient to check whether the channel is open first, and to fail the
listener on the calling thread if not. That's what this commit does.
Closes#104651
We should be reading a single `BytesReference` (that would be backed by a single large `byte[]`)
here when we care about the individual values in the list only.
Without breaking the behavior of only serializing once when sending to multiple targets this change:
* lazy serializes as needed and keeps the original terms, so we don't needlessly go through serialization in e.g. a single node situation
or or requests that are handled on the coordinator directly (concurrency should be fine here, we serialize on the same thread in practice and
should we ever not be on the same thread at all times this will worst case lead to serializing multiple times).
* stops allocating a potentially huge byte[] when receiving these things over the wire
We are adding a query parameter to the field_caps api in order to filter out
fields with no values. The parameter is called `include_empty_fields` and
defaults to true, and if set to false it will filter out from the field_caps
response all the fields that has no value in the index.
We keep track of FieldInfos during refresh in order to know which field has
value in an index. We added also a system property
`es.field_caps_empty_fields_filter` in order to disable this feature if needed.
---------
Co-authored-by: Matthias Wilhelm <ankertal@gmail.com>
To improve cross-cluster search user experience, Kibana needs an endpoint that is accessible
by arbitrary Kibana dashboard search users and provides:
1. a listing of clusters in scope for a CCS query (based on the index expression and whether
there are any indices on each cluster that the Kibana user has access to query).
2. whether that cluster is currently connected to the querying cluster (will it come back as
skipped or failed in a CCS search)
3. showing the skip_unavailable setting for those clusters (so you can know whether it will
return skipped or failed in a CCS search)
4. the ES version of the cluster
Since no single Elasticsearch endpoint provides all of these features, this PR creates a new endpoint `_resolve/cluster` that works along side the existing `_resolve/index` endpoint
(and leverages some of its features).
Example usage against a cluster with 2 remote clusters configured:
GET /_resolve/cluster/*,remote*:bl*
Response:
{
"(local)": {
"connected": true,
"skip_unavailable": false,
"matching_indices": true,
"version": {
"number": "8.12.0-SNAPSHOT",
"build_flavor": "default",
"minimum_wire_compatibility_version": "7.17.0",
"minimum_index_compatibility_version": "7.0.0"
}
},
"remote2": {
"connected": true,
"skip_unavailable": true,
"matching_indices": true,
"version": {
"number": "8.12.0-SNAPSHOT",
"build_flavor": "default",
"minimum_wire_compatibility_version": "7.17.0",
"minimum_index_compatibility_version": "7.0.0"
}
},
"remote1": {
"connected": true,
"skip_unavailable": false,
"matching_indices": false,
"version": {
"number": "8.12.0-SNAPSHOT",
"build_flavor": "default",
"minimum_wire_compatibility_version": "7.17.0",
"minimum_index_compatibility_version": "7.0.0"
}
}
}
Almost all errors show up as "error" entries in the response.
Only the local SecurityException returns a 403 since that happens before the ResolveCluster
Transport code kicks in.
We see occasional test failures in CI due to the analysis not completing
within this 30s timeout. It doesn't look like anything is actually
wrong, the test machine is just busy and these tests can be quite
IO-intensive. This commit gives them more time.
Closes#99422
Detects and efficiently encodes cyclic ordinals, as proposed by
@jpountz. This is beneficial for encoding dimensions that are
multivalued, such as host.ip.
A follow-up on #99747
If the client closes the channel while we're in the middle of a chunked
write then today we don't complete the corresponding listener. This
commit fixes the problem.
This reduces the risk of document loss if too many fields are added.
As these component templates are imported by Fleet, this also affects
integrations.
Today the various `ESTestCase#safeAwait` variants do not include a
descriptive message with their failures, which means you have to dig
through the stack trace to work out the reason for a test failure. This
commit adds the missing messages to make it a little easier on the
reader.
We sometimes see a need to do full GC twice within the current 5s interval.
While we should work to improve our allocation pattern for that, it also
seems too conservative to not allow more full GCs, as long as we also get
some real work done. Hence lowering it to 2s here, which would fix the
current problematic cases.
Elasticsearch requires access to some native functions. Historically
this has been achieved with the JNA library. However, JNA is a
complicated, magical library, and has caused various problems booting
Elasticsearch over the years. The new Java Foreign Function and Memory
API allows access to call native functions directly from Java. It also
has the advantage of tight integration with hotspot which can improve
performance of these functions (though performance of Elasticsearch's
native calls has never been much of an issue since they are mostly at
boot time).
This commit adds a new native lib that is internal to Elasticsearch. It
is built to use the foreign function api starting with Java 21, and
continue using JNA with Java versions below that.
Only one function, checking whether Elasticsearch is running as root, is
migrated. Future changes will migrate other native functions.
Preallocate opens a FileInputStream in order to get a native file
desctiptor to pass to native functions. However, getting at the file
descriptor requires breaking modular access. This commit adds native
posix functions for opening/closing and retrieving stats on a file in
order to avoid requiring additional permissions.
Qualfied exports in the boot layer only work when they are to other boot
modules. Yet Elasticsearch has dynamically loaded modules as in plugins.
For this purpose we have ModuleQualifiedExportsService. This commit
moves loading of ModuleQualfiedExportService instances in the boot layer
into core so that it can be reused by ProviderLocator when a qualified
export applies to an embedded module.
We have some usages of SearchException that don't provide a cause exception and also don't define a status code. That means that the status code of such requests will default to 500 which is in many cases not a good choice. Normally, for internal server error a cause is associated with the wrapper exception.
This scenario is not very common, and looks like a leftover of shard validation that used to happen on shards, which can be moved to the coordinating node. This commit moves some of the exceptions thrown in SearchService#parseSource to SearchRequest#validate. This way we will fail before serializing the shard level request to all the shards, which is much better.
Note that for bw comp reasons, we need to keep on throwing the same exception from the data node, while intuitively this is now replaced by the same validation in the coord node. This is because in a mixed cluster scenario, an older node that does not perform the validation as coord node, could serialize shard level requests that need to be checked again on data nodes, to prevent unexpected situations.
This avoids keeping downsamplingInterval field around. Additionally, the
downsample interval is known when downsample interval is invoked and
doesn't change.
This PR extends the repository integrity health indicator to cover also unknown and invalid repositories. Because these errors are local to a node, we extend the `LocalHealthMonitor` to monitor the repositories and report the changes in their health regarding the unknown or invalid status.
To simplify this extension in the future, we introduce the `HealthTracker` abstract class that can be used to create new local health checks.
Furthermore, we change the severity of the health status when the repository integrity indicator reports unhealthy from `RED` to `YELLOW` because even though this is a serious issue, there is no user impact yet.
* Move doc-values classes needed by ST_INTERSECTS to server
This classes are needed by ESQL spatial queries, and are not licensed in a way that prevents this move.
Since they depend on lucene it is not possible to move them to a library.
Instead they are moved to be co-located with the GeoPoint doc-values classes that already exist in server.
* Moved to lucene package org.elasticsearch.lucene.spatial
* Moved Geo/ShapeDocValuesQuery to server because it is Lucene specific
And this gives us access to these classes from ESQL for lucene-pushdown of spatial queries.
With this commit we refactor the internal representation of stacktraces
to use plain arrays instead of lists for some of its properties. The
motivation behind this change is simplicity:
* It avoids unnecessary boxing
* We could eliminate a few redundant null checks because we use
primitive types now in some places
* We could slightly simplify runlength decoding
Improve downsampling by making the following changes:
- Avoid NPE and assert tripping when fetching the last processed tsid.
- If the write block has been set, then there is no reason to start the downsample persistent tasks, since shard level downsampling has completed. Not doing so also causes ILM/DSL to get stuck on downsampling. In this case shard level downsampling should be skipped.
- Sometimes the source index may not be allocated yet on the node performing shard level downsampling operation. This causes a NPE, with this PR, this now fails a shard level downsample with a less disturbing error.
Additionally unmute
DataStreamLifecycleDownsampleDisruptionIT#testDataStreamLifecycleDownsampleRollingRestart
Relates to #105068
The unconditional rollover that is a consequence of a lazy rollover command is triggered by the creation of a document. In many cases, the user triggering this rollover won't have sufficient privileges to ensure the successful execution of this rollover. For this reason, we introduce a dedicated rollover action and a dedicated internal user to cover this case and enable this functionality.