Commit graph

8978 commits

Author SHA1 Message Date
David Turner
7bcbca1268
Upgrade MinIO test container (#128424)
Earlier versions of MinIO had a bug which can cause repository analysis
failures. This commit upgrades the MinIO test container version to pick
up the bug fix, and reverts the workaround implemented in #127166.

Relates https://github.com/minio/minio/issues/21189
2025-05-27 07:24:41 +01:00
Keith Massey
7207692056
Adding dry_run mode for setting data stream settings (#128269) 2025-05-23 11:29:00 -05:00
Ignacio Vera
de7c91c1d9
Use IndexOrDocValuesQuery in NumberFieldType#termQuery implementations (#128293) 2025-05-23 16:58:50 +02:00
Lorenzo Dematté
6bf531660c
Fix NPE in APMTracer through RestController (#128314)
Our APMTracer doesn't like nulls - this is a sensible thing, as APM in general does not allow nulls (it only allows a precise set of types).
This PR changes the attribute to a sentinel "" in place of null values. It also makes a small change to APMTracer to give a better error message in case of null values in attributes.
2025-05-23 09:32:22 +02:00
David Turner
3504c27e7d
Remove exception-mangling in connect/close listeners (#127954)
The close-listeners are never completed exceptionally today so they do
not need the exception mangling of a `ListenableFuture`. The connect-
and remove-listeners sometimes see an exception if the connection
attempt fails, but they also do not need any exception-mangling.

This commit removes the exception-mangling by replacing these
`ListenableFuture` instances with `SubscribableListener` ones.
2025-05-22 21:22:04 +10:00
David Turner
c3a1d58e25
Remove first FlowControlHandler from HTTP pipeline (#128099)
Today we have a `FlowControlHandler` near the top of the Netty HTTP
pipeline in order to hold back a request body while validating the
request headers. This is inefficient since once we've validated the
headers we can handle the body chunks as fast as they arrive, needing no
more flow control. Moreover today we always fork the validation
completion back onto the event loop, forcing any available chunks to be
buffered in the `FlowControlHandler`.

This commit moves the flow-control mechanism into
`Netty4HttpHeaderValidator` itself so that we can bypass it on validated
message bodies. Morever in the (common) case that validation completes
immediately, e.g. because the credentials are available in cache, then
with this commit we skip the flow-control-related buffering entirely.
2025-05-22 18:14:11 +10:00
Nick Tindall
268e39b05b
Make GoogleCloudStorageRetryingInputStream request same generation on resume (#127626) 2025-05-22 17:00:20 +10:00
Simon Chase
e713f7c315
transport: pass network channel exceptions to close listeners (#127895)
Previously, exceptions encountered on a netty channel were caught and logged at
some level, but not passed to the TcpChannel or Transport.Connection close
listeners. This limited observability. This change implements this exception
reporting and passing, with TcpChannel.onException and NodeChannels.closeAndFail
reporting exceptions and their close listeners receiving them. Some test
infrastructure (FakeTcpChannel) and assertions in close listener onFailure
methods have been updated.
2025-05-21 12:04:45 -07:00
Pete Gillin
1fe3b77a2a
ES-10063 Add multi-project support for more stats APIs (#127650)
* Add multi-project support for more stats APIs

This affects the following APIs:
 - `GET _nodes/stats`:
   - For `indices`, it now prefixes the index name with the project ID (for non-default projects). Previously, it didn't tell you which project an index was in, and it failed if two projects had the same index name.
   - For `ingest`, it now gets the pipeline and processor stats for all projects, and prefixes the pipeline ID with the project ID. Previously, it only got them for the default project.
 - `GET /_cluster/stats`:
   - For `ingest`, it now aggregates the pipeline and processor stats for all projects. Previously, it only got them for the default project.
 - `GET /_info`:
   - For `ingest`, same as for `GET /_nodes/stats`.

This is done by making `IndicesService.stats()` and `IngestService.stats()` include project IDs in the `NodeIndicesStats` and `IngestStats` objects they return, and making those stats objects incorporate the project IDs when converting to XContent.

The transitive callers of these two methods are rather extensive (including all callers to `NodeService.stats()`, all callers of `TransportNodesStatsAction`, and so on). To ensure the change is safe, the callers were all checked out, and they fall into the following cases:
 - The behaviour change is one of the desired enhancements described above.
 - There is no behaviour change because it was getting node stats but neither `indices` nor `ingest` stats were requested.
 - There is no behaviour change because it was getting `indices` and/or `ingest` stats but only using aggregate values.
 - In `MachineLearningUsageTransportAction` and `TransportGetTrainedModelsStatsAction`, the `IngestStats` returned will return stats from all projects instead of just the default with this change, but they have been changed to filter the non-default project stats out, so this change is a noop there. (These actions are not MP-ready yet.)
 - `MonitoringService` will be affected, but this is the legacy monitoring module which is not in use anywhere that MP is going to be enabled. (If anything, the behaviour is probably improved by this change, as it will now include project IDs, rather than producing ambiguous unqualified results and failing in the case of duplicates.)

* Update test/external-modules/multi-project/build.gradle

Change suggested by Niels.

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>

* Respond to review comments

* fix merge weirdness

* [CI] Auto commit changes from spotless

* Fix test compilation following upstream change to base class

* Update x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/datatiers/DataTierUsageFixtures.java

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>

* Make projects-by-index map nullable and omit in single-project; always include project prefix in XContent in multip-project, even if default; also incorporate one other review comment

* Add a TODO

* update IT to reflect changed behaviour

* Switch to using XContent.Params to indicate whether it is multi-project or not

* Refactor NodesStatsMultiProjectIT to common up repeated assertions

* Defer use of ProjectIdResolver in REST handlers to keep tests happy

* Include index UUID in "unknown project" case

* Make the index-to-project map empty rather than null in the BWC deserialization case.

This works out fine, for the reasons given in the comment. As it happens, I'd already forgotten to do the null check in the one place it's actively used.

* remove a TODO that is done, and add a comment

* fix typo

* Get REST YAML tests working with project ID prefix TODO finish this

* As a drive-by, fix and un-suppress one of the health REST tests

* [CI] Auto commit changes from spotless

* TODO ugh

* Experiment with different stashing behaviour

* [CI] Auto commit changes from spotless

* Try a more sensible stash behaviour for assertions

* clarify comment

* Make checkstyle happy

* Make the way `Assertion` works more consistent, and simplify implementation

* [CI] Auto commit changes from spotless

* In RestNodesStatsAction, make the XContent params to channel.request(), which is the value it would have had before this change

---------

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
2025-05-21 19:04:22 +01:00
Keith Massey
bc45087962
Adding rest actions to get and set data stream settings (#127858) 2025-05-21 12:17:56 -05:00
Ryan Ernst
a2b4a6f246
Add temporary LegacyActionRequest (#128107)
In order to remove ActionType, ActionRequest will become strongly typed,
referring to the ActionResponse type. As a precursor to that, this
commit adds a LegacyActionRequest which all existing ActionRequest
implementations now inherit from. This will allow adding the
ActionResponse type to ActionRequest in a future commit without
modifying every implementation at once.
2025-05-20 07:09:27 -07:00
Yang Wang
265848e5ab
[Test] Fix testContentRangeValidation (#128188)
Ensure sufficent bytes for the start position.
2025-05-20 21:52:58 +10:00
Tanguy Leroux
27a3eb0fd1
Increase repository_azure max. threads on serverless (#128130)
On Serverless, the `repository_azure` thread pool is
shared between snapshots and translogs/segments upload
logic. Because snapshots can be rate-limited when
executing in the repository_azure thread pool, we want
to leave enough room for the other upload threads to be
executed.

Relates ES-11391
2025-05-20 09:18:23 +02:00
David Turner
18c60791c3
Make S3 custom query parameter optional (#128043)
Today Elasticsearch will record the purpose for each request to S3 using
a custom query parameter[^1]. This isn't believed to be necessary
outside of the ECH/ECE/ECK/... managed services, and it adds rather a
lot to the request logs, so with this commit we make the feature
optional and disabled by default.

[^1]:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html#LogFormatCustom
2025-05-20 17:14:39 +10:00
Ryan Ernst
d6ffe01122
Avoid nested docs in painless execute api (#127991)
Painless does not support accessing nested docs (except through
_source). Yet the painless execute api indexes any nested docs that are
found when parsing the sample document. This commit changes the ram
indexing to only index the root document, ignoring any nested docs.

fixes #41004
2025-05-19 08:18:09 -07:00
David Turner
20c02f430d
Set connection: close header on shutdown (#128025)
Lets clients using HTTP pipelining know to cease usage of connections to
shutting-down nodes.

Closes #127984
2025-05-19 06:14:46 +10:00
David Turner
cd1fa77990
Add missing entitlement to repository-azure (#128047)
This entitlement is required, but only if validating the metadata
endpoint against `https://login.microsoft.com/` which isn't something we
can do in a test. Kind of a SDK bug, we should be using an existing
event loop rather than spawning threads randomly like this.
2025-05-14 09:28:15 +01:00
Pete Gillin
ca921a0c31
Flip default metric for data stream auto-sharding (#127930)
This changes the default value of both the
`data_streams.auto_sharding.increase_shards.load_metric` and
`data_streams.auto_sharding.decrease_shards.load_metric` cluster
settings from `PEAK` to `ALL_TIME`. This setting has been applied via
config for several weeks now.

The approach taken to updating the tests was to swap the values given for the all-time and peak loads in all the stats objects provided as input to the tests, and to swap the enum values in the couple of places they appear.
2025-05-12 14:32:41 +01:00
David Turner
0f9c1ead9b
Ensure S3Service is STARTED when creating client (#128026)
It's possible for another component to request a S3 client after the
node has started to shut down, and today the `S3Service` will dutifully
attempt to create a fresh client instance even if it is closed. Such
clients will then leak, resulting in test failures.

With this commit we refuse to create new S3 clients once the service has
started to shut down.
2025-05-12 20:20:32 +10:00
Ryan Ernst
8ad272352b
Remove doPrivileged from ES modules (#127848)
Continuing the cleanup of SecurityManager related code, this commit
removes uses of doPrivileged in all Elasticsearch modules.
2025-05-09 14:15:48 -04:00
Nik Everett
da553b11e3
Fix a bug in significant_terms (#127975)
Fix a bug in the `significant_terms` agg where the "subsetSize" array is
too small because we never collect the ordinal for the agg "above" it.

This mostly hits when the you do a `range` agg containing a
`significant_terms` AND you only collect the first few ranges. `range`
isn't particularly popular, but `date_histogram` is super popular and it
rewrites into a `range` pretty commonly - so that's likely what's really
hitting this - a `date_histogram` followed by a `significant_text` where
the matches are all early in the date range held by the shard.
2025-05-09 13:48:19 -04:00
Ryan Ernst
ab690ba23f
Check hidden frames in entitlements (#127877)
Entitlements do a stack walk to find the calling class. When method
refences are used in a lambda, the frame ends up hidden in the stack
walk. In the case of using a method reference with
AccessController.doPrivileged, the call looks like it is the jdk itself,
so the call is trivially allowed. This commit adds hidden frames to the
stack walk so that the lambda frame created for the method reference is
included. Several internal packages are then necessary to filter out of
the stack.
2025-05-08 16:59:03 -07:00
David Turner
aa6e1ad8e3
Add comments pointing to Azure creds renewal docs (#127897)
These were some of the places I looked for information about renewal.
Leaving a hint for next time.
2025-05-08 19:10:53 +10:00
David Turner
85d9990d70
Replace auto-read with proper flow-control in HTTP pipeline (#127817)
Re-applying #126441 (cf. #127259) with:

- the extra `FlowControlHandler` needed to ensure one-chunk-per-read
  semantics (also present in #127259).

- no extra `read()` after exhausting a `Netty4HttpRequestBodyStream`
  (the bug behind #127391 and #127391).

See #127111 for related tests.
2025-05-08 17:35:10 +10:00
Mary Gouseti
077b6b949b
Skip the validation when retrieving the index mode during reindexing a time series data stream. (#127824)
During reindexing we retrieve the index mode from the template settings. However, we do not fully resolve the settings as we do when validating a template or when creating a data stream. This results on throwing the error reported in #125607.

I do not see a reason to not fix this as suggested in #125607 (comment).

Fixes: #125607
2025-05-08 10:25:53 +03:00
David Turner
d934a0c540
Reinstate use of S3 protocol client setting (#127744)
The `s3.client.CLIENT_NAME.protocol` setting became unused in #126843 as
it is inapplicable in the v2 SDK. However, the v2 SDK requires the
`s3.client.CLIENT_NAME.endpoint` setting to be a URL that includes a
scheme, so in #127489 we prepend a `https://` to the endpoint if needed.
This commit generalizes this slightly so that we prepend `http://` if
the endpoint has no scheme and the `.protocol` setting is set to `http`.
2025-05-07 10:02:22 +01:00
David Turner
9765251cd3
Improve Netty4IncrementalRequestHandlingIT (#127768)
* Avoid time-based expiry of channel stats or else `testHttpClientStats`
  will fail if running multiple iterations for more than 5m.

* Assert all bytes received in `testHttpClientStats`.
2025-05-07 06:54:36 +10:00
Ryan Ernst
b78ac7c94c
Remove PrivilegedOperations (#127726)
With the SecurityManager gone, the PrivilegedOperations class is no
longer needed, these operations can be called directly.
2025-05-06 10:50:49 -07:00
Ryan Ernst
22a52a9c64
Remove security manager policy files (#127727)
Now that security manager is gone, the policy files are no longer
needed. This commit removes the server, test and plugin specific policy
files
2025-05-06 19:37:46 +02:00
Rene Groeschke
a2e580fb60
Update Gradle wrapper to 8.14 (#126519)
* Fix PatternSetFactory incompatibility
* Update ospackage plugin
* Remove ambigious method definitions
* Cleanup verification metadata
* Some cleanup on unused methods and attributes
2025-05-06 13:00:15 +02:00
Patrick Doyle
5df5cb890e
Propagate file settings health info to the health node (#127397)
* Initial testHealthIndicator that fails

* Refactor: FileSettingsHealthInfo record

* Propagate file settings health indicator to health node

* ensureStableCluster

* Try to induce a failure from returning node-local info

* Remove redundant node from client() call

* Use local node ID in UpdateHealthInfoCacheAction.Request

* Move logger to top

* Test node-local health on master and health nodes

* Fix calculate to use the given info

* mutateFileSettingsHealthInfo

* Test status from local current info

* FileSettingsHealthTracker

* Spruce up HealthInfoTests

* spotless

* randomNonNegativeLong

* Rename variable

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>

* Address Niels' comments

* Test one- and two-node clusters

* [CI] Auto commit changes from spotless

* Ensure there's a master node

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>

* setBootstrapMasterNodeIndex

---------

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
2025-05-05 16:39:28 +02:00
Nick Tindall
8dc6bf8893
Default S3 endpoint scheme to HTTPS when not specified (#127489) 2025-05-05 18:47:23 +10:00
Mary Gouseti
ba49d48203
Add rest API capability for failures default retention (#127674)
This PR is adding the API capability to ensure that the API tests that
check for the default failures retention will only be executed when the
version supports this. This was missed in the original PR
(https://github.com/elastic/elasticsearch/pull/127573).
2025-05-04 00:51:37 +10:00
Mary Gouseti
fe36c42eee
[Failure store] Introduce default retention for failure indices (#127573)
We introduce a new global retention setting `data_streams.lifecycle.retention.failures_default` which is used by the data stream lifecycle management as the default retention when the failure store lifecycle of the data stream does not specify one.

Elasticsearch comes with the default value of 30 days. The value can be changed via the settings API to any time value higher than 10 seconds or -1 to indicate no default retention should apply.

The failures default retention can be set to values higher than the max retention, but then the max retention will be effective. The reason for this choice it to ensure that no deployments will be broken, if the user has already set up max retention less than 30 days.
2025-05-03 15:50:22 +03:00
Ankita Kumar
084542a690
Account for time taken to write index buffers in IndexingMemoryController (#126786)
This PR adds to the indexing write load, the time taken to flush write indexing buffers using the indexing threads (this is done here to push back on indexing)

This changes the semantics of InternalIndexingStats#recentIndexMetric and InternalIndexingStats#peakIndexMetric  to more accurately account for load on the indexing thread. Address ES-11356.
2025-05-01 16:56:14 -04:00
Alexey Ivanov
d362fb337a
New per-project only settings can be defined and used by components (#127280)
This change introduces Settings to ProjectMetadata and adds project scope support for Setting.

For now, project-scoped settings are independent from cluster settings and do not fall back to cluster-level settings.
Also, setting update consumers do not yet work correctly for project-scoped settings. These issues will be addressed separately in future PRs.
2025-05-01 00:22:04 +02:00
Oleksandr Kolomiiets
0c1b3acee2
Properly handle multi fields in block loaders with synthetic source enabled (#127483) 2025-04-30 09:33:35 -07:00
Mary Gouseti
03d77816cf
[Failure store] Introduce dedicated failure store lifecycle configuration (#127314)
The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.
2025-04-30 18:22:06 +03:00
Keith Massey
23b7a31406
Fixing DataStream::getEffectiveSettings for component templates (#127515) 2025-04-29 20:31:54 +02:00
Benjamin Trent
3d67e0e7ca
Fix npe when using source confirmed text query against missing field (#127414)
We should check for the field and statistics actually existing when
checking matches and explanation with `match_only_text` fields

closes: https://github.com/elastic/elasticsearch/issues/125635
2025-04-30 03:05:01 +10:00
Niels Bauman
fd93fad994
Remove test usages of getDefaultBackingIndexName in DS and LogsDB tests (#127384)
We replace usages of time sensitive
`DataStream#getDefaultBackingIndexName` with the retrieval of the name
via an API call. The problem with using the time sensitive method is
that we can have test failures around midnight.

Relates #123376
2025-04-29 14:48:37 +02:00
Keith Massey
bdb70c03ee
Adding transport actions for getting and updating data stream settings (#127417) 2025-04-28 10:46:20 -05:00
Keith Massey
7ddc8d9e7e
Using DataStream::getEffectiveSettings (#127282) 2025-04-25 14:40:37 -05:00
David Turner
5c753a81d2
S3BlobContainer: Revert broadened exception handler again (#127405)
Catching `Exception` instead of `SdkException` in `copyBlob` and
`executeMultipart` led to failures in `S3RepositoryAnalysisRestIT` due
to the injected exceptions getting wrapped in `IOExceptions` that
prevented them from being caught and handled in `BlobAnalyzeAction`.

Repeat of #126731, regressed due to #126843
Closes #127399
2025-04-25 12:11:46 -07:00
Oleksandr Kolomiiets
26e2261132
Remove legacy block loader test infrastructure (#127273) 2025-04-25 10:26:27 -07:00
David Turner
6f622e813c
Revert "Replace auto-read with proper flow-control in HTTP pipeline (#127259)" (#127403)
This reverts commit 3cf70614b8 and unmutes
the associated tests

Closes #127391
Closes #127392
2025-04-25 17:52:34 +01:00
Keith Massey
3f736a7826
Updating tika to 2.9.3 (#127353) 2025-04-25 08:43:26 -05:00
Niels Bauman
c72d00fd39
Don't start a new node in InternalTestCluster#getClient (#127318)
This method would default to starting a new node when the cluster was
empty. This is pretty trappy as `getClient()` (or things like
`getMaster()` that depend on `getClient()`) don't look at all like
something that would start a new node.

In any case, the intention of tests is much clearer when they explicitly
define a cluster configuration.
2025-04-25 10:07:52 +02:00
David Turner
15b6e85400
Skip region validation in S3BlobStoreRepositoryTests (#127372)
Today these tests assert that the requests received by the handler are
signed in region `us-east-1` with no region specified, but in fact when
running in EC2 the SDK will pick up the actual region which may be
different. This commit skips this region validation for now (it is
tested elsewhere).
2025-04-25 08:55:47 +01:00
David Turner
3cf70614b8
Replace auto-read with proper flow-control in HTTP pipeline (#127259)
Re-applying #126441 with the extra `FlowControlHandler` needed to ensure
one-chunk-per-read semantics - see #127111 for related tests.
2025-04-25 07:49:20 +01:00