elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-28 09:28:55 -04:00

Author	SHA1	Message	Date
Keith Massey	dc2fbe19a6	Removing the data stream settings feature flag (#128594 )	2025-05-29 09:50:14 -05:00
Keith Massey	41f186dca0	Adding prefer_ilm as a whitelisted data stream setting (#128375 )	2025-05-27 15:42:08 -05:00
Keith Massey	7207692056	Adding dry_run mode for setting data stream settings (#128269 )	2025-05-23 11:29:00 -05:00
Keith Massey	bc45087962	Adding rest actions to get and set data stream settings (#127858 )	2025-05-21 12:17:56 -05:00
Pete Gillin	ca921a0c31	Flip default metric for data stream auto-sharding (#127930 ) This changes the default value of both the `data_streams.auto_sharding.increase_shards.load_metric` and `data_streams.auto_sharding.decrease_shards.load_metric` cluster settings from `PEAK` to `ALL_TIME`. This setting has been applied via config for several weeks now. The approach taken to updating the tests was to swap the values given for the all-time and peak loads in all the stats objects provided as input to the tests, and to swap the enum values in the couple of places they appear.	2025-05-12 14:32:41 +01:00
Mary Gouseti	077b6b949b	Skip the validation when retrieving the index mode during reindexing a time series data stream. (#127824 ) During reindexing we retrieve the index mode from the template settings. However, we do not fully resolve the settings as we do when validating a template or when creating a data stream. This results on throwing the error reported in #125607. I do not see a reason to not fix this as suggested in #125607 (comment). Fixes: #125607	2025-05-08 10:25:53 +03:00
Patrick Doyle	5df5cb890e	Propagate file settings health info to the health node (#127397 ) * Initial testHealthIndicator that fails * Refactor: FileSettingsHealthInfo record * Propagate file settings health indicator to health node * ensureStableCluster * Try to induce a failure from returning node-local info * Remove redundant node from client() call * Use local node ID in UpdateHealthInfoCacheAction.Request * Move logger to top * Test node-local health on master and health nodes * Fix calculate to use the given info * mutateFileSettingsHealthInfo * Test status from local current info * FileSettingsHealthTracker * Spruce up HealthInfoTests * spotless * randomNonNegativeLong * Rename variable Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com> * Address Niels' comments * Test one- and two-node clusters * [CI] Auto commit changes from spotless * Ensure there's a master node Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com> * setBootstrapMasterNodeIndex --------- Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com> Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>	2025-05-05 16:39:28 +02:00
Mary Gouseti	ba49d48203	Add rest API capability for failures default retention (#127674 ) This PR is adding the API capability to ensure that the API tests that check for the default failures retention will only be executed when the version supports this. This was missed in the original PR (https://github.com/elastic/elasticsearch/pull/127573).	2025-05-04 00:51:37 +10:00
Mary Gouseti	fe36c42eee	[Failure store] Introduce default retention for failure indices (#127573 ) We introduce a new global retention setting `data_streams.lifecycle.retention.failures_default` which is used by the data stream lifecycle management as the default retention when the failure store lifecycle of the data stream does not specify one. Elasticsearch comes with the default value of 30 days. The value can be changed via the settings API to any time value higher than 10 seconds or -1 to indicate no default retention should apply. The failures default retention can be set to values higher than the max retention, but then the max retention will be effective. The reason for this choice it to ensure that no deployments will be broken, if the user has already set up max retention less than 30 days.	2025-05-03 15:50:22 +03:00
Ankita Kumar	084542a690	Account for time taken to write index buffers in IndexingMemoryController (#126786 ) This PR adds to the indexing write load, the time taken to flush write indexing buffers using the indexing threads (this is done here to push back on indexing) This changes the semantics of InternalIndexingStats#recentIndexMetric and InternalIndexingStats#peakIndexMetric to more accurately account for load on the indexing thread. Address ES-11356.	2025-05-01 16:56:14 -04:00
Mary Gouseti	03d77816cf	[Failure store] Introduce dedicated failure store lifecycle configuration (#127314 ) The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs. We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature. This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default. This PR introduces the following: Configuration We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below: PUT _data_stream//options { "failure_store": { "lifecycle": { "data_retention": "5d" } } } GET _data_stream//options { "data_streams": [ { "name": "my-ds", "options": { "failure_store": { "lifecycle": { "data_retention": "5d" } } } } ] } To retrieve the effective configuration you need to use the GET data streams API, see #126668 Functionality The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data. The data stream options APIs reflect only the user's configuration. The GET data stream API should be used to check the current state of the effective failure store configuration. Telemetry We extend the data stream failure store telemetry to also include the lifecycle telemetry. { "data_streams": { "available": true, "enabled": true, "data_streams": 10, "indices_count": 50, "failure_store": { "explicitly_enabled_count": 1, "effectively_enabled_count": 15, "failure_indices_count": 30 "lifecycle": { "explicitly_enabled_count": 5, "effectively_enabled_count": 20, "data_retention": { "configured_data_streams": 5, "minimum_millis": X, "maximum_millis": Y, "average_millis": Z, }, "effective_retention": { "retained_data_streams": 20, "minimum_millis": X, "maximum_millis": Y, "average_millis": Z }, "global_retention": { "max": { "defined": false }, "default": { "defined": true, <------ this is the default value applicable for the failure store "millis": X } } } } } Implementation details We ensure that partially reset failure store will create valid failure store configuration. We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.	2025-04-30 18:22:06 +03:00
Keith Massey	23b7a31406	Fixing DataStream::getEffectiveSettings for component templates (#127515 )	2025-04-29 20:31:54 +02:00
Niels Bauman	fd93fad994	Remove test usages of `getDefaultBackingIndexName` in DS and LogsDB tests (#127384 ) We replace usages of time sensitive `DataStream#getDefaultBackingIndexName` with the retrieval of the name via an API call. The problem with using the time sensitive method is that we can have test failures around midnight. Relates #123376	2025-04-29 14:48:37 +02:00
Keith Massey	bdb70c03ee	Adding transport actions for getting and updating data stream settings (#127417 )	2025-04-28 10:46:20 -05:00
Keith Massey	7ddc8d9e7e	Using DataStream::getEffectiveSettings (#127282 )	2025-04-25 14:40:37 -05:00
Niels Bauman	c72d00fd39	Don't start a new node in `InternalTestCluster#getClient` (#127318 ) This method would default to starting a new node when the cluster was empty. This is pretty trappy as `getClient()` (or things like `getMaster()` that depend on `getClient()`) don't look at all like something that would start a new node. In any case, the intention of tests is much clearer when they explicitly define a cluster configuration.	2025-04-25 10:07:52 +02:00
Keith Massey	ee2d2f313d	Adding settings to data streams (#126947 )	2025-04-23 13:27:40 -05:00
Mary Gouseti	db2992f0f8	[Failure Store] Expose failure store lifecycle information via the `GET` data stream API (#126668 ) To retrieve the effective configuration you need to use the `GET` data streams API, for example, if a data stream has empty data stream options, it might still have failure store enabled from a cluster setting. The failure store is managed by default with a lifecycle with infinite (for now) retention, so the response will look like this: ``` GET _data_stream/* { "data_streams": [ { "name": "my-data-stream", "timestamp_field": { "name": "@timestamp" }, ..... "failure_store": { "enabled": true, "lifecycle": { "enabled": true }, "rollover_on_write": false, "indices": [ { "index_name": ".fs-my-data-stream-2099.03.08-000003", "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw", "managed_by": "Data stream lifecycle" } ] } },... ] ``` In case there is a failure indexed managed by ILM the failure index info will be displayed as follows. ``` { "index_name": ".fs-my-data-stream-2099.03.08-000002", "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw", "prefer_ilm": true, "ilm_policy": "my-lifecycle-policy", "managed_by": "Index Lifecycle Management" } ```	2025-04-23 23:44:46 +10:00
Niels Bauman	4207cee3eb	Rename data stream transport actions (#127222 ) The new action names are more consistent with the rest of the codebase.	2025-04-23 12:40:38 +02:00
Mary Gouseti	b9917086e1	Create dedicated factory methods for data lifecycle (#126487 ) The class `DataStreamLifecycle` is currently capturing the lifecycle configuration that currently manages all data stream indices, but soon enough it will be split into two variants, the data and the failures lifecycle. Some pre-work has been done already but as we are progressing in our POC, we see that it will be really useful if the `DataStreamLifecycle` is "aware" of the target index component. This will allow us to correctly apply global retention or to throw an error if a downsampling configuration is provided to a failure lifecycle. In this PR, we perform a small refactoring to reduce the noise in https://github.com/elastic/elasticsearch/pull/125658. Here we introduce the following: - A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon. - We rename the "empty" builder to explicitly mention the index component it refers to.	2025-04-23 20:00:25 +10:00
James Baiera	7b89f4d4a6	Add ability to redirect ingestion failures on data streams to a failure store (#126973 ) Removes the feature flags and guards that prevent the new failure store functionality from operating in production runtimes.	2025-04-18 16:33:03 -04:00
James Baiera	d928d1a418	Add node feature for failure store, refactor capability names (#126885 ) Adds a node feature that is conditionally added to the cluster state if the failure store feature flag is enabled. Requires all nodes in the cluster to have the node feature present in order to redirect failed documents to the failure store from the ingest node or from shard level bulk failures.	2025-04-18 13:42:48 -04:00
Martijn van Groningen	6012590929	Improve resiliency of UpdateTimeSeriesRangeService (#126637 ) If updating the `index.time_series.end_time` fails for one data stream, then UpdateTimeSeriesRangeService should continue updating this setting for other data streams. The following error was observed in the wild: ``` [2025-04-07T08:50:39,698][WARN ][o.e.d.UpdateTimeSeriesRangeService] [node-01] failed to update tsdb data stream end times java.lang.IllegalArgumentException: [index.time_series.end_time] requires [index.mode=time_series] at org.elasticsearch.index.IndexSettings$1.validate(IndexSettings.java:636) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.index.IndexSettings$1.validate(IndexSettings.java:619) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.common.settings.Setting.get(Setting.java:563) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.common.settings.Setting.get(Setting.java:535) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.datastreams.UpdateTimeSeriesRangeService.updateTimeSeriesTemporalRange(UpdateTimeSeriesRangeService.java:111) ~[?:?] at org.elasticsearch.datastreams.UpdateTimeSeriesRangeService$UpdateTimeSeriesExecutor.execute(UpdateTimeSeriesRangeService.java:210) ~[?:?] at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1075) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1038) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.cluster.service.MasterService.executeAndPublishBatch(MasterService.java:245) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.lambda$run$2(MasterService.java:1691) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.action.ActionListener.run(ActionListener.java:452) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.run(MasterService.java:1688) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.cluster.service.MasterService$5.lambda$doRun$0(MasterService.java:1283) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.action.ActionListener.run(ActionListener.java:452) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.cluster.service.MasterService$5.doRun(MasterService.java:1262) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023) ~[elasticsearch-8.17.3.jar:?] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-8.17.3.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?] at java.lang.Thread.run(Thread.java:1575) ~[?:?] ``` Which resulted in a situation, that causes the `index.time_series.end_time` index setting not being updated for any data stream. This then caused data loss as metrics couldn't be indexed, because no suitable backing index could be resolved: ``` the document timestamp [2025-03-26T15:26:10.000Z] is outside of ranges of currently writable indices [[2025-01-31T07:22:43.000Z,2025-02-15T07:24:06.000Z][2025-02-15T07:24:06.000Z,2025-03-02T07:34:07.000Z][2025-03-02T07:34:07.000Z,2025-03-10T12:45:37.000Z][2025-03-10T12:45:37.000Z,2025-03-10T14:30:37.000Z][2025-03-10T14:30:37.000Z,2025-03-25T12:50:40.000Z][2025-03-25T12:50:40.000Z,2025-03-25T14:35:40.000Z ```	2025-04-11 12:58:10 +02:00
Armin Braun	dd1db5031e	Move calls to FeatureFlag.enabled to class-load time (#125885 ) I noticed that we tend to create the flag instance and call this method everywhere. This doesn't compile the same way as a real boolean constant unless you're running with `-XX:+TrustFinalNonStaticFields`. For most of the code spots changed here that's irrelevant but at least the usage in the mapper parsing code is a little hot and gets a small speedup from this potentially. Also we're simply wasting some bytes for the static footprint of ES by using the `FeatureFlag` indirection instead of just a boolean.	2025-04-11 01:46:28 +02:00
Mary Gouseti	78ac5d58ef	[Failure store] Support failure store for system data streams (#126585 ) In this PR we add support for the failure store for system data streams. Specifically: - We pass the system descriptor so the failure index can be created based on that. - We extend the tests to ensure it works - We remove a guard we had but I wasn't able to test it because it only gets triggered if the data stream gets created right after a failure in the ingest pipeline, and I didn't see how to add one (yet). - We extend the system data stream migration to ensure this is also working.	2025-04-11 05:14:11 +10:00
Alexey Ivanov	ecf9adfc78	[main] System data streams are not being upgraded in the feature migration API (#126409 ) This commit adds support for system data streams reindexing. The system data stream migration extends the existing system indices migration task and uses the data stream reindex API. The system index migration task starts a reindex data stream task and tracks its status every second. Only one system index or system data stream is migrated at a time. If a data stream migration fails, the entire system index migration task will also fail. Port of #123926	2025-04-08 20:42:58 +02:00
Mary Gouseti	060a9b746a	[DLM]Use default lifecycle instance instead of default constructor (#126461 ) When creating the an empty lifecycle we used to use the default constructor. This is not just for efficiency but it will allow us to separate the default data and failures lifecycle in the future.	2025-04-08 23:37:30 +10:00
Ryan Ernst	991e80d56e	Remove unnecessary generic params from action classes (#126364 ) Transport actions have associated request and response classes. However, the base type restrictions are not necessary to duplicate when creating a map of transport actions. Relatedly, the ActionHandler class doesn't actually need strongly typed action type and classes since they are lost when shoved into the node client map. This commit removes these type restrictions and generic parameters.	2025-04-07 16:22:56 -07:00
Mary Gouseti	a525b3d924	Fix test to anticipate force merge failure (#126282 ) This test had a copy paste mistake. When the cluster has only one data node the replicas cannot be assigned so we end up with a force merge error. In the case of the failure store this was not asserted correctly. On the other hand, this test only checked for the existence of an error and it was not ensuring that the current error is not the rollover error that should have recovered. We make this test a bit more explicit. Fixes: https://github.com/elastic/elasticsearch/issues/126252	2025-04-05 05:26:58 +11:00
Sam Xiao	b6c6db9861	Add multi-project support for health indicator data_stream_lifecycle (#126056 )	2025-04-03 16:26:22 -04:00
Mary Gouseti	488951edf3	Data stream lifecycle does not record error in failure store rollover (#126229 ) Issue The data stream lifecycle does not register correctly rollover errors for failure store. Observed bahaviour When data stream lifecycle encounters a rollover error it records it unless it sees that the current write index of this data stream doesn't match the source index of the request. However, the write index check does not use the failure write index but the write backing index, so the failure gets ignored Desired behaviour When data stream lifecycle encounters a rollover error it will check the relevant write index before it determines if it should be recorded or not.	2025-04-04 03:44:09 +11:00
Mary Gouseti	95257bbf07	Make data stream options multi-project aware (#126141 )	2025-04-03 14:33:40 +03:00
Mary Gouseti	25050495b9	Data stream options convert to `javaRestTests` to `yamlRestTests`. (#126037 ) In this PR we introduce the data stream API in the `es-rest-api` using the feature flag feature. This enabled us to use the `yamlRestTests` tests instead of the `javaRestTests`.	2025-04-03 01:32:54 +11:00
Niels Bauman	a8f5db2604	Make data stream lifecycle project-aware (#125476 ) Now that all actions that DLM depends on are project-aware, we can make DLM itself project-aware. There still exists only one instance of `DataStreamLifecycleService`, it just loops over all the projects - which matches the approach we've taken for similar scenarios thus far.	2025-03-31 14:52:43 +01:00
Pete Gillin	66432fb886	ES-10037 Track the peak indexing load for each shard (#125521 ) This tracks the highest value seen for the recent write load metric any time the stats for a shard was computed, exposes this value alongside the recent value, and persists it in index metadata alongside it too. The new test in `IndexShardTests` is designed to more thoroughly test the recent write load metric previously added, as well as to test the peak metric being added here. ES-10037 #comment Added peak load metric in https://github.com/elastic/elasticsearch/pull/125521	2025-03-27 12:03:39 +02:00
Mary Gouseti	6503c1b94b	[Failure Store] Conceptually introduce the failure store lifecycle (#125258 ) * Specify index component when retrieving lifecycle * Add getters for the failure lifecycle * Conceptually introduce the failure store lifecycle (even for now it's the same)	2025-03-26 13:21:48 +02:00
Niels Bauman	8b691db436	Fix data stream retrieval in `ExplainDataStreamLifecycleIT` (#125611 ) These tests had the potential to fail when two consecutive GET data streams requests would hit two different nodes, where one node already had the cluster state that contained the new backing index and the other node didn't yet. Caused by #122852 Fixes #124882 Fixes #124885	2025-03-26 10:33:33 +00:00
Niels Bauman	542a3b65a9	Fix data stream retrieval in `DataStreamLifecycleServiceIT` (#125195 ) These tests had the potential to fail when two consecutive GET data streams requests would hit two different nodes, where one node already had the cluster state that contained the new backing index and the other node didn't yet. Caused by #122852 Fixes #124846 Fixes #124950 Fixes #124999	2025-03-24 17:43:09 +02:00
Niels Bauman	f7d7ce7ccc	Run `TransportGetDataStreamOptionsAction` on local node (#125213 ) This action solely needs the cluster state, it can run on any node. Additionally, it needs to be cancellable to avoid doing unnecessary work after a client failure or timeout. Relates #101805	2025-03-22 16:18:28 +02:00
Niels Bauman	bbc47d9cad	Run `TransportGetDataStreamLifecycleAction` on local node (#125214 ) This action solely needs the cluster state, it can run on any node. Additionally, it needs to be cancellable to avoid doing unnecessary work after a client failure or timeout. Relates #101805	2025-03-22 13:00:47 +02:00
Mary Gouseti	2c377f9c85	Unify template builders for data stream options, failure store and data stream lifecycle (#125293 )	2025-03-21 10:03:27 +02:00
Yang Wang	7a0a399055	[Test] Reconcile TestProjectResolvers (#124988 ) This PR updates the different methods in TestProjectResolvers so that their names are more accurate and behaviours to be more as expected. For example, In MP-1749, we differentiate between single-project and single-project only resolvers. The later should not support multi-project.	2025-03-21 11:43:05 +11:00
Niels Bauman	8e64f50d66	Make DLM stats and DLM error store project-aware (#124810 ) This is part of the work to make DLM project-aware. These two features were pretty tightly coupled, so I saved some effort by combining them in one PR.	2025-03-19 12:39:28 +02:00
Pete Gillin	50e689493c	Calculate recent write load in indexing stats (#124652 ) This uses the recently-added `ExponentiallyWeightedMovingRate` class to calculate a write load which favours more recent load and include this alongside the existing unweighted all-time write load in `IndexingStats.Stats`. As of this change, the new load metric is not used anywhere, although it can be retrieved with the index stats or node stats APIs.	2025-03-18 21:23:20 +02:00
Mary Gouseti	ce04da7dea	Refactor data stream lifecycle to use the template paradigm (#124593 )	2025-03-18 13:24:06 +02:00
Rene Groeschke	ae569def9c	[Build] Require reason for usesDefaultDistribution (#124707 ) This makes using usesDefaultDistribution in our test setup for explicit by requiring a reason why it's needed. This is helpful as part of revisiting the need for all those usages in our code base.	2025-03-17 08:25:39 +01:00
Armin Braun	4c1c51e870	Remove remoteAddress field from TransportResponse (#120016 ) This field is only used (by security) for requests, having it in responses is redundant. Also, we have a couple of responses that are singletons/quasi-enums where setting the value needlessly might introduce some strange contention even though it's a plain store. This isn't just a cosmetic change. It makes it clear at compile time that each response instance is exclusively defined by the bytes that it is read from. This makes it easier to reason about the validity of suggested optimizations like https://github.com/elastic/elasticsearch/pull/120010	2025-03-16 19:54:29 +01:00
John Verwolf	cb3c35783b	Bug Fix: System Data Streams Should Be Restorable (#124651 ) This PR adds a new MetadataDeleteDataStreamService that allows us to delete system data streams prior to a restore operation. This fixes a bug where system data streams were previously un-restorable.	2025-03-14 08:00:44 -07:00
Niels Bauman	f0eb8da172	Make `DeleteSourceAndAddDownsampleToDS` project-aware (#124808 ) This is part of the work to make DLM project-aware.	2025-03-14 08:40:37 +00:00
Niels Bauman	af6eb8cc38	Run `TransportGetDataStreamsAction` on local node (#122852 ) This action solely needs the cluster state, it can run on any node. Additionally, it needs to be cancellable to avoid doing unnecessary work after a client failure or timeout. Relates #101805	2025-03-13 21:16:14 +00:00

1 2 3 4 5 ...

664 commits