Commit graph

674 commits

Author SHA1 Message Date
Keith Massey
528bd9c234
Adding mappings to data streams (#129787) 2025-06-25 15:03:28 -05:00
Keith Massey
0b58a53a98
Adding the ability to unset data stream settings (#129677) 2025-06-24 10:30:15 -05:00
Niels Bauman
bae6e3c66d
Fix data stream stats YAML test (#129813)
Occasional shard allocation issues were causing the YAML tests to fail
because the shard that had the document with the max timestamp in it
would be unavailable.

Fixes #118217
2025-06-24 01:22:48 +10:00
Pooya Salehi
a229c8d932
Integrate project global blocks into existing checks for cluster blocks (Part 2) (#129570)
Relates https://github.com/elastic/elasticsearch/pull/129467
Resolves ES-11209
2025-06-18 19:51:40 +10:00
Keith Massey
90c24d06e7
Fixing for loop in DataStreamSettingsIT (#129564) 2025-06-18 05:30:16 +10:00
Niels Bauman
9f78d11639
Remove usages of DataStream#getDefaultBackingIndexName (#129466)
These usages had the potential of causing test failures when a data
stream was created before midnight and the backing index name generation
ran the next day - which would be millisecconds apart. To avoid these
failures, we update the tests to be robust to these time differences.

Resolves #123376
2025-06-17 11:02:52 -03:00
Niels Bauman
398da36f49
Make use of new projectClient method and remove old one (#129393)
We added a new `projectClient` method on `Client` in #129174. We now
update the usages of the old method (on `ProjectResolver`) to use the
new one and we delete the old method.
2025-06-17 13:39:04 +02:00
Pooya Salehi
d3d10c2efb
Integrate project global blocks into existing checks for cluster blocks (#129467)
This PR updates some of the existing checks for cluster blocks to also consider project global blocks. It adds a few project-aware flavour of existing methods in `ClusterBlocks`. `globalBlockedException(ClusterBlockLevel)` is the mostly used one. I've updated only some of the obvious ones here. 
Follow up to https://github.com/elastic/elasticsearch/pull/127978
Relates ES-11209
2025-06-17 12:36:59 +02:00
Yang Wang
adf4d1005f
Setting for estimated shard heap allocation decider (#128722)
This PR adds a new setting to toggle the collection for shard heap
usages as well as wiring ShardHeapUsage into ClusterInfoSimulator.

Relates: #128723
2025-06-17 13:28:00 +10:00
Mary Gouseti
9764730d49
Remove include_default query param from get data stream options. (#128730)
Initially we added to the `include_defaults` to the get data stream
options REST API as it was used in the lifecycler API; however, we
decided to simplify it and not use it. We remove it now before it gets
adopted.
2025-06-03 18:15:42 +10:00
Keith Massey
dc2fbe19a6
Removing the data stream settings feature flag (#128594) 2025-05-29 09:50:14 -05:00
Keith Massey
41f186dca0
Adding prefer_ilm as a whitelisted data stream setting (#128375) 2025-05-27 15:42:08 -05:00
Keith Massey
7207692056
Adding dry_run mode for setting data stream settings (#128269) 2025-05-23 11:29:00 -05:00
Keith Massey
bc45087962
Adding rest actions to get and set data stream settings (#127858) 2025-05-21 12:17:56 -05:00
Pete Gillin
ca921a0c31
Flip default metric for data stream auto-sharding (#127930)
This changes the default value of both the
`data_streams.auto_sharding.increase_shards.load_metric` and
`data_streams.auto_sharding.decrease_shards.load_metric` cluster
settings from `PEAK` to `ALL_TIME`. This setting has been applied via
config for several weeks now.

The approach taken to updating the tests was to swap the values given for the all-time and peak loads in all the stats objects provided as input to the tests, and to swap the enum values in the couple of places they appear.
2025-05-12 14:32:41 +01:00
Mary Gouseti
077b6b949b
Skip the validation when retrieving the index mode during reindexing a time series data stream. (#127824)
During reindexing we retrieve the index mode from the template settings. However, we do not fully resolve the settings as we do when validating a template or when creating a data stream. This results on throwing the error reported in #125607.

I do not see a reason to not fix this as suggested in #125607 (comment).

Fixes: #125607
2025-05-08 10:25:53 +03:00
Patrick Doyle
5df5cb890e
Propagate file settings health info to the health node (#127397)
* Initial testHealthIndicator that fails

* Refactor: FileSettingsHealthInfo record

* Propagate file settings health indicator to health node

* ensureStableCluster

* Try to induce a failure from returning node-local info

* Remove redundant node from client() call

* Use local node ID in UpdateHealthInfoCacheAction.Request

* Move logger to top

* Test node-local health on master and health nodes

* Fix calculate to use the given info

* mutateFileSettingsHealthInfo

* Test status from local current info

* FileSettingsHealthTracker

* Spruce up HealthInfoTests

* spotless

* randomNonNegativeLong

* Rename variable

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>

* Address Niels' comments

* Test one- and two-node clusters

* [CI] Auto commit changes from spotless

* Ensure there's a master node

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>

* setBootstrapMasterNodeIndex

---------

Co-authored-by: Niels Bauman <33722607+nielsbauman@users.noreply.github.com>
Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co>
2025-05-05 16:39:28 +02:00
Mary Gouseti
ba49d48203
Add rest API capability for failures default retention (#127674)
This PR is adding the API capability to ensure that the API tests that
check for the default failures retention will only be executed when the
version supports this. This was missed in the original PR
(https://github.com/elastic/elasticsearch/pull/127573).
2025-05-04 00:51:37 +10:00
Mary Gouseti
fe36c42eee
[Failure store] Introduce default retention for failure indices (#127573)
We introduce a new global retention setting `data_streams.lifecycle.retention.failures_default` which is used by the data stream lifecycle management as the default retention when the failure store lifecycle of the data stream does not specify one.

Elasticsearch comes with the default value of 30 days. The value can be changed via the settings API to any time value higher than 10 seconds or -1 to indicate no default retention should apply.

The failures default retention can be set to values higher than the max retention, but then the max retention will be effective. The reason for this choice it to ensure that no deployments will be broken, if the user has already set up max retention less than 30 days.
2025-05-03 15:50:22 +03:00
Ankita Kumar
084542a690
Account for time taken to write index buffers in IndexingMemoryController (#126786)
This PR adds to the indexing write load, the time taken to flush write indexing buffers using the indexing threads (this is done here to push back on indexing)

This changes the semantics of InternalIndexingStats#recentIndexMetric and InternalIndexingStats#peakIndexMetric  to more accurately account for load on the indexing thread. Address ES-11356.
2025-05-01 16:56:14 -04:00
Mary Gouseti
03d77816cf
[Failure store] Introduce dedicated failure store lifecycle configuration (#127314)
The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}
To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry
We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}
Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.
2025-04-30 18:22:06 +03:00
Keith Massey
23b7a31406
Fixing DataStream::getEffectiveSettings for component templates (#127515) 2025-04-29 20:31:54 +02:00
Niels Bauman
fd93fad994
Remove test usages of getDefaultBackingIndexName in DS and LogsDB tests (#127384)
We replace usages of time sensitive
`DataStream#getDefaultBackingIndexName` with the retrieval of the name
via an API call. The problem with using the time sensitive method is
that we can have test failures around midnight.

Relates #123376
2025-04-29 14:48:37 +02:00
Keith Massey
bdb70c03ee
Adding transport actions for getting and updating data stream settings (#127417) 2025-04-28 10:46:20 -05:00
Keith Massey
7ddc8d9e7e
Using DataStream::getEffectiveSettings (#127282) 2025-04-25 14:40:37 -05:00
Niels Bauman
c72d00fd39
Don't start a new node in InternalTestCluster#getClient (#127318)
This method would default to starting a new node when the cluster was
empty. This is pretty trappy as `getClient()` (or things like
`getMaster()` that depend on `getClient()`) don't look at all like
something that would start a new node.

In any case, the intention of tests is much clearer when they explicitly
define a cluster configuration.
2025-04-25 10:07:52 +02:00
Keith Massey
ee2d2f313d
Adding settings to data streams (#126947) 2025-04-23 13:27:40 -05:00
Mary Gouseti
db2992f0f8
[Failure Store] Expose failure store lifecycle information via the GET data stream API (#126668)
To retrieve the effective configuration you need to use the `GET` data
streams API, for example, if a data stream has empty data stream
options, it might still have failure store enabled from a cluster
setting. The failure store is managed by default with a lifecycle with
infinite (for now) retention, so the response will look like this:

```
GET _data_stream/*
{
  "data_streams": [
    {
      "name": "my-data-stream",
      "timestamp_field": {
        "name": "@timestamp"
      },
      .....
      "failure_store": {
        "enabled": true,
        "lifecycle": {
          "enabled": true
        },
        "rollover_on_write": false,
        "indices": [
           {
            "index_name": ".fs-my-data-stream-2099.03.08-000003",
            "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
            "managed_by": "Data stream lifecycle"
          }
        ]
      }
    },...
]
```

In case there is a failure indexed managed by ILM the failure index info
will be displayed as follows.

```
      {
          "index_name": ".fs-my-data-stream-2099.03.08-000002",
          "index_uuid": "PA_JquKGSiKcAKBA8DJ5gw",
          "prefer_ilm": true,
          "ilm_policy": "my-lifecycle-policy",
          "managed_by": "Index Lifecycle Management"
        }
```
2025-04-23 23:44:46 +10:00
Niels Bauman
4207cee3eb
Rename data stream transport actions (#127222)
The new action names are more consistent with the rest of the codebase.
2025-04-23 12:40:38 +02:00
Mary Gouseti
b9917086e1
Create dedicated factory methods for data lifecycle (#126487)
The class `DataStreamLifecycle` is currently capturing the lifecycle
configuration that currently manages all data stream indices, but soon
enough it will be split into two variants, the data and the failures
lifecycle. 

Some pre-work has been done already but as we are progressing in our
POC, we see that it will be really useful if the `DataStreamLifecycle`
is "aware" of the target index component. This will allow us to
correctly apply global retention or to throw an error if a downsampling
configuration is provided to a failure lifecycle.

In this PR, we perform a small refactoring to reduce the noise in
https://github.com/elastic/elasticsearch/pull/125658. Here we introduce
the following:

- A factory method that creates a data lifecycle, for now it's trivial but it will be more useful soon.
- We rename the "empty" builder to explicitly mention the index component it refers to.
2025-04-23 20:00:25 +10:00
James Baiera
7b89f4d4a6
Add ability to redirect ingestion failures on data streams to a failure store (#126973)
Removes the feature flags and guards that prevent the new failure store functionality 
from operating in production runtimes.
2025-04-18 16:33:03 -04:00
James Baiera
d928d1a418
Add node feature for failure store, refactor capability names (#126885)
Adds a node feature that is conditionally added to the cluster state if the failure store 
feature flag is enabled. Requires all nodes in the cluster to have the node feature 
present in order to redirect failed documents to the failure store from the ingest node 
or from shard level bulk failures.
2025-04-18 13:42:48 -04:00
Martijn van Groningen
6012590929
Improve resiliency of UpdateTimeSeriesRangeService (#126637)
If updating the `index.time_series.end_time` fails for one data stream,
then UpdateTimeSeriesRangeService should continue updating this setting for other data streams.

The following error was observed in the wild:

```
[2025-04-07T08:50:39,698][WARN ][o.e.d.UpdateTimeSeriesRangeService] [node-01] failed to update tsdb data stream end times
java.lang.IllegalArgumentException: [index.time_series.end_time] requires [index.mode=time_series]
        at org.elasticsearch.index.IndexSettings$1.validate(IndexSettings.java:636) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.index.IndexSettings$1.validate(IndexSettings.java:619) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.common.settings.Setting.get(Setting.java:563) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.common.settings.Setting.get(Setting.java:535) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.datastreams.UpdateTimeSeriesRangeService.updateTimeSeriesTemporalRange(UpdateTimeSeriesRangeService.java:111) ~[?:?]
        at org.elasticsearch.datastreams.UpdateTimeSeriesRangeService$UpdateTimeSeriesExecutor.execute(UpdateTimeSeriesRangeService.java:210) ~[?:?]
        at org.elasticsearch.cluster.service.MasterService.innerExecuteTasks(MasterService.java:1075) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:1038) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.cluster.service.MasterService.executeAndPublishBatch(MasterService.java:245) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.lambda$run$2(MasterService.java:1691) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.action.ActionListener.run(ActionListener.java:452) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.run(MasterService.java:1688) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.cluster.service.MasterService$5.lambda$doRun$0(MasterService.java:1283) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.action.ActionListener.run(ActionListener.java:452) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.cluster.service.MasterService$5.doRun(MasterService.java:1262) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1023) ~[elasticsearch-8.17.3.jar:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27) ~[elasticsearch-8.17.3.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
        at java.lang.Thread.run(Thread.java:1575) ~[?:?]
```

Which resulted in a situation, that causes the `index.time_series.end_time` index setting not being updated for any data stream. This then caused data loss as metrics couldn't be indexed, because no suitable backing index could be resolved:

```
the document timestamp [2025-03-26T15:26:10.000Z] is outside of ranges of currently writable indices [[2025-01-31T07:22:43.000Z,2025-02-15T07:24:06.000Z][2025-02-15T07:24:06.000Z,2025-03-02T07:34:07.000Z][2025-03-02T07:34:07.000Z,2025-03-10T12:45:37.000Z][2025-03-10T12:45:37.000Z,2025-03-10T14:30:37.000Z][2025-03-10T14:30:37.000Z,2025-03-25T12:50:40.000Z][2025-03-25T12:50:40.000Z,2025-03-25T14:35:40.000Z
```
2025-04-11 12:58:10 +02:00
Armin Braun
dd1db5031e
Move calls to FeatureFlag.enabled to class-load time (#125885)
I noticed that we tend to create the flag instance and call this method
everywhere. This doesn't compile the same way as a real boolean constant
unless you're running with `-XX:+TrustFinalNonStaticFields`.
For most of the code spots changed here that's irrelevant but at least
the usage in the mapper parsing code is a little hot and gets a small
speedup from this potentially.
Also we're simply wasting some bytes for the static footprint of ES by
using the `FeatureFlag` indirection instead of just a boolean.
2025-04-11 01:46:28 +02:00
Mary Gouseti
78ac5d58ef
[Failure store] Support failure store for system data streams (#126585)
In this PR we add support for the failure store for system data streams.
Specifically:

- We pass the system descriptor so the failure index can be created based on that.
- We extend the tests to ensure it works
- We remove a guard we had but I wasn't able to test it because it only gets triggered if the data stream gets created right after a failure in the ingest pipeline, and I didn't see how to add one (yet).
- We extend the system data stream migration to ensure this is also working.
2025-04-11 05:14:11 +10:00
Alexey Ivanov
ecf9adfc78
[main] System data streams are not being upgraded in the feature migration API (#126409)
This commit adds support for system data streams reindexing. The system data stream migration extends the existing system indices migration task and uses the data stream reindex API.
The system index migration task starts a reindex data stream task and tracks its status every second. Only one system index or system data stream is migrated at a time. If a data stream migration fails, the entire system index migration task will also fail.

Port of #123926
2025-04-08 20:42:58 +02:00
Mary Gouseti
060a9b746a
[DLM]Use default lifecycle instance instead of default constructor (#126461)
When creating the an empty lifecycle we used to use the default
constructor. This is not just for efficiency but it will allow us to
separate the default data and failures lifecycle in the future.
2025-04-08 23:37:30 +10:00
Ryan Ernst
991e80d56e
Remove unnecessary generic params from action classes (#126364)
Transport actions have associated request and response classes. However,
the base type restrictions are not necessary to duplicate when creating
a map of transport actions. Relatedly, the ActionHandler class doesn't
actually need strongly typed action type and classes since they are lost
when shoved into the node client map. This commit removes these type
restrictions and generic parameters.
2025-04-07 16:22:56 -07:00
Mary Gouseti
a525b3d924
Fix test to anticipate force merge failure (#126282)
This test had a copy paste mistake. When the cluster has only one data
node the replicas cannot be assigned so we end up with a force merge
error. In the case of the failure store this was not asserted correctly.

On the other hand, this test only checked for the existence of an error
and it was not ensuring that the current error is not the rollover error
that should have recovered. We make this test a bit more explicit.

Fixes: https://github.com/elastic/elasticsearch/issues/126252
2025-04-05 05:26:58 +11:00
Sam Xiao
b6c6db9861
Add multi-project support for health indicator data_stream_lifecycle (#126056) 2025-04-03 16:26:22 -04:00
Mary Gouseti
488951edf3
Data stream lifecycle does not record error in failure store rollover (#126229)
**Issue** The data stream lifecycle does not register correctly rollover
errors for failure store.

**Observed bahaviour** When data stream lifecycle encounters a rollover
error it records it unless it sees that the current write index of this
data stream doesn't match the source index of the request. However, the
write index check does not use the failure write index but the write
backing index, so the failure gets ignored

**Desired behaviour** When data stream lifecycle encounters a rollover
error it will check the relevant write index before it determines if it
should be recorded or not.
2025-04-04 03:44:09 +11:00
Mary Gouseti
95257bbf07
Make data stream options multi-project aware (#126141) 2025-04-03 14:33:40 +03:00
Mary Gouseti
25050495b9
Data stream options convert to javaRestTests to yamlRestTests. (#126037)
In this PR we introduce the data stream API in the `es-rest-api` using
the feature flag feature. This enabled us to use the `yamlRestTests`
tests instead of the `javaRestTests`.
2025-04-03 01:32:54 +11:00
Niels Bauman
a8f5db2604
Make data stream lifecycle project-aware (#125476)
Now that all actions that DLM depends on are project-aware, we can make DLM itself project-aware.
There still exists only one instance of `DataStreamLifecycleService`, it just loops over all the projects - which matches the approach we've taken for similar scenarios thus far.
2025-03-31 14:52:43 +01:00
Pete Gillin
66432fb886
ES-10037 Track the peak indexing load for each shard (#125521)
This tracks the highest value seen for the recent write load metric
any time the stats for a shard was computed, exposes this value
alongside the recent value, and persists it in index metadata
alongside it too.

The new test in `IndexShardTests` is designed to more thoroughly test
the recent write load metric previously added, as well as to test the
peak metric being added here.

ES-10037 #comment Added peak load metric in https://github.com/elastic/elasticsearch/pull/125521
2025-03-27 12:03:39 +02:00
Mary Gouseti
6503c1b94b
[Failure Store] Conceptually introduce the failure store lifecycle (#125258)
* Specify index component when retrieving lifecycle

* Add getters for the failure lifecycle

* Conceptually introduce the failure store lifecycle (even for now it's the same)
2025-03-26 13:21:48 +02:00
Niels Bauman
8b691db436
Fix data stream retrieval in ExplainDataStreamLifecycleIT (#125611)
These tests had the potential to fail when two consecutive GET data
streams requests would hit two different nodes, where one node already
had the cluster state that contained the new backing index and the other
node didn't yet.

Caused by #122852

Fixes #124882
Fixes #124885
2025-03-26 10:33:33 +00:00
Niels Bauman
542a3b65a9
Fix data stream retrieval in DataStreamLifecycleServiceIT (#125195)
These tests had the potential to fail when two consecutive GET data
streams requests would hit two different nodes, where one node already
had the cluster state that contained the new backing index and the other
node didn't yet.

Caused by #122852

Fixes #124846
Fixes #124950
Fixes #124999
2025-03-24 17:43:09 +02:00
Niels Bauman
f7d7ce7ccc
Run TransportGetDataStreamOptionsAction on local node (#125213)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

Relates #101805
2025-03-22 16:18:28 +02:00
Niels Bauman
bbc47d9cad
Run TransportGetDataStreamLifecycleAction on local node (#125214)
This action solely needs the cluster state, it can run on any node.
Additionally, it needs to be cancellable to avoid doing unnecessary work
after a client failure or timeout.

Relates #101805
2025-03-22 13:00:47 +02:00