mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 15:47:23 -04:00
In this PR we expose the global retention via the `GET _data_stream/{target}/_lifecycle` API. Since the global retention is a main feature of the data stream lifecycle we chose to expose it by default. ``` GET /_data_stream/my-data-stream/_lifecycle { "global_retention": { "default_retention": "7d", "max_retention": "365d" }, "data_streams": [...] } ```
228 lines
9.2 KiB
Text
228 lines
9.2 KiB
Text
[role="xpack"]
|
|
[[tutorial-manage-data-stream-retention]]
|
|
=== Tutorial: Data stream retention
|
|
|
|
In this tutorial, we are going to go over the data stream lifecycle retention; we will define it, go over how it can be configured
|
|
and how it can gets applied. Keep in mind, the following options apply only to data streams that are managed by the data stream lifecycle.
|
|
|
|
. <<what-is-retention>>
|
|
. <<retention-configuration>>
|
|
. <<effective-retention-calculation>>
|
|
. <<effective-retention-application>>
|
|
|
|
You can verify if a data steam is managed by the data stream lifecycle via the <<data-streams-get-lifecycle,get data stream lifecycle API>>:
|
|
|
|
////
|
|
[source,console]
|
|
----
|
|
PUT /_index_template/template
|
|
{
|
|
"index_patterns": ["my-data-stream*"],
|
|
"template": {
|
|
"lifecycle": {}
|
|
},
|
|
"data_stream": { }
|
|
}
|
|
|
|
PUT /_data_stream/my-data-stream
|
|
----
|
|
// TESTSETUP
|
|
////
|
|
|
|
////
|
|
[source,console]
|
|
----
|
|
DELETE /_data_stream/my-data-stream*
|
|
DELETE /_index_template/template
|
|
PUT /_cluster/settings
|
|
{
|
|
"persistent" : {
|
|
"data_streams.lifecycle.retention.*" : null
|
|
}
|
|
}
|
|
----
|
|
// TEARDOWN
|
|
////
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET _data_stream/my-data-stream/_lifecycle
|
|
--------------------------------------------------
|
|
|
|
The result should look like this:
|
|
|
|
[source,console-result]
|
|
--------------------------------------------------
|
|
{
|
|
"data_streams": [
|
|
{
|
|
"name": "my-data-stream", <1>
|
|
"lifecycle": {
|
|
"enabled": true <2>
|
|
}
|
|
}
|
|
]
|
|
}
|
|
--------------------------------------------------
|
|
// TESTRESPONSE[skip:the result is for illustrating purposes only]
|
|
<1> The name of your data stream.
|
|
<2> Ensure that the lifecycle is enabled, meaning this should be `true`.
|
|
|
|
[discrete]
|
|
[[what-is-retention]]
|
|
==== What is data stream retention?
|
|
|
|
We define retention as the least amount of time the data of a data stream are going to be kept in {es}. After this time period
|
|
has passed, {es} is allowed to remove these data to free up space and/or manage costs.
|
|
|
|
NOTE: Retention does not define the period that the data will be removed, but the minimum time period they will be kept.
|
|
|
|
We define 4 different types of retention:
|
|
|
|
* The data stream retention, or `data_retention`, which is the retention configured on the data stream level. It can be
|
|
set via an <<index-templates,index template>> for future data streams or via the <<data-streams-put-lifecycle, PUT data
|
|
stream lifecycle API>> for an existing data stream. When the data stream retention is not set, it implies that the data
|
|
need to be kept forever.
|
|
* The global default retention, let's call it `default_retention`, which is a retention configured via the cluster setting
|
|
<<data-streams-lifecycle-retention-default, `data_streams.lifecycle.retention.default`>> and will be
|
|
applied to all data streams managed by data stream lifecycle that do not have `data_retention` configured. Effectively,
|
|
it ensures that there will be no data streams keeping their data forever. This can be set via the
|
|
<<cluster-update-settings, update cluster settings API>>.
|
|
* The global max retention, let's call it `max_retention`, which is a retention configured via the cluster setting
|
|
<<data-streams-lifecycle-retention-max, `data_streams.lifecycle.retention.max`>> and will be applied to
|
|
all data streams managed by data stream lifecycle. Effectively, it ensures that there will be no data streams whose retention
|
|
will exceed this time period. This can be set via the <<cluster-update-settings, update cluster settings API>>.
|
|
* The effective retention, or `effective_retention`, which is the retention applied at a data stream on a given moment.
|
|
Effective retention cannot be set, it is derived by taking into account all the configured retention listed above and is
|
|
calculated as it is described <<effective-retention-calculation,here>>.
|
|
|
|
NOTE: Global default and max retention do not apply to data streams internal to elastic. Internal data streams are recognised
|
|
either by having the `system` flag set to `true` or if their name is prefixed with a dot (`.`).
|
|
|
|
[discrete]
|
|
[[retention-configuration]]
|
|
==== How to configure retention?
|
|
|
|
- By setting the `data_retention` on the data stream level. This retention can be configured in two ways:
|
|
+
|
|
-- For new data streams, it can be defined in the index template that would be applied during the data stream's creation.
|
|
You can use the <<indices-put-template,create index template API>>, for example:
|
|
+
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT _index_template/template
|
|
{
|
|
"index_patterns": ["my-data-stream*"],
|
|
"data_stream": { },
|
|
"priority": 500,
|
|
"template": {
|
|
"lifecycle": {
|
|
"data_retention": "7d"
|
|
}
|
|
},
|
|
"_meta": {
|
|
"description": "Template with data stream lifecycle"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
-- For an existing data stream, it can be set via the <<data-streams-put-lifecycle, PUT lifecycle API>>.
|
|
+
|
|
[source,console]
|
|
----
|
|
PUT _data_stream/my-data-stream/_lifecycle
|
|
{
|
|
"data_retention": "30d" <1>
|
|
}
|
|
----
|
|
// TEST[continued]
|
|
<1> The retention period of this data stream is set to 30 days.
|
|
|
|
- By setting the global retention via the `data_streams.lifecycle.retention.default` and/or `data_streams.lifecycle.retention.max`
|
|
that are set on a cluster level. You can be set via the <<cluster-update-settings, update cluster settings API>>. For example:
|
|
+
|
|
[source,console]
|
|
--------------------------------------------------
|
|
PUT /_cluster/settings
|
|
{
|
|
"persistent" : {
|
|
"data_streams.lifecycle.retention.default" : "7d",
|
|
"data_streams.lifecycle.retention.max" : "90d"
|
|
}
|
|
}
|
|
--------------------------------------------------
|
|
// TEST[continued]
|
|
|
|
[discrete]
|
|
[[effective-retention-calculation]]
|
|
==== How is the effective retention calculated?
|
|
The effective is calculated in the following way:
|
|
|
|
- The `effective_retention` is the `default_retention`, when `default_retention` is defined and the data stream does not
|
|
have `data_retention`.
|
|
- The `effective_retention` is the `data_retention`, when `data_retention` is defined and if `max_retention` is defined,
|
|
it is less than the `max_retention`.
|
|
- The `effective_retention` is the `max_retention`, when `max_retention` is defined, and the data stream has either no
|
|
`data_retention` or its `data_retention` is greater than the `max_retention`.
|
|
|
|
The above is demonstrated in the examples below:
|
|
|
|
|===
|
|
|`default_retention` |`max_retention` |`data_retention` |`effective_retention` |Retention determined by
|
|
|
|
|Not set |Not set |Not set |Infinite |N/A
|
|
|Not relevant |12 months |**30 days** |30 days |`data_retention`
|
|
|Not relevant |Not set |**30 days** |30 days |`data_retention`
|
|
|**30 days** |12 months |Not set |30 days |`default_retention`
|
|
|**30 days** |30 days |Not set |30 days |`default_retention`
|
|
|Not relevant |**30 days** |12 months |30 days |`max_retention`
|
|
|Not set |**30 days** |Not set |30 days |`max_retention`
|
|
|===
|
|
|
|
Considering our example, if we retrieve the lifecycle of `my-data-stream`:
|
|
[source,console]
|
|
----
|
|
GET _data_stream/my-data-stream/_lifecycle
|
|
----
|
|
// TEST[continued]
|
|
|
|
We see that it will remain the same with what the user configured:
|
|
[source,console-result]
|
|
----
|
|
{
|
|
"global_retention" : {
|
|
"max_retention" : "90d", <1>
|
|
"default_retention" : "7d" <2>
|
|
},
|
|
"data_streams": [
|
|
{
|
|
"name": "my-data-stream",
|
|
"lifecycle": {
|
|
"enabled": true,
|
|
"data_retention": "30d", <3>
|
|
"effective_retention": "30d", <4>
|
|
"retention_determined_by": "data_stream_configuration" <5>
|
|
}
|
|
}
|
|
]
|
|
}
|
|
----
|
|
<1> The maximum retention configured in the cluster.
|
|
<2> The default retention configured in the cluster.
|
|
<3> The requested retention for this data stream.
|
|
<4> The retention that is applied by the data stream lifecycle on this data stream.
|
|
<5> The configuration that determined the effective retention. In this case it's the `data_configuration` because
|
|
it is less than the `max_retention`.
|
|
|
|
[discrete]
|
|
[[effective-retention-application]]
|
|
==== How is the effective retention applied?
|
|
|
|
Retention is applied to the remaining backing indices of a data stream as the last step of
|
|
<<data-streams-lifecycle-how-it-works, a data stream lifecycle run>>. Data stream lifecycle will retrieve the backing indices
|
|
whose `generation_time` is longer than the effective retention period and delete them. The `generation_time` is only
|
|
applicable to rolled over backing indices and it is either the time since the backing index got rolled over, or the time
|
|
optionally configured in the <<index-data-stream-lifecycle-origination-date,`index.lifecycle.origination_date`>> setting.
|
|
|
|
IMPORTANT: We use the `generation_time` instead of the creation time because this ensures that all data in the backing
|
|
index have passed the retention period. As a result, the retention period is not the exact time data get deleted, but
|
|
the minimum time data will be stored.
|