Here we introduce a new implementation of `IndexSettingProvider` whose goal is to "inject" the
`index.mode` setting with value `logsdb` when a cluster setting `cluster.logsdb.enabled` is `true`.
We also make sure that:
* the existing `index.mode` is not set
* the datastream name matches the `logs-*-*` pattern
* `logs@settings` component template is used
Since we are enriching the component templates with more entries such as
the data stream lifecycle and in the future the data stream options, we
add a template builder to help with the code, especially tests.
To highlight the value and prepare for the PRs that will add the data
stream options to the template we replace calls to the constructor with
all arguments by the builder: - when there are aguements with null
values, or - when we copy another template and change only a few fields.
This prepares the ground, so when we add data stream options, we will
not need to edit all these places.
Closes https://github.com/elastic/elasticsearch/issues/110387
Having this in now affords us not having to introduce version checks in
the ES exporter later. We can simply use the same serialization logic
for metric attributes as we do for other signals. This also enables us
to properly map `*.ip` fields to the ip field type as ip fields
containing a list of IPs are not converted to a comma-separated list.
The failure store status is a flag that indicates how the failure store was used or could be used if enabled. The user can be informed about the usage of the failure store in the following way:
When relevant we add the optional field `failure_store` . The field will be omitted when the use of the failure store is not relevant. For example, if a document was successfully indexed in a data stream, if a failure concerns an index or if the opType is not index or create. In more detail:
- when we have a “success” create/index response, the field `failure_store` will not be present if the documented was indexed in a backing index. Otherwise, if it got stored in the failure store it will have the value `used`.
- when we have a “rejected“ create/index response, meaning the document was not persisted in elasticsearch, we return the field `failure_store` which is either `not_enabled`, if the document could have ended up in the failure store if it was enabled, or `failed` if something went wrong and the document was not persisted in the failure store, for example, the cluster is out of space and in read-only mode.
We chose to make it an optional field to reduce the impact of this field on a bulk response. The value will exist in the java object but it will not be returned to the user. The only values that will be displayed are:
- `used`: meaning this document was indexed in the failure store
- `not_enabled`: meaning this document was rejected but could have been stored in the failure store if it was applicable.
- `failed`: meaning this failed document, failed to be stored in the failure store.
Example:
```
"errors": true,
"took": 202,
"items": [
{
"create": {
"_index": ".fs-my-ds-2024.09.04-000002",
"_id": "iRDDvJEB_J3Inuia2zgH",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 6,
"_primary_term": 1,
"status": 201,
"failure_store": "used"
}
},
{
"create": {
"_index": "ds-no-fs",
"_id": "hxDDvJEB_J3Inuia2jj3",
"status": 400,
"error": {
"type": "document_parsing_exception",
"reason": "[1:153] failed to parse field [count] of type [long] in document with id 'hxDDvJEB_J3Inuia2jj3'. Preview of field's value: 'bla'",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "For input string: \"bla\""
}
}
},
"failure_store": "not_enabled"
},
{
"create": {
"_index": ".ds-my-ds-2024.09.04-000001",
"_id": "iBDDvJEB_J3Inuia2jj3",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 7,
"_primary_term": 1,
"status": 201
}
}
]
```
Introduces per-field param `synthetic_source_keep` that overrides the
behavior for keeping the field source in synthetic source mode: -
`none` : no source is stored - `arrays`: the incoming source is
recorded as-is for arrays of a given field - `all`: the incoming source
is recorded as is for both singleton and array values of a given field
Related to #112012
In synthetic source, storing array elements to `_ignored_source` may
hide other, regular elements from showing up during source synthesizing.
This is due to contents from `_ignored_source` taking precedence over
matching fields from regular source loading.
To avoid this, arrays are pre-emptively tracked and marked for source
storing, if any of their elements needs to store its source. A second
doc parsing phase is introduced that checks for fields missing values
and records their source, while skipping objects and arrays that don't
contain any such fields.
Fixes#112374
This commit adds a module emitting a deprecation warning when a
dot-prefixed index is manually or automatically created, or when a
composable index template with an index pattern that uses a dot-prefix
is created. This pattern warns that in the future these indices will not
be allowed. In a future breaking change (10.0.0 maybe?) the deprecation
can then be changed to an exception.
These deprecations are only displayed when a non-operator user is using
the API (one that does not set the `X-elastic-product-origin` header).
[TEST] Assert DSL merge policy respects end date
Backing indexes with an end date in the future may still get writes,
so DSL should not apply the merge policy (first configuring the
settings on the index, then doing the force merge) until that time has
passed. The implementation already does this, because
`DataStreamLifecycleService.run()` calls
`timeSeriesIndicesStillWithinTimeBounds` and adds the resulting
indices to `indicesToExcludeForRemainingRun` before calling
`maybeExecuteForceMerge`. This change simply adds a unit test to
ensure that this behaviour does not regress.
Closes#109030
* Fix verbose get data stream API not requiring extra privileges
When a user uses the `GET /_data_stream?verbose` API to retrieve the verbose version of the response (which includes the `maximum_timestamp`, as added in #112303), the response object should be performed with the same privilege-checking as the get-data-stream API, meaning that no extra priveleges should be required return the field.
This commit makes the Transport action use an entitled client so that extra privileges are not required, and adds a test to ensure that it works.
* Update docs/changelog/112973.yaml
Here we test reindexing logsdb indices, creating and restoring
snapshots. Note that logsdb uses synthetic source and restoring
source only snapshots fails due to missing _source.
Dropping support for pre-8.12 requests from remote nodes, and also
cleaning up some unnecessary abstraction in the request builder
hierarchy.
Relates #101815
Relates #107984 (drops some unnecessary trappy timeouts)
Replaces the somewhat-awkward API on `ClusterAdminClient` for
manipulating ingest pipelines with some test-specific utilities that are
easier to use.
Relates #107984 in that this change massively reduces the noise that
would otherwise result from removing the trappy timeouts in these APIs.
When indexing to a data stream with a failure store it's possible to get
a version conflict. The reproduction path is the following:
```
PUT /_bulk
{"create":{"_index": "my-ds-with-fs", "_id": "1"}}
{"@timestamp": "2022-01-01", "baz": "quick", "a": "brown", "b": "fox"}
{"create":{"_index": "my-ds-with-fs", "_id": "1"}}
{"@timestamp": "2022-01-01", "baz": "lazy", "a": "dog"}
```
We would like the second document to not be sent to the failure store
and return an error to the user:
```
{
"errors" : true,
"took" : 409,
"items" : [
{
"create" : {
"_index" : ".ds-my-ds-with-fs-xxxxx-xxxx",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
{
"create" : {
"_index" : ".ds-my-ds-with-fs-xxxxx-xxxx",
"_id" : "1",
"status" : 409,
"error" : {
"type" : "version_conflict_engine_exception",
"reason" : "[1]: version conflict, document already exists (current version [1])",
"index_uuid" : ".....",
"shard" : "0",
"index" : ".ds-my-ds-with-fs-xxxxx-xxxx"
}
}
}
]
}
```
The version conflict doc is counted as a rejected doc in APM telemetry.
Introduce an index setting that forces storing the source of leaf field
and object arrays in synthetic source mode. Nested objects are excluded
as they already preserve ordering in synthetic source.
Next step is to introduce override params at the mapper level that will
allow disabling the source, or storing the source for arrays (if not
enabled at index level), or storing the source for both arrays and
singletons. This will happen in follow-up changes, so that we can
benchmark the impact of this change in parallel.
Related to #112012
Restricts the "index settings" provider that's invoked when creating new
indices to only inspect the current project's metadata (rather than the
whole global metadata).
In this PR we expose the global retention via the `GET
_data_stream/{target}/_lifecycle` API.
Since the global retention is a main feature of the data stream
lifecycle we chose to expose it by default.
```
GET /_data_stream/my-data-stream/_lifecycle
{
"global_retention": {
"default_retention": "7d",
"max_retention": "365d"
},
"data_streams": [...]
}
```
This commit adds support for the `verbose` querystring parameter to the
get data stream API (`GET /_data_stream/{name}`).
The flag defaults to "false".
When set to true, the `maximum_timestamp` for the data stream will be
retrieved and returned for each data stream retrieved. This is the same
information available from the data stream stats API (and internally
uses the same action to retrieval).
With #111972 we enable users to set up global retention for data streams that are managed by the data stream lifecycle. This will allow users of elasticsearch to have a more control over their data retention, and consequently better resource management of their clusters.
However, there is a small number of data streams that are necessary for the good operation of elasticsearch and should not follow user defined retention to avoid surprises.
For this reason, we put forth the following definition of internal data streams.
A data stream is internal if it's either a system index (system flag is true) or if its name starts with a dot.
This PR adds the `isInternalDataStream` param in the effective retention calculation making explicit that this is also used to determine the effective retention.
* Search coordinator uses event.ingested in cluster state to do rewrites
Min/max range for the event.ingested timestamp field (part of Elastic Common
Schema) was added to IndexMetadata in cluster state for searchable snapshots
in #106252.
This commit modifies the search coordinator to rewrite searches to MatchNone
if the query searches a range of event.ingested that, from the min/max range
in cluster state, is known to not overlap. This is the same behavior we currently
have for the @timestamp field.
Sometimes initial indexing results into exactly one segment.
However, multiple segments are needed to perform the force merge that purges stored fields for _id field in a later stage of the test.
This change tweaks the test such that an extra update is performed after initial indexing. This should always create an extra segment, so that this test can actual purge stored fields for _id field.
Closes#112124
Add the ability to schedule an SLM policies with a time unit interval schedule rather than a cron job schedule. For example, an slm policy can be created with the argument "schedule":"30m". This will create a policy that will run 30 minutes after the policy modification_date. It will then run again every time another 30 minutes has passed. Every time the policy is changed, the next snapshot will be re-scheduled to run one interval after the new modification date.