RFC: Improve saved object migrations (#66056)

* RFC: Improve saved object migrations

* Added rolling upgrades as alternative

* Clarify advantage of re-using index for avoiding max shards open errors

* Add more detail about how dry run migrations and invalid document exports will be exposed to users

* Add separate section to highlight assumptions and tradeoffs

* Apply nits from code review

Co-authored-by: Josh Dover <me@joshdover.com>

* Move 'Single node migrations coordinated through a lease/lock' under alternatives

* Fix 5.2.2 document lock algorithm step 3.2

* Minor clarificiations

* Minimizing data loss with mixed Kibana versions: move 7.x algorithm into alternatives and clarify 8.x algorithm

* Adds impact idempotent restrictions have on upcoming/requested migration features

* nit: clarify 4.3.1.2.3

Clarify that if documents already exist, they aren't overwritten.

Co-authored-by: Brandon Kobel <brandon.kobel@gmail.com>

* Update 4.5.2: we need only one version-specific alias, not separate read/write aliases

* Add 4.0.4 the assumption that a saved object type will only ever be registered by one plugin

* Updates 4.3.1.2: Adds mappings compatibility check, adds soft deletes to prevent lost deletes, add note about data loss with mixed plugins on the same kibana version

* Don't try to minimize data loss for unsupported configurations, update index mappings during 8.x migration

 - Move "4.5.2 Minimizing data loss with mixed Kibana versions (9.0)" to
 alternatives section. Expand this section with a more detailed algorithm as
 well as enumerating some data loss scenarios that could still occur.
 - Add the supported upgrade procedure which guarantees no undetected data loss.
 - Update "4.5.1 Migration algorithm (8.0)"
   - Added a step to update index mappings
   - If there's update version mismatch errors, complete all batches before
     restarting the migration to improve performance.

* Flesh out upgrade and rollback procedure with gracefull shutdown

* New migration algorithm, upgrade procedure based on index cloning and new ES features

* Minor: clarify that index cloning algorithm solves (2.6) but (2.7) is out of scope

* Move 'Tag objects as invalid if their transformation fails' to the alternatives section

* Don't migrate when there's orphan documents so that rollback is always possible. Updates: 4.2.1.2 Migration algorithm: Cloned index per version & 4.2.1.4 Handling documents that belong to a disabled plugin

* Link to new Elasticsearch features required for the implementation

* 4.2.1.2 Clarify the purpose of the target index postfix

* Add open questions around mappings growing over time. Fix rollback index example

* minor: fix formatting

* Update assumptions and tradeoffs to match latest algorithm

* Review feedback
 - Remove 4.0.2 since the latest algorithm resolves the tradeoff
 - 4.2.1.2 should handle v6 `.kibana` indexes

* More alternatives for 4.2.1.4 Handling documents that belong to a disabled plugin. Don't introduce breaking changes in 7.x

* Handle new ES cluster without any indices. Fix clone API parameters

* Update 4.2.1.2: transform outdated documents on every startup, only fail for unknown types in 8.x

Co-authored-by: Josh Dover <me@joshdover.com>
Co-authored-by: Brandon Kobel <brandon.kobel@gmail.com>
This commit is contained in:
Rudolf Meijering 2020-09-29 09:34:54 +02:00 committed by GitHub
parent 19a8fa4d10
commit d7d96d9864
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -0,0 +1,745 @@
- Start Date: 2020-05-11
- RFC PR: (leave this empty)
- Kibana Issue: (leave this empty)
---
- [1. Summary](#1-summary)
- [2. Motivation](#2-motivation)
- [3. Saved Object Migration Errors](#3-saved-object-migration-errors)
- [4. Design](#4-design)
- [4.0 Assumptions and tradeoffs](#40-assumptions-and-tradeoffs)
- [4.1 Discover and remedy potential failures before any downtime](#41-discover-and-remedy-potential-failures-before-any-downtime)
- [4.2 Automatically retry failed migrations until they succeed](#42-automatically-retry-failed-migrations-until-they-succeed)
- [4.2.1 Idempotent migrations performed without coordination](#421-idempotent-migrations-performed-without-coordination)
- [4.2.1.1 Restrictions](#4211-restrictions)
- [4.2.1.2 Migration algorithm: Cloned index per version](#4212-migration-algorithm-cloned-index-per-version)
- [4.2.1.3 Upgrade and rollback procedure](#4213-upgrade-and-rollback-procedure)
- [4.2.1.4 Handling documents that belong to a disabled plugin](#4214-handling-documents-that-belong-to-a-disabled-plugin)
- [5. Alternatives](#5-alternatives)
- [5.1 Rolling upgrades](#51-rolling-upgrades)
- [5.2 Single node migrations coordinated through a lease/lock](#52-single-node-migrations-coordinated-through-a-leaselock)
- [5.2.1 Migration algorithm](#521-migration-algorithm)
- [5.2.2 Document lock algorithm](#522-document-lock-algorithm)
- [5.2.3 Checking for "weak lease" expiry](#523-checking-for-weak-lease-expiry)
- [5.3 Minimize data loss with mixed Kibana versions during 7.x](#53-minimize-data-loss-with-mixed-kibana-versions-during-7x)
- [5.4 In-place migrations that re-use the same index (8.0)](#54-in-place-migrations-that-re-use-the-same-index-80)
- [5.4.1 Migration algorithm (8.0):](#541-migration-algorithm-80)
- [5.4.2 Minimizing data loss with unsupported upgrade configurations (8.0)](#542-minimizing-data-loss-with-unsupported-upgrade-configurations-80)
- [5.5 Tag objects as “invalid” if their transformation fails](#55-tag-objects-as-invalid-if-their-transformation-fails)
- [6. How we teach this](#6-how-we-teach-this)
- [7. Unresolved questions](#7-unresolved-questions)
# 1. Summary
Improve the Saved Object migration algorithm to ensure a smooth Kibana upgrade
procedure.
# 2. Motivation
Kibana version upgrades should have a minimal operational impact. To achieve
this, users should be able to rely on:
1. A predictable downtime window.
2. A small downtime window.
1. (future) provide a small downtime window on indices with 10k or even
a 100k documents.
3. The ability to discover and remedy potential failures before initiating the
downtime window.
4. Quick roll-back in case of failure.
5. Detailed documentation about the impact of downtime on the features they
are using (e.g. actions, task manager, fleet, reporting).
6. Mixed Kibana versions shouldnt cause data loss.
7. (stretch goal) Maintain read-only functionality during the downtime window.
The biggest hurdle to achieving the above is Kibanas Saved Object migrations.
Migrations arent resilient and require manual intervention anytime an error
occurs (see [3. Saved Object Migration
Errors](#3-saved-object-migration-errors)).
It is impossible to discover these failures before initiating downtime. Errors
often force users to roll-back to a previous version of Kibana or cause hours
of downtime. To retry the migration, users are asked to manually delete a
`.kibana_x` index. If done incorrectly this can lead to data loss, making it a
terrifying experience (restoring from a pre-upgrade snapshot is a safer
alternative but not mentioned in the docs or logs).
Cloud users dont have access to Kibana logs to be able to identify and remedy
the cause of the migration failure. Apart from blindly retrying migrations by
restoring a previous snapshot, cloud users are unable to remedy a failed
migration and have to escalate to support which can further delay resolution.
Taken together, version upgrades are a major operational risk and discourage
users from adopting the latest features.
# 3. Saved Object Migration Errors
Any of the following classes of errors could result in a Saved Object
migration failure which requires manual intervention to resolve:
1. A bug in a plugins registered document transformation function causes it
to throw an exception on _valid_ data.
2. _Invalid_ data stored in Elasticsearch causes a plugins registered
document transformation function to throw an exception .
3. Failures resulting from an unhealthy Elasticsearch cluster:
1. Maximum shards open
2. Too many scroll contexts
3. `circuit_breaking_exception` (insufficient heap memory)
4. `process_cluster_event_timeout_exception` for index-aliases, create-index, put-mappings
5. Read-only indices due to low disk space (hitting the flood_stage watermark)
6. Re-index failed: search rejected due to missing shards
7. `TooManyRequests` while doing a `count` of documents requiring a migration
8. Bulk write failed: primary shard is not active
4. The Kibana process is killed while migrations are in progress.
# 4. Design
## 4.0 Assumptions and tradeoffs
The proposed design makes several important assumptions and tradeoffs.
**Background:**
The 7.x upgrade documentation lists taking an Elasticsearch snapshot as a
required step, but we instruct users to retry migrations and perform rollbacks
by deleting the failed `.kibana_n` index and pointing the `.kibana` alias to
`.kibana_n-1`:
- [Handling errors during saved object
migrations.](https://github.com/elastic/kibana/blob/75444a9f1879c5702f9f2b8ad4a70a3a0e75871d/docs/setup/upgrade/upgrade-migrations.asciidoc#handling-errors-during-saved-object-migrations)
- [Rolling back to a previous version of Kibana.](https://github.com/elastic/kibana/blob/75444a9f1879c5702f9f2b8ad4a70a3a0e75871d/docs/setup/upgrade/upgrade-migrations.asciidoc#rolling-back-to-a-previous-version-of-kib)
- Server logs from failed migrations.
**Assumptions and tradeoffs:**
1. It is critical to maintain a backup index during 7.x to ensure that anyone
following the existing upgrade / rollback procedures don't end up in a
position where they no longer can recover their data.
1. This excludes us from introducing in-place migrations to support huge
indices during 7.x.
2. The simplicity of idempotent, coordination-free migrations outweighs the
restrictions this will impose on the kinds of migrations we're able to
support in the future. See (4.2.1)
3. A saved object type (and it's associated migrations) will only ever be
owned by one plugin. If pluginA registers saved object type `plugin_a_type`
then pluginB must never register that same type, even if pluginA is
disabled. Although we cannot enforce it on third-party plugins, breaking
this assumption may lead to data loss.
## 4.1 Discover and remedy potential failures before any downtime
> Achieves goals: (2.3)
> Mitigates errors: (3.1), (3.2)
1. Introduce a CLI option to perform a dry run migration to allow
administrators to locate and fix potential migration failures without
taking their existing Kibana node(s) offline.
2. To have the highest chance of surfacing potential failures such as low disk
space, dry run migrations should not be mere simulations. A dry run should
perform a real migration in a way that doesnt impact the existing Kibana
cluster.
3. The CLI should generate a migration report to make it easy to create a
support request from a failed migration dry run.
1. The report would be an NDJSON export of all failed objects.
2. If support receives such a report, we could modify all the objects to
ensure the migration would pass and send this back to the client.
3. The client can then import the updated objects using the standard Saved
Objects NDJSON import and run another dry run to verify all problems
have been fixed.
4. Make running dry run migrations a required step in the upgrade procedure
documentation.
5. (Optional) Add dry run migrations to the standard cloud upgrade procedure?
## 4.2 Automatically retry failed migrations until they succeed
> Achieves goals: (2.2), (2.6)
> Mitigates errors (3.3) and (3.4)
External conditions such as failures from an unhealthy Elasticsearch cluster
(3.3) can cause the migration to fail. The Kibana cluster should be able to
recover automatically once these external conditions are resolved. There are
two broad approaches to solving this problem based on whether or not
migrations are idempotent:
| Idempotent migrations |Description |
| --------------------- | --------------------------------------------------------- |
| Yes | Idempotent migrations performed without coordination |
| No | Single node migrations coordinated through a lease / lock |
Idempotent migrations don't require coordination making the algorithm
significantly less complex and will never require manual intervention to
retry. We, therefore, prefer this solution, even though it introduces
restrictions on migrations (4.2.1.1). For other alternatives that were
considered see section [(5)](#5-alternatives).
## 4.2.1 Idempotent migrations performed without coordination
The migration system can be said to be idempotent if the same results are
produced whether the migration was run once or multiple times. This property
should hold even if new (up to date) writes occur in between migration runs
which introduces the following restrictions:
### 4.2.1.1 Restrictions
1. All document transforms need to be deterministic, that is a document
transform will always return the same result for the same set of inputs.
2. It should always be possible to construct the exact set of inputs required
for (1) at any point during the migration process (before, during, after).
Although these restrictions require significant changes, it does not prevent
known upcoming migrations such as [sharing saved-objects in multiple spaces](https://github.com/elastic/kibana/issues/27004) or [splitting a saved
object into multiple child
documents](https://github.com/elastic/kibana/issues/26602). To ensure that
these migrations are idempotent, they will have to generate new saved object
id's deterministically with e.g. UUIDv5.
### 4.2.1.2 Migration algorithm: Cloned index per version
Note:
- The description below assumes the migration algorithm is released in 7.10.0.
So < 7.10.0 will use `.kibana` and >= 7.10.0 will use `.kibana_current`.
- We refer to the alias and index that outdated nodes use as the source alias
and source index.
- Every version performs a migration even if mappings or documents aren't outdated.
1. Locate the source index by fetching aliases (including `.kibana` for
versions prior to v7.10.0)
```
GET '/_alias/.kibana_current,.kibana_7.10.0,.kibana'
```
The source index is:
1. the index the `.kibana_current` alias points to, or if it doesnt exist,
2. the index the `.kibana` alias points to, or if it doesn't exist,
3. the v6.x `.kibana` index
If none of the aliases exists, this is a new Elasticsearch cluster and no
migrations are necessary. Create the `.kibana_7.10.0_001` index with the
following aliases: `.kibana_current` and `.kibana_7.10.0`.
2. If `.kibana_current` and `.kibana_7.10.0` both exists and are pointing to the same index this version's migration has already been completed.
1. Because the same version can have plugins enabled at any point in time,
perform the mappings update in step (6) and migrate outdated documents
with step (7).
2. Skip to step (9) to start serving traffic.
3. Fail the migration if:
1. `.kibana_current` is pointing to an index that belongs to a later version of Kibana .e.g. `.kibana_7.12.0_001`
2. (Only in 8.x) The source index contains documents that belong to an unknown Saved Object type (from a disabled plugin). Log an error explaining that the plugin that created these documents needs to be enabled again or that these objects should be deleted. See section (4.2.1.4).
4. Mark the source index as read-only and wait for all in-flight operations to drain (requires https://github.com/elastic/elasticsearch/pull/58094). This prevents any further writes from outdated nodes. Assuming this API is similar to the existing `/<index>/_close` API, we expect to receive `"acknowledged" : true` and `"shards_acknowledged" : true`. If all shards dont acknowledge within the timeout, retry the operation until it succeeds.
5. Clone the source index into a new target index which has writes enabled. All nodes on the same version will use the same fixed index name e.g. `.kibana_7.10.0_001`. The `001` postfix isn't used by Kibana, but allows for re-indexing an index should this be required by an Elasticsearch upgrade. E.g. re-index `.kibana_7.10.0_001` into `.kibana_7.10.0_002` and point the `.kibana_7.10.0` alias to `.kibana_7.10.0_002`.
1. `POST /.kibana_n/_clone/.kibana_7.10.0_001?wait_for_active_shards=all {"settings": {"index.blocks.write": false}}`. Ignore errors if the clone already exists.
2. Wait for the cloning to complete `GET /_cluster/health/.kibana_7.10.0_001?wait_for_status=green&timeout=60s` If cloning doesnt complete within the 60s timeout, log a warning for visibility and poll again.
6. Update the mappings of the target index
1. Retrieve the existing mappings including the `migrationMappingPropertyHashes` metadata.
2. Update the mappings with `PUT /.kibana_7.10.0_001/_mapping`. The API deeply merges any updates so this won't remove the mappings of any plugins that were enabled in a previous version but are now disabled.
3. Ensure that fields are correctly indexed using the target index's latest mappings `POST /.kibana_7.10.0_001/_update_by_query?conflicts=proceed`. In the future we could optimize this query by only targeting documents:
1. That belong to a known saved object type.
2. Which don't have outdated migrationVersion numbers since these will be transformed anyway.
3. That belong to a type whose mappings were changed by comparing the `migrationMappingPropertyHashes`. (Metadata, unlike the mappings isn't commutative, so there is a small chance that the metadata hashes do not accurately reflect the latest mappings, however, this will just result in an less efficient query).
7. Transform documents by reading batches of outdated documents from the target index then transforming and updating them with optimistic concurrency control.
1. Ignore any version conflict errors.
2. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents. If any failures occured, log the complete list of documents that failed to transform. Fail the migration.
8. Mark the migration as complete by doing a single atomic operation (requires https://github.com/elastic/elasticsearch/pull/58100) that:
1. Checks that `.kibana-current` alias is still pointing to the source index
2. Points the `.kibana-7.10.0` and `.kibana_current` aliases to the target index.
3. If this fails with a "required alias [.kibana_current] does not exist" error fetch `.kibana_current` again:
1. If `.kibana_current` is _not_ pointing to our target index fail the migration.
2. If `.kibana_current` is pointing to our target index the migration has succeeded and we can proceed to step (9).
9. Start serving traffic.
Together with the limitations, this algorithm ensures that migrations are
idempotent. If two nodes are started simultaneously, both of them will start
transforming documents in that version's target index, but because migrations are idempotent, it doesnt matter which nodes writes win.
<details>
<summary>In the future, this algorithm could enable (2.6) "read-only functionality during the downtime window" but this is outside of the scope of this RFC.</summary>
Although the migration algorithm guarantees there's no data loss while providing read-only access to outdated nodes, this could cause plugins to behave in unexpected ways. If we wish to persue it in the future, enabling read-only functionality during the downtime window will be it's own project and must include an audit of all plugins' behaviours.
</details>
### 4.2.1.3 Upgrade and rollback procedure
When a newer Kibana starts an upgrade, it blocks all writes to the outdated index to prevent data loss. Since Kibana is not designed to gracefully handle a read-only index this could have unintended consequences such as a task executing multiple times but never being able to write that the task was completed successfully. To prevent unintended consequences, the following procedure should be followed when upgrading Kibana:
1. Gracefully shutdown outdated nodes by sending a `SIGTERM` signal
1. Node starts returning `503` from it's healthcheck endpoint to signal to
the load balancer that it's no longer accepting new traffic (requires https://github.com/elastic/kibana/issues/46984).
2. Allows ungoing HTTP requests to complete with a configurable timeout
before forcefully terminating any open connections.
3. Closes any keep-alive sockets by sending a `connection: close` header.
4. Shutdown all plugins and Core services.
2. (recommended) Take a snapshot of all Kibana's Saved Objects indices. This simplifies doing a rollback to a simple snapshot restore, but is not required in order to do a rollback if a migration fails.
3. Start the upgraded Kibana nodes. All running Kibana nodes should be on the same version, have the same plugins enabled and use the same configuration.
To rollback to a previous version of Kibana with a snapshot
1. Shutdown all Kibana nodes.
2. Restore the Saved Object indices and aliases from the snapshot
3. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration.
To rollback to a previous version of Kibana without a snapshot:
(Assumes the migration to 7.11.0 failed)
1. Shutdown all Kibana nodes.
2. Remove the index created by the failed Kibana migration by using the version-specific alias e.g. `DELETE /.kibana_7.11.0`
3. Identify the rollback index:
1. If rolling back to a Kibana version < 7.10.0 use `.kibana`
2. If rolling back to a Kibana version >= 7.10.0 use the version alias of the Kibana version you wish to rollback to e.g. `.kibana_7.10.0`
4. Point the `.kibana_current` alias to the rollback index.
5. Remove the write block from the rollback index.
6. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration.
### 4.2.1.4 Handling documents that belong to a disabled plugin
It is possible for a plugin to create documents in one version of Kibana, but then when upgrading Kibana to a newer version, that plugin is disabled. Because the plugin is disabled it cannot register it's Saved Objects type including the mappings or any migration transformation functions. These "orphan" documents could cause future problems:
- A major version introduces breaking mapping changes that cannot be applied to the data in these documents.
- Two majors later migrations will no longer be able to migrate this old schema and could fail unexpectadly when the plugin is suddenly enabled.
As a concrete example of the above, consider a user taking the following steps:
1. Installs Kibana 7.6.0 with spaces=enabled. The spaces plugin creates a default space saved object.
2. User upgrades to 7.10.0 but uses the OSS download which has spaces=disabled. Although the 7.10.0 spaces plugin includes a migration for space documents, the OSS release cannot migrate the documents or update it's mappings.
3. User realizes they made a mistake and use Kibana 7.10.0 with x-pack and the spaces plugin enabled. At this point we have a completed migration for 7.10.0 but there's outdated spaces documents with migrationVersion=7.6.0 instead of 7.10.0.
There are several approaches we could take to dealing with these orphan documents:
1. Start up but refuse to query on types with outdated documents until a user manually triggers a re-migration
Advantages:
- The impact is limited to a single plugin
Disadvantages:
- It might be less obvious that a plugin is in a degraded state unless you read the logs (not possible on Cloud) or view the `/status` endpoint.
- If a user doesn't care that the plugin is degraded, orphan documents are carried forward indefinitely.
- Since Kibana has started receiving traffic, users can no longer
downgrade without losing data. They have to re-migrate, but if that
fails they're stuck.
- Introduces a breaking change in the upgrade behaviour
To perform a re-migration:
- Remove the `.kibana_7.10.0` alias
- Take a snapshot OR set the configuration option `migrations.target_index_postfix: '002'` to create a new target index `.kibana_7.10.0_002` and keep the `.kibana_7.10.0_001` index to be able to perform a rollback.
- Start up Kibana
2. Refuse to start Kibana until the plugin is enabled or it's data deleted
Advantages:
- Admins are forced to deal with the problem as soon as they disable a plugin
Disadvantages:
- Cannot temporarily disable a plugin to aid in debugging or to reduce the load a Kibana plugin places on an ES cluster.
- Introduces a breaking change
3. Refuse to start a migration until the plugin is enabled or it's data deleted
Advantages:
- We force users to enable a plugin or delete the documents which prevents these documents from creating future problems like a mapping update not being compatible because there are fields which are assumed to have been migrated.
- We keep the index “clean”.
Disadvantages:
- Since users have to take down outdated nodes before they can start the upgrade, they have to enter the downtime window before they know about this problem. This prolongs the downtime window and in many cases might cause an operations team to have to reschedule their downtime window to give them time to investigate the documents that need to be deleted. Logging an error on every startup could warn users ahead of time to mitigate this.
- We dont expose Kibana logs on Cloud so this will have to be escalated to support and could take 48hrs to resolve (users can safely rollback, but without visibility into the logs they might not know this). Exposing Kibana logs is on the cloud teams roadmap.
- It might not be obvious just from the saved object type, which plugin created these objects.
- Introduces a breaking change in the upgrade behaviour
4. Use a hash of enabled plugins as part of the target index name
Using a migration target index name like
`.kibana_7.10.0_${hash(enabled_plugins)}_001` we can migrate all documents
every time a plugin is enabled / disabled.
Advantages:
- Outdated documents belonging to disabled plugins will be upgraded as soon
as the plugin is enabled again.
Disadvantages:
- Disabling / enabling a plugin will cause downtime (breaking change).
- When a plugin is enabled, disabled and enabled again our target index
will be an existing outdated index which needs to be deleted and
re-cloned. Without a way to check if the index is outdated, we cannot
deterministically perform the delete and re-clone operation without
coordination.
5. Transform outdated documents (step 7) on every startup
Advantages:
- Outdated documents belonging to disabled plugins will be upgraded as soon
as the plugin is enabled again.
Disadvantages:
- Orphan documents are retained indefinitely so there's still a potential
for future problems.
- Slightly slower startup time since we have to query for outdated
documents every time.
We prefer option (3) since it provides flexibility for disabling plugins in
the same version while also protecting users' data in all cases during an
upgrade migration. However, because this is a breaking change we will
implement (5) during 7.x and only implement (3) during 8.x.
# 5. Alternatives
## 5.1 Rolling upgrades
We considered implementing rolling upgrades to provide zero downtime
migrations. However, this would introduce significant complexity for plugins:
they will need to maintain up and down migration transformations and ensure
that queries match both current and outdated documents across all
versions. Although we can afford the once-off complexity of implementing
rolling upgrades, the complexity burden of maintaining plugins that support
rolling-upgrades will slow down all development in Kibana. Since a predictable
downtime window is sufficient for our users, we decided against trying to
achieve zero downtime with rolling upgrades. See "Rolling upgrades" in
https://github.com/elastic/kibana/issues/52202 for more information.
## 5.2 Single node migrations coordinated through a lease/lock
This alternative is a proposed algorithm for coordinating migrations so that
these only happen on a single node and therefore don't have the restrictions
found in [(4.2.1.1)](#4311-restrictions). We decided against this algorithm
primarily because it is a lot more complex, but also because it could still
require manual intervention to retry from certain unlikely edge cases.
<details>
<summary>It's impossible to guarantee that a single node performs the
migration and automatically retry failed migrations.</summary>
Coordination should ensure that only one Kibana node performs the migration at
a given time which can be achived with a distributed lock built on top of
Elasticsearch. For the Kibana cluster to be able to retry a failed migration,
requires a specialized lock which expires after a given amount of inactivity.
We will refer to such expiring locks as a "lease".
If a Kibana process stalls, it is possible that the process' lease has expired
but the process doesn't yet recognize this and continues the migration. To
prevent this from causing data loss each lease should be accompanied by a
"guard" that prevents all writes after the lease has expired. See
[how to do distributed
locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html)
for an in-depth discussion.
Elasticsearch doesn't provide any building blocks for constructing such a guard.
</details>
However, we can implement a lock (that never expires) with strong
data-consistency guarantees. Because theres no expiration, a failure between
obtaining the lock and releasing it will require manual intervention. Instead
of trying to accomplish the entire migration after obtaining a lock, we can
only perform the last step of the migration process, moving the aliases, with
a lock. A permanent failure in only this last step is not impossible, but very
unlikely.
### 5.2.1 Migration algorithm
1. Obtain a document lock (see [5.2.2 Document lock
algorithm](#522-document-lock-algorithm)). Convert the lock into a "weak
lease" by expiring locks for nodes which aren't active (see [4.2.2.4
Checking for lease expiry](#4324-checking-for-lease-expiry)). This "weak
lease" doesn't require strict guarantees since it's only used to prevent
multiple Kibana nodes from performing a migration in parallel to reduce the
load on Elasticsearch.
2. Migrate data into a new process specific index (we could use the process
UUID thats used in the lease document like
`.kibana_3ef25ff1-090a-4335-83a0-307a47712b4e`).
3. Obtain a document lock (see [5.2.2 Document lock
algorithm](#522-document-lock-algorithm)).
4. Finish the migration by pointing `.kibana`
`.kibana_3ef25ff1-090a-4335-83a0-307a47712b4e`. This automatically releases
the document lock (and any leases) because the new index will contain an
empty `kibana_cluster_state`.
If a process crashes or is stopped after (3) but before (4) the lock will have
to be manually removed by deleting the `kibana_cluster_state` document from
`.kibana` or restoring from a snapshot.
### 5.2.2 Document lock algorithm
To improve on the existing Saved Objects migrations lock, a locking algorithm
needs to satisfy the following requirements:
- Must guarantee that only a single node can obtain the lock. Since we can
only provide strong data-consistency guarantees on the document level in
Elasticsearch our locking mechanism needs to be based on a document.
- Manually removing the lock
- shouldn't have any risk of accidentally causing data loss.
- can be done with a single command that's always the same (shouldnt
require trying to find `n` for removing the correct `.kibana_n` index).
- Must be easy to retrieve the lock/cluster state to aid in debugging or to
provide visibility.
Algorithm:
1. Node reads `kibana_cluster_state` lease document from `.kibana`
2. It sends a heartbeat every `heartbeat_interval` seconds by sending an
update operation that adds its UUID to the `nodes` array and sets the
`lastSeen` value to the current local node time. If the update fails due to
a version conflict the update operation is retried after a random delay by
fetching the document again and attempting the update operation once more.
3. To obtain a lease, a node:
1. Fetches the `kibana_cluster_state` document
2. If all the nodes `hasLock === false` it sets its own `hasLock` to
true and attempts to write the document. If the update fails
(presumably because of another nodes heartbeat update) it restarts the
process to obtain a lease from step (3).
3. If another nodes `hasLock === true` the node failed to acquire a
lock and waits until the active lock has expired before attempting to
obtain a lock again.
4. Once a node is done with its lock, it releases it by fetching and then
updating `hasLock = false`. The fetch + update operations are retried until
this nodes `hasLock === false`.
Each machine writes a `UUID` to a file, so a single machine may have multiple
processes with the same Kibana `UUID`, so we should rather generate a new UUID
just for the lifetime of this process.
`KibanaClusterState` document format:
```js
nodes: {
"852bd94e-5121-47f3-a321-e09d9db8d16e": {
version: "7.6.0",
lastSeen: [ 1114793, 555149266 ], // hrtime() big int timestamp
hasLease: true,
hasLock: false,
},
"8d975c5b-cbf6-4418-9afb-7aa3ea34ac90": {
version: "7.6.0",
lastSeen: [ 1114862, 841295591 ],
hasLease: false,
hasLock: false,
},
"3ef25ff1-090a-4335-83a0-307a47712b4e": {
version: "7.6.0",
lastSeen: [ 1114877, 611368546 ],
hasLease: false,
hasLock: false,
},
},
oplog: [
{op: 'ACQUIRE_LOCK', node: '852bd94e...', timestamp: '2020-04-20T11:58:56.176Z'}
]
}
```
### 5.2.3 Checking for "weak lease" expiry
The simplest way to check for lease expiry is to inspect the `lastSeen` value.
If `lastSeen + expiry_timeout > now` the lock is considered expired. If there
are clock drift or daylight savings time adjustments, theres a risk that a
node loses its lease before `expiry_timeout` has occurred. Since losing a
lock prematurely will not lead to data loss its not critical that the
expiry time is observed under all conditions.
A slightly safer approach is to use a monotonically increasing clock
(`process.hrtime()`) and relative time to determine expiry. Using a
monotonically increasing clock guarantees that the clock will always increase
even if the system time changes due to daylight savings time, NTP clock syncs,
or manually setting the time. To check for expiry, other nodes poll the
cluster state document. Once they see that the `lastSeen` value has increased,
they capture the current hr time `current_hr_time` and starts waiting until
`process.hrtime() - current_hr_time > expiry_timeout` if at that point
`lastSeen` hasnt been updated the lease is considered to have expired. This
means other nodes can take up to `2*expiry_timeout` to recognize an expired
lease, but a lease will never expire prematurely.
Any node that detects an expired lease can release that lease by setting the
expired nodes `hasLease = false`. It can then attempt to acquire its lease.
## 5.3 Minimize data loss with mixed Kibana versions during 7.x
When multiple versions of Kibana are running at the same time, writes from the
outdated node can end up either in the outdated Kibana index, the newly
migrated index, or both. New documents added (and some updates) into the old
index while a migration is in-progress will be lost. Writes that end up in the
new index will be in an outdated format. This could cause queries on the data
to only return a subset of the results which leads to incorrect results or
silent data loss.
Minimizing data loss from mixed 7.x versions, introduces two additional steps
to rollback to a previous version without a snapshot:
1. (existing) Point the `.kibana` alias to the previous Kibana index `.kibana_n-1`
2. (existing) Delete `.kibana_n`
3. (new) Enable writes on `.kibana_n-1`
4. (new) Delete the dummy "version lock" document from `.kibana_n-1`
Since our documentation and server logs have implicitly encouraged users to
rollback without using snapshots, many users might have to rely on these
additional migration steps to perform a rollback. Since even the existing
steps are error prone, introducing more steps will likely introduce more
problems than what it solves.
1. All future versions of Kibana 7.x will use the `.kibana_saved_objects`
alias to locate the current index. If `.kibana_saved_objects` doesn't
exist, newer versions will fallback to reading `.kibana`.
2. All future versions of Kibana will locate the index that
`.kibana_saved_objects` points to and then read and write directly from
the _index_ instead of the alias.
3. Before starting a migration:
1. Write a new dummy "version lock" document to the `.kibana` index with a
`migrationVersion` set to the current version of Kibana. If an outdated
node is started up after a migration was started it will detect that
newer documents are present in the index and refuse to start up.
2. Set the outdated index to read-only. Since `.kibana` is never advanced,
it will be pointing to a read-only index which prevent writes from
6.8+ releases which are already online.
## 5.4 In-place migrations that re-use the same index (8.0)
> We considered an algorithm that re-uses the same index for migrations and an approach to minimize data-loss if our upgrade procedures aren't followed. This is no longer our preferred approach because of several downsides:
> - It requires taking snapshots to prevent data loss so we can only release this in 8.x
> - Minimizing data loss with unsupported upgrade configurations adds significant complexity and still doesn't guarantee that data isn't lost.
### 5.4.1 Migration algorithm (8.0):
1. Exit Kibana with a fatal error if a newer node has started a migration by
checking for:
1. Documents with a newer `migrationVersion` numbers.
2. If the mappings are out of date, update the mappings to the combination of
the index's current mappings and the expected mappings.
3. If there are outdated documents, migrate these in batches:
1. Read a batch of outdated documents from the index.
2. Transform documents by applying the migration transformation functions.
3. Update the document batch in the same index using optimistic concurrency
control. If a batch fails due to an update version mismatch continue
migrating the other batches.
4. If a batch fails due other reasons repeat the entire migration process.
4. If any of the batches in step (3.3) failed, repeat the entire migration
process. This ensures that in-progress bulk update operations from an
outdated node won't lead to unmigrated documents still being present after
the migration.
5. Once all documents are up to date, the migration is complete and Kibana can
start serving traffic.
Advantages:
- Not duplicating all documents into a new index will speed up migrations and
reduce the downtime window. This will be especially important for the future
requirement to support > 10k or > 100k documents.
- We can check the health of an existing index before starting the migration,
but we cannot detect what kind of failures might occur while creating a new
index. Whereas retrying migrations will eventually recover from the errors
in (3.3), re-using an index allows us to detect these problems before trying
and avoid errors like (3.3.1) altogether.
- Single index to backup instead of “index pattern” that matches any
`.kibana_n`.
- Simplifies Kibana system index Elasticsearch plugin since it needs to work
on one index per "tenant".
- By leveraging optimistic concurrency control we can further minimize data
loss for unsupported upgrade configurations in the future.
Drawbacks:
- Cannot make breaking mapping changes (even though it was possible, we have not
introduced a breaking mapping change during 7.x).
- Rollback is only possible by restoring a snapshot which requires educating
users to ensure that they don't rely on `.kibana_n` indices as backups.
(Apart from the need to educate users, snapshot restores provide many
benefits).
- It narrows the second restriction under (4.2.1) even further: migrations
cannot rely on any state that could change as part of a migration because we
can no longer use the previous index as a snapshot of unmigrated state.
- We cant automatically perform a rollback from a half-way done migration.
- Its impossible to provide read-only functionality for outdated nodes which
means we can't achieve goal (2.7).
### 5.4.2 Minimizing data loss with unsupported upgrade configurations (8.0)
> This alternative can reduce some data loss when our upgrade procedure isn't
> followed with the algorithm in (5.4.1).
Even if (4.5.2) is the only supported upgrade procedure, we should try to
prevent data loss when these instructions aren't followed.
To prevent data loss we need to prevent any writes from older nodes. We use
a version-specific alias for this purpose. Each time a migration is started,
all other aliases are removed. However, aliases are stored inside
Elasticsearch's ClusterState and this state could remain inconsistent between
nodes for an unbounded amount of time. In addition, bulk operations that were
accepted before the alias was removed will continue to run even after removing
the alias.
As a result, Kibana cannot guarantee that there would be no data loss but
instead, aims to minimize it as much as possible by adding the bold sections
to the migration algorithm from (5.4.1)
1. **Disable `action.auto_create_index` for the Kibana system indices.**
2. Exit Kibana with a fatal error if a newer node has started a migration by
checking for:
1. **Version-specific aliases on the `.kibana` index with a newer version.**
2. Documents with newer `migrationVersion` numbers.
3. **Remove all other aliases and create a new version-specific alias for
reading and writing to the `.kibana` index .e.g `.kibana_8.0.1`. During and
after the migration, all saved object reads and writes use this alias
instead of reading or writing directly to the index. By using the atomic
`POST /_aliases` API we minimize the chance that an outdated node creating
new outdated documents can cause data loss.**
4. **Wait for the default bulk operation timeout of 30s. This ensures that any
bulk operations accepted before the removal of the alias have either
completed or returned a timeout error to it's initiator.**
5. If the mappings are out of date, update the mappings **through the alias**
to the combination of the index's current mappings and the expected
mappings. **If this operation fails due to an index missing exception (most
likely because another node removed our version-specific alias) repeat the
entire migration process.**
6. If there are outdated documents, migrate these in batches:
1. Read a batch of outdated documents from `.kibana_n`.
2. Transform documents by applying the migration functions.
3. Update the document batch in the same index using optimistic concurrency
control. If a batch fails due to an update version mismatch continue
migrating the other batches.
4. If a batch fails due other reasons repeat the entire migration process.
7. If any of the batches in step (6.3) failed, repeat the entire migration
process. This ensures that in-progress bulk update operations from an
outdated node won't lead to unmigrated documents still being present after
the migration.
8. Once all documents are up to date, the migration is complete and Kibana can
start serving traffic.
Steps (2) and (3) from the migration algorithm in minimize the chances of the
following scenarios occuring but cannot guarantee it. It is therefore useful
to enumarate some scenarios and their worst case impact:
1. An outdated node issued a bulk create to it's version-specific alias.
Because a user doesn't wait for all traffic to drain a newer node starts
it's migration before the bulk create was complete. Since this bulk create
was accepted before the newer node deleted the previous version-specific
aliases, it is possible that the index now contains some outdated documents
that the new node is unaware of and doesn't migrate. Although these outdated
documents can lead to inconsistent query results and data loss, step (4)
ensures that an error will be returned to the node that created these
objects.
2. A 8.1.0 node and a 8.2.0 node starts migrating a 8.0.0 index in parallel.
Even though the 8.2.0 node will remove the 8.1.0 version-specific aliases,
the 8.1.0 node could have sent an bulk update operation that got accepted
before its alias was removed. When the 8.2.0 node tries to migrate these
8.1.0 documents it gets a version conflict but cannot be sure if this was
because another node of the same version migrated this document (which can
safely be ignored) or interference from a different Kibana version. The
8.1.0 node will hit the error in step (6.3) and restart the migration but
then ultimately fail at step (2). The 8.2.0 node will repeat the entire
migration process from step (7) thus ensuring that all documents are up to
date.
3. A race condition with another Kibana node on the same version, but with
different enabled plugins caused this node's required mappings to be
overwritten. If this causes a mapper parsing exception in step (6.3) we can
restart the migration. Because updating the mappings is additive and saved
object types are unique to a plugin, restarting the migration will allow
the node to update the mappings to be compatible with node's plugins. Both
nodes will be able to successfully complete the migration of their plugins'
registered saved object types. However, if the migration doesn't trigger a
mapper parsing exception the incompatible mappings would go undetected
which can cause future problems like write failures or inconsistent query
results.
## 5.5 Tag objects as “invalid” if their transformation fails
> This alternative prevents a failed migration when there's a migration transform function bug or a document with invalid data. Although it seems preferable to not fail the entire migration because of a single saved object type's migration transform bug or a single invalid document this has several pitfalls:
> 1. When an object fails to migrate the data for that saved object type becomes inconsistent. This could load to a critical feature being unavailable to a user leaving them with no choice but to downgrade.
> 2. Because Kibana starts accepting traffic after encountering invalid objects a rollback will lead to data loss leaving users with no clean way to recover.
> As a result we prefer to let an upgrade fail and making it easy for users to rollback until they can resolve the root cause.
> Achieves goals: (2.2)
> Mitigates Errors (3.1), (3.2)
1. Tag objects as “invalid” if they cause an exception when being transformed,
but dont fail the entire migration.
2. Log an error message informing administrators that there are invalid
objects which require inspection. For each invalid object, provide an error
stack trace to aid in debugging.
3. Administrators should be able to generate a migration report (similar to
the one dry run migrations create) which is an NDJSON export of all objects
tagged as “invalid”.
1. Expose this as an HTTP API first
2. (later) Notify administrators and allow them to export invalid objects
from the Kibana UI.
4. When an invalid object is read, the Saved Objects repository will throw an
invalid object exception which should include a link to the documentation
to help administrators resolve migration bugs.
5. Educate Kibana developers to no longer simply write back an unmigrated
document if an exception occurred. A migration function should either
successfully transform the object or throw.
# 6. How we teach this
1. Update documentation and server logs to start educating users to depend on
snapshots for Kibana rollbacks.
2. Update developer documentation and educate developers with best practices
for writing migration functions.
# 7. Unresolved questions
1. When cloning an index we can only ever add new fields to the mappings. When
a saved object type or specific field is removed, the mappings will remain
until we re-index. Is it sufficient to only re-index every major? How do we
track the field count as it grows over every upgrade?
2. More generally, how do we deal with the growing field count approaching the
default limit of 1000?