Adds documentation and improves migrations failing on timeouts while waiting for index yellow status (#130352)

* reapply docs and doclink changes

* Updates wait_for_index_yellow_status response type on response timeout, updates create_index action and model to account for the changes

* Refactors clone_index action to account for new return type of waitForIndexYellow, updates model

* Updates README

* Updates snapshot

* Updates docs

* Fix import violations

* imports

* Extends the retry log message with an actionable item linking to the docs on every retryable migration action

* Refactor retry_state and model to allow linking to specific subsections in the docs

* Updates resolving saved objects migration failures docs

* Calls waitForIndexStatusYellow directly in actions integration tests

* Deletes comment

* Update src/core/server/saved_objects/migrations/model/retry_state.test.ts

Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Rudolf Meijering <skaapgif@gmail.com>
This commit is contained in:
Christiane (Tina) Heiligers 2022-04-21 08:11:39 -07:00 committed by GitHub
parent 502a00b025
commit fb33187270
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
25 changed files with 425 additions and 33 deletions

View file

@ -46,7 +46,11 @@ Take these extra steps to ensure you are ready for migration.
[float]
==== Ensure your {es} cluster is healthy
Problems with your {es} cluster can prevent {kib} upgrades from succeeding. Ensure that your cluster has:
Problems with your {es} cluster can prevent {kib} upgrades from succeeding.
During the upgrade process, {kib} creates new indices into which updated documents are written. If a cluster is approaching the low watermark, there's a high risk of {kib} not being able to create these. Reading, transforming and writing updated documents can be memory intensive, using more available heap than during routine operation. You must make sure that enough heap is available to prevent requests from timing out or throwing errors from circuit breaker exceptions. You should also ensure that all shards are replicated and assigned.
A healthy cluster has:
* Enough free disk space, at least twice the amount of storage taken up by the `.kibana` and `.kibana_task_manager` indices
* Sufficient heap size

View file

@ -99,7 +99,7 @@ object types will also log the following warning message:
[source,sh]
--------------------------------------------
CHECK_UNKNOWN_DOCUMENTS Upgrades will fail for 8.0+ because documents were found for unknown saved object types. To ensure that upgrades will succeed in the future, either re-enable plugins or delete these documents from the ".kibana_7.17.0_001" index after the current upgrade completes.
CHECK_UNKNOWN_DOCUMENTS Upgrades will fail for 8.0+ because documents were found for unknown saved object types. To ensure that future upgrades will succeed, either re-enable plugins or delete these documents from the ".kibana_7.17.0_001" index after the current upgrade completes.
--------------------------------------------
If you fail to remedy this, your upgrade to 8.0+ will fail with a message like:
@ -123,3 +123,67 @@ In {kib} 7.5.0 and earlier, when the task manager index is set to `.tasks`
with the configuration setting `xpack.tasks.index: ".tasks"`,
upgrade migrations fail. In {kib} 7.5.1 and later, the incompatible configuration
setting prevents upgrade migrations from starting.
[float]
==== Repeated time-out requests that eventually fail
Migrations get stuck in a loop of retry attempts waiting for index yellow status that's never reached.
In the CLONE_TEMP_TO_TARGET or CREATE_REINDEX_TEMP steps, you might see a log entry similar to:
[source,sh]
--------------------------------------------
"Action failed with [index_not_yellow_timeout] Timeout waiting for the status of the [.kibana_8.1.0_001] index to become "yellow". Retrying attempt 1 in 2 seconds."
--------------------------------------------
The process is waiting for a yellow index status. There are two known causes:
* Cluster hits the low watermark for disk usage
* Cluster has <<routing-allocation-disabled,routing allocation disabled>>
Before retrying the migration, inspect the output of the `_cluster/allocation/explain?index=${targetIndex}` API to identify why the index isn't yellow:
[source,sh]
--------------------------------------------
GET _cluster/allocation/explain
{
"index": ".kibana_8.1.0_001",
"shard": 0,
"primary": true,
}
--------------------------------------------
If the cluster exceeded the low watermark for disk usage, the output should contain a message similar to this:
[source,sh]
--------------------------------------------
"The node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=85%], using more disk space than the maximum allowed [85.0%], actual free: [11.692661332965082%]"
--------------------------------------------
Refer to the {es} guide for how to {ref}/fix-common-cluster-issues.html#_error_disk_usage_exceeded_flood_stage_watermark_index_has_read_only_allow_delete_block[fix common cluster issues].
If routing allocation is the issue, the `_cluster/allocation/explain` API will return an entry similar to this:
[source,sh]
--------------------------------------------
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes"
--------------------------------------------
[float]
[[routing-allocation-disabled]]
==== Routing allocation disabled or restricted
Upgrade migrations fail because routing allocation is disabled or restricted (`cluster.routing.allocation.enable: none/primaries/new_primaries`), which causes {kib} to log errors such as:
[source,sh]
--------------------------------------------
Unable to complete saved object migrations for the [.kibana] index: The elasticsearch cluster has cluster routing allocation incorrectly set for migrations to continue. To proceed, please remove the cluster routing allocation settings with PUT /_cluster/settings {"transient": {"cluster.routing.allocation.enable": null}, "persistent": {"cluster.routing.allocation.enable": null}}
--------------------------------------------
To get around the issue, remove the transient and persisted routing allocation settings:
[source,sh]
--------------------------------------------
PUT /_cluster/settings
{
"transient": {
"cluster.routing.allocation.enable": null
},
"persistent": {
"cluster.routing.allocation.enable": null
}
}
--------------------------------------------

View file

@ -642,5 +642,8 @@ export const getDocLinks = ({ kibanaBranch }: GetDocLinkOptions): DocLinks => {
legal: {
privacyStatement: `${ELASTIC_WEBSITE_URL}legal/privacy-statement`,
},
kibanaUpgradeSavedObjects: {
resolveMigrationFailures: `${KIBANA_DOCS}resolve-migrations-failures.html`,
},
});
};

View file

@ -398,4 +398,7 @@ export interface DocLinks {
readonly legal: {
readonly privacyStatement: string;
};
readonly kibanaUpgradeSavedObjects: {
readonly resolveMigrationFailures: string;
};
}

View file

@ -181,7 +181,11 @@ and the migration source index is the index the `.kibana` alias points to.
Create the target index. This operation is idempotent, if the index already exist, we wait until its status turns yellow
### New control state
`MARK_VERSION_INDEX_READY`
1. If the action succeeds
`MARK_VERSION_INDEX_READY`
2. If the action fails with a `index_not_yellow_timeout`
`CREATE_NEW_TARGET`
## LEGACY_SET_WRITE_BLOCK
### Next action
@ -209,8 +213,10 @@ Create a new `.kibana_pre6.5.0_001` index into which we can reindex the legacy
index. (Since the task manager index was converted from a data index into a
saved objects index in 7.4 it will be reindexed into `.kibana_pre7.4.0_001`)
### New control state
1. If the index creation succeeds
`LEGACY_REINDEX`
2. If the index creation task failed with a `index_not_yellow_timeout`
`LEGACY_REINDEX_WAIT_FOR_TASK`
## LEGACY_REINDEX
### Next action
`reindex`
@ -257,7 +263,10 @@ Wait for the Elasticsearch cluster to be in "yellow" state. It means the index's
We don't have as much data redundancy as we could have, but it's enough to start the migration.
### New control state
1. If the action succeeds
`SET_SOURCE_WRITE_BLOCK`
2. If the action fails with a `index_not_yellow_timeout`
`WAIT_FOR_YELLOW_SOURCE`
## SET_SOURCE_WRITE_BLOCK
### Next action
@ -278,7 +287,10 @@ This operation is idempotent, if the index already exist, we wait until its stat
- (Since we never query the temporary index we can potentially disable refresh to speed up indexing performance. Profile to see if gains justify complexity)
### New control state
1. If the action succeeds
`REINDEX_SOURCE_TO_TEMP_OPEN_PIT`
2. If the action fails with a `index_not_yellow_timeout`
`CREATE_REINDEX_TEMP`
## REINDEX_SOURCE_TO_TEMP_OPEN_PIT
### Next action
@ -357,7 +369,10 @@ Ask elasticsearch to clone the temporary index into the target index. If the tar
We cant use the temporary index as our target index because one instance can complete the migration, delete a document, and then a second instance starts the reindex operation and re-creates the deleted document. By cloning the temporary index and only accepting writes/deletes from the cloned target index, we prevent lost acknowledged deletes.
### New control state
1. If the action succeeds
`OUTDATED_DOCUMENTS_SEARCH`
2. If the action fails with a `index_not_yellow_timeout`
`CLONE_TEMP_TO_TARGET`
## OUTDATED_DOCUMENTS_SEARCH
### Next action

View file

@ -32,6 +32,9 @@ Object {
},
],
"maxBatchSizeBytes": 100000000,
"migrationDocLinks": Object {
"resolveMigrationFailures": "https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html",
},
"outdatedDocuments": Array [],
"outdatedDocumentsQuery": Object {
"bool": Object {
@ -193,6 +196,9 @@ Object {
},
],
"maxBatchSizeBytes": 100000000,
"migrationDocLinks": Object {
"resolveMigrationFailures": "https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html",
},
"outdatedDocuments": Array [],
"outdatedDocumentsQuery": Object {
"bool": Object {
@ -358,6 +364,9 @@ Object {
},
],
"maxBatchSizeBytes": 100000000,
"migrationDocLinks": Object {
"resolveMigrationFailures": "https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html",
},
"outdatedDocuments": Array [],
"outdatedDocumentsQuery": Object {
"bool": Object {
@ -527,6 +536,9 @@ Object {
},
],
"maxBatchSizeBytes": 100000000,
"migrationDocLinks": Object {
"resolveMigrationFailures": "https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html",
},
"outdatedDocuments": Array [],
"outdatedDocumentsQuery": Object {
"bool": Object {
@ -722,6 +734,9 @@ Object {
},
],
"maxBatchSizeBytes": 100000000,
"migrationDocLinks": Object {
"resolveMigrationFailures": "https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html",
},
"outdatedDocuments": Array [
Object {
"_id": "1234",
@ -894,6 +909,9 @@ Object {
},
],
"maxBatchSizeBytes": 100000000,
"migrationDocLinks": Object {
"resolveMigrationFailures": "https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html",
},
"outdatedDocuments": Array [
Object {
"_id": "1234",

View file

@ -15,7 +15,7 @@ import {
catchRetryableEsClientErrors,
RetryableEsClientError,
} from './catch_retryable_es_client_errors';
import type { IndexNotFound, AcknowledgeResponse } from '.';
import type { IndexNotFound, AcknowledgeResponse, IndexNotYellowTimeout } from '.';
import { waitForIndexStatusYellow } from './wait_for_index_status_yellow';
import {
DEFAULT_TIMEOUT,
@ -49,7 +49,7 @@ export const cloneIndex = ({
target,
timeout = DEFAULT_TIMEOUT,
}: CloneIndexParams): TaskEither.TaskEither<
RetryableEsClientError | IndexNotFound,
RetryableEsClientError | IndexNotFound | IndexNotYellowTimeout,
CloneIndexResponse
> => {
const cloneTask: TaskEither.TaskEither<
@ -122,7 +122,7 @@ export const cloneIndex = ({
return pipe(
cloneTask,
TaskEither.chain((res) => {
TaskEither.chainW((res) => {
if (res.acknowledged && res.shardsAcknowledged) {
// If the cluster state was updated and all shards ackd we're done
return TaskEither.right(res);

View file

@ -22,7 +22,7 @@ import {
INDEX_AUTO_EXPAND_REPLICAS,
WAIT_FOR_ALL_SHARDS_TO_BE_ACTIVE,
} from './constants';
import { waitForIndexStatusYellow } from './wait_for_index_status_yellow';
import { IndexNotYellowTimeout, waitForIndexStatusYellow } from './wait_for_index_status_yellow';
function aliasArrayToRecord(aliases: string[]): Record<string, estypes.IndicesAlias> {
const result: Record<string, estypes.IndicesAlias> = {};
@ -54,7 +54,10 @@ export const createIndex = ({
indexName,
mappings,
aliases = [],
}: CreateIndexParams): TaskEither.TaskEither<RetryableEsClientError, 'create_index_succeeded'> => {
}: CreateIndexParams): TaskEither.TaskEither<
RetryableEsClientError | IndexNotYellowTimeout,
'create_index_succeeded'
> => {
const createIndexTask: TaskEither.TaskEither<
RetryableEsClientError,
AcknowledgeResponse

View file

@ -35,8 +35,11 @@ export { removeWriteBlock } from './remove_write_block';
export type { CloneIndexResponse, CloneIndexParams } from './clone_index';
export { cloneIndex } from './clone_index';
export type { WaitForIndexStatusYellowParams } from './wait_for_index_status_yellow';
import { waitForIndexStatusYellow } from './wait_for_index_status_yellow';
export type {
WaitForIndexStatusYellowParams,
IndexNotYellowTimeout,
} from './wait_for_index_status_yellow';
import { IndexNotYellowTimeout, waitForIndexStatusYellow } from './wait_for_index_status_yellow';
export type { WaitForTaskResponse, WaitForTaskCompletionTimeout } from './wait_for_task';
import { waitForTask, WaitForTaskCompletionTimeout } from './wait_for_task';
@ -149,6 +152,7 @@ export interface ActionErrorTypeMap {
request_entity_too_large_exception: RequestEntityTooLargeException;
unknown_docs_found: UnknownDocsFound;
unsupported_cluster_routing_allocation: UnsupportedClusterRoutingAllocation;
index_not_yellow_timeout: IndexNotYellowTimeout;
}
/**

View file

@ -321,8 +321,13 @@ describe('migration actions', () => {
});
describe('waitForIndexStatusYellow', () => {
afterAll(async () => {
await client.indices.delete({ index: 'red_then_yellow_index' });
afterEach(async () => {
try {
await client.indices.delete({ index: 'red_then_yellow_index' });
await client.indices.delete({ index: 'red_index' });
} catch (e) {
/** ignore */
}
});
it('resolves right after waiting for an index status to be yellow if the index already existed', async () => {
// Create a red index
@ -366,6 +371,39 @@ describe('migration actions', () => {
const yellowStatusResponse = await client.cluster.health({ index: 'red_then_yellow_index' });
expect(yellowStatusResponse.status).toBe('yellow');
});
it('resolves left with "index_not_yellow_timeout" after waiting for an index status to be yellow timeout', async () => {
// Create a red index
await client.indices
.create({
index: 'red_index',
timeout: '5s',
body: {
mappings: { properties: {} },
settings: {
// Allocate no replicas so that this index stays red
number_of_replicas: '0',
// Disable all shard allocation so that the index status is red
index: { routing: { allocation: { enable: 'none' } } },
},
},
})
.catch((e) => {});
// try to wait for index status yellow:
const task = waitForIndexStatusYellow({
client,
index: 'red_index',
timeout: '1s',
});
await expect(task()).resolves.toMatchInlineSnapshot(`
Object {
"_tag": "Left",
"left": Object {
"message": "[index_not_yellow_timeout] Timeout waiting for the status of the [red_index] index to become 'yellow'",
"type": "index_not_yellow_timeout",
},
}
`);
});
});
describe('cloneIndex', () => {
@ -459,7 +497,7 @@ describe('migration actions', () => {
}
`);
});
it('resolves left with a retryable_es_client_error if clone target already exists but takes longer than the specified timeout before turning yellow', async () => {
it('resolves left with a index_not_yellow_timeout if clone target already exists but takes longer than the specified timeout before turning yellow', async () => {
// Create a red index
await client.indices
.create({
@ -489,8 +527,8 @@ describe('migration actions', () => {
Object {
"_tag": "Left",
"left": Object {
"message": "Timeout waiting for the status of the [clone_red_index] index to become 'yellow'",
"type": "retryable_es_client_error",
"message": "[index_not_yellow_timeout] Timeout waiting for the status of the [clone_red_index] index to become 'yellow'",
"type": "index_not_yellow_timeout",
},
}
`);

View file

@ -21,6 +21,11 @@ export interface WaitForIndexStatusYellowParams {
index: string;
timeout?: string;
}
export interface IndexNotYellowTimeout {
type: 'index_not_yellow_timeout';
message: string;
}
/**
* A yellow index status means the index's primary shard is allocated and the
* index is ready for searching/indexing documents, but ES wasn't able to
@ -37,7 +42,10 @@ export const waitForIndexStatusYellow =
client,
index,
timeout = DEFAULT_TIMEOUT,
}: WaitForIndexStatusYellowParams): TaskEither.TaskEither<RetryableEsClientError, {}> =>
}: WaitForIndexStatusYellowParams): TaskEither.TaskEither<
RetryableEsClientError | IndexNotYellowTimeout,
{}
> =>
() => {
return client.cluster
.health(
@ -47,14 +55,14 @@ export const waitForIndexStatusYellow =
timeout,
},
// Don't reject on status code 408 so that we can handle the timeout
// explicitly and provide more context in the error message
// explicitly with a custom response type and provide more context in the error message
{ ignore: [408] }
)
.then((res) => {
if (res.timed_out === true) {
return Either.left({
type: 'retryable_es_client_error' as const,
message: `Timeout waiting for the status of the [${index}] index to become 'yellow'`,
type: 'index_not_yellow_timeout' as const,
message: `[index_not_yellow_timeout] Timeout waiting for the status of the [${index}] index to become 'yellow'`,
});
}
return Either.right({});

View file

@ -8,15 +8,19 @@
import { ByteSizeValue } from '@kbn/config-schema';
import * as Option from 'fp-ts/Option';
import { DocLinksServiceSetup } from '../../doc_links';
import { docLinksServiceMock } from '../../mocks';
import { SavedObjectsMigrationConfigType } from '../saved_objects_config';
import { SavedObjectTypeRegistry } from '../saved_objects_type_registry';
import { createInitialState } from './initial_state';
describe('createInitialState', () => {
let typeRegistry: SavedObjectTypeRegistry;
let docLinks: DocLinksServiceSetup;
beforeEach(() => {
typeRegistry = new SavedObjectTypeRegistry();
docLinks = docLinksServiceMock.createSetupContract();
});
const migrationsConfig = {
@ -36,6 +40,7 @@ describe('createInitialState', () => {
indexPrefix: '.kibana_task_manager',
migrationsConfig,
typeRegistry,
docLinks,
})
).toEqual({
batchSize: 1000,
@ -108,6 +113,10 @@ describe('createInitialState', () => {
},
versionAlias: '.kibana_task_manager_8.1.0',
versionIndex: '.kibana_task_manager_8.1.0_001',
migrationDocLinks: {
resolveMigrationFailures:
'https://www.elastic.co/guide/en/kibana/test-branch/resolve-migrations-failures.html',
},
});
});
@ -135,6 +144,7 @@ describe('createInitialState', () => {
indexPrefix: '.kibana_task_manager',
migrationsConfig,
typeRegistry,
docLinks,
});
expect(initialState.knownTypes).toEqual(['foo', 'bar']);
@ -160,6 +170,7 @@ describe('createInitialState', () => {
indexPrefix: '.kibana_task_manager',
migrationsConfig,
typeRegistry,
docLinks,
});
expect(initialState.excludeFromUpgradeFilterHooks).toEqual({ foo: fooExcludeOnUpgradeHook });
@ -178,6 +189,7 @@ describe('createInitialState', () => {
indexPrefix: '.kibana_task_manager',
migrationsConfig,
typeRegistry,
docLinks,
});
expect(Option.isSome(initialState.preMigrationScript)).toEqual(true);
@ -199,6 +211,7 @@ describe('createInitialState', () => {
indexPrefix: '.kibana_task_manager',
migrationsConfig,
typeRegistry,
docLinks,
}).preMigrationScript
)
).toEqual(true);
@ -216,6 +229,7 @@ describe('createInitialState', () => {
indexPrefix: '.kibana_task_manager',
migrationsConfig,
typeRegistry,
docLinks,
}).outdatedDocumentsQuery
).toMatchInlineSnapshot(`
Object {

View file

@ -13,6 +13,7 @@ import { SavedObjectsMigrationConfigType } from '../saved_objects_config';
import type { ISavedObjectTypeRegistry } from '../saved_objects_type_registry';
import { InitState } from './state';
import { excludeUnusedTypesQuery } from './core';
import { DocLinksServiceStart } from '../../doc_links';
/**
* Construct the initial state for the model
@ -25,6 +26,7 @@ export const createInitialState = ({
indexPrefix,
migrationsConfig,
typeRegistry,
docLinks,
}: {
kibanaVersion: string;
targetMappings: IndexMapping;
@ -33,6 +35,7 @@ export const createInitialState = ({
indexPrefix: string;
migrationsConfig: SavedObjectsMigrationConfigType;
typeRegistry: ISavedObjectTypeRegistry;
docLinks: DocLinksServiceStart;
}): InitState => {
const outdatedDocumentsQuery = {
bool: {
@ -64,6 +67,8 @@ export const createInitialState = ({
.filter((type) => !!type.excludeOnUpgrade)
.map((type) => [type.name, type.excludeOnUpgrade!])
);
// short key to access savedObjects entries directly from docLinks
const migrationDocLinks = docLinks.links.kibanaUpgradeSavedObjects;
return {
controlState: 'INIT',
@ -87,5 +92,6 @@ export const createInitialState = ({
unusedTypesQuery: excludeUnusedTypesQuery,
knownTypes,
excludeFromUpgradeFilterHooks: excludeFilterHooks,
migrationDocLinks,
};
};

View file

@ -16,6 +16,7 @@ import { SavedObjectTypeRegistry } from '../saved_objects_type_registry';
import { SavedObjectsType } from '../types';
import { DocumentMigrator } from './core/document_migrator';
import { ByteSizeValue } from '@kbn/config-schema';
import { docLinksServiceMock } from '../../mocks';
import { lastValueFrom } from 'rxjs';
jest.mock('./core/document_migrator', () => {
@ -287,6 +288,7 @@ const mockOptions = () => {
retryAttempts: 20,
},
client: elasticsearchClientMock.createElasticsearchClient(),
docLinks: docLinksServiceMock.createSetupContract(),
};
return options;
};

View file

@ -29,6 +29,7 @@ import { ISavedObjectTypeRegistry } from '../saved_objects_type_registry';
import { SavedObjectsType } from '../types';
import { runResilientMigrator } from './run_resilient_migrator';
import { migrateRawDocsSafely } from './core/migrate_raw_docs';
import { DocLinksServiceStart } from '../../doc_links';
export interface KibanaMigratorOptions {
client: ElasticsearchClient;
@ -37,6 +38,7 @@ export interface KibanaMigratorOptions {
kibanaIndex: string;
kibanaVersion: string;
logger: Logger;
docLinks: DocLinksServiceStart;
}
export type IKibanaMigrator = Pick<KibanaMigrator, keyof KibanaMigrator>;
@ -65,6 +67,7 @@ export class KibanaMigrator {
private readonly activeMappings: IndexMapping;
private readonly soMigrationsConfig: SavedObjectsMigrationConfigType;
public readonly kibanaVersion: string;
private readonly docLinks: DocLinksServiceStart;
/**
* Creates an instance of KibanaMigrator.
@ -76,6 +79,7 @@ export class KibanaMigrator {
soMigrationsConfig,
kibanaVersion,
logger,
docLinks,
}: KibanaMigratorOptions) {
this.client = client;
this.kibanaIndex = kibanaIndex;
@ -93,6 +97,7 @@ export class KibanaMigrator {
// Building the active mappings (and associated md5sums) is an expensive
// operation so we cache the result
this.activeMappings = buildActiveMappings(this.mappingProperties);
this.docLinks = docLinks;
}
/**
@ -177,6 +182,7 @@ export class KibanaMigrator {
indexPrefix: index,
migrationsConfig: this.soMigrationsConfig,
typeRegistry: this.typeRegistry,
docLinks: this.docLinks,
});
},
};

View file

@ -8,7 +8,7 @@
import { cleanupMock } from './migrations_state_machine_cleanup.mocks';
import { migrationStateActionMachine } from './migrations_state_action_machine';
import { loggingSystemMock, elasticsearchServiceMock } from '../../mocks';
import { loggingSystemMock, elasticsearchServiceMock, docLinksServiceMock } from '../../mocks';
import { typeRegistryMock } from '../saved_objects_type_registry.mock';
import * as Either from 'fp-ts/lib/Either';
import * as Option from 'fp-ts/lib/Option';
@ -33,6 +33,7 @@ describe('migrationsStateActionMachine', () => {
const mockLogger = loggingSystemMock.create();
const typeRegistry = typeRegistryMock.create();
const docLinks = docLinksServiceMock.createSetupContract();
const initialState = createInitialState({
kibanaVersion: '7.11.0',
@ -48,6 +49,7 @@ describe('migrationsStateActionMachine', () => {
retryAttempts: 5,
},
typeRegistry,
docLinks,
});
const next = jest.fn((s: State) => {

View file

@ -94,6 +94,9 @@ describe('migrations v2 model', () => {
},
knownTypes: ['dashboard', 'config'],
excludeFromUpgradeFilterHooks: {},
migrationDocLinks: {
resolveMigrationFailures: 'resolveMigrationFailures',
},
};
describe('exponential retry delays for retryable_es_client_error', () => {
@ -182,7 +185,7 @@ describe('migrations v2 model', () => {
expect(newState.controlState).toEqual('FATAL');
expect(newState.reason).toMatchInlineSnapshot(
`"Unable to complete the INIT step after 15 attempts, terminating."`
`"Unable to complete the INIT step after 15 attempts, terminating. The last failure message was: snapshot_in_progress_exception"`
);
});
});
@ -560,9 +563,29 @@ describe('migrations v2 model', () => {
expect(newState.retryCount).toEqual(0);
expect(newState.retryDelay).toEqual(0);
});
// The createIndex action called by LEGACY_CREATE_REINDEX_TARGET never
// returns a left, it will always succeed or timeout. Since timeout
// failures are always retried we don't explicitly test this logic
test('LEGACY_CREATE_REINDEX_TARGET -> LEGACY_CREATE_REINDEX_TARGET if action fails with index_not_yellow_timeout', () => {
const res: ResponseType<'LEGACY_CREATE_REINDEX_TARGET'> = Either.left({
message: '[index_not_yellow_timeout] Timeout waiting for ...',
type: 'index_not_yellow_timeout',
});
const newState = model(legacyCreateReindexTargetState, res);
expect(newState.controlState).toEqual('LEGACY_CREATE_REINDEX_TARGET');
expect(newState.retryCount).toEqual(1);
expect(newState.retryDelay).toEqual(2000);
});
test('LEGACY_CREATE_REINDEX_TARGET -> LEGACY_REINDEX resets retry count and retry delay if action succeeds', () => {
const res: ResponseType<'LEGACY_CREATE_REINDEX_TARGET'> =
Either.right('create_index_succeeded');
const testState = {
...legacyCreateReindexTargetState,
retryCount: 1,
retryDelay: 2000,
};
const newState = model(testState, res);
expect(newState.controlState).toEqual('LEGACY_REINDEX');
expect(newState.retryCount).toEqual(0);
expect(newState.retryDelay).toEqual(0);
});
});
describe('LEGACY_REINDEX', () => {
@ -707,6 +730,33 @@ describe('migrations v2 model', () => {
sourceIndex: Option.some('.kibana_3'),
});
});
test('WAIT_FOR_YELLOW_SOURCE -> WAIT_FOR_YELLOW_SOURCE if action fails with index_not_yellow_timeout', () => {
const res: ResponseType<'WAIT_FOR_YELLOW_SOURCE'> = Either.left({
message: '[index_not_yellow_timeout] Timeout waiting for ...',
type: 'index_not_yellow_timeout',
});
const newState = model(waitForYellowSourceState, res);
expect(newState.controlState).toEqual('WAIT_FOR_YELLOW_SOURCE');
expect(newState.retryCount).toEqual(1);
expect(newState.retryDelay).toEqual(2000);
});
test('WAIT_FOR_YELLOW_SOURCE -> CHECK_UNKNOWN_DOCUMENTS resets retry count and delay if action succeeds', () => {
const res: ResponseType<'WAIT_FOR_YELLOW_SOURCE'> = Either.right({});
const testState = {
...waitForYellowSourceState,
retryCount: 1,
retryDelay: 2000,
};
const newState = model(testState, res);
expect(newState.controlState).toEqual('CHECK_UNKNOWN_DOCUMENTS');
expect(newState).toMatchObject({
controlState: 'CHECK_UNKNOWN_DOCUMENTS',
sourceIndex: Option.some('.kibana_3'),
});
});
});
describe('CHECK_UNKNOWN_DOCUMENTS', () => {
@ -900,6 +950,28 @@ describe('migrations v2 model', () => {
expect(newState.retryCount).toEqual(0);
expect(newState.retryDelay).toEqual(0);
});
it('CREATE_REINDEX_TEMP -> CREATE_REINDEX_TEMP if action fails with index_not_yellow_timeout', () => {
const res: ResponseType<'CREATE_REINDEX_TEMP'> = Either.left({
message: '[index_not_yellow_timeout] Timeout waiting for ...',
type: 'index_not_yellow_timeout',
});
const newState = model(state, res);
expect(newState.controlState).toEqual('CREATE_REINDEX_TEMP');
expect(newState.retryCount).toEqual(1);
expect(newState.retryDelay).toEqual(2000);
});
it('CREATE_REINDEX_TEMP -> REINDEX_SOURCE_TO_TEMP_OPEN_PIT resets retry count if action succeeds', () => {
const res: ResponseType<'CREATE_REINDEX_TEMP'> = Either.right('create_index_succeeded');
const testState = {
...state,
retryCount: 1,
retryDelay: 2000,
};
const newState = model(testState, res);
expect(newState.controlState).toEqual('REINDEX_SOURCE_TO_TEMP_OPEN_PIT');
expect(newState.retryCount).toEqual(0);
expect(newState.retryDelay).toEqual(0);
});
});
describe('REINDEX_SOURCE_TO_TEMP_OPEN_PIT', () => {
@ -1212,6 +1284,31 @@ describe('migrations v2 model', () => {
expect(newState.retryCount).toBe(0);
expect(newState.retryDelay).toBe(0);
});
it('CLONE_TEMP_TO_TARGET -> CLONE_TEMP_TO_TARGET if action fails with index_not_yellow_timeout', () => {
const res: ResponseType<'CLONE_TEMP_TO_TARGET'> = Either.left({
message: '[index_not_yellow_timeout] Timeout waiting for ...',
type: 'index_not_yellow_timeout',
});
const newState = model(state, res);
expect(newState.controlState).toEqual('CLONE_TEMP_TO_TARGET');
expect(newState.retryCount).toEqual(1);
expect(newState.retryDelay).toEqual(2000);
});
it('CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY resets the retry count and delay', () => {
const res: ResponseType<'CLONE_TEMP_TO_TARGET'> = Either.right({
acknowledged: true,
shardsAcknowledged: true,
});
const testState = {
...state,
retryCount: 1,
retryDelay: 2000,
};
const newState = model(testState, res);
expect(newState.controlState).toBe('REFRESH_TARGET');
expect(newState.retryCount).toBe(0);
expect(newState.retryDelay).toBe(0);
});
});
describe('OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT', () => {
@ -1698,6 +1795,29 @@ describe('migrations v2 model', () => {
expect(newState.retryCount).toEqual(0);
expect(newState.retryDelay).toEqual(0);
});
test('CREATE_NEW_TARGET -> CREATE_NEW_TARGET if action fails with index_not_yellow_timeout', () => {
const res: ResponseType<'CREATE_NEW_TARGET'> = Either.left({
message: '[index_not_yellow_timeout] Timeout waiting for ...',
type: 'index_not_yellow_timeout',
});
const newState = model(createNewTargetState, res);
expect(newState.controlState).toEqual('CREATE_NEW_TARGET');
expect(newState.retryCount).toEqual(1);
expect(newState.retryDelay).toEqual(2000);
});
test('CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY resets the retry count and delay', () => {
const res: ResponseType<'CREATE_NEW_TARGET'> = Either.right('create_index_succeeded');
const testState = {
...createNewTargetState,
retryCount: 1,
retryDelay: 2000,
};
const newState = model(testState, res);
expect(newState.controlState).toEqual('MARK_VERSION_INDEX_READY');
expect(newState.retryCount).toEqual(0);
expect(newState.retryDelay).toEqual(0);
});
});
describe('MARK_VERSION_INDEX_READY', () => {

View file

@ -235,7 +235,21 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
}
} else if (stateP.controlState === 'LEGACY_CREATE_REINDEX_TARGET') {
const res = resW as ExcludeRetryableEsError<ResponseType<typeof stateP.controlState>>;
if (Either.isRight(res)) {
if (Either.isLeft(res)) {
const left = res.left;
if (isLeftTypeof(left, 'index_not_yellow_timeout')) {
// `index_not_yellow_timeout` for the LEGACY_CREATE_REINDEX_TARGET source index:
// A yellow status timeout could theoretically be temporary for a busy cluster
// that takes a long time to allocate the primary and we retry the action to see if
// we get a response.
// If the cluster hit the low watermark for disk usage the LEGACY_CREATE_REINDEX_TARGET action will
// continue to timeout and eventually lead to a failed migration.
const retryErrorMessage = `${left.message} Refer to ${stateP.migrationDocLinks.resolveMigrationFailures} for information on how to resolve the issue.`;
return delayRetryState(stateP, retryErrorMessage, stateP.retryAttempts);
} else {
return throwBadResponse(stateP, left);
}
} else if (Either.isRight(res)) {
return {
...stateP,
controlState: 'LEGACY_REINDEX',
@ -285,7 +299,7 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
// After waiting for the specified timeout, the task has not yet
// completed. Retry this step to see if the task has completed after an
// exponential delay. We will basically keep polling forever until the
// Elasticeasrch task succeeds or fails.
// Elasticsearch task succeeds or fails.
return delayRetryState(stateP, left.message, Number.MAX_SAFE_INTEGER);
} else if (
isLeftTypeof(left, 'index_not_found_exception') ||
@ -344,6 +358,19 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
...stateP,
controlState: 'CHECK_UNKNOWN_DOCUMENTS',
};
} else if (Either.isLeft(res)) {
const left = res.left;
if (isLeftTypeof(left, 'index_not_yellow_timeout')) {
// A yellow status timeout could theoretically be temporary for a busy cluster
// that takes a long time to allocate the primary and we retry the action to see if
// we get a response.
// In the event of retries running out, we link to the docs to help with diagnosing
// the problem.
const retryErrorMessage = `${left.message} Refer to ${stateP.migrationDocLinks.resolveMigrationFailures} for information on how to resolve the issue.`;
return delayRetryState(stateP, retryErrorMessage, stateP.retryAttempts);
} else {
return throwBadResponse(stateP, left);
}
} else {
return throwBadResponse(stateP, res);
}
@ -425,6 +452,20 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
const res = resW as ExcludeRetryableEsError<ResponseType<typeof stateP.controlState>>;
if (Either.isRight(res)) {
return { ...stateP, controlState: 'REINDEX_SOURCE_TO_TEMP_OPEN_PIT' };
} else if (Either.isLeft(res)) {
const left = res.left;
if (isLeftTypeof(left, 'index_not_yellow_timeout')) {
// `index_not_yellow_timeout` for the CREATE_REINDEX_TEMP target temp index:
// The index status did not go yellow within the specified timeout period.
// A yellow status timeout could theoretically be temporary for a busy cluster.
//
// If there is a problem CREATE_REINDEX_TEMP action will
// continue to timeout and eventually lead to a failed migration.
const retryErrorMessage = `${left.message} Refer to ${stateP.migrationDocLinks.resolveMigrationFailures} for information on how to resolve the issue.`;
return delayRetryState(stateP, retryErrorMessage, stateP.retryAttempts);
} else {
return throwBadResponse(stateP, left);
}
} else {
// If the createIndex action receives an 'resource_already_exists_exception'
// it will wait until the index status turns green so we don't have any
@ -645,6 +686,18 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
...stateP,
controlState: 'REFRESH_TARGET',
};
} else if (isLeftTypeof(left, 'index_not_yellow_timeout')) {
// `index_not_yellow_timeout` for the CLONE_TEMP_TO_TARGET source -> target index:
// The target index status did not go yellow within the specified timeout period.
// The cluster could just be busy and we retry the action.
// Once we run out of retries, the migration fails.
// Identifying the cause requires inspecting the ouput of the
// `_cluster/allocation/explain?index=${targetIndex}` API.
// Unless the root cause is identified and addressed, the request will
// continue to timeout and eventually lead to a failed migration.
const retryErrorMessage = `${left.message} Refer to ${stateP.migrationDocLinks.resolveMigrationFailures} for information on how to resolve the issue.`;
return delayRetryState(stateP, retryErrorMessage, stateP.retryAttempts);
} else {
throwBadResponse(stateP, left);
}
@ -876,7 +929,7 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
if (isLeftTypeof(left, 'wait_for_task_completion_timeout')) {
// After waiting for the specified timeout, the task has not yet
// completed. Retry this step to see if the task has completed after an
// exponential delay. We will basically keep polling forever until the
// exponential delay. We will basically keep polling forever until the
// Elasticsearch task succeeds or fails.
return delayRetryState(stateP, res.left.message, Number.MAX_SAFE_INTEGER);
} else {
@ -890,6 +943,19 @@ export const model = (currentState: State, resW: ResponseType<AllActionStates>):
...stateP,
controlState: 'MARK_VERSION_INDEX_READY',
};
} else if (Either.isLeft(res)) {
const left = res.left;
if (isLeftTypeof(left, 'index_not_yellow_timeout')) {
// `index_not_yellow_timeout` for the CREATE_NEW_TARGET target index:
// The cluster might just be busy and we retry the action for a set number of times.
// If the cluster hit the low watermark for disk usage the action will continue to timeout.
// Unless the disk space is addressed, the LEGACY_CREATE_REINDEX_TARGET action will
// continue to timeout and eventually lead to a failed migration.
const retryErrorMessage = `${left.message} Refer to ${stateP.migrationDocLinks.resolveMigrationFailures} for information on how to resolve the issue.`;
return delayRetryState(stateP, retryErrorMessage, stateP.retryAttempts);
} else {
return throwBadResponse(stateP, left);
}
} else {
// If the createIndex action receives an 'resource_already_exists_exception'
// it will wait until the index status turns green so we don't have any

View file

@ -109,7 +109,7 @@ describe('delayRetryState', () => {
hello: 'dolly',
retryCount: 5,
retryDelay: 64,
reason: `Unable to complete the TEST step after 5 attempts, terminating.`,
reason: `Unable to complete the TEST step after 5 attempts, terminating. The last failure message was: some-error`,
});
});
});

View file

@ -18,12 +18,11 @@ export const delayRetryState = <S extends State>(
return {
...state,
controlState: 'FATAL',
reason: `Unable to complete the ${state.controlState} step after ${maxRetryAttempts} attempts, terminating.`,
reason: `Unable to complete the ${state.controlState} step after ${maxRetryAttempts} attempts, terminating. The last failure message was: ${errorMessage}`,
};
} else {
const retryCount = state.retryCount + 1;
const retryDelay = 1000 * Math.min(Math.pow(2, retryCount), 64); // 2s, 4s, 8s, 16s, 32s, 64s, 64s, 64s ...
return {
...state,
retryCount,

View file

@ -18,6 +18,7 @@ import { createInitialState } from './initial_state';
import { migrationStateActionMachine } from './migrations_state_action_machine';
import { SavedObjectsMigrationConfigType } from '../saved_objects_config';
import type { ISavedObjectTypeRegistry } from '../saved_objects_type_registry';
import { DocLinksServiceStart } from '../../doc_links';
/**
* Migrates the provided indexPrefix index using a resilient algorithm that is
@ -35,6 +36,7 @@ export async function runResilientMigrator({
indexPrefix,
migrationsConfig,
typeRegistry,
docLinks,
}: {
client: ElasticsearchClient;
kibanaVersion: string;
@ -46,6 +48,7 @@ export async function runResilientMigrator({
indexPrefix: string;
migrationsConfig: SavedObjectsMigrationConfigType;
typeRegistry: ISavedObjectTypeRegistry;
docLinks: DocLinksServiceStart;
}): Promise<MigrationResult> {
const initialState = createInitialState({
kibanaVersion,
@ -55,6 +58,7 @@ export async function runResilientMigrator({
indexPrefix,
migrationsConfig,
typeRegistry,
docLinks,
});
return migrationStateActionMachine({
initialState,

View file

@ -122,6 +122,10 @@ export interface BaseState extends ControlState {
string,
SavedObjectTypeExcludeFromUpgradeFilterHook
>;
/**
* DocLinks for savedObjects. to reference online documentation
*/
readonly migrationDocLinks: Record<string, string>;
}
export interface InitState extends BaseState {

View file

@ -36,6 +36,7 @@ import { NodesVersionCompatibility } from '../elasticsearch/version_check/ensure
import { SavedObjectsRepository } from './service/lib/repository';
import { registerCoreObjectTypes } from './object_types';
import { getSavedObjectsDeprecationsProvider } from './deprecations';
import { docLinksServiceMock } from '../doc_links/doc_links_service.mock';
jest.mock('./service/lib/repository');
jest.mock('./object_types');
@ -79,6 +80,7 @@ describe('SavedObjectsService', () => {
return {
pluginsInitialized,
elasticsearch: elasticsearchServiceMock.createInternalStart(),
docLinks: docLinksServiceMock.createStartContract(),
};
};

View file

@ -50,6 +50,7 @@ import { ServiceStatus } from '../status';
import { calculateStatus$ } from './status';
import { registerCoreObjectTypes } from './object_types';
import { getSavedObjectsDeprecationsProvider } from './deprecations';
import { DocLinksServiceStart } from '../doc_links';
const kibanaIndex = '.kibana';
@ -284,6 +285,7 @@ interface WrappedClientFactoryWrapper {
export interface SavedObjectsStartDeps {
elasticsearch: InternalElasticsearchServiceStart;
pluginsInitialized?: boolean;
docLinks: DocLinksServiceStart;
}
export class SavedObjectsService
@ -383,6 +385,7 @@ export class SavedObjectsService
public async start({
elasticsearch,
pluginsInitialized = true,
docLinks,
}: SavedObjectsStartDeps): Promise<InternalSavedObjectsServiceStart> {
if (!this.setupDeps || !this.config) {
throw new Error('#setup() needs to be run first');
@ -394,7 +397,8 @@ export class SavedObjectsService
const migrator = this.createMigrator(
this.config.migration,
elasticsearch.client.asInternalUser
elasticsearch.client.asInternalUser,
docLinks
);
this.migrator$.next(migrator);
@ -509,7 +513,8 @@ export class SavedObjectsService
private createMigrator(
soMigrationsConfig: SavedObjectsMigrationConfigType,
client: ElasticsearchClient
client: ElasticsearchClient,
docLinks: DocLinksServiceStart
): IKibanaMigrator {
return new KibanaMigrator({
typeRegistry: this.typeRegistry,
@ -518,6 +523,7 @@ export class SavedObjectsService
soMigrationsConfig,
kibanaIndex,
client,
docLinks,
});
}

View file

@ -316,6 +316,7 @@ export class Server {
const savedObjectsStart = await this.savedObjects.start({
elasticsearch: elasticsearchStart,
pluginsInitialized: this.#pluginsInitialized,
docLinks: docLinkStart,
});
await this.resolveSavedObjectsStartPromise!(savedObjectsStart);