[docs/ops] write docs about flaky tests (#139866)

Co-authored-by: Jonathan Budzenski <jon@budzenski.me>
This commit is contained in:
Spencer 2022-09-01 10:23:49 -05:00 committed by GitHub
parent d1b7f02a22
commit e620280d87
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
8 changed files with 227 additions and 140 deletions

View file

@ -12,4 +12,4 @@ The service implementation is available at https://github.com/elastic/kibana-ci-
The service is run on Google Cloud Run, which allows us to build a container, push it to GCR, define a memory limit, vCPU count, and concurrent request limit per container, and Google will automatically scale the container for us. It works pretty well and hides a lot of the complexity of running the service. The repo uses Buildkite CI to build and deploy the container when pushing to the main branch. All changes to the main branch must come from PRs, but at this time we don't require review for PRs.
The website at https://ci-stats.kibana.dev uses EUI and Elastic Charts, and currently has users powered by Github OAuth. When someone authenticates with ci-stats they are first redirected to Github for authentication, then their membership in the Elastic org is checked. Users in the Elastic org will be able to do things that other users can't, like [trigger flaky test runner jobs](https://ci-stats.kibana.dev/trigger_flaky_test_runner).
The website at https://ci-stats.kibana.dev uses EUI and Elastic Charts, and currently has users powered by Github OAuth. When someone authenticates with ci-stats they are first redirected to Github for authentication, then their membership in the Elastic org is checked. Users in the Elastic org will be able to do things that other users can't, like trigger the <DocLink id="kibDevDocsOpsFlakyTestRunner"/>.

View file

@ -0,0 +1,21 @@
---
id: kibDevDocsOpsFlakyTestRunner
slug: /kibana-dev-docs/ops/flaky-test-runner
title: "Flaky Test Runner"
description: A tool which runs specific tests many times to spot flakiness
tags: ['kibana', 'dev', 'contributor', 'operations', 'ci', 'tests', 'flaky']
---
The Flaky Test Runner is a tool that can be used to gauge the flakiness of a test. It is triggered using the form at https://ci-stats.kibana.dev/trigger_flaky_test_runner. Follow the instructions in the wizard to pick a PR, then a test which will run and the number of executions for that test, and finally start that job at Buildkite.
After starting the job you will be sent to buildkite to view the progress of the job.
## How many successful runs do I need to know if my test is flaky?
Most of the time flakiness should be resolved by starting with the error that's occurring in CI, then applying the suggestions in <DocLink id="kibDevDocsOpsFlakyTests" section="how-do-i-write-tests-that-arent-flaky" text="Flaky tests: How do I write functional tests that aren't flaky?"/> to find places where flakiness is likely being introduced. When working this way the flaky test runner can be used to debug a failure by running it many times to trigger the flakiness, or used to verify at the end that flakiness isn't increased by the changes.
If you did want to prove that your test wasn't flaky anymore, the number of runs that you would need is based on the flakiness of the test you are dealing with. We execute tests about 300 times a day, on average, so if your test is only failing a few times a week then you might need over 1000 successful test runs to prove that it's no longer flaky. If a test is failing several times a day, then you would need way fewer.
Ultimately, running the flaky test runner enough times to validate that a test isn't flaky just isn't an economically responsible way to fix flakiness so should be a last resort strategy.
If there is reason to believe that a test which was previously flaky is no longer flaky, then the test can just be unskipped. If the test is still flaky then that will be proven shortly enough in CI and Operations will skip the test again.

View file

@ -0,0 +1,65 @@
---
id: kibDevDocsOpsFlakyTests
slug: /kibana-dev-docs/ops/flaky-tests
title: "Flaky Tests"
description: A description of how the operations team helps manage flaky tests in Kibana
tags: ['kibana', 'dev', 'contributor', 'operations', 'ci', 'tests', 'flaky']
---
The Kibana repository contains hundreds of thousands of test cases, and on every PR and commit to the repository we run every one of those test cases. We need to do this because so many parts of Kibana are interconnected and we don't currently have the ability to only run the tests that are necessary based on the changes in a PR/commit. This has the side effect of running each test hundreds of times a day, which means that tests need to be very reliable.
Because of the high reliability requirements for our tests it is common for tests to become "flaky", meaning that something changes which causes their reliability to drop and causes them to fail when they should be passing. When a test becomes flaky the Operations team will skip that test, meaning it will no longer run until the team owning that test investigates the flakiness, and unskips the test.
## How is test ownership determined?
In many cases ownership is relatively obvious based on the location of the test. In cases where it's more ambiguous we will normally trace back to when the test was added, or look at the PRs which edit the functionality of the tests the most (ignoring repo-wide edits like ESLint upgrades, etc). The authors/teams that are labeled on these PRs or the teams of the authors of these PRs are treated as the "owner" of tests.
In the <DocLink id="kibDevDocsOpsPackages" text="IDM project" /> we will be improving this by putting all code, including functional tests, into packages which have a specific and verified owner.
## When is a test flaky enough to skip?
Identifying when tests are flaky enough that they should be skipped is the responsibility of members of the Operations team. When making that decision we consider a few different questions:
### Is this a very new test?
When a test is very new (~ 3-4 business days) and has already started failing, then the test is likely flaky enough to skip. While we don't know how flaky this test really is, we know that it didn't take long for it to start failing. Additionally, we assume that the author of the test is still familiar with the context of the change and not deep in the middle of some other complicated work. In these cases we will assign and ping the author directly and expect the test to not be skipped for long.
This is an ideal case for flaky tests, but isn't very common.
### How often is this test failing?
The frequency of failures is judged uniquely based on the type of test that is failing:
*Jest Unit/Integration tests:* These tests are not retried, and shouldn't be doing anything complicated enough to introduce flakiness. Any number of failures in these tests will lead to them being skipped.
*Functional/API Integration tests:* (aka. FTR configs) These tests involve running full Kibana and Elasticsearch instances and usually require interacting with the product through web browser automation. Because of this 100% stability is really unattainable, so any time there is a failure in an FTR config the tests in that config will be re-run automatically. If the test passes on the second time then the CI succeeds, but is "unstable" and requires that PR authors triage and interpret the failure to make sure their changes didn't cause this instability.
Our goal is to balance reducing this overhead for "innocent" PR authors without creating too much work on teams who need to unskip flaky tests. To achieve this goal, one of these tests is "flaky enough to be skipped" when it is _failing at least four times in the last week or shows a consistent, and recent, historical pattern of failing_.
To understand how many times a test has failed we use a [Kibana dashboard](https://ops.kibana.dev/s/ci/app/dashboards#/view/cb7380fb-67fe-5c41-873d-003d1a407dad) which records all of the failures in Kibana CI. This allows us to see the pattern of failures and investigate if a test which has only failed once on a tracked branch is actually failing regularly in different PRs.
Since PRs don't always contain complete code we don't immediately treat failures in PRs as failures, we try to verify that they're spread out over time and that they aren't all on the same PR or consistently triggered by the same PR author (indications that these failures might be change based, not caused by flakiness).
### Are the owning teams able to fix this source of flakiness?
There are cases where tests will fail because of something that is outside of the test-authors' control.
A recent example of this is issues caused by the way that the <DocLink id="kibDevDocsOpsEsArchiver" /> supported archiving Kibana saved object indices. This was a "best practice" that started failing as we worked to make the migration process more resilient, and caused flakiness that individual test owners couldn't really fix.
To fix this specific example the Operations, Core, and QA teams worked together to find a better solution, and the QA team has been assisting other teams by migrating their tests from the <DocLink id="kibDevDocsOpsEsArchiver" /> to the new `KbnArchiver`
Now the majority of Kibana indexes stored in ES Archives have been migrated and this is no longer a major source of flakiness. When another case pops-up we usually just ping the QA team and ask them to migrate the specific usage rather than skip the test rather than skip the test.
## How do I know when my team's tests are skipped?
Every test that fails on a tracked branch is logged in a Github issue. On subsequent failures of the same test a comment is added to the issue, to help us get an idea of flakiness and help us find instances of the failure on CI. These issues require `team:` labels just like all other issues, and many people work to make sure that all these issues are labels correctly. Following these issues will allow you to see tests which are likely flaky.
When a test is determined to be flaky enough to be skipped the issue will get the `blocker`, `skipped test`, and at least the version label of the first upcoming version that the test was skipped for. The goal here is to highlight that there is functionality in that upcoming version which might be broken because it is untested, and teams should look at them as quickly as possible to deal with the blocker.
## How do I get my tests unskipped?
Unskipping tests is the responsibility of the test owner. To unskip a test the flaky failures should be analyzed in order to determine the cause for the flakiness, or the problematic tests should be scanned for issues described in the section ("How do I write tests that aren't flaky?")[#how-do-i-write-tests-that-arent-flaky]. When the cause of the flakiness has been identified, open a PR with the changes necessary and validate that the flakiness is reduced using the <DocLink id="kibDevDocsOpsFlakyTestRunner" />. See the <DocLink id="kibDevDocsOpsFlakyTestRunner" /> docs for more information about how to use it.
## How do I write functional tests that aren't flaky?
Checkout <DocLink id="kibDevDocsOpsWritingStableFunctionalTests"/>

View file

@ -22,6 +22,9 @@ layout: landing
<DocRelatedArticles
sectionTitle='CI'
items={[
{ pageId: "kibDevDocsOpsFlakyTests" },
{ pageId: "kibDevDocsOpsWritingStableFunctionalTests" },
{ pageId: "kibDevDocsOpsFlakyTestRunner" },
{ pageId: "kibDevDocsOpsCiStats" },
]}
/>
@ -74,5 +77,6 @@ layout: landing
{ pageId: "kibDevDocsOpsManagedVscodeConfig" },
{ pageId: "kibDevDocsOpsManagedVscodeConfigCli" },
{ pageId: "kibDevDocsOpsTest" },
{ pageId: "kibDevDocsOpsEsArchiver" },
]}
/>

View file

@ -0,0 +1,84 @@
---
id: kibDevDocsOpsWritingStableFunctionalTests
slug: /kibana-dev-docs/ops/writing-stable-functional-tests
title: "Writing Stable Functional Tests"
description: Some best-practices for writing stable functional tests
tags: ['kibana', 'dev', 'contributor', 'operations', 'ci', 'tests', 'flaky']
---
Consistently writing functional tests that aren't flaky is impossible. There are too many variables that can't be reproduced reliably, and those variables change over time, so instead we have to focus on how we can reduce the flakiness in our tests as much as possible, making them more resilient to changing conditions.
When you watch tests execute locally it can be tempting to think "after I click this button I can click this other button" but this assumes that the first button click will always execute its click handler immediately, or that the render of the second button will be done immediately. The reality is that user interfaces are super complex and these types of assumptions are by far the most common cause of flakiness in tests.
There are all sorts of things that can happen to delay a click handler, or react render, and we need to be prepared for those in our tests. We can do this using appropriate timeouts for specific actions, retry logic, and validating our assumptions with code.
## Become familiar with the retry/timing logic of each common service method
Services like the `testSubjects` or `find` service will usually do some amount of retries/timeouts based on the intention of a specific method. Check the documentation of the method to understand how the method is supposed to be used, for instance:
<DocCallOut color="warning">
The intended usage/retry/timeout behavior of each method in all services is not well documented. Things in these common services have grown very stable now, so if you spend the time to analyze how a method works please help others by improving the method description. Thank you!
</DocCallOut>
- `testSubjects.exists()`: this method is intended to quickly answer the question about if something exists and has a default timeout of 2.5 seconds. This is ideal for determining, based on the current state of Kibana, if something should be done or not.
```ts
if (await testSubjects.exists('someModal')) {
// close the modal if it is open
}
```
- `testSubjects.existsOrFail()`: this method is intended to be used as a success or fail point, where a specific element is expected to be visible and if it isn't an error will be thrown. This is ideal for when you click a button and want to make sure the success state is reached.
```ts
await testSubjects.click('mySubmitButton')
await testSubjects.existsOrFail('mySuccessMessage')
```
## Use services for reusing functionality and implementing it correctly once
When you are writing functional tests for your application or feature it is probably appropriate to create a new service for the specific component of Kibana that you will be interacting with. Additionally, there are many other services which you might be able to reuse based on what you're trying to do.
Your service should define the proper way to interact with your components/app and prevent people from needing to encode specific testSubjects, or success criteria, into their apps.
## Service methods which trigger an async process should verify success or failure
Many service methods will be used for reading the state of specific elements in the Kibana UI so that tests can understand what's going on and assert functionality, but some of the methods in our services will trigger asynchronous actions in the UI. Examples of such service methods are:
- `SettingsPage.toggleAdvancedSettingCheckbox()`
- `CommonPage.navigate()`
- `LoginPage.login()`
All of these methods interact with the UI and start an async process (a process which takes some time to complete). All of them should do more than just interact with the UI, they should encode validation of the success/failure of the action they started and only resolve when the action is
completed successfully, or reject when it's unable to be completed for some reason.
## Never use a "sleep" or rely on a timer to wait for something to complete
Using methods like `sleep()` or `setTimeout()` to pause test execution for some amount of time is an appropriate tool when used sparingly, but it should never be used to wait for some action to complete. Instead, a method like `retry.waitFor()` should be used to define the success state we are waiting for.
***Don't*** do this:
```ts
await myService.clickSave()
// wait for the save to complete and redirect to the detail page
await sleep()
```
Do this instead:
```ts
await myService.clickSave()
await testSubjects.existsOrFail('savedItemDetailPage')
```
## Do as little work in the UI as possible to validate your test case
Even if you are very careful, the more UI automation you do the more likely you are to make a mistake and write a flaky test. If there is any way to do setup work for your test via the Kibana or Elasticsearch APIs rather than interacting with the UI, then take advantage of that opportunity to write less UI automation.
## Do you really need a functional test for this?
Once you've invested a lot of time and energy into figuring out how to write functional tests well it can be tempting to use them for all sorts of things which might not justify the cost of a functional test. Make sure that your test is validating something that couldn't be validated by a series of unit tests on a component+store+API.
API integration tests can test many integrations with Elasticsearch in a way that's far more efficient and also is far less likely to be flaky.
Unit tests are the cheapest tests and can even be run locally by people with a normal amount of patience!
If you could write your test using either Jest Unit, Jest Integration, or API Integration tests (in that order) then it is definitely best to write those instead of a functional test.

View file

@ -25,6 +25,15 @@
}
]
},
{
"label": "Operations",
"items": [
{
"id": "kibDevDocsOpsOverview",
"label": "Overview"
}
]
},
{
"label": "Contributing",
"items": [
@ -438,145 +447,6 @@
"id": "kibDevDocsOpsTest"
}
]
},
{
"label": "Operations",
"items": [
{
"id": "kibDevDocsOpsOverview",
"label": "Overview"
},
{
"label": "Initiatives",
"items": [
{
"id": "kibDevDocsOpsPackages"
}
]
},
{
"label": "CI",
"items": [
{
"id": "kibDevDocsOpsCiStats"
}
]
},
{
"label": "Build tooling",
"items": [
{
"id": "kibDevDocsOpsKbnPm"
},
{
"id": "kibDevDocsOpsOptimizer"
},
{
"id": "kibDevDocsOpsBabelPreset"
},
{
"id": "kibDevDocsOpsTypeSummarizer"
},
{
"id": "kibDevDocsOpsBabelPluginSyntheticPackages"
},
{
"id": "kibDevDocsOpsUiSharedDepsNpm"
},
{
"id": "kibDevDocsOpsUiSharedDepsSrc"
},
{
"id": "kibDevDocsOpsPluginDiscovery"
}
]
},
{
"label": "Linting & Validation",
"items": [
{
"id": "kibDevDocsOpsEslintConfig"
},
{
"id": "kibDevDocsOpsEslintPluginEslint"
},
{
"id": "kibDevDocsOpsEslintWithTypes"
},
{
"id": "kibDevDocsOpsEslintPluginImports"
},
{
"id": "kibDevDocsOpsEslintPluginDisable"
},
{
"id": "kibDevDocsOpsKbnYarnLockValidator"
}
]
},
{
"label": "Utilities",
"items": [
{
"id": "kibDevDocsToolingLog"
},
{
"id": "kibDevDocsOpsJestSerializers"
},
{
"id": "kibDevDocsOpsExpect"
},
{
"id": "kibDevDocsOpsAmbientStorybookTypes"
},
{
"id": "kibDevDocsOpsAmbientUiTypes"
},
{
"id": "kibDevDocsOpsTestSubjSelector"
},
{
"id": "kibDevDocsOpsBazelRunner"
},
{
"id": "kibDevDocsOpsCliDevMode"
},
{
"id": "kibDevDocsOpsEs"
},
{
"id": "kibDevDocsOpsSomeDevLog"
},
{
"id": "kibDevDocsOpsDevCliRunner"
},
{
"id": "kibDevDocsOpsGetRepoFiles"
},
{
"id": "kibDevDocsOpsRepoSourceClassifier"
},
{
"id": "kibDevDocsOpsJsonc"
},
{
"id": "kibDevDocsOpsKibanaManifestParser"
},
{
"id": "kibDevDocsOpsKibanaManifestSchema"
},
{
"id": "kibDevDocsOpsManagedVscodeConfig"
},
{
"id": "kibDevDocsOpsManagedVscodeConfigCli"
},
{
"id": "kibDevDocsOpsTest"
}
]
}
]
}
]
}

View file

@ -0,0 +1,17 @@
---
id: kibDevDocsOpsEsArchiver
slug: /kibana-dev-docs/ops/es-archiver
title: "ES Archiver"
description: A tool which helps developers capture and restore ES indexes
tags: ['kibana', 'dev', 'contributor', 'operations', 'ci']
---
The ES Archiver is a service primarily used by the Functional Tests to load up ES indexes using the bulk API which makes the archives more resilient to ES upgrades and easier to inspect/edit locally because they are just plain text files containing newline-delimited JSON (though they are sometimes compressed).
## CLI
This tool also has a CLI which can be used to save indexes to new archives or load additional archives into a specific ES instance.
To teach the ES Archiver how to talk to ES and Kibana it is ideal to start ES and Kibana using `node scripts/functional_test_servers --config some/config/file.ts` and then use the same `--config` flag when running `node scripts/es_archiver` so that it can access the location and authorization information about the ES instace from the FTR config file.
Additional information about what functionality the CLI provides can be found by running `node scripts/es_archiver --help`

View file

@ -30,6 +30,21 @@ export class TestSubjects extends FtrService {
public readonly TRY_TIME = this.config.get('timeouts.try');
public readonly WAIT_FOR_EXISTS_TIME = this.config.get('timeouts.waitForExists');
/**
* Get a promise that resolves with `true` when an element exists, if the element doesn't exist
* yet it will wait until the element does exist. If we wait until the timeout and the element
* still doesn't exist the promise will resolve with `false`.
*
* This method is intended to quickly answer the question "does this testSubject exist". Its
* 2.5 second timeout responds quickly, making it a good candidate for putting inside
* `retry.waitFor()` loops.
*
* When `options.timeout` is not passed the `timeouts.waitForExists` config is used as
* the timeout. The default value for that config is currently 2.5 seconds.
*
* If the element is hidden it is not treated as "existing", unless `options.allowHidden`
* is set to `true`.
*/
public async exists(selector: string, options: ExistsOptions = {}): Promise<boolean> {
const { timeout = this.WAIT_FOR_EXISTS_TIME, allowHidden = false } = options;
@ -39,6 +54,17 @@ export class TestSubjects extends FtrService {
: this.findService.existsByDisplayedByCssSelector(testSubjSelector(selector), timeout));
}
/**
* Get a promise that resolves when an element exists, if the element doesn't exist
* before the timeout is reached the promise will reject with an error.
*
* This method is intended to be used as success critieria when something is expected
* to exist. The default 2 minute timeout is not appropriate for all conditions, but
* hard-coding timeouts all over tests is also bad, so please use your best judgement.
*
* The options are equal to the options accepted by the {@link #exists} method except
* that `options.timeout` defaults to the `timeouts.try` config, or 2 minutes.
*/
public async existOrFail(selector: string, existsOptions?: ExistsOptions): Promise<void | never> {
if (!(await this.exists(selector, { timeout: this.TRY_TIME, ...existsOptions }))) {
throw new Error(`expected testSubject(${selector}) to exist`);