Remove RFCs from our repository.
|
@ -1,108 +0,0 @@
|
|||
- Start Date: (fill me in with today's date, YYYY-MM-DD)
|
||||
- TTL: (e.g. "April 20th, 2021", time the review is expected to be completed by. Don't use relative days.)
|
||||
- Champion: (usually you, person who writes and updates the draft and incorporates feedback)
|
||||
- Main reviewer: (somebody familiar with the subject matter, who has committed to provide timely and detailed reviews for this RFC)
|
||||
- Owner team: (team who will own implementation, if it is accepted)
|
||||
- Stakeholders: (people or groups who will be affected by the proposed changes)
|
||||
- RFC PR: (leave this empty, it will be a link to PR of this RFC)
|
||||
- PoC PR: (optional, link to a PoC implementation of the feature)
|
||||
- Kibana Issue: (link to issue where the proposed feature is tracked)
|
||||
|
||||
|
||||
# Executive Summary
|
||||
|
||||
Summarize this RFC so those unfamiliar with the project and code can quickly understand
|
||||
what the problem is, why it is important,
|
||||
and the proposed solution. Below are some suggested sections for the Executive
|
||||
Summary. Tweak as you desire and try to keep it succinct.
|
||||
|
||||
## Problem statement
|
||||
|
||||
What is the problem we are trying to solve? Supply any relevant background
|
||||
context. Why is this something we should focus on _now_.
|
||||
|
||||
Focus on explaining the problem so that if this RFC is not accepted, this
|
||||
information could be used to develop alternative solutions. In other words,
|
||||
don't couple this too closely with the solution you have in mind.
|
||||
|
||||
## Goals
|
||||
|
||||
What are the goals of this project? How will we know if it was successful?
|
||||
|
||||
## Proposal
|
||||
|
||||
What are we doing to achieve the goals and solve the problem?
|
||||
|
||||
|
||||
# Who is affected and how
|
||||
|
||||
Use this section to hone in on who will be affected and how. For example:
|
||||
|
||||
- Are consumers of a specific plugin affected because of a public API change?
|
||||
- Will all Kibana Contributors be affected because of a change that may affect
|
||||
the development experience?
|
||||
|
||||
|
||||
# Detailed design
|
||||
|
||||
This is the bulk of the RFC. Explain the design in enough detail for somebody
|
||||
familiar with Kibana to understand, and for somebody familiar with the
|
||||
implementation to implement. This should get into specifics and corner-cases,
|
||||
and include examples of how the feature is used. Any new terminology should be
|
||||
defined here.
|
||||
|
||||
Include architectural diagrams if you see fit, a picture is worth a thousand
|
||||
words.
|
||||
|
||||
## Terminology
|
||||
|
||||
A glossary of new terms can be very helpful.
|
||||
|
||||
|
||||
# Risks
|
||||
|
||||
Why should we *not* do this? Please consider:
|
||||
|
||||
- implementation cost, both in term of code size and complexity
|
||||
- the impact on teaching people Kibana development
|
||||
- integration of this feature with other existing and planned features
|
||||
- cost of migrating existing Kibana plugins (is it a breaking change?)
|
||||
|
||||
There are tradeoffs to choosing any path. Attempt to identify them here.
|
||||
|
||||
|
||||
# Alternatives
|
||||
|
||||
What other designs have been considered? What is the impact of not doing this?
|
||||
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
If we implement this proposal, how will existing Kibana developers adopt it? Is
|
||||
this a breaking change? Can we write a codemod? Should we coordinate with
|
||||
other projects or libraries?
|
||||
|
||||
|
||||
# How this scales
|
||||
|
||||
Does this change affect Kibana's performance in a substantial way? Have we discovered
|
||||
the upper bounds before we see performance degradations? Will any load
|
||||
tests be added to cover these scenarios?
|
||||
|
||||
|
||||
# How we teach this
|
||||
|
||||
What names and terminology work best for these concepts and why? How is this
|
||||
idea best presented? As a continuation of existing Kibana patterns?
|
||||
|
||||
Would the acceptance of this proposal mean the Kibana documentation must be
|
||||
re-organized or altered? Does it change how Kibana is taught to new developers
|
||||
at any level?
|
||||
|
||||
How should this feature be taught to existing Kibana developers?
|
||||
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
Optional, but suggested for first drafts. What parts of the design are still
|
||||
TBD?
|
139
rfcs/README.md
|
@ -1,139 +0,0 @@
|
|||
# Kibana RFCs
|
||||
|
||||
Many changes, including small to medium features, fixes, and documentation
|
||||
improvements can be implemented and reviewed via the normal GitHub pull request
|
||||
workflow.
|
||||
|
||||
Some changes though are "substantial", and we ask that these be put
|
||||
through a bit of a design process and produce a consensus among the relevant
|
||||
Kibana team.
|
||||
|
||||
The "RFC" (request for comments) process is intended to provide a
|
||||
consistent and controlled path for new features to enter the project.
|
||||
|
||||
[Active RFC List](https://github.com/elastic/kibana/pulls?q=is%3Aopen+is%3Apr+label%3ARFC)
|
||||
|
||||
Kibana is still **actively developing** this process, and it will still change as
|
||||
more features are implemented and the community settles on specific approaches
|
||||
to feature development.
|
||||
|
||||
## Contributor License Agreement (CLA)
|
||||
|
||||
In order to accept your pull request, we need you to submit a CLA. You only need
|
||||
to do this once, so if you've done this for another Elastic open source
|
||||
project, you're good to go.
|
||||
|
||||
**[Complete your CLA here.](https://www.elastic.co/contributor-agreement)**
|
||||
|
||||
## When to follow this process
|
||||
|
||||
You should consider using this process if you intend to make "substantial"
|
||||
changes to Kibana or its documentation. Some examples that would benefit
|
||||
from an RFC are:
|
||||
|
||||
- A new feature that creates new API surface area, such as a new
|
||||
service available to plugins.
|
||||
- The removal of features that already shipped as part of a release.
|
||||
- The introduction of new idiomatic usage or conventions, even if they
|
||||
do not include code changes to Kibana itself.
|
||||
|
||||
The RFC process is a great opportunity to get more eyeballs on your proposal
|
||||
before it becomes a part of a released version of Kibana. Quite often, even
|
||||
proposals that seem "obvious" can be significantly improved once a wider
|
||||
group of interested people have a chance to weigh in.
|
||||
|
||||
The RFC process can also be helpful to encourage discussions about a proposed
|
||||
feature as it is being designed, and incorporate important constraints into
|
||||
the design while it's easier to change, before the design has been fully
|
||||
implemented.
|
||||
|
||||
Some changes do not require an RFC:
|
||||
|
||||
- Rephrasing, reorganizing or refactoring
|
||||
- Addition or removal of warnings
|
||||
- Additions that strictly improve objective, numerical quality
|
||||
criteria (speedup, better browser support)
|
||||
- Addition of features that do not impact other Kibana plugins (do not
|
||||
expose any API to other plugins)
|
||||
|
||||
## What the process is
|
||||
|
||||
In short, to get a major feature added to the Kibana codebase, one usually
|
||||
first gets the RFC merged into the RFC tree as a markdown file. At that point
|
||||
the RFC is 'active' and may be implemented with the goal of eventual inclusion
|
||||
into Kibana.
|
||||
|
||||
* Fork the Kibana repo http://github.com/elastic/kibana
|
||||
* Copy `rfcs/0000_template.md` to `rfcs/text/0001_my_feature.md` (where
|
||||
'my_feature' is descriptive. Assign a number. Check that an RFC with this
|
||||
number doesn't already exist in `master` or an open PR).
|
||||
* Fill in the RFC. Put care into the details: **RFCs that do not
|
||||
present convincing motivation, demonstrate understanding of the
|
||||
impact of the design, or are disingenuous about the drawbacks or
|
||||
alternatives tend to be poorly-received**.
|
||||
* Submit a pull request. As a pull request the RFC will receive design
|
||||
feedback from the larger community and Elastic staff. The author should
|
||||
be prepared to revise it in response.
|
||||
* Build consensus and integrate feedback. RFCs that have broad support
|
||||
are much more likely to make progress than those that don't receive any
|
||||
comments.
|
||||
* Eventually, the team will decide whether the RFC is a candidate
|
||||
for inclusion in Kibana.
|
||||
* RFCs that are candidates for inclusion in Kibana will enter a "final comment
|
||||
period" lasting at least 3 working days. The beginning of this period will be signaled with a
|
||||
comment and tag on the RFCs pull request.
|
||||
* An RFC can be modified based upon feedback from the team and community.
|
||||
Significant modifications may trigger a new final comment period.
|
||||
* An RFC may be rejected by the team after public discussion has settled
|
||||
and comments have been made summarizing the rationale for rejection. A member of
|
||||
the team should then close the RFCs associated pull request.
|
||||
* An RFC may be accepted at the close of its final comment period. A team
|
||||
member will merge the RFCs associated pull request, at which point the RFC will
|
||||
become 'active'.
|
||||
|
||||
## The RFC life-cycle
|
||||
|
||||
Once an RFC becomes active, then authors may implement it and submit the
|
||||
feature as a pull request to the Kibana repo. Becoming 'active' is not a rubber
|
||||
stamp, and in particular still does not mean the feature will ultimately
|
||||
be merged; it does mean that the team in ownership of the feature has agreed to
|
||||
it in principle and are amenable to merging it.
|
||||
|
||||
Furthermore, the fact that a given RFC has been accepted and is
|
||||
'active' implies nothing about what priority is assigned to its
|
||||
implementation, nor whether anybody is currently working on it.
|
||||
|
||||
Modifications to active RFCs can be done in followup PRs. We strive
|
||||
to write each RFC in a manner that it will reflect the final design of
|
||||
the feature; but the nature of the process means that we cannot expect
|
||||
every merged RFC to actually reflect what the end result will be at
|
||||
the time of the next major release; therefore we try to keep each RFC
|
||||
document somewhat in sync with the Kibana feature as planned,
|
||||
tracking such changes via followup pull requests to the document. You
|
||||
may include updates to the RFC in the same PR that makes the code change.
|
||||
|
||||
## Implementing an RFC
|
||||
|
||||
The author of an RFC is not obligated to implement it. Of course, the
|
||||
RFC author (like any other developer) is welcome to post an
|
||||
implementation for review after the RFC has been accepted.
|
||||
|
||||
If you are interested in working on the implementation for an 'active'
|
||||
RFC, but cannot determine if someone else is already working on it,
|
||||
feel free to ask (e.g. by leaving a comment on the associated issue).
|
||||
|
||||
## Reviewing RFCs
|
||||
|
||||
Each week the team will attempt to review some set of open RFC
|
||||
pull requests.
|
||||
|
||||
Every accepted feature should have a champion from the team which will
|
||||
ultimately maintain the feature long-term. The champion will represent the
|
||||
feature and its progress.
|
||||
|
||||
**Kibana's RFC process owes its inspiration to the [React RFC process], [Yarn RFC process], [Rust RFC process], and [Ember RFC process]**
|
||||
|
||||
[React RFC process]: https://github.com/reactjs/rfcs
|
||||
[Yarn RFC process]: https://github.com/yarnpkg/rfcs
|
||||
[Rust RFC process]: https://github.com/rust-lang/rfcs
|
||||
[Ember RFC process]: https://github.com/emberjs/rfcs
|
Before Width: | Height: | Size: 28 KiB |
Before Width: | Height: | Size: 190 KiB |
Before Width: | Height: | Size: 43 KiB |
Before Width: | Height: | Size: 66 KiB |
Before Width: | Height: | Size: 184 KiB |
Before Width: | Height: | Size: 106 KiB |
Before Width: | Height: | Size: 130 KiB |
Before Width: | Height: | Size: 48 KiB |
Before Width: | Height: | Size: 436 KiB |
Before Width: | Height: | Size: 440 KiB |
Before Width: | Height: | Size: 442 KiB |
Before Width: | Height: | Size: 81 KiB |
Before Width: | Height: | Size: 4.7 KiB |
Before Width: | Height: | Size: 21 KiB |
Before Width: | Height: | Size: 455 KiB |
Before Width: | Height: | Size: 47 KiB |
Before Width: | Height: | Size: 115 KiB |
Before Width: | Height: | Size: 41 KiB |
Before Width: | Height: | Size: 71 KiB |
Before Width: | Height: | Size: 247 KiB |
Before Width: | Height: | Size: 163 KiB |
Before Width: | Height: | Size: 57 KiB |
Before Width: | Height: | Size: 51 KiB |
Before Width: | Height: | Size: 108 KiB |
Before Width: | Height: | Size: 113 KiB |
Before Width: | Height: | Size: 9.2 KiB |
Before Width: | Height: | Size: 914 KiB |
Before Width: | Height: | Size: 144 KiB |
Before Width: | Height: | Size: 194 KiB |
Before Width: | Height: | Size: 162 KiB |
|
@ -1,141 +0,0 @@
|
|||
- Start Date: 2019-03-05
|
||||
- RFC PR: [#32507](https://github.com/elastic/kibana/pull/32507)
|
||||
- Kibana Issue: [#33045](https://github.com/elastic/kibana/issues/33045)
|
||||
|
||||
# Summary
|
||||
|
||||
The `setup` lifecycle function for core and plugins will be for one-time setup
|
||||
and configuration logic that should be completed in a finite amount of time
|
||||
rather than be available throughout the runtime of the service.
|
||||
|
||||
The existing `start` lifecycle function will continue to serve only the purpose
|
||||
of longer running code that intentionally only executes when `setup` is
|
||||
finished.
|
||||
|
||||
# Basic example
|
||||
|
||||
```ts
|
||||
class Plugin {
|
||||
public setup(core, plugins) {
|
||||
// example operation that should only happen during setup
|
||||
core.savedObjects.setRepository(/* ... */);
|
||||
}
|
||||
|
||||
public start(core, plugins) {
|
||||
// example retrieval of client with guarantee that repository was set above
|
||||
core.savedObjects.getClient();
|
||||
}
|
||||
|
||||
public stop(core, plugins) {
|
||||
// ...
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
We want services and plugins to be designed to adapt to changes in data and
|
||||
configurations over time, but there are practical limits to this philosophy,
|
||||
which we already acknowledge by having a separate `start` and `stop` handler.
|
||||
|
||||
Currently, the `start` handler is where the vast majority of business logic
|
||||
takes place because it gets fired off almost immediately on startup and then no
|
||||
new lifecycle events are encountered until it's time to shutdown.
|
||||
|
||||
This results in lifecycle-like behaviors being hardcoded into the `start`
|
||||
handler itself rather than being exposed in a systematic way that other
|
||||
services and plugins can take advantage of.
|
||||
|
||||
For example, core should not bind to a port until all HTTP handlers have been
|
||||
registered, but the service itself needs to initialize before it can expose the
|
||||
means of registering HTTP endpoints for plugins. It exposes this capability via
|
||||
its `start` handler. Port binding, however, is hardcoded to happen after the
|
||||
rest of the services are started. No other services behave this way.
|
||||
|
||||
Unlike core services which can have hacky hardcoded behaviors that don't
|
||||
completely adhere to the order of execution in a lifecycle, plugins have no way
|
||||
of saying "execute this only when all plugins have initialized". It's not
|
||||
practical for a plugin that has side effects like pushing cluster privileges to
|
||||
Elasticsearch to constantly be executing those side effects whenever an
|
||||
observable changes. Instead, they need a point in time when they can safely
|
||||
assume the necessary configurations have been made.
|
||||
|
||||
A `setup` lifecycle handler would allow core and plugins to expose contracts
|
||||
that have a reliable expiration in the context of the overall lifecycle.
|
||||
|
||||
# Detailed design
|
||||
|
||||
A new `setup` lifecycle handler will be adopted for services and plugins. The
|
||||
order in which lifecycle handlers execute will be:
|
||||
|
||||
1. `setup`
|
||||
2. `start`
|
||||
3. `stop`
|
||||
|
||||
## Core
|
||||
|
||||
The core system will have an `setup` function that will get executed prior to
|
||||
`start`. An `setup` function will also be added to all core services, and will
|
||||
be invoked from the core `setup` in the same spirit of `start` and `stop`.
|
||||
|
||||
Decisions on which service functionality should belong in `setup` vs `start`
|
||||
will need to be handled case-by-case and is beyond the scope of this RFC, but
|
||||
much of the existing functionality will likely be exposed through `setup`
|
||||
instead.
|
||||
|
||||
## Plugins
|
||||
|
||||
Plugins will have an `setup` function that will get executed by the core plugin
|
||||
service from its own `setup`.
|
||||
|
||||
Like `start` and `stop`, the `setup` lifecycle handler will receive
|
||||
setup-specific core contracts via the first argument.
|
||||
|
||||
Also like `start` and `stop`, the `setup` lifecycle handler will receive the
|
||||
setup-specific plugin contracts from all plugins that it has a declared
|
||||
dependency on via the second argument.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
- An additional lifecycle handler adds complexity for many plugins and services
|
||||
which draw no direct benefit from it.
|
||||
- The answer to "does this belong in `setup` or `start`?" is not always clear.
|
||||
There is not a formal decision tree we can apply to all circumstances.
|
||||
- While lifecycle hooks are relatively new, there still many services that will
|
||||
need to be updated.
|
||||
- Adopting new lifecycle hooks is a slippery slope, and the more we have in the
|
||||
system, the more complicated it is to reason about the capabilities of the
|
||||
system at any given point.
|
||||
|
||||
# Alternatives
|
||||
|
||||
When a service or plugin needs to know when initialization has finished, it can
|
||||
expose a custom event or transaction system via its relevant contracts so it
|
||||
can tell when downstream code has finished initializing. One significant
|
||||
drawback to this approach is that it only works when the plugin that needs to
|
||||
wait for initialization isn't dependent on an upstream service that does not
|
||||
implement a similar transaction capability.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Adoption will need to be manual. Since the bulk of the `start` logic in the
|
||||
repo today is configuration-oriented, I recommend renaming `start`->`setup` in
|
||||
all services and plugins, and then adding an empty `start` where it is
|
||||
necessary. Functionality can then be moved from `setup`->`start` on a
|
||||
case-by-case.
|
||||
|
||||
If this change doesn't happen for a while, then it might make sense to follow
|
||||
the reverse process to ensure the least impact.
|
||||
|
||||
The migration guide will be updated to reflect the `setup` and `start`
|
||||
distinction as soon as this RFC is accepted.
|
||||
|
||||
# How we teach this
|
||||
|
||||
There shouldn't need to be much knowledge sharing around this since even
|
||||
`start` and `stop` are new concepts to most people. The sooner we introduce
|
||||
this change, the better.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
None, at the moment.
|
|
@ -1,252 +0,0 @@
|
|||
- Start Date: 2019-03-22
|
||||
- RFC PR: [#33740](https://github.com/elastic/kibana/pull/33740)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
In order to support the action service we need a way to encrypt/decrypt
|
||||
attributes on saved objects that works with security and spaces filtering as
|
||||
well as performing audit logging. Sufficiently hides the private key used and
|
||||
removes encrypted attributes from being exposed through regular means.
|
||||
|
||||
# Basic example
|
||||
|
||||
Register saved object type with the `encrypted_saved_objects` plugin:
|
||||
|
||||
```typescript
|
||||
server.plugins.encrypted_saved_objects.registerType({
|
||||
type: 'server-action',
|
||||
attributesToEncrypt: new Set(['credentials', 'apiKey']),
|
||||
});
|
||||
```
|
||||
|
||||
Use the same API to create saved objects with encrypted attributes as for any other saved object type:
|
||||
|
||||
```typescript
|
||||
const savedObject = await server.savedObjects
|
||||
.getScopedSavedObjectsClient(request)
|
||||
.create('server-action', {
|
||||
name: 'my-server-action',
|
||||
data: { location: 'BBOX (100.0, ..., 0.0)', email: '<html>...</html>' },
|
||||
credentials: { username: 'some-user', password: 'some-password' },
|
||||
apiKey: 'dGhpcyBpcyBub3QgYSByZWFsIHRva2VuIGJ1dCBpdCBpcyBvb'
|
||||
});
|
||||
|
||||
// savedObject = {
|
||||
// id: 'dd9750b9-ef0a-444c-8405-4dfcc2e9d670',
|
||||
// type: 'server-action',
|
||||
// name: 'my-server-action',
|
||||
// data: { location: 'BBOX (100.0, ..., 0.0)', email: '<html>...</html>' },
|
||||
// };
|
||||
|
||||
```
|
||||
|
||||
Use dedicated method to retrieve saved object with decrypted attributes on behalf of Kibana internal user:
|
||||
|
||||
```typescript
|
||||
const savedObject = await server.plugins.encrypted_saved_objects.getDecryptedAsInternalUser(
|
||||
'server-action',
|
||||
'dd9750b9-ef0a-444c-8405-4dfcc2e9d670'
|
||||
);
|
||||
|
||||
// savedObject = {
|
||||
// id: 'dd9750b9-ef0a-444c-8405-4dfcc2e9d670',
|
||||
// type: 'server-action',
|
||||
// name: 'my-server-action',
|
||||
// data: { location: 'BBOX (100.0, ..., 0.0)', email: '<html>...</html>' },
|
||||
// credentials: { username: 'some-user', password: 'some-password' },
|
||||
// apiKey: 'dGhpcyBpcyBub3QgYSByZWFsIHRva2VuIGJ1dCBpdCBpcyBvb',
|
||||
// };
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
Main motivation is the storage and usage of third-party credentials for use with
|
||||
the action service to do notifications. Also perform other types integrations,
|
||||
call webhooks using tokens.
|
||||
|
||||
# Detailed design
|
||||
|
||||
In order for this to be in basic it needs to be done as a wrapper around the
|
||||
saved object client. This can be added from the `x-pack` plugin.
|
||||
|
||||
## General
|
||||
|
||||
To be able to manage saved objects with encrypted attributes from any plugin one should
|
||||
do the following:
|
||||
|
||||
1. Define `encrypted_saved_objects` plugin as a dependency.
|
||||
2. Add attributes to be encrypted in `mappings.json` file for the respective saved object type. These attributes should
|
||||
always have a `binary` type since they'll contain encrypted content as a `Base64` encoded string and should never be
|
||||
searchable or analyzed. This makes defining of attributes that require encryption explicit and auditable, and significantly
|
||||
simplifies implementation:
|
||||
```json
|
||||
{
|
||||
"server-action": {
|
||||
"properties": {
|
||||
"name": { "type": "keyword" },
|
||||
"data": {
|
||||
"properties": {
|
||||
"location": { "type": "geo_shape" },
|
||||
"email": { "type": "text" }
|
||||
}
|
||||
},
|
||||
"credentials": { "type": "binary" },
|
||||
"apiKey": { "type": "binary" }
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
3. Register saved object type and attributes that should be encrypted with `encrypted_saved_objects` plugin:
|
||||
```typescript
|
||||
server.plugins.encrypted_saved_objects.registerType({
|
||||
type: 'server-action',
|
||||
attributesToEncrypt: new Set(['credentials', 'apiKey']),
|
||||
attributesToExcludeFromAAD: new Set(['data']),
|
||||
});
|
||||
```
|
||||
|
||||
Notice the optional `attributesToExcludeFromAAD` property, it allows one to exclude some of the saved object attributes
|
||||
from Additional authenticated data (AAD), read more about that below in `Encryption and decryption` section.
|
||||
|
||||
Since `encrypted_saved_objects` adds its own wrapper (`EncryptedSavedObjectsClientWrapper`) into `SavedObjectsClient`
|
||||
wrapper chain consumers will be able to create, update, delete and retrieve saved objects using standard Saved Objects API.
|
||||
Two main responsibilities of the wrapper are:
|
||||
|
||||
* It encrypts attributes that are supposed to be encrypted during `create`, `bulkCreate` and `update` operations
|
||||
* It strips encrypted attributes from **any** saved object returned from the Saved Objects API
|
||||
|
||||
As noted above the wrapper is stripping encrypted attributes from saved objects returned from the API methods, that means
|
||||
that there is no way at all to retrieve encrypted attributes using standard Saved Objects API unless `encrypted_saved_objects`
|
||||
plugin is disabled. This potentially can lead to the situation when consumer retrieves saved object, updates its non-encrypted
|
||||
properties and passes that same object to the `update` Saved Objects API method without re-defining encrypted attributes. In
|
||||
this case only specified attributes will be updated and encrypted attributes will stay untouched. And if these updated
|
||||
attributes are included into AAD, that is true by default for all attributes unless they are specifically excluded via
|
||||
`attributesToExcludeFromAAD`, then it will be no longer possible to decrypt encrypted attributes. At this stage we consider
|
||||
this as a developer mistake and don't prevent it from happening in any way apart from logging this type of event. Partial
|
||||
update of only attributes that are not the part of AAD will not cause this issue.
|
||||
|
||||
Saved object ID is an essential part of AAD used during encryption process and hence should be as hard to guess as possible.
|
||||
To fulfil this requirement wrapper generates highly random IDs (UUIDv4) for the saved objects that contain encrypted
|
||||
attributes and hence consumers are not allowed to specify ID when calling `create` or `bulkCreate` method and if they try
|
||||
to do so the error will be thrown.
|
||||
|
||||
To reduce the risk of unintentional decryption and consequent leaking of the sensitive information there is only one way
|
||||
to retrieve saved object and decrypt its encrypted attributes and it's exposed only through `encrypted_saved_objects` plugin:
|
||||
|
||||
```typescript
|
||||
const savedObject = await server.plugins.encrypted_saved_objects.getDecryptedAsInternalUser(
|
||||
'server-action',
|
||||
'dd9750b9-ef0a-444c-8405-4dfcc2e9d670'
|
||||
);
|
||||
|
||||
// savedObject = {
|
||||
// id: 'dd9750b9-ef0a-444c-8405-4dfcc2e9d670',
|
||||
// type: 'server-action',
|
||||
// name: 'my-server-action',
|
||||
// data: { location: 'BBOX (100.0, ..., 0.0)', email: '<html>...</html>' },
|
||||
// credentials: { username: 'some-user', password: 'some-password' },
|
||||
// apiKey: 'dGhpcyBpcyBub3QgYSByZWFsIHRva2VuIGJ1dCBpdCBpcyBvb',
|
||||
// };
|
||||
```
|
||||
|
||||
As can be seen from the method name, the request to retrieve saved object and decrypt its attributes is performed on
|
||||
behalf of the internal Kibana user and hence isn't supposed to be called within user request context.
|
||||
|
||||
**Note:** the fact that saved object with encrypted attributes is created using standard Saved Objects API within a
|
||||
particular user and space context, but retrieved out of any context makes it unclear how consumers are supposed to
|
||||
provide that context and retrieve saved object from a particular space. Current plan for `getDecryptedAsInternalUser`
|
||||
method is to accept a third `BaseOptions` argument that allows consumers to specify `namespace` that they can retrieve
|
||||
from the request using public `spaces` plugin API.
|
||||
|
||||
## Encryption and decryption
|
||||
|
||||
Saved object attributes are encrypted using [@elastic/node-crypto](https://github.com/elastic/node-crypto) library. Please
|
||||
take a look at the source code of this library to know how encryption is performed exactly, what algorithm and encryption
|
||||
parameters are used, but in short it's AES Encryption with AES-256-GCM that uses random initialization vector and salt.
|
||||
|
||||
As with encryption key for Kibana's session cookie, master encryption key used by `encrypted_saved_objects` plugin can be
|
||||
defined as a configuration value (`xpack.encryptedSavedObjects.encryptionKey`) via `kibana.yml`, but it's **highly
|
||||
recommended** to define this key in the [Kibana Keystore](https://www.elastic.co/guide/en/kibana/current/secure-settings.html)
|
||||
instead. The master key should be cryptographically safe and be equal or greater than 32 bytes.
|
||||
|
||||
To prevent certain vectors of attacks where raw content of encrypted attributes of one saved object is copied to another
|
||||
saved object which would unintentionally allow it to decrypt content that was not supposed to be decrypted we rely on Additional
|
||||
authenticated data (AAD) during encryption and decryption. AAD consists of the following components:
|
||||
|
||||
* Saved object ID
|
||||
* Saved object type
|
||||
* Saved object attributes
|
||||
|
||||
AAD does not include encrypted attributes themselves and attributes defined in optional `attributesToExcludeFromAAD`
|
||||
parameter provided during saved object type registration with `encrypted_saved_objects` plugin. There are a number of
|
||||
reasons why one would want to exclude certain attributes from AAD:
|
||||
|
||||
* if attribute contains large amount of data that can significantly slow down encryption and decryption, especially during
|
||||
bulk operations (e.g. large geo shape or arbitrary HTML document)
|
||||
* if attribute contains data that is supposed to be updated separately from encrypted attributes or attributes included
|
||||
into AAD (e.g some user defined content associated with the email action or alert)
|
||||
|
||||
## Audit
|
||||
|
||||
Encrypted attributes will most likely contain sensitive information and any attempt to access these should be properly
|
||||
logged to allow any further audit procedures. The following events will be logged with Kibana audit log functionality:
|
||||
|
||||
* Successful attempt to encrypt attributes (incl. saved object ID, type and attributes names)
|
||||
* Failed attempt to encrypt attribute (incl. saved object ID, type and attribute name)
|
||||
* Successful attempt to decrypt attributes (incl. saved object ID, type and attributes names)
|
||||
* Failed attempt to decrypt attribute (incl. saved object ID, type and attribute name)
|
||||
|
||||
In addition to audit log events we'll issue ordinary log events for any attempts to save, update or decrypt saved objects
|
||||
with missing attributes that were supposed to be encrypted/decrypted based on the registration parameters.
|
||||
|
||||
# Benefits
|
||||
|
||||
* None of the registered types will expose their encrypted details. The saved
|
||||
objects with their unencrypted attributes could still be obtained and searched
|
||||
on. The wrapper will follow all the security and spaces filtering of saved
|
||||
objects so that only users with appropriate permissions will be able to obtain
|
||||
the scrubbed objects or _save_ objects with encrypted attributes.
|
||||
|
||||
* No explicit access to a method that takes in an encrypted string exists. If the
|
||||
type was not registered no decryption is possible. No need to handle the saved object
|
||||
with the encrypted attributes reducing the risk of accidentally returning it in a
|
||||
handler.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
* It isn't possible to decrypt existing encrypted attributes once encryption key changes
|
||||
* Possibly have a performance impact on Saved Objects API operations that require encryption/decryption
|
||||
* Will require non trivial tests to test functionality along with spaces and security
|
||||
* The attributes that are encrypted have to be defined and if they change they need to be migrated
|
||||
|
||||
# Out of scope
|
||||
|
||||
* Encryption key rotation mechanism, either regular or emergency
|
||||
* Mechanism that would detect and warn when Kibana does not use keystore to store encryption key
|
||||
|
||||
# Alternatives
|
||||
|
||||
Only allow this to be used within the Actions service itself where the details
|
||||
of the saved object are handled there directly. And the saved objects are
|
||||
`hidden` but still use the security and spaces wrappers.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Integration should be pretty easy which would include depending on the plugin, registering the desired saved object type
|
||||
with it and defining encrypted attributes in the `mappings.json`.
|
||||
|
||||
# How we teach this
|
||||
|
||||
The `encrypted_saved_objects` as the name of the `thing` where it's seen as a separate
|
||||
extension on top of the saved object service.
|
||||
|
||||
Provide a README.md in the plugin directory with the usage examples.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
* Is it acceptable to have this plugin in Basic?
|
||||
* Are there any other use-cases that are not served with that interface?
|
||||
* How would this work with Saved Objects Export\Import API?
|
||||
* How would this work with migrations, if the attribute names wanted to be
|
||||
changed, a decrypt context would need to be created for migration?
|
|
@ -1,356 +0,0 @@
|
|||
- Start Date: 2019-05-11
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
Handlers are asynchronous functions registered with core services invoked to
|
||||
respond to events like a HTTP request, or mounting an application. _Handler
|
||||
context_ is a pattern that would allow APIs and values to be provided to handler
|
||||
functions by the service that owns the handler (aka service owner) or other
|
||||
services that are not necessarily known to the service owner.
|
||||
|
||||
# Basic example
|
||||
|
||||
```js
|
||||
// services can register context providers to route handlers
|
||||
http.registerContext('myApi', (context, request) => ({ getId() { return request.params.myApiId } }));
|
||||
|
||||
http.router.route({
|
||||
method: 'GET',
|
||||
path: '/saved_object/:id',
|
||||
// routeHandler implements the "handler" interface
|
||||
async routeHandler(context, request) {
|
||||
// returned value of the context registered above is exposed on the `myApi` key of context
|
||||
const objectId = context.myApi.getId();
|
||||
// core context is always present in the `context.core` key
|
||||
return context.core.savedObjects.find(objectId);
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
The informal concept of handlers already exists today in HTTP routing, task
|
||||
management, and the designs of application mounting and alert execution.
|
||||
Examples:
|
||||
|
||||
```tsx
|
||||
// Task manager tasks
|
||||
taskManager.registerTaskDefinitions({
|
||||
myTask: {
|
||||
title: 'The task',
|
||||
timeout: '5m',
|
||||
createTaskRunner(context) {
|
||||
return {
|
||||
async run() {
|
||||
const docs = await context.core.elasticsearch.search();
|
||||
doSomethingWithDocs(docs);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
})
|
||||
|
||||
// Application mount handlers
|
||||
application.registerApp({
|
||||
id: 'myApp',
|
||||
mount(context, domElement) {
|
||||
ReactDOM.render(
|
||||
<MyApp overlaysService={context.core.overlays} />,
|
||||
domElement
|
||||
);
|
||||
return () => ReactDOM.unmountComponentAtNode(domElement);
|
||||
}
|
||||
});
|
||||
|
||||
// Alerting
|
||||
alerting.registerType({
|
||||
id: 'myAlert',
|
||||
async execute(context, params, state) {
|
||||
const indexPatterns = await context.core.savedObjects.find('indexPattern');
|
||||
// use index pattern to search
|
||||
}
|
||||
})
|
||||
```
|
||||
|
||||
Without a formal definition, each handler interface varies slightly and
|
||||
different solutions are developed per handler for managing complexity and
|
||||
enabling extensibility.
|
||||
|
||||
The official handler context convention seeks to address five key problems:
|
||||
|
||||
1. Different services and plugins should be able to expose functionality that
|
||||
is configured for the particular context where the handler is invoked, such
|
||||
as a savedObject client in an alert handler already being configured to use
|
||||
the appropriate API token.
|
||||
|
||||
2. The service owner of a handler should not need to know about the services
|
||||
or plugins that extend its handler context, such as the security plugin
|
||||
providing a currentUser function to an HTTP router handler.
|
||||
|
||||
3. Functionality in a handler should be "fixed" for the life of that
|
||||
handler's context rather than changing configuration under the hood in
|
||||
mid-execution. For example, while Elasticsearch clients can technically
|
||||
be replaced throughout the course of the Kibana process, an HTTP route
|
||||
handler should be able to depend on their being a consistent client for its
|
||||
own shorter lifespan.
|
||||
|
||||
4. Plugins should not need to pass down high level service contracts throughout
|
||||
their business logic just so they can access them within the context of a
|
||||
handler.
|
||||
|
||||
5. Functionality provided by services should not be arbitrarily used in
|
||||
unconstrained execution such as in the plugin lifecycle hooks. For example,
|
||||
it's appropriate for an Elasticsearch client to throw an error if it's used
|
||||
inside an API route and Elasticsearch isn't available, however it's not
|
||||
appropriate for a plugin to throw an error in their start function if
|
||||
Elasticsearch is not available. If the ES client was only made available
|
||||
within the handler context and not to the plugin's start contract at large,
|
||||
then this isn't an issue we'll encounter in the first place.
|
||||
|
||||
# Detailed design
|
||||
|
||||
There are two parts to this proposal. The first is the handler interface
|
||||
itself, and the second is the interface that a service owner implements to make
|
||||
their handlers extensible.
|
||||
|
||||
## Handler Context
|
||||
|
||||
```ts
|
||||
interface Context {
|
||||
core: Record<string, unknown>;
|
||||
[contextName: string]: unknown;
|
||||
}
|
||||
|
||||
type Handler = (context: Context, ...args: unknown[]) => Promise<unknown>;
|
||||
```
|
||||
|
||||
- `args` in this example is specific to the handler type, for instance in a
|
||||
http route handler, this would include the incoming request object.
|
||||
- The context object is marked as `Partial<Context>` because the contexts
|
||||
available will vary depending on which plugins are enabled.
|
||||
- This type is a convention, not a concrete type. The `core` key should have a
|
||||
known interface that is declared in the service owner's specific Context type.
|
||||
|
||||
## Registering new contexts
|
||||
|
||||
```ts
|
||||
type ContextProvider<T extends keyof Context> = (
|
||||
context: Partial<Context>,
|
||||
...args: unknown[]
|
||||
) => Promise<Context[T]>;
|
||||
|
||||
interface HandlerService {
|
||||
registerContext<T extends keyof Context>(contextName: T, provider: ContextProvider<T>): void;
|
||||
}
|
||||
```
|
||||
|
||||
- `args` in this example is specific to the handler type, for instance in a http
|
||||
route handler, this would include the incoming request object. It would not
|
||||
include the results from the other context providers in order to keep
|
||||
providers from having dependencies on one another.
|
||||
- The `HandlerService` is defined as a literal interface in this document, but
|
||||
in practice this interface is just a guide for the pattern of registering
|
||||
context values. Certain services may have multiple different types of
|
||||
handlers, so they may choose not to use the generic name `registerContext` in
|
||||
favor of something more explicit.
|
||||
|
||||
## Context creation
|
||||
|
||||
Before a handler is executed, each registered context provider will be called
|
||||
with the given arguments to construct a context object for the handler. Each
|
||||
provider must return an object of the correct type. The return values of these
|
||||
providers is merged into a single object where each key of the object is the
|
||||
name of the context provider and the value is the return value of the provider.
|
||||
Key facts about context providers:
|
||||
|
||||
- **Context providers are executed in registration order.** Providers are
|
||||
registered during the setup phase, which happens in topological dependency
|
||||
order, which will cause the context providers to execute in the same order.
|
||||
Providers can leverage this property to rely on the context of dependencies to
|
||||
be present during the execution of its own providers. All context registered
|
||||
by Core will be present during all plugin context provider executions.
|
||||
- **Context providers may be executed with the different arguments from
|
||||
handlers.** Each service owner should define what arguments are available to
|
||||
context providers, however the context itself should never be an argument (see
|
||||
point above).
|
||||
- **Context providers cannot takeover the handler execution.** Context providers
|
||||
cannot "intercept" handlers and return a different response. This is different
|
||||
than traditional middleware. It should be noted that throwing an exception
|
||||
will be bubbled up to the calling code and may prevent the handler from
|
||||
getting executed at all. How the service owner handles that exception is
|
||||
service-specific.
|
||||
- **Values returned by context providers are expected to be valid for the entire
|
||||
execution scope of the handler.**
|
||||
|
||||
Here's a simple example of how a service owner could construct a context and
|
||||
execute a handler:
|
||||
|
||||
```js
|
||||
const contextProviders = new Map()<string, ContextProvider<unknown>>;
|
||||
|
||||
async function executeHandler(handler, request, toolkit) {
|
||||
const newContext = {};
|
||||
for (const [contextName, provider] of contextProviders.entries()) {
|
||||
newContext[contextName] = await provider(newContext, request, toolkit);
|
||||
}
|
||||
|
||||
return handler(context, request, toolkit);
|
||||
}
|
||||
```
|
||||
|
||||
## End to end example
|
||||
|
||||
```js
|
||||
http.router.registerRequestContext('elasticsearch', async (context, request) => {
|
||||
const client = await core.elasticsearch.client$.toPromise();
|
||||
return client.child({
|
||||
headers: { authorization: request.headers.authorization },
|
||||
});
|
||||
});
|
||||
|
||||
http.router.route({
|
||||
path: '/foo',
|
||||
async routeHandler(context) {
|
||||
context.core.elasticsearch.search(); // === callWithRequest(request, 'search')
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
## Types
|
||||
|
||||
While services that implement this pattern will not be able to define a static
|
||||
type, plugins should be able to reopen a type to extend it with whatever context
|
||||
it provides. This allows the `registerContext` function to be type-safe.
|
||||
For example, if the HTTP service defined a setup type like this:
|
||||
|
||||
```ts
|
||||
// http_service.ts
|
||||
interface RequestContext {
|
||||
core: {
|
||||
elasticsearch: ScopedClusterClient;
|
||||
};
|
||||
[contextName: string]?: unknown;
|
||||
}
|
||||
|
||||
interface HttpSetup {
|
||||
// ...
|
||||
|
||||
registerRequestContext<T extends keyof RequestContext>(
|
||||
contextName: T,
|
||||
provider: (context: Partial<RequestContext>, request: Request) => RequestContext[T] | Promise<RequestContext[T]>
|
||||
): void;
|
||||
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
A consuming plugin could extend the `RequestContext` to be type-safe like this:
|
||||
|
||||
```ts
|
||||
// my_plugin/server/index.ts
|
||||
import { RequestContext } from '../../core/server';
|
||||
|
||||
// The plugin *should* add a new property to the RequestContext interface from
|
||||
// core to represent whatever type its context provider returns. This will be
|
||||
// available to any module that imports this type and will ensure that the
|
||||
// registered context provider returns the expected type.
|
||||
declare module "../../core/server" {
|
||||
interface RequestContext {
|
||||
myPlugin?: { // should be optional because this plugin may be disabled.
|
||||
getFoo(): string;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
class MyPlugin {
|
||||
setup(core) {
|
||||
// This will be type-safe!
|
||||
core.http.registerRequestContext('myPlugin', (context, request) => ({
|
||||
getFoo() { return 'foo!' }
|
||||
}))
|
||||
}
|
||||
};
|
||||
```
|
||||
|
||||
# Drawbacks
|
||||
|
||||
- Since the context properties that are present change if plugins are disabled,
|
||||
they are all marked as optional properties which makes consuming the context
|
||||
type awkward. We can expose types at the core and plugin level, but consumers
|
||||
of those types might need to define which properties are present manually to
|
||||
match their required plugin dependencies. Example:
|
||||
```ts
|
||||
type RequiredDependencies = 'data' | 'timepicker';
|
||||
type OptionalDependencies = 'telemetry';
|
||||
type MyPluginContext = Pick<RequestContext, 'core'> &
|
||||
Pick<RequestContext, RequiredDependencies> &
|
||||
Pick<Partial<RequestContext>, OptionalDependencies>;
|
||||
// => { core: {}, data: Data, timepicker: Timepicker, telemetry?: Telemetry };
|
||||
```
|
||||
This could even be provided as a generic type:
|
||||
```ts
|
||||
type AvailableContext<C, Req extends keyof C = never, Opt extends keyof C = never>
|
||||
= Pick<C, 'core'> & Required<Pick<C, Req>> & Partial<Pick<C, Opt>>;
|
||||
type MyPluginContext = AvailableContext<RequestContext, RequiredDependencies, OptionalDependencies>;
|
||||
// => { core: {}, data: Data, timepicker: Timepicker, telemetry?: Telemetry };
|
||||
```
|
||||
- Extending types with `declare module` merging is not a typical pattern for
|
||||
developers and it's not immediately obvious that you need to do this to type
|
||||
the `registerContext` function. We do already use this pattern with extending
|
||||
Hapi and EUI though, so it's not completely foreign.
|
||||
- The longer we wait to implement this, the more refactoring of newer code
|
||||
we'll need to do to roll this out.
|
||||
- It's a new formal concept and set of terminology that developers will need to
|
||||
learn relative to other new platform terminology.
|
||||
- Handlers are a common pattern for HTTP route handlers, but people don't
|
||||
necessarily associate similar patterns elsewhere as the same set of problems.
|
||||
- "Chicken and egg" questions will arise around where context providers should be
|
||||
registered. For example, does the `http` service invoke its
|
||||
registerRequestContext for `elasticsearch`, or does the `elasticsearch` service
|
||||
invoke `http.registerRequestContext`, or does core itself register the
|
||||
provider so neither service depends directly on the other.
|
||||
- The existence of plugins that a given plugin does not depend on may leak
|
||||
through the context object. This becomes a problem if a plugin uses any
|
||||
context properties provided by a plugin that it does not depend on and that
|
||||
plugin gets disabled in production. This can be solved by service owners, but
|
||||
may need to be reimplemented for each one.
|
||||
|
||||
# Alternatives
|
||||
|
||||
The obvious alternative is what we've always done: expose all functionality at
|
||||
the plugin level and then leave it up to the consumer to build a "context" for
|
||||
their particular handler. This creates a lot of inconsistency and makes
|
||||
creating simple but useful handlers more complicated. This can also lead to
|
||||
subtle but significant bugs as it's unreasonable to assume all developers
|
||||
understand the important details for constructing a context with plugins they
|
||||
don't know anything about.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
The easiest adoption strategy to is to roll this change out in the new platform
|
||||
before we expose any handlers to plugins, which means there wouldn't be any
|
||||
breaking change.
|
||||
|
||||
In the event that there's a long delay before this is implemented, its
|
||||
principles can be rolled out without altering plugin lifecycle arguments so
|
||||
existing handlers would continue to operate for a timeframe of our choosing.
|
||||
|
||||
# How we teach this
|
||||
|
||||
The handler pattern should be one we officially adopt in our developer
|
||||
documentation alongside other new platform terminology.
|
||||
|
||||
Core should be updated to follow this pattern once it is rolled out so there
|
||||
are plenty of examples in the codebase.
|
||||
|
||||
For many developers, the formalization of this interface will not have an
|
||||
obvious, immediate impact on the code they're writing since the concept is
|
||||
already widely in use in various forms.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
Is the term "handler" appropriate and sufficient? I also toyed with the phrase
|
||||
"contextual handler" to make it a little more distinct of a concept. I'm open
|
||||
to ideas here.
|
|
@ -1,334 +0,0 @@
|
|||
- Start Date: 2019-05-10
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
A front-end service to manage registration and root-level routing for
|
||||
first-class applications.
|
||||
|
||||
# Basic example
|
||||
|
||||
|
||||
```tsx
|
||||
// my_plugin/public/application.js
|
||||
|
||||
import React from 'react';
|
||||
import ReactDOM from 'react-dom';
|
||||
|
||||
import { MyApp } from './componnets';
|
||||
|
||||
export function renderApp(context, { element }) {
|
||||
ReactDOM.render(
|
||||
<MyApp mountContext={context} deps={pluginStart} />,
|
||||
element
|
||||
);
|
||||
|
||||
return () => {
|
||||
ReactDOM.unmountComponentAtNode(element);
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
```tsx
|
||||
// my_plugin/public/plugin.js
|
||||
|
||||
class MyPlugin {
|
||||
setup({ application }) {
|
||||
application.register({
|
||||
id: 'my-app',
|
||||
title: 'My Application',
|
||||
async mount(context, params) {
|
||||
const { renderApp } = await import('./application');
|
||||
return renderApp(context, params);
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
By having centralized management of applications we can have a true single page
|
||||
application. It also gives us a single place to enforce authorization and/or
|
||||
licensing constraints on application access.
|
||||
|
||||
By making the mounting interface of the ApplicationService generic, we can
|
||||
support many different rendering technologies simultaneously to avoid framework
|
||||
lock-in.
|
||||
|
||||
# Detailed design
|
||||
|
||||
## Interface
|
||||
|
||||
```ts
|
||||
/** A context type that implements the Handler Context pattern from RFC-0003 */
|
||||
export interface AppMountContext {
|
||||
/** These services serve as an example, but are subject to change. */
|
||||
core: {
|
||||
http: {
|
||||
fetch(...): Promise<any>;
|
||||
};
|
||||
i18n: {
|
||||
translate(
|
||||
id: string,
|
||||
defaultMessage: string,
|
||||
values?: Record<string, string>
|
||||
): string;
|
||||
};
|
||||
notifications: {
|
||||
toasts: {
|
||||
add(...): void;
|
||||
};
|
||||
};
|
||||
overlays: {
|
||||
showFlyout(render: (domElement) => () => void): Flyout;
|
||||
showModal(render: (domElement) => () => void): Modal;
|
||||
};
|
||||
uiSettings: { ... };
|
||||
};
|
||||
/** Other plugins can inject context by registering additional context providers */
|
||||
[contextName: string]: unknown;
|
||||
}
|
||||
|
||||
export interface AppMountParams {
|
||||
/** The base path the application is mounted on. Used to configure routers. */
|
||||
appBasePath: string;
|
||||
/** The element the application should render into */
|
||||
element: HTMLElement;
|
||||
}
|
||||
|
||||
export type Unmount = () => Promise<void> | void;
|
||||
|
||||
export interface AppSpec {
|
||||
/**
|
||||
* A unique identifier for this application. Used to build the route for this
|
||||
* application in the browser.
|
||||
*/
|
||||
id: string;
|
||||
|
||||
/**
|
||||
* The title of the application.
|
||||
*/
|
||||
title: string;
|
||||
|
||||
/**
|
||||
* A mount function called when the user navigates to this app's route.
|
||||
* @param context the `AppMountContext` generated for this app
|
||||
* @param params the `AppMountParams`
|
||||
* @returns An unmounting function that will be called to unmount the application.
|
||||
*/
|
||||
mount(context: MountContext, params: AppMountParams): Unmount | Promise<Unmount>;
|
||||
|
||||
/**
|
||||
* A EUI iconType that will be used for the app's icon. This icon
|
||||
* takes precendence over the `icon` property.
|
||||
*/
|
||||
euiIconType?: string;
|
||||
|
||||
/**
|
||||
* A URL to an image file used as an icon. Used as a fallback
|
||||
* if `euiIconType` is not provided.
|
||||
*/
|
||||
icon?: string;
|
||||
|
||||
/**
|
||||
* Custom capabilities defined by the app.
|
||||
*/
|
||||
capabilities?: Partial<Capabilities>;
|
||||
}
|
||||
|
||||
export interface ApplicationSetup {
|
||||
/**
|
||||
* Registers an application with the system.
|
||||
*/
|
||||
register(app: AppSpec): void;
|
||||
registerMountContext<T extends keyof MountContext>(
|
||||
contextName: T,
|
||||
provider: (context: Partial<MountContext>) => MountContext[T] | Promise<MountContext[T]>
|
||||
): void;
|
||||
}
|
||||
|
||||
export interface ApplicationStart {
|
||||
/**
|
||||
* The UI capabilities for the current user.
|
||||
*/
|
||||
capabilities: Capabilties;
|
||||
}
|
||||
```
|
||||
|
||||
## Mounting
|
||||
|
||||
When an app is registered via `register`, it must provide a `mount` function
|
||||
that will be invoked whenever the window's location has changed from another app
|
||||
to this app.
|
||||
|
||||
This function is called with a `AppMountContext` and an
|
||||
`AppMountParams` which contains a `HTMLElement` for the application to
|
||||
render itself to. The mount function must also return a function that can be
|
||||
called by the ApplicationService to unmount the application at the given DOM
|
||||
Element. The mount function may return a Promise of an unmount function in order
|
||||
to import UI code dynamically.
|
||||
|
||||
The ApplicationService's `register` method will only be available during the
|
||||
*setup* lifecycle event. This allows the system to know when all applications
|
||||
have been registered.
|
||||
|
||||
The `mount` function will also get access to the `AppMountContext` that
|
||||
has many of the same core services available during the `start` lifecycle.
|
||||
Plugins can also register additional context attributes via the
|
||||
`registerMountContext` function.
|
||||
|
||||
## Routing
|
||||
|
||||
The ApplicationService will serve as the global frontend router for Kibana,
|
||||
enabling Kibana to be a 100% single page application. However, the router will
|
||||
only manage top-level routes. Applications themselves will need to implement
|
||||
their own routing as subroutes of the top-level route.
|
||||
|
||||
An example:
|
||||
- "MyApp" is registered with `id: 'my-app'`
|
||||
- User navigates from mykibana.com/app/home to mykibana.com/app/my-app
|
||||
- ApplicationService sees the root app has changed and mounts the new
|
||||
application:
|
||||
- Calls the `Unmount` function returned my "Home"'s `mount`
|
||||
- Calls the `mount` function registered by "MyApp"
|
||||
- MyApp's internal router takes over rest of routing. Redirects to initial
|
||||
"overview" page: mykibana.com/app/my-app/overview
|
||||
|
||||
When setting up a router, your application should only handle the part of the
|
||||
URL following the `params.appBasePath` provided when you application is mounted.
|
||||
|
||||
### Legacy Applications
|
||||
|
||||
In order to introduce this service now, the ApplicationService will need to be
|
||||
able to handle "routing" to legacy applications. We will not be able to run
|
||||
multiple legacy applications on the same page load due to shared stateful
|
||||
modules in `ui/public`.
|
||||
|
||||
Instead, the ApplicationService should do a full-page refresh when rendering
|
||||
legacy applications. Internally, this will be managed by registering legacy apps
|
||||
with the ApplicationService separately and handling those top-level routes by
|
||||
starting a full-page refresh rather than a mounting cycle.
|
||||
|
||||
## Complete Example
|
||||
|
||||
Here is a complete example that demonstrates rendering a React application with
|
||||
a full-featured router and code-splitting. Note that using React or any other
|
||||
3rd party tools featured here is not required to build a Kibana Application.
|
||||
|
||||
```tsx
|
||||
// my_plugin/public/application.tsx
|
||||
|
||||
import React from 'react';
|
||||
import ReactDOM from 'react-dom';
|
||||
import { BrowserRouter, Route } from 'react-router-dom';
|
||||
import loadable from '@loadable/component';
|
||||
|
||||
// Apps can choose to load components statically in the same bundle or
|
||||
// dynamically when routes are rendered.
|
||||
import { HomePage } from './pages';
|
||||
const LazyDashboard = loadable(() => import('./pages/dashboard'));
|
||||
|
||||
const MyApp = ({ basename }) => (
|
||||
// Setup router's basename from the basename provided from MountContext
|
||||
<BrowserRouter basename={basename}>
|
||||
|
||||
{/* mykibana.com/app/my-app/ */}
|
||||
<Route path="/" exact component={HomePage} />
|
||||
|
||||
{/* mykibana.com/app/my-app/dashboard/42 */}
|
||||
<Route
|
||||
path="/dashboard/:id"
|
||||
render={({ match }) => <LazyDashboard dashboardId={match.params.id} />}
|
||||
/>
|
||||
|
||||
</BrowserRouter>,
|
||||
);
|
||||
|
||||
export function renderApp(context, params) {
|
||||
ReactDOM.render(
|
||||
// `params.appBasePath` would be `/app/my-app` in this example.
|
||||
// This exact string is not guaranteed to be stable, always reference the
|
||||
// provided value at `params.appBasePath`.
|
||||
<MyApp basename={params.appBasePath} />,
|
||||
params.element
|
||||
);
|
||||
|
||||
return () => ReactDOM.unmountComponentAtNode(params.element);
|
||||
}
|
||||
```
|
||||
|
||||
```tsx
|
||||
// my_plugin/public/plugin.tsx
|
||||
|
||||
export class MyPlugin {
|
||||
setup({ application }) {
|
||||
application.register({
|
||||
id: 'my-app',
|
||||
async mount(context, params) {
|
||||
const { renderApp } = await import('./application');
|
||||
return renderApp(context, params);
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Core Entry Point
|
||||
|
||||
Once we can support application routing for new and legacy applications, we
|
||||
should create a new entry point bundle that only includes Core and any necessary
|
||||
uiExports (hacks for example). This should be served by the backend whenever a
|
||||
`/app/<app-id>` request is received for an app that the legacy platform does not
|
||||
have a bundle for.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
- Implementing this will be significant work and requires migrating legacy code
|
||||
from `ui/chrome`
|
||||
- Making Kibana a single page application may lead to problems if applications
|
||||
do not clean themselves up properly when unmounted
|
||||
- Application `mount` functions will have access to *setup* via the closure. We
|
||||
may want to lock down these APIs from being used after *setup* to encourage
|
||||
usage of the `MountContext` instead.
|
||||
- In order to support new applications being registered in the legacy platform,
|
||||
we will need to create a new `uiExport` that is imported during the new
|
||||
platform's *setup* lifecycle event. This is necessary because app registration
|
||||
must happen prior to starting the legacy platform. This is only an issue for
|
||||
plugins that are migrating using a shim in the legacy platform.
|
||||
|
||||
# Alternatives
|
||||
|
||||
- We could provide a full featured react-router instance that plugins could
|
||||
plug directly into. The downside is this locks us more into React and makes
|
||||
code splitting a bit more challenging.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Adoption of the application service will have to happen as part of the migration
|
||||
of each plugin. We should be able to support legacy plugins registering new
|
||||
platform-style applications before they actually move all of their code
|
||||
over to the new platform.
|
||||
|
||||
# How we teach this
|
||||
|
||||
Introducing this service makes applications a first-class feature of the Kibana
|
||||
platform. Right now, plugins manage their own routes and can export "navlinks"
|
||||
that get rendered in the navigation UI, however there is a not a self-contained
|
||||
concept like an application to encapsulate these related responsibilities. It
|
||||
will need to be emphasized that plugins can register zero, one, or multiple
|
||||
applications.
|
||||
|
||||
Most new and existing Kibana developers will need to understand how the
|
||||
ApplicationService works and how multiple apps run in a single page application.
|
||||
This should be accomplished through thorough documentation in the
|
||||
ApplicationService's API implementation as well as in general plugin development
|
||||
tutorials and documentation.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
- Are there any major caveats to having multiple routers on the page? If so, how
|
||||
can these be prevented or worked around?
|
||||
- How should global URL state be shared across applications, such as timepicker
|
||||
state?
|
|
@ -1,185 +0,0 @@
|
|||
- Start Date: 2019-06-29
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: https://github.com/elastic/kibana/issues/33779
|
||||
|
||||
# Summary
|
||||
|
||||
Http Service in New platform should provide the ability to execute some logic in response to an incoming request and send the result of this operation back.
|
||||
|
||||
# Basic example
|
||||
Declaring a route handler for `/url` endpoint:
|
||||
```typescript
|
||||
router.get(
|
||||
{ path: '/url', ...[otherRouteParameters] },
|
||||
(context: Context, request: KibanaRequest, t: KibanaResponseToolkit) => {
|
||||
// logic to handle request ...
|
||||
return t.ok(result);
|
||||
);
|
||||
|
||||
```
|
||||
|
||||
# Motivation
|
||||
The new platform is built with library-agnostic philosophy and we cannot transfer the current solution for Network layer from Hapi. To avoid vendor lock-in in the future, we have to define route handler logic and request/response objects formats that can be implemented in any low-level library such as Express, Hapi, etc. It means that we are going to operate our own abstractions for such Http domain entities as Router, Route, Route Handler, Request, Response.
|
||||
|
||||
# Detailed design
|
||||
The new platform doesn't support the Legacy platform `Route Handler` format nor exposes implementation details, such as [Hapi.ResponseToolkit](https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/hapi/v17/index.d.ts#L984).
|
||||
Rather `Route Handler` in New platform has the next signature:
|
||||
```typescript
|
||||
type RequestHandler = (
|
||||
context: Context,
|
||||
request: KibanaRequest,
|
||||
t: KibanaResponseToolkit
|
||||
) => KibanaResponse | Promise<KibanaResponse>;
|
||||
```
|
||||
and accepts next Kibana specific parameters as arguments:
|
||||
- context: [Context](https://github.com/elastic/kibana/blob/master/rfcs/text/0003_handler_interface.md#handler-context). A handler context contains core service and plugin functionality already scoped to the incoming request.
|
||||
- request: [KibanaRequest](https://github.com/elastic/kibana/blob/master/src/core/server/http/router/request.ts). An immutable representation of the incoming request details, such as body, parameters, query, url and route information. Note: you **must** to specify route schema during route declaration to have access to `body, parameters, query` in the request object. You cannot extend KibanaRequest with arbitrary data nor remove any properties from it.
|
||||
```typescript
|
||||
interface KibanaRequest {
|
||||
url: url.Url;
|
||||
headers: Record<string, string | string [] | undefined>;
|
||||
params?: Record<string, any>;
|
||||
body?: Record<string, any>;
|
||||
query?: Record<string, any>;
|
||||
route: {
|
||||
path: string;
|
||||
method: 'get' | 'post' | ...
|
||||
options: {
|
||||
authRequired: boolean;
|
||||
tags: string [];
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
- t: [KibanaResponseToolkit](https://github.com/elastic/kibana/blob/master/src/core/server/http/router/response.ts#L27)
|
||||
Provides a set of pre-configured methods to respond to an incoming request. It is expected that handler **always** returns a result of one of `KibanaResponseToolkit` methods as an output:
|
||||
```typescript
|
||||
interface KibanaResponseToolkit {
|
||||
[method:string]: (...params: any) => KibanaResponse
|
||||
}
|
||||
router.get(...,
|
||||
(context: Context, request: KibanaRequest, t: KibanaResponseToolkit): KibanaResponse => {
|
||||
return t.ok();
|
||||
// or
|
||||
return t.redirected('/url');
|
||||
// or
|
||||
return t.badRequest(error);
|
||||
);
|
||||
```
|
||||
*KibanaResponseToolkit* methods allow an end user to adjust the next response parameters:
|
||||
- Body. Supported values:`undefined | string | JSONValue | Buffer | Stream`.
|
||||
- Status code.
|
||||
- Headers. Supports adjusting [known values](https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/node/v10/http.d.ts#L8) and attaching [custom values as well](https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/types/node/v10/http.d.ts#L67)
|
||||
|
||||
Other response parameters, such as `etag`, `MIME-type`, `bytes` that used in the Legacy platform could be adjusted via Headers.
|
||||
|
||||
The router handler doesn't expect that logic inside can throw or return something different from `KibanaResponse`. In this case, Http service will respond with `Server error` to prevent exposure of internal logic details.
|
||||
|
||||
#### KibanaResponseToolkit methods
|
||||
Basic primitives:
|
||||
```typescript
|
||||
type HttpResponsePayload = undefined | string | JSONValue | Buffer | Stream;
|
||||
interface HttpResponseOptions {
|
||||
headers?: {
|
||||
// list of known headers
|
||||
...
|
||||
// for custom headers:
|
||||
[header: string]: string | string[];
|
||||
}
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
##### Success
|
||||
Server indicated that request was accepted:
|
||||
```typescript
|
||||
type SuccessResponse<T> = <T extends HttpResponsePayload>(
|
||||
payload: T,
|
||||
options?: HttpResponseOptions
|
||||
) => KibanaResponse<T>;
|
||||
|
||||
const kibanaResponseToolkit = {
|
||||
ok: <T extends HttpResponsePayload>(payload: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(200, payload, options),
|
||||
accepted: <T extends HttpResponsePayload>(payload: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(202, payload, options),
|
||||
noContent: (options?: HttpResponseOptions) => new KibanaResponse(204, undefined, options)
|
||||
```
|
||||
|
||||
##### Redirection
|
||||
The server wants a user to perform additional actions.
|
||||
```typescript
|
||||
const kibanaResponseToolkit = {
|
||||
redirected: (url: string, options?: HttpResponseOptions) => new KibanaResponse(302, url, options),
|
||||
notModified: (options?: HttpResponseOptions) => new KibanaResponse(304, undefined, options),
|
||||
```
|
||||
|
||||
##### Error
|
||||
Server signals that request cannot be handled and explains details of the error situation
|
||||
```typescript
|
||||
// Supports attaching additional data to send to the client
|
||||
interface ResponseError extends Error {
|
||||
meta?: {
|
||||
data?: JSONValue;
|
||||
errorCode?: string; // error code to simplify search, translations in i18n, etc.
|
||||
docLink?: string; // link to the docs
|
||||
}
|
||||
}
|
||||
|
||||
export const createResponseError = (error: Error | string, meta?: ResponseErrorType['meta']) =>
|
||||
new ResponseError(error, meta)
|
||||
|
||||
const kibanaResponseToolkit = {
|
||||
// Client errors
|
||||
badRequest: <T extends ResponseError>(err: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(400, err, options),
|
||||
unauthorized: <T extends ResponseError>(err: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(401, err, options),
|
||||
|
||||
forbidden: <T extends ResponseError>(err: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(403, err, options),
|
||||
notFound: <T extends ResponseError>(err: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(404, err, options),
|
||||
conflict: <T extends ResponseError>(err: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(409, err, options),
|
||||
|
||||
// Server errors
|
||||
internal: <T extends ResponseError>(err: T, options?: HttpResponseOptions) =>
|
||||
new KibanaResponse(500, err, options),
|
||||
```
|
||||
|
||||
##### Custom
|
||||
If a custom response is required
|
||||
```typescript
|
||||
interface CustomOptions extends HttpResponseOptions {
|
||||
statusCode: number;
|
||||
}
|
||||
export const kibanaResponseToolkit = {
|
||||
custom: <T extends HttpResponsePayload>(payload: T, {statusCode, ...options}: CustomOptions) =>
|
||||
new KibanaResponse(statusCode, payload, options),
|
||||
```
|
||||
# Drawbacks
|
||||
- `Handler` is not compatible with Legacy platform implementation when anything can be returned or thrown from handler function and server send it as a valid result. Transition to the new format may require additional work in plugins.
|
||||
- `Handler` doesn't cover **all** functionality of the Legacy server at the current moment. For example, we cannot render a view in New platform yet and in this case, we have to proxy the request to the Legacy platform endpoint to perform rendering. All such cases should be considered in an individual order.
|
||||
- `KibanaResponseToolkit` may not cover all use cases and requires an extension for specific use-cases.
|
||||
- `KibanaResponseToolkit` operates low-level Http primitives, such as Headers e.g., and it is not always handy to work with them directly.
|
||||
- `KibanaResponse` cannot be extended with arbitrary data.
|
||||
|
||||
# Alternatives
|
||||
|
||||
- `Route Handler` may adopt well-known Hapi-compatible format.
|
||||
- `KibanaResponseToolkit` can expose only one method that allows specifying any type of response body, headers, status without creating additional abstractions and restrictions.
|
||||
- `KibanaResponseToolkit` may provide helpers for more granular use-cases, say `
|
||||
binary(data: Buffer, type: MimeType, size: number) => KibanaResponse`
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Breaking changes are expected during migration to the New platform. To simplify adoption we could provide an extended set of type definitions for primitives with high variability of possible values (such as content-type header, all headers in general).
|
||||
|
||||
# How we teach this
|
||||
|
||||
`Route Handler`, `Request`, `Response` terms are familiar to all Kibana developers. Even if their interface is different from existing ones, it shouldn't be a problem to adopt the code to the new format. Adding a section to the Migration guide should be sufficient.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
Is proposed functionality cover all the use cases of the `Route Handler` and responding to a request?
|
|
@ -1,324 +0,0 @@
|
|||
- Start Date: 2019-08-20
|
||||
- RFC PR: TBD
|
||||
- Kibana Issue: [#43499](https://github.com/elastic/kibana/issues/43499)
|
||||
|
||||
# Summary
|
||||
Management is one of the four primary "domains" covered by @elastic/kibana-app-arch (along with Data, Embeddables, and Visualizations). There are two main purposes for this service:
|
||||
|
||||
1. Own the management "framework" -- the UI that displays the management sidebar nav, the landing page, and handles rendering each of the sections
|
||||
2. Expose a registry for other plugins to add their own registry sections to the UI and add nested links to them in the sidebar.
|
||||
|
||||
The purpose of this RFC is to consider item 2 above -- the service for registering sections to the nav & loading them up.
|
||||
|
||||
# Motivation
|
||||
|
||||
## Why now?
|
||||
The main driver for considering this now is that the Management API moving to the new platform is going to block other teams from completing migration, so we need to have an answer to what the new platform version of the API looks like as soon as possible in `7.x`.
|
||||
|
||||
## Why not just keep the current API and redesign later?
|
||||
The answer to that has to do with the items that are currently used in the management implementation which must be removed in order to migrate to NP: the framework currently registers a `uiExport`, and relies on `IndexedArray`, `uiRegistry`, and `ui/routes`.
|
||||
|
||||
This means that we will basically need to rebuild the service anyway in order to migrate to the new platform. So if we are going to invest that time, we might as well invest it in building the API the way we want it to be longer term, rather than creating more work for ourselves later.
|
||||
|
||||
## Technical goals
|
||||
- Remove another usage of `IndexedArray` & `uiRegistry` (required for migration)
|
||||
- Remove dependency on `ui/routes` (required for migration)
|
||||
- Remove management section `uiExport` (required for migration)
|
||||
- Simple API that is designed in keeping with new platform principles
|
||||
- This includes being rendering-framework-agnostic... You should be able to build your management section UI however you'd like
|
||||
- Clear separation of app/UI code and service code, even if both live within the same plugin
|
||||
- Flexibility to potentially support alternate layouts in the future (see mockups in [reference section](#reference) below)
|
||||
|
||||
# Basic example
|
||||
This API is influenced heavily by the [application service mounting RFC](https://github.com/elastic/kibana/blob/master/rfcs/text/0004_application_service_mounting.md). The intent is to make the experience consistent with that service; the Management section is basically one big app with a bunch of registered "subapps".
|
||||
|
||||
```ts
|
||||
// my_plugin/public/plugin.ts
|
||||
|
||||
export class MyPlugin {
|
||||
setup(core, { management }) {
|
||||
// Registering a new app to a new section
|
||||
const mySection = management.sections.register({
|
||||
id: 'my-section',
|
||||
title: 'My Main Section', // display name
|
||||
order: 10,
|
||||
euiIconType: 'iconName',
|
||||
});
|
||||
mySection.registerApp({
|
||||
id: 'my-management-app',
|
||||
title: 'My Management App', // display name
|
||||
order: 20,
|
||||
async mount(context, params) {
|
||||
const { renderApp } = await import('./my-section');
|
||||
return renderApp(context, params);
|
||||
}
|
||||
});
|
||||
|
||||
// Registering a new app to an existing section
|
||||
const kibanaSection = management.sections.get('kibana');
|
||||
kibanaSection.registerApp({ id: 'my-kibana-management-app', ... });
|
||||
}
|
||||
|
||||
start(core, { management }) {
|
||||
// access all registered sections, filtered based on capabilities
|
||||
const sections = management.sections.getAvailable();
|
||||
sections.forEach(section => console.log(`${section.id} - ${section.title}`));
|
||||
// automatically navigate to any app by id
|
||||
management.sections.navigateToApp('my-kibana-management-app');
|
||||
}
|
||||
}
|
||||
|
||||
// my_plugin/public/my-section.tsx
|
||||
|
||||
export function renderApp(context, { sectionBasePath, element }) {
|
||||
ReactDOM.render(
|
||||
// `sectionBasePath` would be `/app/management/my-section/my-management-app`
|
||||
<MyApp basename={sectionBasePath} />,
|
||||
element
|
||||
);
|
||||
|
||||
// return value must be a function that unmounts (just like Core Application Service)
|
||||
return () => ReactDOM.unmountComponentAtNode(element);
|
||||
}
|
||||
```
|
||||
|
||||
We can also create a utility in `kibana_react` to make it easy for folks to `mount` a React app:
|
||||
```ts
|
||||
// src/plugins/kibana_react/public/mount_with_react.tsx
|
||||
import { KibanaContextProvider } from './context';
|
||||
|
||||
export const mountWithReact = (
|
||||
Component: React.ComponentType<{ basename: string }>,
|
||||
context: AppMountContext,
|
||||
params: ManagementSectionMountParams,
|
||||
) => {
|
||||
ReactDOM.render(
|
||||
(
|
||||
<KibanaContextProvider services={{ ...context }}>
|
||||
<Component basename={params.sectionBasePath} />
|
||||
</KibanaContextProvider>
|
||||
),
|
||||
params.element
|
||||
);
|
||||
|
||||
return () => ReactDOM.unmountComponentAtNode(params.element);
|
||||
}
|
||||
|
||||
// my_plugin/public/plugin.ts
|
||||
import { mountWithReact } from 'src/plugins/kibana_react/public';
|
||||
|
||||
export class MyPlugin {
|
||||
setup(core, { management }) {
|
||||
const kibanaSection = management.sections.get('kibana');
|
||||
kibanaSection.registerApp({
|
||||
id: 'my-other-kibana-management-app',
|
||||
...,
|
||||
async mount(context, params) {
|
||||
const { MySection } = await import('./components/my-section');
|
||||
const unmountCallback = mountWithReact(MySection, context, params);
|
||||
return () => unmountCallback();
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
# Detailed design
|
||||
|
||||
```ts
|
||||
interface ManagementSetup {
|
||||
sections: SectionsServiceSetup;
|
||||
}
|
||||
|
||||
interface ManagementStart {
|
||||
sections: SectionsServiceStart;
|
||||
}
|
||||
|
||||
interface SectionsServiceSetup {
|
||||
get: (sectionId: string) => Section;
|
||||
getAvailable: () => Section[]; // filtered based on capabilities
|
||||
register: RegisterSection;
|
||||
}
|
||||
|
||||
interface SectionsServiceStart {
|
||||
getAvailable: () => Array<Omit<Section, 'registerApp'>>; // filtered based on capabilities
|
||||
// uses `core.application.navigateToApp` under the hood, automatically prepending the `path` for the link
|
||||
navigateToApp: (appId: string, options?: { path?: string; state?: any }) => void;
|
||||
}
|
||||
|
||||
type RegisterSection = (
|
||||
id: string,
|
||||
title: string,
|
||||
order?: number,
|
||||
euiIconType?: string, // takes precedence over `icon` property.
|
||||
icon?: string, // URL to image file; fallback if no `euiIconType`
|
||||
) => Section;
|
||||
|
||||
type RegisterManagementApp = (
|
||||
id: string;
|
||||
title: string;
|
||||
order?: number;
|
||||
mount: ManagementSectionMount;
|
||||
) => ManagementApp;
|
||||
|
||||
type Unmount = () => Promise<void> | void;
|
||||
|
||||
interface ManagementSectionMountParams {
|
||||
sectionBasePath: string; // base path for setting up your router
|
||||
element: HTMLElement; // element the section should render into
|
||||
}
|
||||
|
||||
type ManagementSectionMount = (
|
||||
context: AppMountContext, // provided by core.ApplicationService
|
||||
params: ManagementSectionMountParams,
|
||||
) => Unmount | Promise<Unmount>;
|
||||
|
||||
interface ManagementApp {
|
||||
id: string;
|
||||
title: string;
|
||||
basePath: string;
|
||||
sectionId: string;
|
||||
order?: number;
|
||||
}
|
||||
|
||||
interface Section {
|
||||
id: string;
|
||||
title: string;
|
||||
apps: ManagementApp[];
|
||||
registerApp: RegisterManagementApp;
|
||||
order?: number;
|
||||
euiIconType?: string;
|
||||
icon?: string;
|
||||
}
|
||||
```
|
||||
|
||||
# Legacy service (what this would be replacing)
|
||||
|
||||
Example of how this looks today:
|
||||
```js
|
||||
// myplugin/index
|
||||
new Kibana.Plugin({
|
||||
uiExports: {
|
||||
managementSections: ['myplugin/management'],
|
||||
}
|
||||
});
|
||||
|
||||
// myplugin/public/management
|
||||
import { management } from 'ui/management';
|
||||
|
||||
// completely new section
|
||||
const newSection = management.register('mypluginsection', {
|
||||
name: 'mypluginsection',
|
||||
order: 10,
|
||||
display: 'My Plugin',
|
||||
icon: 'iconName',
|
||||
});
|
||||
newSection.register('mypluginlink', {
|
||||
name: 'mypluginlink',
|
||||
order: 10,
|
||||
display: 'My sublink',
|
||||
url: `#/management/myplugin`,
|
||||
});
|
||||
|
||||
// new link in existing section
|
||||
const kibanaSection = management.getSection('kibana');
|
||||
kibanaSection.register('mypluginlink', {
|
||||
name: 'mypluginlink',
|
||||
order: 10,
|
||||
display: 'My sublink',
|
||||
url: `#/management/myplugin`,
|
||||
});
|
||||
|
||||
// use ui/routes to render component
|
||||
import routes from 'ui/routes';
|
||||
|
||||
const renderReact = (elem) => {
|
||||
render(<MyApp />, elem);
|
||||
};
|
||||
|
||||
routes.when('management/myplugin', {
|
||||
controller($scope, $http, kbnUrl) {
|
||||
$scope.$on('$destroy', () => {
|
||||
const elem = document.getElementById('usersReactRoot');
|
||||
if (elem) unmountComponentAtNode(elem);
|
||||
});
|
||||
$scope.$$postDigest(() => {
|
||||
const elem = document.getElementById('usersReactRoot');
|
||||
const changeUrl = (url) => {
|
||||
kbnUrl.change(url);
|
||||
$scope.$apply();
|
||||
};
|
||||
renderReact(elem, $http, changeUrl);
|
||||
});
|
||||
},
|
||||
});
|
||||
```
|
||||
Current public contracts owned by the legacy service:
|
||||
```js
|
||||
// ui/management/index
|
||||
interface API {
|
||||
SidebarNav: React.FC<any>;
|
||||
management: new ManagementSection();
|
||||
MANAGEMENT_BREADCRUMB: {
|
||||
text: string;
|
||||
href: string;
|
||||
};
|
||||
}
|
||||
|
||||
// ui/management/section
|
||||
class ManagementSection {
|
||||
get visibleItems,
|
||||
addListener: (fn: function) => void,
|
||||
register: (id: string, options: Options) => ManagementSection,
|
||||
deregister: (id: string) => void,
|
||||
hasItem: (id: string) => boolean,
|
||||
getSection: (id: string) => ManagementSection,
|
||||
hide: () => void,
|
||||
show: () => void,
|
||||
disable: () => void,
|
||||
enable: () => void,
|
||||
}
|
||||
|
||||
interface Options {
|
||||
order: number | null;
|
||||
display: string | null; // defaults to id
|
||||
url: string | null; // defaults to ''
|
||||
visible: boolean | null; // defaults to true
|
||||
disabled: boolean | null; // defaults to false
|
||||
tooltip: string | null; // defaults to ''
|
||||
icon: string | null; // defaults to ''
|
||||
}
|
||||
```
|
||||
|
||||
# Notes
|
||||
|
||||
- The hide/show/disable/enable options were dropped with the assumption that we will be working with uiCapabilities to determine this instead... so people shouldn't need to manage it manually as they can look up a pre-filtered list of sections.
|
||||
- This was updated to add flexibility for custom (non-EUI) icons as outlined in [#32661](https://github.com/elastic/kibana/issues/32661). Much like the Core Application Service, you either choose an EUI icon, or provide a URL to an icon.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
- This removes the ability to infinitely nest sections within each other by making a distinction between a section header and a nav link.
|
||||
- So far we didn't seem to be using this feature anyway, but would like feedback on any use cases for it.
|
||||
|
||||
# Reference
|
||||
|
||||
- Issues about Global vs Spaces-based management sections: https://github.com/elastic/kibana/issues/37285 https://github.com/elastic/kibana/issues/37283
|
||||
- Mockups related to above issues: https://marvelapp.com/52b8616/screen/57582729
|
||||
|
||||
# Alternatives
|
||||
|
||||
An alternative design would be making everything React-specific and simply requiring consumers of the service to provide a React component to render when a route is hit, or giving them a react-router instance to work with.
|
||||
|
||||
This would require slightly less work for folks using the service as it would eliminate the need for a `mount` function. However, it comes at the cost of forcing folks into a specific rendering framework, which ultimately provides less flexibility.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Our strategy for implementing this should be to build the service entirely in the new platform in a `management` plugin, so that plugins can gradually cut over to the new service as they prepare to migrate to the new platform.
|
||||
|
||||
One thing we would need to figure out is how to bridge the gap between the new plugin and the legacy `ui/management` service. Ideally we would find a way to integrate the two, such that the management nav could display items registered via both services. This is a strategy we'd need to work out in more detail as we got closer to implementation.
|
||||
|
||||
# How we teach this
|
||||
|
||||
The hope is that this will already feel familiar to Kibana application developers, as most will have already been exposed to the Core Application Service and how it handles mounting.
|
||||
|
||||
A guide could also be added to the "Management" section of the Kibana docs (the legacy service is not even formally documented).
|
|
@ -1,373 +0,0 @@
|
|||
- Start Date: 2019-09-11
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
## Table of contents
|
||||
- [Summary](#summary)
|
||||
- [Motivation](#motivation)
|
||||
- [Detailed design](#detailed-design)
|
||||
- [<ol><li>Synchronous lifecycle methods</li></ol>](#ollisynchronous-lifecycle-methodsliol)
|
||||
- [<ol start="2"><li>Synchronous Context Provider functions</li></ol>](#ol-start2lisynchronous-context-provider-functionsliol)
|
||||
- [<ol start="3"><li>Core should not expose API's as observables</li></ol>](#ol-start3licore-should-not-expose-apis-as-observablesliol)
|
||||
- [<ol start="4"><li>Complete example code</li></ol>](#ol-start4licomplete-example-codeliol)
|
||||
- [<ol start="5"><li>Core should expose a status signal for Core services & plugins</li></ol>](#ol-start5licore-should-expose-a-status-signal-for-core-services-amp-pluginsliol)
|
||||
- [Drawbacks](#drawbacks)
|
||||
- [Alternatives](#alternatives)
|
||||
- [<ol><li>Introduce a lifecycle/context provider timeout</li></ol>](#olliintroduce-a-lifecyclecontext-provider-timeoutliol)
|
||||
- [<ol start="2"><li>Treat anything that blocks Kibana from starting up as a bug</li></ol>](#ol-start2litreat-anything-that-blocks-kibana-from-starting-up-as-a-bugliol)
|
||||
- [Adoption strategy](#adoption-strategy)
|
||||
- [How we teach this](#how-we-teach-this)
|
||||
- [Unresolved questions](#unresolved-questions)
|
||||
- [Footnotes](#footnotes)
|
||||
|
||||
# Summary
|
||||
|
||||
Prevent plugin lifecycle methods from blocking Kibana startup by making the
|
||||
following changes:
|
||||
1. Synchronous lifecycle methods
|
||||
2. Synchronous context provider functions
|
||||
3. Core should not expose API's as observables
|
||||
|
||||
# Motivation
|
||||
Plugin lifecycle methods and context provider functions are async
|
||||
(promise-returning) functions. Core runs these functions in series and waits
|
||||
for each plugin's lifecycle/context provider function to resolve before
|
||||
calling the next. This allows plugins to depend on the API's returned from
|
||||
other plugins.
|
||||
|
||||
With the current design, a single lifecycle method that blocks will block all
|
||||
of Kibana from starting up. Similarly, a blocking context provider will block
|
||||
all the handlers that depend on that context. Plugins (including legacy
|
||||
plugins) rely heavily on this blocking behaviour to ensure that all conditions
|
||||
required for their plugin's operation are met before their plugin is started
|
||||
and exposes it's API's. This means a single plugin with a network error that
|
||||
isn't retried or a dependency on an external host that is down, could block
|
||||
all of Kibana from starting up.
|
||||
|
||||
We should make it impossible for a single plugin lifecycle function to stall
|
||||
all of kibana.
|
||||
|
||||
# Detailed design
|
||||
|
||||
### 1. Synchronous lifecycle methods
|
||||
Lifecycle methods are synchronous functions, they can perform async operations
|
||||
but Core doesn't wait for these to complete. This guarantees that no plugin
|
||||
lifecycle function can block other plugins or core from starting up [1].
|
||||
|
||||
Core will still expose special API's that are able block the setup lifecycle
|
||||
such as registering Saved Object migrations, but this will be limited to
|
||||
operations where the risk of blocking all of kibana starting up is limited.
|
||||
|
||||
### 2. Synchronous Context Provider functions
|
||||
Making context provider functions synchronous guarantees that a context
|
||||
handler will never be blocked by registered context providers. They can expose
|
||||
async API's which could potentially have blocking behaviour.
|
||||
|
||||
```ts
|
||||
export type IContextProvider<
|
||||
THandler extends HandlerFunction<any>,
|
||||
TContextName extends keyof HandlerContextType<THandler>
|
||||
> = (
|
||||
context: Partial<HandlerContextType<THandler>>,
|
||||
...rest: HandlerParameters<THandler>
|
||||
) =>
|
||||
| HandlerContextType<THandler>[TContextName];
|
||||
```
|
||||
|
||||
### 3. Core should not expose API's as observables
|
||||
All Core API's should be reactive: when internal state changes, their behaviour
|
||||
should change accordingly. But, exposing these internal state changes as part
|
||||
of the API contract leaks internal implementation details consumers can't do
|
||||
anything useful with and don't care about.
|
||||
|
||||
For example: Core currently exposes `core.elasticsearch.adminClient$`, an
|
||||
Observable which emits a pre-configured elasticsearch client every time there's
|
||||
a configuration change. This includes changes to the logging configuration and
|
||||
might in the future include updating the authentication headers sent to
|
||||
elasticsearch https://github.com/elastic/kibana/issues/19829. As a plugin
|
||||
author who wants to make search requests against elasticsearch I shouldn't
|
||||
have to care about, react to, or keep track of, how many times the underlying
|
||||
configuration has changed. I want to use the `callAsInternalUser` method and I
|
||||
expect Core to use the most up to date configuration to send this request.
|
||||
|
||||
> Note: It would not be desirable for Core to dynamically load all
|
||||
> configuration changes. Changing the Elasticsearch `hosts` could mean Kibana
|
||||
> is pointing to a completely new Elasticsearch cluster. Since this is a risky
|
||||
> change to make and would likely require core and almost all plugins to
|
||||
> completely re-initialize, it's safer to require a complete Kibana restart.
|
||||
|
||||
This does not mean we should remove all observables from Core's API's. When an
|
||||
API consumer is interested in the *state changes itself* it absolutely makes
|
||||
sense to expose this as an Observable. Good examples of this is exposing
|
||||
plugin config as this is state that changes over time to which a plugin should
|
||||
directly react to.
|
||||
|
||||
This is important in the context of synchronous lifecycle methods and context
|
||||
handlers since exposing convenient API's become very ugly:
|
||||
|
||||
*(3.1): exposing Observable-based API's through the route handler context:*
|
||||
```ts
|
||||
// Before: Using an async context provider
|
||||
coreSetup.http.registerRouteHandlerContext(coreId, 'core', async (context, req) => {
|
||||
const adminClient = await coreSetup.elasticsearch.adminClient$.pipe(take(1)).toPromise();
|
||||
const dataClient = await coreSetup.elasticsearch.dataClient$.pipe(take(1)).toPromise();
|
||||
return {
|
||||
elasticsearch: {
|
||||
adminClient: adminClient.asScoped(req),
|
||||
dataClient: dataClient.asScoped(req),
|
||||
},
|
||||
};
|
||||
});
|
||||
|
||||
// After: Using a synchronous context provider
|
||||
coreSetup.http.registerRouteHandlerContext(coreId, 'core', async (context, req) => {
|
||||
return {
|
||||
elasticsearch: {
|
||||
// (3.1.1) We can expose a convenient API by doing a lot of work
|
||||
adminClient: () => {
|
||||
callAsInternalUser: async (...args) => {
|
||||
const adminClient = await coreSetup.elasticsearch.adminClient$.pipe(take(1)).toPromise();
|
||||
return adminClient.asScoped(req).callAsinternalUser(args);
|
||||
},
|
||||
callAsCurrentUser: async (...args) => {
|
||||
adminClient = await coreSetup.elasticsearch.adminClient$.pipe(take(1)).toPromise();
|
||||
return adminClient.asScoped(req).callAsCurrentUser(args);
|
||||
}
|
||||
},
|
||||
// (3.1.2) Or a lazy approach which perpetuates the problem to consumers:
|
||||
dataClient: async () => {
|
||||
const dataClient = await coreSetup.elasticsearch.dataClient$.pipe(take(1)).toPromise();
|
||||
return dataClient.asScoped(req);
|
||||
},
|
||||
},
|
||||
};
|
||||
});
|
||||
```
|
||||
|
||||
### 4. Complete example code
|
||||
*(4.1) Doing async operations in a plugin's setup lifecycle*
|
||||
```ts
|
||||
export class Plugin {
|
||||
public setup(core: CoreSetup) {
|
||||
// Async setup is possible and any operations involving async API's
|
||||
// will still block until these API's are ready, (savedObjects find only
|
||||
// resolves once the elasticsearch client has established a connection to
|
||||
// the cluster). The difference is that these details are now internal to
|
||||
// the API.
|
||||
(async () => {
|
||||
const docs = await core.savedObjects.client.find({...});
|
||||
...
|
||||
await core.savedObjects.client.update(...);
|
||||
})();
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
*(4.2) Exposing an API from a plugin's setup lifecycle*
|
||||
```ts
|
||||
export class Plugin {
|
||||
constructor(private readonly initializerContext: PluginInitializerContext) {}
|
||||
private async initSavedConfig(core: CoreSetup) {
|
||||
// Note: pulling a config value here means our code isn't reactive to
|
||||
// changes, but this is equivalent to doing it in an async setup lifecycle.
|
||||
const config = await this.initializerContext.config
|
||||
.create<TypeOf<typeof ConfigSchema>>()
|
||||
.pipe(first())
|
||||
.toPromise();
|
||||
try {
|
||||
const savedConfig = await core.savedObjects.internalRepository.get({...});
|
||||
return Object.assign({}, config, savedConfig);
|
||||
} catch (e) {
|
||||
if (SavedObjectErrorHelpers.isNotFoundError(e)) {
|
||||
return await core.savedObjects.internalRepository.create(config, {...});
|
||||
}
|
||||
}
|
||||
}
|
||||
public setup(core: CoreSetup) {
|
||||
// savedConfigPromise resolves with the same kind of "setup state" that a
|
||||
// plugin would have constructed in an async setup lifecycle.
|
||||
const savedConfigPromise = initSavedConfig(core);
|
||||
return {
|
||||
ping: async () => {
|
||||
const savedConfig = await savedConfigPromise;
|
||||
if (config.allowPing === false || savedConfig.allowPing === false) {
|
||||
throw new Error('ping() has been disabled');
|
||||
}
|
||||
// Note: the elasticsearch client no longer exposes an adminClient$
|
||||
// observable, improving the ergonomics of consuming the API.
|
||||
return await core.elasticsearch.adminClient.callAsInternalUser('ping', ...);
|
||||
}
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
*(4.3) Exposing an observable free Elasticsearch API from the route context*
|
||||
```ts
|
||||
coreSetup.http.registerRouteHandlerContext(coreId, 'core', async (context, req) => {
|
||||
return {
|
||||
elasticsearch: {
|
||||
adminClient: coreSetup.elasticsearch.adminClient.asScoped(req),
|
||||
dataClient: coreSetup.elasticsearch.adminClient.asScoped(req),
|
||||
},
|
||||
};
|
||||
});
|
||||
```
|
||||
|
||||
### 5. Core should expose a status signal for Core services & plugins
|
||||
Core should expose a global mechanism for core services and plugins to signal
|
||||
their status. This is equivalent to the legacy status API
|
||||
`kibana.Plugin.status` which allowed plugins to set their status to e.g. 'red'
|
||||
or 'green'. The exact design of this API is outside of the scope of this RFC.
|
||||
|
||||
What is important, is that there is a global mechanism to signal status
|
||||
changes which Core then makes visible to system administrators in the Kibana
|
||||
logs and the `/status` HTTP API. Plugins should be able to inspect and
|
||||
subscribe to status changes from any of their dependencies.
|
||||
|
||||
This will provide an obvious mechanism for plugins to signal that the
|
||||
conditions which are required for this plugin to operate are not currently
|
||||
present and manual intervention might be required. Status changes can happen
|
||||
in both setup and start lifecycles e.g.:
|
||||
- [setup] a required remote host is down
|
||||
- [start] a remote host which was up during setup, started returning
|
||||
connection timeout errors.
|
||||
|
||||
# Drawbacks
|
||||
Not being able to block on a lifecycle method means plugins can no longer be
|
||||
certain that all setup is "complete" before they expose their API's or reach
|
||||
the start lifecycle.
|
||||
|
||||
A plugin might want to poll an external host to ensure that the host is up in
|
||||
its setup lifecycle before making network requests to this host in it's start
|
||||
lifecycle.
|
||||
|
||||
Even if Kibana was using a valid, but incorrect configuration for the remote
|
||||
host, with synchronous lifecycles Kibana would still start up. Although the
|
||||
status API and logs would indicate a problem, these might not be monitored
|
||||
leading to the error only being discovered once someone tries to use it's
|
||||
functionality. This is an acceptable drawback because it buys us isolation.
|
||||
Some problems might go unnoticed, but no single plugin should affect the
|
||||
availability of all other plugins.
|
||||
|
||||
In effect, the plugin is polling the world to construct a snapshot
|
||||
of state which drives future behaviour. Modeling this with lifecycle functions
|
||||
is insufficient since it assumes that any state constructed in the setup
|
||||
lifecycle is static and won't and can't be changed in the future.
|
||||
|
||||
For example: a plugin's setup lifecycle might poll for the existence of a
|
||||
custom Elasticsearch index and if it doesn't exist, create it. Should there be
|
||||
an Elasticsearch restore which deletes the index, the plugin wouldn't be able
|
||||
to gracefully recover by simply running it's setup lifecycle a second time.
|
||||
|
||||
The once-off nature of lifecycle methods are incompatible with the real-world
|
||||
dynamic conditions under which plugins run. Not being able to block a
|
||||
lifecycle method is, therefore, only a drawback when plugins are authored under
|
||||
the false illusion of stability.
|
||||
|
||||
# Alternatives
|
||||
## 1. Introduce a lifecycle/context provider timeout
|
||||
Lifecycle methods and context providers would timeout after X seconds and any
|
||||
API's they expose would not be available if the timeout had been reached.
|
||||
|
||||
Drawbacks:
|
||||
1. A blocking setup lifecycle makes it easy for plugin authors to fall into
|
||||
the trap of assuming that their plugin's behaviour can continue to operate
|
||||
based on the snapshot of conditions present during setup.
|
||||
|
||||
2. For lifecycle methods: there would be no way to recover from a timeout,
|
||||
once a timeout had been reached the API will remain unavailable.
|
||||
|
||||
Context providers have the benefit of being re-created for each handler
|
||||
call, so a single timeout would not permanently disable the API.
|
||||
|
||||
3. Plugins have less control over their behaviour. When an upstream server
|
||||
becomes unavailable, a plugin might prefer to keep retrying the request
|
||||
indefinitely or only timeout after more than X seconds. It also isn't able
|
||||
to expose detailed error information to downstream consumers such as
|
||||
specifying which host or service is unavailable.
|
||||
|
||||
4. (minor) Introduces an additional failure condition that needs to be handled.
|
||||
Consumers should handle the API not being available in setup, as well as,
|
||||
error responses from the API itself. Since remote hosts like Elasticsearch
|
||||
could go down even after a successful setup, this effectively means API
|
||||
consumers have to handle the same error condition in two places.
|
||||
|
||||
## 2. Treat anything that blocks Kibana from starting up as a bug
|
||||
Keep the existing New Platform blocking behaviour, but through strong
|
||||
conventions and developer awareness minimize the risk of plugins blocking
|
||||
Kibana's startup indefinetely. By logging detailed diagnostic info on any
|
||||
plugins that appear to be blocking startup, we can aid system administrators
|
||||
to recover a blocked Kibana.
|
||||
|
||||
A parallel can be drawn between Kibana's async plugin initialization and the TC39
|
||||
proposal for [top-level await](https://github.com/tc39/proposal-top-level-await).
|
||||
> enables modules to act as big async functions: With top-level await,
|
||||
> ECMAScript Modules (ESM) can await resources, causing other modules who
|
||||
> import them to wait before they start evaluating their body
|
||||
|
||||
They believe the benefits outweigh the risk of modules blocking loading since:
|
||||
- [developer education should result in correct usage](https://github.com/tc39/proposal-top-level-await#will-top-level-await-cause-developers-to-make-their-code-block-longer-than-it-should)
|
||||
- [there are existing unavoidable ways in which modules could block loading such as infinite loops or recursion](https://github.com/tc39/proposal-top-level-await#does-top-level-await-increase-the-risk-of-deadlocks)
|
||||
|
||||
|
||||
Drawbacks:
|
||||
1. A blocking setup lifecycle makes it easy for plugin authors to fall into
|
||||
the trap of assuming that their plugin's behaviour can continue to operate
|
||||
based on the snapshot of conditions present during setup.
|
||||
2. This opens up the potential for a bug in Elastic or third-party plugins to
|
||||
effectively "break" kibana. Instead of a single plugin being disabled all
|
||||
of kibana would be down requiring manual intervention by a system
|
||||
administrator.
|
||||
|
||||
# Adoption strategy
|
||||
Although the eventual goal is to have sync-only lifecycles / providers, we
|
||||
will start by deprecating async behaviour and implementing a 30s timeout as
|
||||
per alternative (1). This will immediately lower the impact of plugin bugs
|
||||
while at the same time enabling a more incremental rollout and the flexibility
|
||||
to discover use cases that would require adopting Core API's to support sync
|
||||
lifecycles / providers.
|
||||
|
||||
Adoption and implementation should be handled as follows:
|
||||
- Adopt Core API’s to make sync lifecycles easier (3)
|
||||
- Update migration guide and other documentation examples.
|
||||
- Deprecate async lifecycles / context providers with a warning. Add a
|
||||
timeout of 30s after which a plugin and it's dependencies will be disabled.
|
||||
- Refactor existing plugin lifecycles which are easily converted to sync
|
||||
- Future: remove async timeout lifecycles / context providers
|
||||
|
||||
The following New Platform plugins or shims currently rely on async lifecycle
|
||||
functions and will be impacted:
|
||||
1. [region_map](https://github.com/elastic/kibana/blob/6039709929caf0090a4130b8235f3a53bd04ed84/src/legacy/core_plugins/region_map/public/plugin.ts#L68)
|
||||
2. [tile_map](https://github.com/elastic/kibana/blob/6039709929caf0090a4130b8235f3a53bd04ed84/src/legacy/core_plugins/tile_map/public/plugin.ts#L62)
|
||||
3. [vis_type_table](https://github.com/elastic/kibana/blob/6039709929caf0090a4130b8235f3a53bd04ed84/src/legacy/core_plugins/vis_type_table/public/plugin.ts#L61)
|
||||
4. [vis_type_vega](https://github.com/elastic/kibana/blob/6039709929caf0090a4130b8235f3a53bd04ed84/src/legacy/core_plugins/vis_type_vega/public/plugin.ts#L59)
|
||||
6. [code](https://github.com/elastic/kibana/blob/5049b460b47d4ae3432e1d9219263bb4be441392/x-pack/legacy/plugins/code/server/plugin.ts#L129-L149)
|
||||
7. [spaces](https://github.com/elastic/kibana/blob/096c7ee51136327f778845c636d7c4f1188e5db2/x-pack/legacy/plugins/spaces/server/new_platform/plugin.ts#L95)
|
||||
8. [licensing](https://github.com/elastic/kibana/blob/4667c46caef26f8f47714504879197708debae32/x-pack/plugins/licensing/server/plugin.ts)
|
||||
9. [security](https://github.com/elastic/kibana/blob/0f2324e44566ce2cf083d89082841e57d2db6ef6/x-pack/plugins/security/server/plugin.ts#L96)
|
||||
|
||||
# How we teach this
|
||||
|
||||
Async Plugin lifecycle methods and async context provider functions have been
|
||||
deprecated. In the future all lifecycle methods will by sync only. Plugins
|
||||
should treat the setup lifecycle as a place in time to register functionality
|
||||
with core or other plugins' API's and not as a mechanism to kick off and wait
|
||||
for any initialization that's required for the plugin to be able to run.
|
||||
|
||||
# Unresolved questions
|
||||
1. ~~Are the drawbacks worth the benefits or can we live with Kibana potentially
|
||||
being blocked for the sake of convenient async lifecycle stages?~~
|
||||
|
||||
2. Should core provide conventions or patterns for plugins to construct a
|
||||
snapshot of state and reactively updating this state and the behaviour it
|
||||
drives as the state of the world changes?
|
||||
|
||||
3. Do plugins ever need to read config values and pass these as parameters to
|
||||
Core API’s? If so we would have to expose synchronous config values to
|
||||
support sync lifecycles.
|
||||
|
||||
# Footnotes
|
||||
[1] Synchronous lifecycles can still be blocked by e.g. an infine for loop,
|
||||
but this would always be unintentional behaviour in contrast to intentional
|
||||
async behaviour like blocking until an external service becomes available.
|
|
@ -1,316 +0,0 @@
|
|||
- Start Date: 2020-02-07
|
||||
- RFC PR: [#57108](https://github.com/elastic/kibana/pull/57108)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Table of contents
|
||||
|
||||
- [Summary](#summary)
|
||||
- [Motivation](#motivation)
|
||||
- [Detailed design](#detailed-design)
|
||||
- [Concepts](#concepts)
|
||||
- [Architecture](#architecture)
|
||||
1. [Remote Pulse Service](#1-remote-pulse-service)
|
||||
- [Deployment](#deployment)
|
||||
- [Endpoints](#endpoints)
|
||||
- [Authenticate](#authenticate)
|
||||
- [Opt-In|Out](#opt-inout)
|
||||
- [Inject telemetry](#inject-telemetry)
|
||||
- [Retrieve instructions](#retrieve-instructions)
|
||||
- [Data model](#data-model)
|
||||
- [Access Control](#access-control)
|
||||
2. [Local Pulse Service](#2-local-pulse-service)
|
||||
- [Data storage](#data-storage)
|
||||
- [Sending telemetry](#sending-telemetry)
|
||||
- [Instruction polling](#instruction-polling)
|
||||
- [Drawbacks](#drawbacks)
|
||||
- [Alternatives](#alternatives)
|
||||
- [Adoption strategy](#adoption-strategy)
|
||||
- [How we teach this](#how-we-teach-this)
|
||||
- [Unresolved questions](#unresolved-questions)
|
||||
|
||||
# Summary
|
||||
|
||||
Evolve our telemetry to collect more diverse data, enhance our products with that data and engage with users by enabling:
|
||||
|
||||
1. _Two-way_ communication link between us and our products.
|
||||
2. Flexibility to collect diverse data and different granularity based on the type of data.
|
||||
3. Enhanced features in our products, allowing remote-driven _small tweaks_ to existing builds.
|
||||
4. All this while still maintaining transparency about what we send and making sure we don't track any of the user's data.
|
||||
|
||||
# Basic example
|
||||
|
||||
There is a POC implemented in the branch [`pulse_poc`](https://github.com/elastic/kibana/tree/pulse_poc) in this repo.
|
||||
|
||||
It covers the following scenarios:
|
||||
|
||||
- Track the behaviour of our users in the UI, reporting UI events throughout our platform.
|
||||
- Report to Elastic when an unexpected error occurs and keep track of it. When it's fixed, it lets the user know, encouraging them to update to their deployment to the latest release (PR [#56724](https://github.com/elastic/kibana/pull/56724)).
|
||||
- Keep track of the notifications and news in the newsfeed to know when they are read/kept unseen. This might help us on improving the way we communicate updates to the user (PR [#53596](https://github.com/elastic/kibana/pull/53596)).
|
||||
- Provide a cost estimate for running that cluster in Elastic Cloud, so the user is well-informed about our up-to-date offering and can decide accordingly (PR [#56324](https://github.com/elastic/kibana/pull/56324)).
|
||||
- Customised "upgrade guide" from your current version to the latest (PR [#56556](https://github.com/elastic/kibana/pull/56556)).
|
||||
|
||||

|
||||
_Basic example of the architecture_
|
||||
|
||||
# Motivation
|
||||
|
||||
Based on our current telemetry, we have many _lessons learned_ we want to tackle:
|
||||
|
||||
- It only supports one type of data:
|
||||
- It makes simple tasks like reporting aggregations of usage based on a number of days [an overengineered solution](https://github.com/elastic/kibana/issues/46599#issuecomment-545024137)
|
||||
- When reporting arrays (i.e.: `ui_metrics`), it cannot be consumed, making the data useless.
|
||||
- _One index to rule them all_:
|
||||
The current unique document structure comes at a price:
|
||||
- People consuming that information finding it hard to understand each element in the document ([[DISCUSS] Data dictionary for product usage data](https://github.com/elastic/telemetry/issues/211))
|
||||
- Maintaining the mappings is a tedious and risky process. It involved increasing the setting for the limit of fields in a mapping and reindexing documents (now millions of them).
|
||||
- We cannot fully control the data we insert in the documents: If we set `mappings.dynamic: 'strict'`, we'll reject all the documents containing more information than the actually mapped, losing all the other content we do want to receive.
|
||||
- Opt-out ratio:
|
||||
We want to reduce the number of `opt-out`s by providing some valuable feedback to our users so that they want to turn telemetry ON because they do benefit from it.
|
||||
|
||||
# Detailed design
|
||||
|
||||
This design is going to be tackled by introducing some common concepts to be used by the main two main components in this architecture:
|
||||
|
||||
1. Remote Pulse Service (RPS)
|
||||
2. Local Pulse Service (LPS)
|
||||
|
||||
After that, it explains how we envision the architecture and design of each of those components.
|
||||
|
||||
## Concepts
|
||||
|
||||
There are some new concepts we'd like to introduce with this new way of reporting telemetry:
|
||||
|
||||
- **Deployment Hash ID**
|
||||
This is the _anonymised_ random ID assigned for a deployment. It is used to link multiple pieces of information for further analysis like cross-referencing different bits of information from different sources.
|
||||
- **Channels**
|
||||
This is each stream of data that have common information. Typically each channel will have a well defined source of information, different to the rest. It will also result in a different structure to the rest of channels. However, all the channels will maintain a minimum piece of common schema for cross-references (like **Deployment Hash ID** and **timestamp**).
|
||||
- **Instructions**
|
||||
These are the messages generated in the form of feedback to the different channels.
|
||||
Typically, channels will follow a bi-directional communication process _(Local <-> Remote)_ but there might be channels that do not generate any kind of instruction _(Local -> Remote)_ and, similarly, some other channels that do not provide any telemetry at all, but allows Pulse to send updates to our products _(Local <- Remote)_.
|
||||
|
||||
## Phased implementation
|
||||
|
||||
At the moment of writing this document, anyone can push _fake_ telemetry data to our Telemetry cluster. They only need to know the public encryption key, the endpoint and the format of the data, all of that easily retrievable. We take that into consideration when analysing the data we have at the moment and it is a risk we are OK with for now.
|
||||
|
||||
But, given that we aim to provide feedback to the users and clusters in the form of instructions, the **Security and Integrity of the information** is critical. We need to come up with a solution that ensures the instructions are created based on data that was uniquely created (signed?) by the source. If we cannot ensure that, we should not allow that piece of information to be used in the generation of the instructions for that cluster and we should mark it so we know it could be maliciously injected when using it in our analysis.
|
||||
|
||||
But also, we want to be able to ship the benefits of Pulse on every release. That's why we are thinking on a phased release, starting with limited functionality and evolving to the final complete vision of this product. This RFC suggests the following phased implementation:
|
||||
|
||||
1. **Be able to ingest granular data**
|
||||
With the introduction of the **channels**, we can start receiving granular data that will help us all on our analysis. At this point, the same _security_ features as the current telemetry are considered: The payload is encrypted by the Kibana server so no mediator can spoof the data.
|
||||
The same risks as the current telemetry still apply at this point: anyone can _impersonate_ and send the data on behalf of another cluster, making the collected information useless.
|
||||
Because this information cannot be used to generate any instruction, we may not care about the **Deployment Hash ID** at this stage. This means no authentication is required to push data.
|
||||
The works at this point in time will be focused on creating the initial infraestructure, receiving early data and start with the migration of the current telemetry into the new channel-based model. Finally, start exploring the new visualisations we can provide with this new model of data.
|
||||
|
||||
2. **Secured ingest channel**
|
||||
In this phase, our efforts will focus on securing the communications and integrity of the data. This includes:
|
||||
- **Generation of the Deployment Hash ID**:
|
||||
Discussions on whether it should be self-generated and accepted/rejected by the Remote Pulse Service (RPS) or it should be generated and assigned by the RPS because it is the only one that can ensure uniqueness.
|
||||
- **Locally store the Deployment Hash ID as an encrypted saved object**:
|
||||
This comes back with a caveat: OSS versions will not be able to receive instructions. We will need to maintain a fallback mechanism to the phase 1 logic (it may be a desired scenario because it could happen the encrypted saved objects are not recoverable due to an error in the deployment and we should still be able to apply that fallback).
|
||||
- **Authenticity of the information (Local -> Remote)**:
|
||||
We need to _sign_ the data in some way the RPS can confirm the information reported as for a _Deployment Hash ID_ comes from the right source.
|
||||
- **Authenticity of the information (Remote -> Local)**:
|
||||
We need the Local Pulse Service (LPS) to be able to confirm the responses from the RPS data has not been altered by any mediator. It could be done via encryption using a key provided by the LPS. This should be provided to the RPS inside an encrypted payload in the same fashion we currently encrypt the telemetry.
|
||||
- **Integrity of the data in the channels**:
|
||||
We need to ensure an external plugin cannot push data to channels to avoid malicious corruption of the data. We could achieve this by either making this plugin only available to Kibana-shipped plugins or storing the `pluginID` that is pushing the data to have better control of the source of the data (then an ingest pipeline can reject any source of data that should not be accepted).
|
||||
|
||||
All the suggestions in this phase can be further discussed at that point (I will create another RFC to discuss those terms after this RFC is approved and merged).
|
||||
|
||||
3. **Instruction handling**
|
||||
This final phase we'll implement the instruction generation and handling at the same time we are adding more **channels**.
|
||||
We can discuss at this point if we want to be able to provide _harmless_ instructions for those deployments that are not _secured_ (i.e.: Cloud cost estimations, User-profiled-based marketing updates, ...).
|
||||
|
||||
## Architecture
|
||||
|
||||
As mentioned earlier, at the beginning of this chapter, there are two main components in this architecture:
|
||||
|
||||
1. Remote Pulse Service
|
||||
2. Local Pulse Service
|
||||
|
||||
### 1. Remote Pulse Service
|
||||
|
||||
This is the service that will receive and store the telemetry from all the _opted-in_ deployments. It will also generate the messages we want to report back to each deployment (aka: instructions).
|
||||
|
||||
#### Deployment
|
||||
|
||||
- The service will be hosted by Elastic.
|
||||
- Most likely maintained by the Infra team.
|
||||
- GCP is contemplated at this moment, but we need to confirm how would it affect us regarding the FedRamp approvals (and similar).
|
||||
- Exposes an API (check [Endpoints](#endpoints) to know more) to inject the data and retrieve the _instructions_.
|
||||
- The data will be stored in an ES cluster.
|
||||
|
||||
#### Endpoints
|
||||
|
||||
The following endpoints **will send every payload** detailed in below **encrypted** with a similar mechanism to the current telemetry encryption.
|
||||
|
||||
##### Authenticate
|
||||
|
||||
This Endpoint will be used to retrieve a randomised `deploymentID` and a `token` for the cluster to use in all the subsequent requests. Ideally, it will provide some sort of identifier (like `cluster_uuid` or `license.uuid`) so we can revoke its access to any of the endpoints if explicitly requested ([Blocking telemetry input](https://github.com/elastic/telemetry/pull/221) and [Delete previous telemetry data](https://github.com/elastic/telemetry/issues/209)).
|
||||
|
||||
I'd appreciate some insights here to come up with a strong handshake mechanism to avoid stealing identities.
|
||||
|
||||
In order to _dereference_ the data, we can store these mappings in a Vault or Secrets provider instead of an index in our ES.
|
||||
|
||||
_NB: Not for phase 1_
|
||||
|
||||
##### Opt-In|Out
|
||||
|
||||
Similar to the current telemetry, we want to keep track of when the user opts in or out of telemetry. The implementation can be very similar to the current one. But we recently learned we need to add the origin to know what application has telemetry disabled (Kibana, Beats, Enterprise Search, ...). This makes me wonder if we will ever want to provide a granular option for the user to be able to cherry-pick about what channels are sent and which ones should be disabled.
|
||||
|
||||
##### Inject telemetry
|
||||
|
||||
In order to minimise the amount of requests, this `POST` should accept bulks of data in the payload (mind the payload size limits if any). It will require authentication based on the `deploymentID` and `token` explained in the [previous endpoint](#authenticate) (_NB: Not for phase 1_).
|
||||
|
||||
The received payload will be pushed to a streaming technology (AWS Firehose, Google Pub/Sub, ...). This way we can maintain a buffer in cases the ingestion of data spikes or we need to stop our ES cluster for any maintenance purposes.
|
||||
|
||||
A subscriber to that stream will receive that info a split the payload into smaller documents per channel and index them into their separate indices.
|
||||
|
||||
This indexing should also trigger some additional processes like the **generation of instructions** and _special views_ (only if needed, check the point [Access control](#access-control) for more details).
|
||||
|
||||
_NB: We might want to consider some sort of piggy-backing to include the instructions in the response. But for the purpose of this RFC, scalability and separation of concerns, I'd rather keep it for future possible improvements._
|
||||
|
||||
##### Retrieve instructions
|
||||
|
||||
_NB: Only after phase 3_
|
||||
|
||||
This `GET` endpoint should return the list of instructions generated for that deployment. To control the likely ever-growing list of instructions for each deployment, it will accept a `since` query parameter where the requester can specify the timestamp ever since it was to retrieve the new values.
|
||||
|
||||
This endpoint will read the `instructions-*` indices, filtering `updated-at` by the `since` query parameter (if provided) and it will return the results, grouping them by channels.
|
||||
|
||||
Additionally, we can consider accepting an additional query parameter to retrieve only specific channels. For use cases like distributed components (endpoint, apm, beats, ...) polling instructions themselves.
|
||||
|
||||
#### Data model
|
||||
|
||||
The storage of each of the documents, will be based on monthly-rolling indices split by channels. This means we'll have indices like `pulse-raw-{CHANNEL_NAME}-YYYY.MM` and `pulse-instructions-{CHANNEL_NAME}-YYYY.MM` (final names TBD).
|
||||
|
||||
The first group will be used to index all the incoming documents from the telemetry. While the second one will contain the instructions to be sent to the deployments.
|
||||
|
||||
The mapping for those indices will be **`strict`** to avoid anyone storing unwanted/not-allowed info. The indexer defined in [the _Inject telemetry_ endpoint](#inject-telemetry) will need to handle accordingly the errors derived from the strict mapping.
|
||||
We'll set up a process to add new mappings and their descriptions before every new release.
|
||||
|
||||
#### Access control
|
||||
|
||||
- The access to _raw_ data indices will be very limited. Only granted to those in need of troubleshooting the service and maintaining mappings (this is the Pulse/Telemetry team at the moment).
|
||||
- Special views (as in aggregations/visualisations/snapshots of the data stored in special indices via separated indexers/aggregators/ES transform or via _BigQuery_ or similar) will be defined for different roles in the company to help them to take informed decisions based on the data.
|
||||
This way we'll be able to control "who can see what" on a very granual basis. It will also provide us with more flexibility to change to structure of the _raw_ if needed.
|
||||
|
||||
### 2. Local Pulse Service
|
||||
|
||||
This refers to the plugin running in Kibana in each of our customers' deployments. It will be a core service in NP, available for all plugins to get the existing channels, to send pieces of data, and subscribe to instructions.
|
||||
|
||||
The channel handlers are only defined inside the pulse context and are used to normalise the data for each channel before sending it to the remote service. The CODEOWNERS should notify the Pulse team every time there's an intended change in this context.
|
||||
|
||||
#### Data storage
|
||||
|
||||
For the purpose of transparency, we want the user to be able to retrieve the telemetry we send at any point, so we should store the information we send for each channel in their own local _dot_ internal indices (similar to a copy of the `pulse-raw-*` and `pulse-instructions-*` indices in our remote service). We may want to also sync back from the remote service any updates we do to the documents: enrichment of the document, anonymisation, categorisation when it makes sense in that specific channel, ...
|
||||
|
||||
In the same effort, we could even provide some _dashboards_ in Kibana for specific roles in the cluster to understand more about their deployment.
|
||||
|
||||
Only those specific roles (admin?) should have access to these local indices, unless they grant permissions to other users they want to share this information with.
|
||||
|
||||
The users should be able to control how long they want to keep that information for via ILM. A default ILM policy will be setup during the startup if it doesn't exist.
|
||||
|
||||
#### Sending telemetry
|
||||
|
||||
The telemetry will be sent, preferably, from the server. Only falling back to the browser in case we detect the server is behind firewalls and it cannot reach the service or if the user explicitly sets the behaviour in the config.
|
||||
|
||||
Periodically, the process (either in the server or the browser) will retrieve the telemetry to be sent by the channels, compile it into 1 bulk payload and send it encrypted to the [ingest endpoint](#inject-telemetry) explained earlier.
|
||||
|
||||
How often it sends the data, depends on the channel specifications. We will have 3 levels of periodicity:
|
||||
|
||||
- `URGENT`: The data is sent as soon as possible.
|
||||
- `HIGH`: Sent every hour.
|
||||
- `NORMAL`: Sent every 24 hours.
|
||||
- `LOW`: Sent every 3 days.
|
||||
|
||||
Some throttling policy should be applied to avoid exploiting the exceeded use of `URGENT`.
|
||||
|
||||
#### Instruction polling
|
||||
|
||||
Similarly to the sending of the telemetry, the instruction polling should happen only on one end (either the server or the browser). It will store the responses in the local index for each channel and the plugins reacting to those instructions will be able to consume that information based on their own needs (either load only the new ones or all the historic data at once).
|
||||
|
||||
Depending on the subscriptions to the channels by the plugins, the polling will happen with different periodicity, similar to the one described in the chapter above.
|
||||
|
||||
#### Exposing channels to the plugins
|
||||
|
||||
The plugins will be able to send messages and/or consume instructions for any channel by using the methods provided as part of the `coreContext` in the `setup` and `start` lifecycle methods in a fashion like (types to be properly defined when implementing it):
|
||||
|
||||
```typescript
|
||||
const coreContext: CoreSetup | CoreStart = {
|
||||
...existingCoreContext,
|
||||
pulse: {
|
||||
sendToChannel: async (channelName: keyof Channels, payload: Channels[channelName]) => void,
|
||||
instructionsFromChannel$: (channelName: keyof Channels) => Observable<ChannelInstructions[channelName]>,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
Plugins will simply need to call `core.pulse.sendToChannel('errors', myUnexpectedErrorIWantToReport)` whenever they want to report any new data to that channel. This will call the channel's handler to store the data.
|
||||
|
||||
Similarly, they'll be able to subscribe to channels like:
|
||||
|
||||
```typescript
|
||||
core.pulse.instructionsFromChannel$('ui_behaviour_tracking')
|
||||
.pipe(filterInstructionsForMyPlugin) // Initially, we won't filter the instructions based on the plugin ID (might not be necessary in all cases)
|
||||
.subscribe(changeTheOrderOfTheComponents);
|
||||
```
|
||||
|
||||
Internally in those methods we should append the `pluginId` to know who is sending/receiving the info.
|
||||
|
||||
##### The _legacy_ collection
|
||||
|
||||
The current telemetry collection via the `UsageCollector` service will be maintained until all the current telemetry is fully migrated into their own channels. In the meantime, the current existing telemetry will be sent to Pulse as the `legacy` channel. This way we can maintain the same architecture for the old and new telemetry to come. At this stage, there is no need for any plugin to update their logic unless they want to send more granular data using other (even specific to that plugin) channels.
|
||||
|
||||
The mapping for this `legacy` channel will be kept `dynamic: false` instead of `strict` to ensure compatibility.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
- Pushing data into telemetry nowadays is as simple as implementing your own `usageCollector`. For consuming, though, the telemetry team needs to update the mappings. But as soon as they do so, the previous data is available. Now we'll be more strict about the mapping. Rejecting any data that does not comply. Changing the structure of the reported data will result in data loss in that channel.
|
||||
- Hard dependency on the Pulse team's availability to update the metrics and on the Infra team to deploy the instruction handlers.
|
||||
- Testing architecture: any dockerised way to test the local dev environment?
|
||||
- We'll increase the local usage of indices. Making it more expensive to users to maintain the cluster. We need be to careful with this! Although it might not change much, compared to the current implementation, if any plugin decides to maintain its own index/saved objects to do aggregations afterwards. Similarly, more granularity per channel, may involve more network usage.
|
||||
- It is indeed a breaking change, but it can be migrated over-time as new features, making use of the instructions.
|
||||
- We need to update other products already reporting telemetry from outside Kibana (like Beats, Enterprise Search, Logstash, ...) to use the new way of pushing telemetry.
|
||||
|
||||
# Alternatives
|
||||
|
||||
> What other designs have been considered?
|
||||
|
||||
We currently have the newsfeed to be able to communicate to the user. This is actually pulling in Kibana from a public API to retrieve the list of entries to be shown in the notification bar. But this is only limitted to notifications to the user while the new _intructions_ can provide capabilities like self-update/self-configuration of components like endpoints, elasticsearch, ...
|
||||
|
||||
> What is the impact of not doing this?
|
||||
|
||||
Users might not see any benefit from providing telemetry and will opt-out. The quality of the telemetry will likely not be as good (or it will require a higher effort on the plugin end to provide it like in [the latest lens effort](https://github.com/elastic/kibana/issues/46599#issuecomment-545024137))
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Initially, we'll focus on the remote service and move the current telemetry to report as a `"legacy"` channel to the new Pulse service.
|
||||
|
||||
Then, we'll focus on doing the client side, providing new APIs to report the data, aiming for the minimum changes on the public end. For instance, the current usage collectors already report an ID, we can work on those IDs mapping to a channel (only grouping them when it makes sense). Nevertheless, it will require the devs to engage with the Pulse team for the mappings and definitions to be properly set up and updated. And any views to be added.
|
||||
|
||||
Finally, the instruction handling APIs are completely new and it will require development on both _remote_ and _local_ ends for the instruction generation and handling.
|
||||
|
||||
# How we teach this
|
||||
|
||||
> What names and terminology work best for these concepts and why? How is this
|
||||
idea best presented? As a continuation of existing Kibana patterns?
|
||||
|
||||
We have 3 points of view to show here:
|
||||
|
||||
- From the users perspective, we need to show the value for them to have the telemetry activated.
|
||||
- From the devs, how to generate data and consume instructions.
|
||||
- From the PMs, how to consume the views + definitions of the fields.
|
||||
|
||||
> Would the acceptance of this proposal mean the Kibana documentation must be
|
||||
re-organized or altered? Does it change how Kibana is taught to new developers
|
||||
at any level?
|
||||
|
||||
This telemetry is supposed to be internal only. Only internal developers will be able to add to this. So the documentation will only be for internal puposes. As mentioned in the _Adoption strategy_, the idea is that the devs to report new data to telemetry will need to engage with the Pulse team.
|
||||
|
||||
> How should this feature be taught to existing Kibana developers?
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
- Pending to define a proper handshake in the authentication mechanism to reduce the chance of a man-in-the-middle attack or DDoS. => We already have some ideas thanks to @jportner and @kobelb but it will be resolved during the _Phase 2_ design.
|
||||
- Opt-in/out per channel?
|
|
@ -1,151 +0,0 @@
|
|||
- Start Date: 2020-03-02
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
Currently, the applications that support screenshot reports are:
|
||||
- Dashboard
|
||||
- Visualize Editor
|
||||
- Canvas
|
||||
|
||||
Kibana UI code should be aware when the page is rendering for the purpose of
|
||||
capturing a screenshot. There should be a service to interact with low-level
|
||||
code for providing that awareness. Reporting would interact with this service
|
||||
to improve the quality of the Kibana Reporting feature for a few reasons:
|
||||
|
||||
- Fewer objects in the headless browser memory since interactive code doesn't run
|
||||
- Fewer API requests made by the headless browser for features that don't apply in a non-interactive context
|
||||
|
||||
**Screenshot mode service**
|
||||
|
||||
The Reporting-enabled applications should use the recommended practice of
|
||||
having a customized URL for Reporting. The customized URL renders without UI
|
||||
features like navigation, auto-complete, and anything else that wouldn't make
|
||||
sense for non-interactive pages.
|
||||
|
||||
However, applications are one piece of the UI code in a browser, and they have
|
||||
dependencies on other UI plugins. Apps can't control plugins and other things
|
||||
that Kibana loads in the browser.
|
||||
|
||||
This RFC proposes a Screenshot Mode Service as a low-level plugin that allows
|
||||
other plugins (UI code) to make choices when the page is rendering for a screenshot.
|
||||
|
||||
More background on how Reporting currently works, including the lifecycle of
|
||||
creating a PNG report, is here: https://github.com/elastic/kibana/issues/59396
|
||||
|
||||
# Motivation
|
||||
|
||||
The Reporting team wants all applications to support a customized URLs, such as
|
||||
Canvas does with its `#/export/workpad/pdf/{workpadId}` UI route. The
|
||||
customized URL is where an app can solve any rendering issue in a PDF or PNG,
|
||||
without needing extra CSS to be injected into the page.
|
||||
|
||||
However, many low-level plugins have been added to the UI over time. These run
|
||||
on every page and an application can not turn them off. Reporting performance
|
||||
is negatively affected by this type of code. When the Reporting team analyzes
|
||||
customer logs to figure out why a job timed out, we sometimes see requests for
|
||||
the newsfeed API and telemetry API: services that aren't needed during a
|
||||
reporting job.
|
||||
|
||||
In 7.12.0, using the customized `/export/workpad/pdf` in Canvas, the Sample
|
||||
Data Flights workpad loads 163 requests. Most of thees requests don't come from
|
||||
the app itself but from the application container code that Canvas can't turn
|
||||
off.
|
||||
|
||||
# Detailed design
|
||||
|
||||
The Screenshot Mode Service is an entirely new plugin that has an API method
|
||||
that returns a Boolean. The return value tells the plugin whether or not it
|
||||
should render itself to optimize for non-interactivity.
|
||||
|
||||
The plugin is low-level as it has no dependencies of its own, so other
|
||||
low-level plugins can depend on it.
|
||||
|
||||
## Interface
|
||||
A plugin would depend on `screenshotMode` in kibana.json. That provides
|
||||
`screenshotMode` as a plugin object. The plugin's purpose is to know when the
|
||||
page is rendered for screenshot capture, and to interact with plugins through
|
||||
an API. It allows plugins to decides what to do with the screenshot mode
|
||||
information.
|
||||
|
||||
```
|
||||
interface IScreenshotModeServiceSetup {
|
||||
isScreenshotMode: () => boolean;
|
||||
}
|
||||
```
|
||||
|
||||
The plugin knows the screenshot mode from request headers: this interface is
|
||||
constructed from a class that refers to information sent via a custom
|
||||
proprietary header:
|
||||
|
||||
```
|
||||
interface HeaderData {
|
||||
'X-Screenshot-Mode': true
|
||||
}
|
||||
|
||||
class ScreenshotModeServiceSetup implements IScreenshotModeServiceSetup {
|
||||
constructor(rawData: HeaderData) {}
|
||||
public isScreenshotMode (): boolean {}
|
||||
}
|
||||
```
|
||||
|
||||
The Reporting headless browser that opens the page can inject custom headers
|
||||
into the request. Teams should be able to test how their app renders when
|
||||
loaded with this header. They could use a web debugging proxy, or perhaps the
|
||||
new service should support a URL parameter which triggers screenshot mode to be
|
||||
enabled, for easier testing.
|
||||
|
||||
# Basic example
|
||||
|
||||
When Kibana loads initially, there is a Newsfeed plugin in the UI that
|
||||
checks internally cached records to see if it must fetch the Elastic News
|
||||
Service for newer items. When the Screenshot Mode Service is implemented, the
|
||||
Newsfeed component has a source of information to check on whether or not it
|
||||
should load in the Kibana UI. If it can avoid loading, it avoids an unnecessary
|
||||
HTTP round trip, which weigh heavily on performance.
|
||||
|
||||
# Alternatives
|
||||
|
||||
- Print media query CSS
|
||||
If applications UIs supported printability using `@media print`, and Kibana
|
||||
Reporting uses `page.print()` to capture the PDF, it would be easy for application
|
||||
developers to test, and prevent bugs showing up in the report.
|
||||
|
||||
However, this proposal only provides high-level customization over visual rendering, which the
|
||||
application already has if it uses a customized URL for rendering the layout for screenshots. It
|
||||
has a performance downside, as well: the headless browser still has to render the entire
|
||||
page as a "normal" render before we can call `page.print()`. No one sees the
|
||||
results of that initial render, so it is the same amount of wasted rendering cycles
|
||||
during report generation that we have today.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Using this service doesn't mean that anything needs to be replaced or thrown away. It's an add on
|
||||
that any plugin or even application can use to add conditionals that previously weren't possible.
|
||||
The Reporting Services team should create an example in a developer example plugin on how to build
|
||||
a UI that is aware of Screenshot Mode Service. From there, the team would work on updating
|
||||
whichever code that would benefit from this the most, which we know from analyzing debugging logs
|
||||
of a report job. The team would work across teams to get it accepted by the owners.
|
||||
|
||||
# How we teach this
|
||||
|
||||
The Reporting Services team will continue to analyze debug logs of reporting jobs to find if there
|
||||
is UI code running during a report job that could be optimized by this service. The team would
|
||||
reach out to the code owners and determine if it makes sense to use this service to improve
|
||||
screenshot performance of their code.
|
||||
|
||||
# Further examples
|
||||
|
||||
- Applications can also use screenshot context to customize the way they load.
|
||||
An example is Toast Notifications: by default they auto-dismiss themselves
|
||||
after 30 seconds or so. That makes sense when there is a human there to
|
||||
notice the message, read it and remember it. But if the page is loaded for
|
||||
capturing a screenshot, the toast notifications should never disappear. The
|
||||
message in the toast needs to be part of the screenshot for its message to
|
||||
mean anything, so it should not force the screenshot capture tool to race
|
||||
against the toast timeout window.
|
||||
- Avoid collection and sending of telemetry from the browser when page is
|
||||
loaded for screenshot capture.
|
||||
- Turn off autocomplete features and auto-refresh features that weigh on
|
||||
performance for screenshot capture.
|
|
@ -1,373 +0,0 @@
|
|||
- Start Date: 2020-03-07
|
||||
- RFC PR: https://github.com/elastic/kibana/pull/59621
|
||||
- Kibana Issue: https://github.com/elastic/kibana/issues/41983
|
||||
|
||||
# Summary
|
||||
|
||||
A set API for describing the current status of a system (Core service or plugin)
|
||||
in Kibana.
|
||||
|
||||
# Basic example
|
||||
|
||||
```ts
|
||||
// Override default behavior and only elevate severity when elasticsearch is not available
|
||||
core.status.set(
|
||||
core.status.core$.pipe(core => core.elasticsearch);
|
||||
)
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
Kibana should do as much possible to help users keep their installation in a working state. This includes providing as much detail about components that are not working as well as ensuring that failures in one part of the application do not block using other portions of the application.
|
||||
|
||||
In order to provide the user with as much detail as possible about any systems that are not working correctly, the status mechanism should provide excellent defaults in terms of expressing relationships between services and presenting detailed information to the user.
|
||||
|
||||
# Detailed design
|
||||
|
||||
## Failure Guidelines
|
||||
|
||||
While this RFC primarily describes how status information is signaled from individual services and plugins to Core, it's first important to define how Core expects these services and plugins to behave in the face of failure more broadly.
|
||||
|
||||
Core is designed to be resilient and adaptive to change. When at all possible, Kibana should automatically recover from failure, rather than requiring any kind of intervention by the user or administrator.
|
||||
|
||||
Given this goal, Core expects the following from plugins:
|
||||
- During initialization, `setup`, and `start` plugins should only throw an exception if a truly unrecoverable issue is encountered. Examples: HTTP port is unavailable, server does not have the appropriate file permissions.
|
||||
- Temporary error conditions should always be retried automatically. A user should not have to restart Kibana in order to resolve a problem when avoidable. This means all initialization code should include error handling and automated retries. Examples: creating an Elasticsearch index, connecting to an external service.
|
||||
- It's important to note that some issues do require manual intervention in _other services_ (eg. Elasticsearch). Kibana should still recover without restarting once that external issue is resolved.
|
||||
- Unhandled promise rejections are not permitted. In the future, Node.js will crash on unhandled promise rejections. It is impossible for Core to be able to properly handle and retry these situations, so all services and plugins should handle all rejected promises and retry when necessary.
|
||||
- Plugins should only crash the Kibana server when absolutely necessary. Some features are considered "mission-critical" to customers and may need to halt Kibana if they are not functioning correctly. Example: audit logging.
|
||||
|
||||
## API Design
|
||||
|
||||
### Types
|
||||
|
||||
```ts
|
||||
/**
|
||||
* The current status of a service at a point in time.
|
||||
*
|
||||
* @typeParam Meta - JSON-serializable object. Plugins should export this type to allow other plugins to read the `meta`
|
||||
* field in a type-safe way.
|
||||
*/
|
||||
type ServiceStatus<Meta extends Record<string, any> = unknown> = {
|
||||
/**
|
||||
* The current availability level of the service.
|
||||
*/
|
||||
level: ServiceStatusLevel.available;
|
||||
/**
|
||||
* A high-level summary of the service status.
|
||||
*/
|
||||
summary?: string;
|
||||
/**
|
||||
* A more detailed description of the service status.
|
||||
*/
|
||||
detail?: string;
|
||||
/**
|
||||
* A URL to open in a new tab about how to resolve or troubleshoot the problem.
|
||||
*/
|
||||
documentationUrl?: string;
|
||||
/**
|
||||
* Any JSON-serializable data to be included in the HTTP API response. Useful for providing more fine-grained,
|
||||
* machine-readable information about the service status. May include status information for underlying features.
|
||||
*/
|
||||
meta?: Meta;
|
||||
} | {
|
||||
level: ServiceStatusLevel;
|
||||
summary: string; // required when level !== available
|
||||
detail?: string;
|
||||
documentationUrl?: string;
|
||||
meta?: Meta;
|
||||
}
|
||||
|
||||
/**
|
||||
* The current "level" of availability of a service.
|
||||
*/
|
||||
enum ServiceStatusLevel {
|
||||
/**
|
||||
* Everything is working!
|
||||
*/
|
||||
available,
|
||||
/**
|
||||
* Some features may not be working.
|
||||
*/
|
||||
degraded,
|
||||
/**
|
||||
* The service is unavailable, but other functions that do not depend on this service should work.
|
||||
*/
|
||||
unavailable,
|
||||
/**
|
||||
* Block all user functions and display the status page, reserved for Core services only.
|
||||
* Note: In the real implementation, this will be split out to a different type. Kept as a single type here to make
|
||||
* the RFC easier to follow.
|
||||
*/
|
||||
critical
|
||||
}
|
||||
|
||||
/**
|
||||
* Status of core services. Only contains entries for backend services that could have a non-available `status`.
|
||||
* For example, `context` cannot possibly be broken, so it is not included.
|
||||
*/
|
||||
interface CoreStatus {
|
||||
elasticsearch: ServiceStatus;
|
||||
http: ServiceStatus;
|
||||
savedObjects: ServiceStatus;
|
||||
uiSettings: ServiceStatus;
|
||||
metrics: ServiceStatus;
|
||||
}
|
||||
```
|
||||
|
||||
### Plugin API
|
||||
|
||||
```ts
|
||||
/**
|
||||
* The API exposed to plugins on CoreSetup.status
|
||||
*/
|
||||
interface StatusSetup {
|
||||
/**
|
||||
* Allows a plugin to specify a custom status dependent on its own criteria.
|
||||
* Completely overrides the default inherited status.
|
||||
*/
|
||||
set(status$: Observable<ServiceStatus>): void;
|
||||
|
||||
/**
|
||||
* Current status for all Core services.
|
||||
*/
|
||||
core$: Observable<CoreStatus>;
|
||||
|
||||
/**
|
||||
* Current status for all dependencies of the current plugin.
|
||||
* Each key of the `Record` is a plugin id.
|
||||
*/
|
||||
dependencies$: Observable<Record<string, ServiceStatus>>;
|
||||
|
||||
/**
|
||||
* The status of this plugin as derived from its dependencies.
|
||||
*
|
||||
* @remarks
|
||||
* By default, plugins inherit this derived status from their dependencies.
|
||||
* Calling {@link StatusSetup.set} overrides this default status.
|
||||
*/
|
||||
derivedStatus$: Observable<ServiceStatus>;
|
||||
}
|
||||
```
|
||||
|
||||
### HTTP API
|
||||
|
||||
The HTTP endpoint should return basic information about the Kibana node as well as the overall system status and the status of each individual system.
|
||||
|
||||
This API does not need to include UI-specific details like the existing API such as `uiColor` and `icon`.
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Response type for the endpoint: GET /api/status
|
||||
*/
|
||||
interface StatusResponse {
|
||||
/** server.name */
|
||||
name: string;
|
||||
/** server.uuid */
|
||||
uuid: string;
|
||||
/** Currently exposed by existing status API */
|
||||
version: {
|
||||
number: string;
|
||||
build_hash: string;
|
||||
build_number: number;
|
||||
build_snapshot: boolean;
|
||||
};
|
||||
/** Similar format to existing API, but slightly different shape */
|
||||
status: {
|
||||
/** See "Overall status calculation" section below */
|
||||
overall: ServiceStatus;
|
||||
core: CoreStatus;
|
||||
plugins: Record<string, ServiceStatus>;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Behaviors
|
||||
|
||||
### Levels
|
||||
|
||||
Each member of the `ServiceStatusLevel` enum has specific behaviors associated with it:
|
||||
- **`available`**:
|
||||
- All endpoints and apps associated with the service are accessible
|
||||
- **`degraded`**:
|
||||
- All endpoints and apps are available by default
|
||||
- Some APIs may return `503 Unavailable` responses. This is not automatic, must be implemented directly by the service.
|
||||
- Some plugin contract APIs may throw errors. This is not automatic, must be implemented directly by the service.
|
||||
- **`unavailable`**:
|
||||
- All endpoints (with some exceptions in Core) in Kibana return a `503 Unavailable` responses by default. This is automatic.
|
||||
- When trying to access any app associated with the unavailable service, the user is presented with an error UI with detail about the outage.
|
||||
- Some plugin contract APIs may throw errors. This is not automatic, must be implemented directly by the service.
|
||||
- **`critical`**:
|
||||
- All endpoints (with some exceptions in Core) in Kibana return a `503 Unavailable` response by default. This is automatic.
|
||||
- All applications redirect to the system-wide status page with detail about which services are down and any relevant detail. This is automatic.
|
||||
- Some plugin contract APIs may throw errors. This is not automatic, must be implemented directly by the service.
|
||||
- This level is reserved for Core services only.
|
||||
|
||||
### Overall status calculation
|
||||
|
||||
The status level of the overall system is calculated to be the highest severity status of all core services and plugins.
|
||||
|
||||
The `summary` property is calculated as follows:
|
||||
- If the overall status level is `available`, the `summary` is `"Kibana is operating normally"`
|
||||
- If a single core service or plugin is not `available`, the `summary` is `Kibana is ${level} due to ${serviceName}. See ${statusPageUrl} for more information.`
|
||||
- If multiple core services or plugins are not `available`, the `summary` is `Kibana is ${level} due to multiple components. See ${statusPageUrl} for more information.`
|
||||
|
||||
### Status inheritance
|
||||
|
||||
By default, plugins inherit their status from all Core services and their dependencies on other plugins.
|
||||
|
||||
This can be summarized by the following matrix:
|
||||
|
||||
| core | required | optional | inherited |
|
||||
|----------------|----------------|----------------|-------------|
|
||||
| critical | _any_ | _any_ | critical |
|
||||
| unavailable | <= unavailable | <= unavailable | unavailable |
|
||||
| degraded | <= degraded | <= degraded | degraded |
|
||||
| <= unavailable | unavailable | <= unavailable | unavailable |
|
||||
| <= degraded | degraded | <= degraded | degraded |
|
||||
| <= degraded | <= degraded | unavailable | degraded |
|
||||
| <= degraded | <= degraded | degraded | degraded |
|
||||
| available | available | available | available |
|
||||
|
||||
If a plugin calls the `StatusSetup#set` API, the inherited status is completely overridden. They status the plugin specifies is the source of truth. If a plugin wishes to "merge" its custom status with the inherited status calculated by Core, it may do so by using the `StatusSetup#inherited$` property in its calculated status.
|
||||
|
||||
If a plugin never calls the `StatusSetup#set` API, the plugin's status defaults to the inherited status.
|
||||
|
||||
_Disabled_ plugins, that is plugins that are explicitly disabled in Kibana's configuration, do not have any status. They are not present in any status APIs and are **not** considered `unavailable`. Disabled plugins are excluded from the status inheritance calculation, even if a plugin has a optional dependency on a disabled plugin. In summary, if a plugin has an optional dependency on a disabled plugin, the plugin will not be considered `degraded` just because that optional dependency is disabled.
|
||||
|
||||
### HTTP responses
|
||||
|
||||
As specified in the [_Levels section_](#levels), a service's HTTP endpoints will respond with `503 Unavailable` responses in some status levels.
|
||||
|
||||
In both the `critical` and `unavailable` levels, all of a service's endpoints will return 503s. However, in the `degraded` level, it is up to service authors to decide which endpoints should return a 503. This may be implemented directly in the route handler logic or by using any of the [utilities provided](#status-utilities).
|
||||
|
||||
When a 503 is returned either via the default behavior or behavior implemented using the [provided utilities](#status-utilities), the HTTP response will include the following:
|
||||
- `Retry-After` header, set to `60` seconds
|
||||
- A body with mime type `application/json` containing the status of the service the HTTP route belongs to:
|
||||
```json5
|
||||
{
|
||||
"error": "Unavailable",
|
||||
// `ServiceStatus#summary`
|
||||
"message": "Newsfeed API cannot be reached",
|
||||
"attributes": {
|
||||
"status": {
|
||||
// Human readable form of `ServiceStatus#level`
|
||||
"level": "critical",
|
||||
// `ServiceStatus#summary`
|
||||
"summary": "Newsfeed API cannot be reached",
|
||||
// `ServiceStatus#detail` or null
|
||||
"detail": null,
|
||||
// `ServiceStatus#documentationUrl` or null
|
||||
"documentationUrl": null,
|
||||
// JSON-serialized from `ServiceStatus#meta` or null
|
||||
"meta": {}
|
||||
}
|
||||
},
|
||||
"statusCode": 503
|
||||
}
|
||||
```
|
||||
|
||||
## Status Utilities
|
||||
|
||||
Though many plugins should be able to rely on the default status inheritance and associated behaviors, there are common patterns and overrides that some plugins will need. The status service should provide some utilities for these common patterns out-of-the-box.
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Extension of the main Status API
|
||||
*/
|
||||
interface StatusSetup {
|
||||
/**
|
||||
* Helpers for expressing status in HTTP routes.
|
||||
*/
|
||||
http: {
|
||||
/**
|
||||
* High-order route handler function for wrapping routes with 503 logic based
|
||||
* on a predicate.
|
||||
*
|
||||
* @remarks
|
||||
* When a 503 is returned, it also includes detailed information from the service's
|
||||
* current `ServiceStatus` including `meta` information.
|
||||
*
|
||||
* @example
|
||||
* ```ts
|
||||
* router.get(
|
||||
* { path: '/my-api' }
|
||||
* unavailableWhen(
|
||||
* ServiceStatusLevel.degraded,
|
||||
* async (context, req, res) => {
|
||||
* return res.ok({ body: 'done' });
|
||||
* }
|
||||
* )
|
||||
* )
|
||||
* ```
|
||||
*
|
||||
* @param predicate When a level is specified, if the plugin's current status
|
||||
* level is >= to the severity of the specified level, route
|
||||
* returns a 503. When a function is specified, if that
|
||||
* function returns `true`, a 503 is returned.
|
||||
* @param handler The route handler to execute when a 503 is not returned.
|
||||
* @param options.retryAfter Number of seconds to set the `Retry-After`
|
||||
* header to when the endpoint is unavailable.
|
||||
* Defaults to `60`.
|
||||
*/
|
||||
unavailableWhen<P, Q, B>(
|
||||
predicate: ServiceStatusLevel |
|
||||
(self: ServiceStatus, core: CoreStatus, plugins: Record<string, ServiceStatus>) => boolean,
|
||||
handler: RouteHandler<P, Q, B>,
|
||||
options?: { retryAfter?: number }
|
||||
): RouteHandler<P, Q, B>;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Additional Examples
|
||||
|
||||
### Combine inherited status with check against external dependency
|
||||
```ts
|
||||
const getExternalDepHealth = async () => {
|
||||
const resp = await window.fetch('https://myexternaldep.com/_healthz');
|
||||
return resp.json();
|
||||
}
|
||||
|
||||
// Create an observable that checks the status of an external service every every 10s
|
||||
const myExternalDependency$: Observable<ServiceStatusLevel> = interval(10000).pipe(
|
||||
mergeMap(() => of(getExternalDepHealth())),
|
||||
map(health => health.ok ? ServiceStatusLevel.available : ServiceStatusLevel.unavailable),
|
||||
catchError(() => of(ServiceStatusLevel.unavailable))
|
||||
);
|
||||
|
||||
// Merge the inherited status with the external check
|
||||
core.status.set(
|
||||
combineLatest(
|
||||
core.status.inherited$,
|
||||
myExternalDependency$
|
||||
).pipe(
|
||||
map(([inherited, external]) => ({
|
||||
level: Math.max(inherited.level, external)
|
||||
}))
|
||||
)
|
||||
);
|
||||
```
|
||||
|
||||
# Drawbacks
|
||||
|
||||
1. **The default behaviors and inheritance of statuses may appear to be "magic" to developers who do not read the documentation about how this works.** Compared to the legacy status mechanism, these defaults are much more opinionated and the resulting status is less explicit in plugin code compared to the legacy `mirrorPluginStatus` mechanism.
|
||||
2. **The default behaviors and inheritance may not fit real-world status very well.** If many plugins must customize their status in order to opt-out of the defaults, this would be a step backwards from the legacy mechanism.
|
||||
|
||||
# Alternatives
|
||||
|
||||
We could somewhat reduce the complexity of the status inheritance by leveraging the dependencies between plugins to enable and disable plugins based on whether or not their upstream dependencies are available. This may simplify plugin code but would greatly complicate how Kibana fundamentally operates, requiring that plugins may get stopped and started multiple times within a single Kibana server process. We would be trading simplicity in one area for complexity in another.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
By default, most plugins would not need to do much at all. Today, very few plugins leverage the legacy status system. The majority of ones that do, simply call the `mirrorPluginStatus` utility to follow the status of the legacy elasticsearch plugin.
|
||||
|
||||
Plugins that wish to expose more detail about their availability will easily be able to do so, including providing detailed information such as links to documentation to resolve the problem.
|
||||
|
||||
# How we teach this
|
||||
|
||||
This largely follows the same patterns we have used for other Core APIs: Observables, composable utilties, etc.
|
||||
|
||||
This should be taught using the same channels we've leveraged for other Kibana Platform APIs: API documentation, additions to the [Migration Guide](../../src/core/MIGRATION.md) and [Migration Examples](../../src/core/MIGRATION_EXMAPLES.md).
|
||||
|
||||
# Unresolved questions
|
|
@ -1,565 +0,0 @@
|
|||
- Start Date: 2020-04-19
|
||||
- RFC PR: [#64284](https://github.com/elastic/kibana/pull/64284)
|
||||
- Kibana Issue: [#61657](https://github.com/elastic/kibana/issues/61657)
|
||||
|
||||
# Summary
|
||||
|
||||
A new Kibana plugin exposing an API on both public and server side, to allow consumers to search for various objects and
|
||||
register result providers.
|
||||
|
||||
# Basic example
|
||||
|
||||
- registering a result provider:
|
||||
|
||||
```ts
|
||||
setupDeps.globalSearch.registerResultProvider({
|
||||
id: 'my_provider',
|
||||
find: (term, options, context) => {
|
||||
const resultPromise = myService.search(term, context.core.savedObjects.client);
|
||||
return from(resultPromise);
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
- using the `find` API from the client-side:
|
||||
|
||||
```ts
|
||||
startDeps.globalSearch.find('some term').subscribe(
|
||||
({ results }) => {
|
||||
updateResults(results);
|
||||
},
|
||||
() => {},
|
||||
() => {
|
||||
showAsyncSearchIndicator(false);
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
Kibana should do its best to assist users searching for and navigating to the various objects present on the Kibana platform.
|
||||
|
||||
We should expose an API to make it possible for plugins to search for the various objects present on a Kibana instance.
|
||||
|
||||
The first consumer of this API will be the global search bar [#57576](https://github.com/elastic/kibana/issues/57576). This API should still be generic to answer similar needs from any other consumer, either client or server side.
|
||||
|
||||
# Detailed design
|
||||
|
||||
## API Design
|
||||
|
||||
### Result provider API
|
||||
|
||||
#### common types
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Static, non exhaustive list of the common search types.
|
||||
* Only present to allow consumers and result providers to have aliases to the most common types.
|
||||
*/
|
||||
enum GlobalSearchCommonResultTypes {
|
||||
application = 'application',
|
||||
dashboard = 'dashboard',
|
||||
visualization = 'visualization',
|
||||
search = 'search',
|
||||
}
|
||||
|
||||
/**
|
||||
* Options provided to {@link GlobalSearchResultProvider | result providers} `find` method.
|
||||
*/
|
||||
interface GlobalSearchProviderFindOptions {
|
||||
/**
|
||||
* A custom preference token associated with a search 'session' that should be used to get consistent scoring
|
||||
* when performing calls to ES. Can also be used as a 'session' token for providers returning data from elsewhere
|
||||
* than an elasticsearch cluster.
|
||||
*/
|
||||
preference: string;
|
||||
/**
|
||||
* Observable that emit once if and when the `find` call has been aborted by the consumer, or when the timeout period as been reached.
|
||||
* When a `find` request is aborted, the service will stop emitting any new result to the consumer anyway, but
|
||||
* this can (and should) be used to cancel any pending asynchronous task and complete the result observable.
|
||||
*/
|
||||
aborted$: Observable<void>;
|
||||
/**
|
||||
* The total maximum number of results (including all batches / emissions) that should be returned by the provider for a given `find` request.
|
||||
* Any result emitted exceeding this quota will be ignored by the service and not emitted to the consumer.
|
||||
*/
|
||||
maxResults: number;
|
||||
}
|
||||
|
||||
/**
|
||||
* Representation of a result returned by a {@link GlobalSearchResultProvider | result provider}
|
||||
*/
|
||||
interface GlobalSearchProviderResult {
|
||||
/** an id that should be unique for an individual provider's results */
|
||||
id: string;
|
||||
/** the title/label of the result */
|
||||
title: string;
|
||||
/** the type of result */
|
||||
type: string;
|
||||
/** an optional EUI icon name to associate with the search result */
|
||||
icon?: string;
|
||||
/**
|
||||
* The url associated with this result.
|
||||
* This can be either an absolute url, a path relative to the basePath, or a structure specifying if the basePath should be prepended.
|
||||
*
|
||||
* @example
|
||||
* `result.url = 'https://kibana-instance:8080/base-path/app/my-app/my-result-type/id';`
|
||||
* `result.url = '/app/my-app/my-result-type/id';`
|
||||
* `result.url = { path: '/base-path/app/my-app/my-result-type/id', prependBasePath: false };`
|
||||
*/
|
||||
url: string | { path: string; prependBasePath: boolean };
|
||||
/** the score of the result, from 1 (lowest) to 100 (highest) */
|
||||
score: number;
|
||||
/** an optional record of metadata for this result */
|
||||
meta?: Record<string, Serializable>;
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- The `Serializable` type should be implemented and exposed from `core`. A basic implementation could be:
|
||||
|
||||
```ts
|
||||
type Serializable = string | number | boolean | PrimitiveArray | PrimitiveRecord;
|
||||
interface PrimitiveArray extends Array<Serializable> {}
|
||||
interface PrimitiveRecord extends Record<string, Serializable> {}
|
||||
```
|
||||
|
||||
#### server
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Context passed to server-side {@GlobalSearchResultProvider | result provider}'s `find` method.
|
||||
*/
|
||||
export interface GlobalSearchProviderContext {
|
||||
core: {
|
||||
savedObjects: {
|
||||
client: SavedObjectsClientContract;
|
||||
typeRegistry: ISavedObjectTypeRegistry;
|
||||
};
|
||||
elasticsearch: {
|
||||
legacy: {
|
||||
client: IScopedClusterClient;
|
||||
};
|
||||
};
|
||||
uiSettings: {
|
||||
client: IUiSettingsClient;
|
||||
};
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* GlobalSearch result provider, to be registered using the {@link GlobalSearchSetup | global search API}
|
||||
*/
|
||||
type GlobalSearchResultProvider = {
|
||||
id: string;
|
||||
find(
|
||||
term: string,
|
||||
options: GlobalSearchProviderFindOptions,
|
||||
context: GlobalSearchProviderContext
|
||||
): Observable<GlobalSearchProviderResult[]>;
|
||||
};
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- Initial implementation will only provide a static / non extensible `GlobalSearchProviderContext` context.
|
||||
It would be possible to allow plugins to register their own context providers as it's done for `RequestHandlerContext`,
|
||||
but this will not be done until the need arises.
|
||||
- The performing `request` object could also be exposed on the context to allow result providers
|
||||
to scope their custom services if needed. However as the previous option, this should only be done once needed.
|
||||
|
||||
#### public
|
||||
|
||||
```ts
|
||||
/**
|
||||
* GlobalSearch result provider, to be registered using the {@link GlobalSearchSetup | global search API}
|
||||
*/
|
||||
type GlobalSearchResultProvider = {
|
||||
id: string;
|
||||
find(
|
||||
term: string,
|
||||
options: GlobalSearchProviderFindOptions
|
||||
): Observable<GlobalSearchProviderResult[]>;
|
||||
};
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- The client-side version of `GlobalSearchResultProvider` is slightly different than the
|
||||
server one, as there is no `context` parameter on the `find` signature.
|
||||
|
||||
### Plugin API
|
||||
|
||||
#### Common types
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Representation of a result returned by the {@link GlobalSearchPluginStart.find | `find` API}
|
||||
*/
|
||||
type GlobalSearchResult = Omit<GlobalSearchProviderResult, 'url'> & {
|
||||
/**
|
||||
* The url associated with this result.
|
||||
* This can be either an absolute url, or a relative path including the basePath
|
||||
*/
|
||||
url: string;
|
||||
};
|
||||
|
||||
|
||||
/**
|
||||
* Response returned from the {@link GlobalSearchServiceStart | global search service}'s `find` API
|
||||
*/
|
||||
type GlobalSearchBatchedResults = {
|
||||
/**
|
||||
* Results for this batch
|
||||
*/
|
||||
results: GlobalSearchResult[];
|
||||
};
|
||||
```
|
||||
|
||||
#### server API
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Options for the server-side {@link GlobalSearchServiceStart.find | find API}
|
||||
*/
|
||||
interface GlobalSearchFindOptions {
|
||||
/**
|
||||
* a custom preference token associated with a search 'session' that should be used to get consistent scoring
|
||||
* when performing calls to ES. Can also be used as a 'session' token for providers returning data from elsewhere
|
||||
* than an elasticsearch cluster.
|
||||
* If not specified, a random token will be generated and used when callingn the underlying result providers.
|
||||
*/
|
||||
preference?: string;
|
||||
/**
|
||||
* Optional observable to notify that the associated `find` call should be canceled.
|
||||
* If/when provided and emitting, the result observable will be completed and no further result emission will be performed.
|
||||
*/
|
||||
aborted$?: Observable<void>;
|
||||
}
|
||||
|
||||
/** @public */
|
||||
interface GlobalSearchPluginSetup {
|
||||
registerResultProvider(provider: GlobalSearchResultProvider);
|
||||
}
|
||||
|
||||
/** @public */
|
||||
interface GlobalSearchPluginStart {
|
||||
find(
|
||||
term: string,
|
||||
options: GlobalSearchFindOptions,
|
||||
request: KibanaRequest
|
||||
): Observable<GlobalSearchBatchedResults>;
|
||||
}
|
||||
```
|
||||
|
||||
#### public API
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Options for the client-side {@link GlobalSearchServiceStart.find | find API}
|
||||
*/
|
||||
interface GlobalSearchFindOptions {
|
||||
/**
|
||||
* Optional observable to notify that the associated `find` call should be canceled.
|
||||
* If/when provided and emitting, the result observable will be completed and no further result emission will be performed.
|
||||
*/
|
||||
aborted$?: Observable<void>;
|
||||
}
|
||||
|
||||
/** @public */
|
||||
interface GlobalSearchPluginSetup {
|
||||
registerResultProvider(provider: GlobalSearchResultProvider);
|
||||
}
|
||||
|
||||
/** @public */
|
||||
interface GlobalSearchPluginStart {
|
||||
find(term: string, options: GlobalSearchFindOptions): Observable<GlobalSearchBatchedResults>;
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- The public API is very similar to its server counterpart. The differences are:
|
||||
- The `registerResultProvider` setup APIs share the same signature, however the input `GlobalSearchResultProvider`
|
||||
types are different on the client and server.
|
||||
- The `find` start API signature got a `KibanaRequest` for `server`, when this parameter is not present for `public`.
|
||||
|
||||
#### http API
|
||||
|
||||
An internal HTTP API will be exposed on `/internal/global_search/find` to allow the client-side `GlobalSearch` plugin
|
||||
to fetch results from the server-side result providers.
|
||||
|
||||
It should be very close to:
|
||||
|
||||
```ts
|
||||
router.post(
|
||||
{
|
||||
path: '/internal/global_search/find',
|
||||
validate: {
|
||||
body: schema.object({
|
||||
term: schema.string(),
|
||||
options: schema.maybe(
|
||||
schema.object({
|
||||
preference: schema.maybe(schema.string()),
|
||||
})
|
||||
),
|
||||
}),
|
||||
},
|
||||
},
|
||||
async (ctx, req, res) => {
|
||||
const { term, options } = req.body;
|
||||
const results = await ctx.globalSearch
|
||||
.find(term, { ...options, $aborted: req.events.aborted$ })
|
||||
.pipe(reduce((acc, results) => [...acc, ...results]))
|
||||
.toPromise();
|
||||
return res.ok({
|
||||
body: {
|
||||
results,
|
||||
},
|
||||
});
|
||||
}
|
||||
);
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- This API is only for internal use and communication between the client and the server parts of the `GS` API. When
|
||||
the need to expose an API for external consumers will appear, a new public API will be exposed for that.
|
||||
- A new `globalSearch` context will be exposed on core's `RequestHandlerContext` to wrap a `find` call with current request.
|
||||
- Example implementation is awaiting for all results and then returns them as a single response. Ideally, we would
|
||||
leverage the `bfetch` plugin to stream the results to the client instead.
|
||||
|
||||
## Functional behavior
|
||||
|
||||
### summary
|
||||
|
||||
- the `GlobalSearch` plugin setup contract exposes an API to be able to register result providers (`GlobalSearchResultProvider`).
|
||||
These providers can be registered from either public or server side, even if the interface for each side is not
|
||||
exactly the same.
|
||||
- the `GlobalSearch` plugin start contract exposes an API to be able to search for objects. This API is available from both public
|
||||
and server sides.
|
||||
- When using the server `find` API, only results from providers registered from the server will be returned.
|
||||
- When using the public `find` API, results from provider registered from both server and public sides will be returned.
|
||||
- During a `find` call, the service will call all the registered result providers and collect their result observables.
|
||||
Every time a result provider emits some new results, the `globalSearch` service will:
|
||||
- process them to convert their url to the expected output format
|
||||
- emit the processed results
|
||||
|
||||
### result provider registration
|
||||
|
||||
Due to the fact that some kind of results (i.e `application`, and maybe later `management_section`) only exists on
|
||||
the public side of Kibana and therefor are not known on the server side, the `registerResultProvider` API will be
|
||||
available both from the public and the server counterpart of the `GlobalSearchPluginSetup` contract.
|
||||
|
||||
However, as results from providers registered from the client-side will not be available from the server's `find` API,
|
||||
registering result providers from the client should only be done to answer this specific use case and will be
|
||||
discouraged, by providing appropriated jsdoc and documentation explaining that it should only
|
||||
be used when it is not technically possible to register it from the server side instead.
|
||||
|
||||
### results url processing
|
||||
|
||||
When retrieving results from providers, the GS service will convert them from the provider's `GlobalSearchProviderResult`
|
||||
result type to `GlobalSeachResult`, which is the structure returned from the `GlobalSearchPluginStart.find` observable.
|
||||
|
||||
In current specification, the only conversion step is to transform the `result.url` property following this logic:
|
||||
|
||||
- if `url` is an absolute url, it will not be modified
|
||||
- if `url` is a relative path, the basePath will be prepended using `basePath.prepend`
|
||||
- if `url` is a `{ path: string; prependBasePath: boolean }` structure:
|
||||
- if `prependBasePath` is true, the basePath will be prepended to the given `path` using `basePath.prepend`
|
||||
- if `prependBasePath` is false, the given `path` will be returned unmodified
|
||||
|
||||
#### redirecting to a result
|
||||
|
||||
Parsing a relative or absolute result url to perform SPA navigation can be non trivial. This is why `ApplicationService.navigateToUrl` has been introduced on the client-side core API
|
||||
|
||||
When using `navigateToUrl` with the url of a result instance, the following logic will be executed:
|
||||
|
||||
If all these criteria are true for `url`:
|
||||
|
||||
- (only for absolute URLs) The origin of the URL matches the origin of the browser's current location
|
||||
- The pathname of the URL starts with the current basePath (eg. /mybasepath/s/my-space)
|
||||
- The pathname segment after the basePath matches any known application route (eg. /app/<id>/ or any application's `appRoute` configuration)
|
||||
|
||||
Then: match the pathname segment to the corresponding application and do the SPA navigation to that application using
|
||||
`application.navigateToApp` using the remaining pathname segment for the `path` option.
|
||||
|
||||
Otherwise: do a full page navigation using `window.location.assign`
|
||||
|
||||
### searching from the server side
|
||||
|
||||
When calling `GlobalSearchPluginStart.find` from the server-side service:
|
||||
|
||||
- the service will call `find` on each server-side registered result provider and collect the resulting result observables
|
||||
|
||||
- then, the service will merge every result observable and trigger the next step on every emission until either
|
||||
- A predefined timeout duration is reached
|
||||
- All result observables are completed
|
||||
|
||||
- on every emission of the merged observable, the results will be processed then emitted.
|
||||
|
||||
A very naive implementation of this behavior would be:
|
||||
|
||||
```ts
|
||||
search(
|
||||
term: string,
|
||||
options: GlobalSearchFindOptions,
|
||||
request: KibanaRequest
|
||||
): Observable<GlobalSearchResponse> {
|
||||
const aborted$ = merge(timeout$, options.$aborted).pipe(first())
|
||||
const fromProviders$ = this.providers.map(p =>
|
||||
p.find(term, { ...options, aborted$ }, contextFromRequest(request))
|
||||
);
|
||||
return merge([...fromProviders$]).pipe(
|
||||
takeUntil(aborted$),
|
||||
map(newResults => {
|
||||
return process(newResults);
|
||||
}),
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
### searching from the client side
|
||||
|
||||
When calling `GlobalSearchPluginStart.find` from the public-side service:
|
||||
|
||||
- The service will call:
|
||||
|
||||
- the server-side API via an http call to fetch results from the server-side result providers
|
||||
- `find` on each client-side registered result provider and collect the resulting observables
|
||||
|
||||
- Then, the service will merge every result observable and trigger the next step on every emission until either
|
||||
|
||||
- A predefined timeout duration is reached
|
||||
- All result observables are completed
|
||||
|
||||
- on every emission of the merged observable, the results will be processed then emitted.
|
||||
|
||||
A very naive implementation of this behavior would be:
|
||||
|
||||
```
|
||||
search(
|
||||
term: string,
|
||||
options: GlobalSearchFindOptions,
|
||||
): Observable<GlobalSearchResponse> {
|
||||
const aborted$ = merge(timeout$, options.$aborted).pipe(first())
|
||||
const fromProviders$ = this.providers.map(p =>
|
||||
p.find(term, { ...options, aborted$ })
|
||||
);
|
||||
const fromServer$ = of(this.fetchServerResults(term, options, aborted$))
|
||||
return merge([...fromProviders$, fromServer$]).pipe(
|
||||
takeUntil(aborted$),
|
||||
map(newResults => {
|
||||
return process(newResults);
|
||||
}),
|
||||
);
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- The example implementation is not streaming results from the server, meaning that all results from server-side
|
||||
registered providers will all be fetched and emitted in a single batch. Ideally, we would leverage the `bfetch` plugin
|
||||
to stream the results to the client instead.
|
||||
|
||||
### results sorting
|
||||
|
||||
As the GS `find` API is 'streaming' the results from the result providers by emitting the results in batches, sorting results in
|
||||
each individual batch, even if technically possible, wouldn't provide much value as the consumer will need to sort the
|
||||
aggregated results on each emission anyway. This is why the results emitted by the `find` API should be considered as
|
||||
unsorted. Consumers should implement sorting themselves, using either the `score` attribute, or any other arbitrary logic.
|
||||
|
||||
#### Note on score value
|
||||
|
||||
Due to the fact that the results will be coming from various providers, from multiple ES queries or even not from ES,
|
||||
using a centralized scoring mechanism is not possible.
|
||||
|
||||
the `GlobalSearchResult` contains a `score` field, with an expected value going from 1 (lowest) to 100 (highest).
|
||||
How this field is populated from each individual provider is considered an implementation detail.
|
||||
|
||||
### Search cancellation
|
||||
|
||||
Consumers can cancel a `find` call at any time by providing a cancellation observable with
|
||||
the `GlobalSearchFindOptions.aborted$` option and then emitting from it.
|
||||
|
||||
When this observable is provided and emitting, the GS service will complete the result observable.
|
||||
|
||||
This observable will also be passed down to the underlying result providers, that can leverage it to cancel any pending
|
||||
asynchronous task and perform cleanup if necessary.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
See alternatives.
|
||||
|
||||
# Alternatives
|
||||
|
||||
## Result providers could be only registrable from the server-side API
|
||||
|
||||
The fact that some kinds of results, and therefore some result providers, must be on the client-side makes the API more complex,
|
||||
while making these results not available from the server-side and HTTP APIs.
|
||||
|
||||
We could decide to only allow providers registration from the server-side. It would reduce API exposure, while simplifying
|
||||
the service implementation. However to do that, we would need to find a solution to be able to implement a server-side
|
||||
result provider for `application` (and later `management_section`) type provider.
|
||||
|
||||
I will directly exclude the option to move the `application` registration (`core.application.register`) from client
|
||||
to server-side, as it's a very heavy impacting (and breaking) change to `core` APIs that would requires more reasons
|
||||
than just this RFC/API to consider.
|
||||
|
||||
### AST parsing
|
||||
|
||||
One option to make the `application` results 'visible' from the server-side would be to parse the client code at build time
|
||||
using AST to find all usages to `application.register` inspect the parameters, and generates a server file
|
||||
containing the applications. The server-side `application` result provider would then just read this file and uses it
|
||||
to return application results.
|
||||
|
||||
However
|
||||
|
||||
- At the parsing would be done at build time, we would not be able to generate entries for any 3rd party plugins
|
||||
- As entries for every existing applications would be generated, the search provider would to be able to know which
|
||||
applications are actually enabled/accessible at runtime to filter them, which is all but easy
|
||||
- It will also not contains test plugin apps, making it really hard to FTR
|
||||
- AST parsing is a complex mechanism for an already unsatisfactory alternative
|
||||
|
||||
### Duplicated server-side `application.register` API
|
||||
|
||||
One other option would be to duplicate the `application.register` API on the server side, with a subset of the
|
||||
client-side metadata.
|
||||
|
||||
```ts
|
||||
core.application.register({
|
||||
id: 'app_status',
|
||||
title: 'App Status',
|
||||
euiIconType: 'snowflake',
|
||||
});
|
||||
```
|
||||
|
||||
This way, the applications could be searchable from the server using this server-side `applications` registry.
|
||||
|
||||
However
|
||||
|
||||
- It forces plugin developers to add this API call. In addition to be a very poor developer experience, it can also
|
||||
very easily be forgotten, making a given app non searchable
|
||||
- client-side only plugins would need to add a server-side part to their plugin just to register their application on
|
||||
the server side
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
The `globalSearch` service is a new feature provided by the `core` API. Also, the base providers
|
||||
used to search for saved objects and applications will be implemented by the platform team, meaning
|
||||
that by default, plugin developers won't have to do anything.
|
||||
|
||||
Plugins that wish to expose additional result providers will easily be able to do so by using the exposed APIs and
|
||||
documentation.
|
||||
|
||||
# How we teach this
|
||||
|
||||
This follows the same patterns we have used for other Core APIs: Observables subscriptions, etc.
|
||||
|
||||
This should be taught using the same channels we've leveraged for other Kibana Platform APIs, API documentation and
|
||||
example plugins.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
N/A
|
|
@ -1,284 +0,0 @@
|
|||
- Start Date: 2020-04-23
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
The reporting plugin is migrating to a purely REST API interface, deprecating page-level integrations such as Dashboard and Discover.
|
||||
|
||||
# Basic example
|
||||
|
||||
Currently, reporting does expose an API for Dashboard exports as seen below.
|
||||
|
||||
```sh
|
||||
# Massively truncated URL
|
||||
curl -x POST http://localhost:5601/api/reporting/generate/printablePdf?jobParams=%28browserTimezone%3AAmerica%2FLos_Angeles%2Clayout...
|
||||
```
|
||||
|
||||
Going forth, reporting would only offer a JSON-based REST API, deprecating older ad-hoc solutions:
|
||||
|
||||
```sh
|
||||
curl -x POST http://localhost:5601/api/reporting/pdf
|
||||
{
|
||||
“baseUrl”: “/my/kibana/page/route?foo=bar&reporting=true”,
|
||||
"waitUntil": {
|
||||
"event": “complete”,
|
||||
},
|
||||
“viewport”: {
|
||||
“width”: 1920,
|
||||
“height”: 1080,
|
||||
"scale": 1
|
||||
},
|
||||
"mediaType": "screen"
|
||||
}
|
||||
```
|
||||
|
||||
A simple JSON response is returned, with an identifier to query for status.
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "123"
|
||||
}
|
||||
```
|
||||
|
||||
Further information can be found via GET call with the job's ID:
|
||||
|
||||
```sh
|
||||
curl http://localhost:5601/api/reporting/123/status
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
The reporting functionality that currently exists in Kibana was originally purpose-built for the Discover, Dashboard and Canvas applications. Because of this, reportings underlying technologies and infrastructure are hard to improve upon and make generally available for pages across Kibana. Currently, the team has to:
|
||||
|
||||
- Build and maintain our own Chromium binary for the 3 main operating systems we support.
|
||||
- Fix and help troubleshoot issues encountered by our users and their complex deployment topologies.
|
||||
- Ensure successful operation in smaller-sized cloud deployments.
|
||||
- Help other teams get their applications “reportable”.
|
||||
- Continue to adapt changes in Discover and Dashboard so that they can be reportable (WebGL for instance).
|
||||
|
||||
In order to ensure the reporting works in a secure manner, we also maintain complex logic that ensures nothing goes wrong during report generation. These include:
|
||||
|
||||
- Home-rolled security role checks.
|
||||
- A custom-built network firewall via puppeteer to ensure chromium can’t be hijacked for nefarious purposes.
|
||||
- Network request interception to apply authorization contexts.
|
||||
- Configuration checks for both Elasticsearch and Kibana, to ensure the user's configuration is valid and workable.
|
||||
- CSV formula injection checks, encodings, and other challenges.
|
||||
|
||||
It's important that there be a barrier between *how* reporting works, as well as *how* an application in Kibana is rendered. As of today no such barrier exists.
|
||||
|
||||
While we understand that many of these requirements are similar across teams, however, in order to better serve the application teams that depend on reporting the time has come to rethink reportings role inside of Kibana, and how we can scale it across our product suite.
|
||||
|
||||
# Detailed design
|
||||
|
||||
## REST API
|
||||
|
||||
Though we plan to support additional functionality longer-term (for instance a client-api or support for scheduling), the initial product will solely be a REST API involving a 4 part life-cycle:
|
||||
|
||||
1. Starting a new job.
|
||||
2. Querying a job's status.
|
||||
3. Downloading the job's results.
|
||||
4. Deleting a job
|
||||
|
||||
Reporting will return a list of HTTP codes to indicate acceptance or rejection of any HTTP interaction:
|
||||
|
||||
**Possible HTTP responses**
|
||||
|
||||
`200`: Job is accepted and is queued for execution
|
||||
|
||||
`204`: OK response, but no message returned (used in DELETE calls)
|
||||
|
||||
`400`: There was a malformation of the request, and consumers should review the returned message
|
||||
|
||||
`403`: The user is not allowed to create a job
|
||||
|
||||
`404`: Job wasn't found
|
||||
|
||||
`401`: The request isn't properly authorized
|
||||
|
||||
### 1. Starting a new job
|
||||
|
||||
The primary export type in this phase will be a PDF binary (retrieved in Step 3). Registering can be as complex as below:
|
||||
|
||||
```sh
|
||||
curl -x POST http://kibana-host:kibana-port/api/reporting/pdf
|
||||
[{
|
||||
“baseUrl”: “/my/kibana/page/route?page=1&reporting=true”,
|
||||
"waitUntil": {
|
||||
"event": “complete”,
|
||||
},
|
||||
“viewport”: {
|
||||
“width”: 1920,
|
||||
“height”: 1080,
|
||||
"scale": 2
|
||||
},
|
||||
"mediaType": "screen",
|
||||
"timeout": 30000,
|
||||
}, {
|
||||
“baseUrl”: “/my/kibana/page/route?page=2&reporting=true”,
|
||||
"waitUntil": {
|
||||
"event": “complete”,
|
||||
},
|
||||
“viewport”: {
|
||||
“width”: 1920,
|
||||
“height”: 1080,
|
||||
"scale": 2
|
||||
},
|
||||
"mediaType": "screen",
|
||||
"timeout": 30000,
|
||||
}]
|
||||
```
|
||||
|
||||
In the above example, a consumer posts an array of URLs to be exported. When doing so, the assumption here is that the pages relate to each other in some fashion (workpad's in Canvas, for instance), and thus can be optimized by using the page and browser objects. It should be noted that even though we're given a collection of pages to export that *they'll be rendered in series and not parallel*.
|
||||
|
||||
`baseUrl: string`: The URL of the page you wish to export, relative to Kibana's default path. For instance, if canvas wanted to export a page at `http://localhost:5601/app/canvas#/workpad/workpad-e08b9bdb-ec14-4339-94c4-063bddfd610e/page/1`, the `baseUrl` would be `/app/canvas#/workpad/workpad-e08b9bdb-ec14-4339-94c4-063bddfd610e/page/1`. This is done to prevent our chromium process from being "hijacked" to navigate elsewhere. You're free to do whatever you'd like for the URL, including any query-string parameters or other variables in order to properly render you page for reporting. For instance, you'll notice the `reporting=true` param listed above.
|
||||
|
||||
`waitUntil: { event: string; selector: string }`: An object, specifying a custom `DOM` event to "listen" for in our chromium process, or the presence of a DOM selector. Either options are valid, however we won't allow for both options to be set. For instance, if a page inserts a `<div class="loaded">` in its markup, then the appropriate payload would be:
|
||||
|
||||
```json
|
||||
"waitUntil": {
|
||||
"selector": "div.loaded",
|
||||
},
|
||||
```
|
||||
|
||||
`viewport: { width: number; height: number; scale: number }`: Viewport will allow consumers the ability to make rigid dimensions of the browser, such that the formatting of their pages sized appropriately. Scale, in this context, refers roughly to pixel density if there's a need for a higher resolution. A page that needs high-resolution PDF, could set this by:
|
||||
|
||||
```json
|
||||
“viewport”: {
|
||||
“width”: 1920,
|
||||
“height”: 1080,
|
||||
"scale": 2
|
||||
},
|
||||
```
|
||||
|
||||
`mediaType: "screen" | "print"`: It's often the case that pages would like to use print media-queries, and this allows for opting-in or out of that behavior. For example, if a page wishes to utilize its print media queries, a payload with:
|
||||
|
||||
```json
|
||||
"mediaType": "print"
|
||||
```
|
||||
|
||||
`timeout: number`: When present, this allows for consumers to override the default reporting timeout. This is useful if a job is known to take much longer to process, or supporting our users without requiring them to restart their Kibana servers for a simple configuration change. Value here is milliseconds.
|
||||
|
||||
```json
|
||||
"timeout": 60000
|
||||
```
|
||||
|
||||
**Full job creation example:**
|
||||
|
||||
```curl
|
||||
curl -x POST http://localhost:5601/api/reporting/pdf
|
||||
[{
|
||||
“baseUrl”: “/my/kibana/page/route?page=1&reporting=true”,
|
||||
"waitUntil": {
|
||||
"event": “complete”,
|
||||
},
|
||||
“viewport”: {
|
||||
“width”: 1920,
|
||||
“height”: 1080,
|
||||
"scale": 2
|
||||
},
|
||||
"mediaType": "screen",
|
||||
"timeout": 30000,
|
||||
}, {
|
||||
“baseUrl”: “/my/kibana/page/route?page=2&reporting=true”,
|
||||
"waitUntil": {
|
||||
"event": “complete”,
|
||||
},
|
||||
“viewport”: {
|
||||
“width”: 1920,
|
||||
“height”: 1080,
|
||||
"scale": 2
|
||||
},
|
||||
"mediaType": "screen",
|
||||
"timeout": 30000,
|
||||
}]
|
||||
|
||||
# Response (note the single ID of the response)
|
||||
# 200 Content-Type application/json
|
||||
{
|
||||
"id": "123"
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Querying and altering a job's status.
|
||||
|
||||
Once created, a user can simply issue simple GET call to see the status of the job.
|
||||
|
||||
**Get a job's status:**
|
||||
|
||||
```curl
|
||||
curl -x GET http://localhost:5601/api/reporting/123/status
|
||||
|
||||
# Response
|
||||
# 200 Content-Type application/json
|
||||
# We might provide other meta-data here as well when required
|
||||
{
|
||||
"status": "pending",
|
||||
"elapsedTime": 12345
|
||||
}
|
||||
```
|
||||
|
||||
Possible types for `status` here are: `pending`, `running`, `complete`, `complete-warnings`, `failed`, or `timedout`. We can add more detail here if needed, such as the current URL being operated or whatever other information is valuable to consumers.
|
||||
|
||||
### 4. Deleting a job
|
||||
|
||||
A DELETE call will remove the report. If a report is in pending/running state, this will attempt to terminate the running job. Once a report is complete, the call to delete will permanently (hard delete) remove the job's output in ElasticSearch.
|
||||
|
||||
When successfully deleted, reporting will simply respond with a `204` HTTP code, indicating success.
|
||||
|
||||
```curl
|
||||
curl -x DELETE http://localhost:5601/api/reporting/123
|
||||
|
||||
# Response (no body, 204 indicates success)
|
||||
# 204 Content-Type text/plain;charset=UTF-8
|
||||
```
|
||||
|
||||
# Drawbacks
|
||||
|
||||
Due to the new nature of this RFC, there are definitely drawbacks to this approach short-term. These short-term drawbacks become miniscule longer-term, since the work being done here frees both reporting and downstream teams to operate in parallel.
|
||||
|
||||
- Initial work to build this pipeline will freeze some current efforts (scheduled reports, etc).
|
||||
- Doesn't solve complex architectural issues experienced by our customers.
|
||||
- Requires work to migrate our existing apps (Canvas, dashboard, visualizations).
|
||||
- Doesn't offer any performance characteristics over our current implementation.
|
||||
|
||||
Though there's some acute pain felt here shorter term, they pale in comparison to building custom ad-hoc solutions for each application inside of Kibana.
|
||||
|
||||
# Alternatives
|
||||
|
||||
Going through the process of developing this current RFC, we did entertain a few other strategies:
|
||||
|
||||
## No changes in how we operate
|
||||
|
||||
This strategy doesn't scale beyond the current two team members since we field many support issues that are application-specific, and not reporting specific. This keeps our trajectory where it currently is, short term, but hamstrings us longer term. Unfortunately, for teams to have the best experience with regards to reporting, they'll need to have ownership on the rendering aspects of their pages.
|
||||
|
||||
## A new plugin
|
||||
|
||||
We debated offering a new plugin, or having apps consume this type of service as a plugin, but ultimately it was too much overhead for the nature of what we're offering. More information on the prior RFC is here: https://github.com/elastic/kibana/pull/59084.
|
||||
|
||||
## Each page builds its own pipeline
|
||||
|
||||
This would allow teams to operate how they best see fit, but would come with a host of issues:
|
||||
|
||||
- Each team would need to ramp up on how we handle chromium and all of its sharp edges.
|
||||
- Potential for many requests to be in flight at once, causing exhaustion of resources.
|
||||
- Mixed experience across different apps, and varying degrees of success.
|
||||
- No central management of a users general reports.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
After work on the service is complete in its initial phase, we'll begin to migrate the Dashboard app over to the new service. This will give a clear example of:
|
||||
|
||||
- Moving a complex page over to this service.
|
||||
- Where the divisions of labor reside (who does what).
|
||||
- How to embed rendering-specific logic into your pages.
|
||||
|
||||
Since reporting only exists on a few select pages, there won't be need for a massive migration effort. Instead, folks wanting to move over to the new rendering service can simply take a look at how Dashboard handles their exporting.
|
||||
|
||||
In short, the adoption strategy is fairly minimal due to the lack of pages being reported on.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
- How to troubleshoot complex customer environments?
|
||||
- When do we do this work?
|
||||
- Nuances in the API, are we missing other critical information?
|
|
@ -1,119 +0,0 @@
|
|||
- Start Date: 2020-07-22
|
||||
- RFC PR: [#72828](https://github.com/elastic/kibana/pull/72828)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
This RFC proposes a way of the encryption key (`xpack.encryptedSavedObjects.encryptionKey`) rotation that would allow administrators to seamlessly change existing encryption key without any data loss and manual intervention.
|
||||
|
||||
# Basic example
|
||||
|
||||
When administrators decide to rotate encryption key they will have to generate a new one and move the old key(s) to the `keyRotation` section in the `kibana.yml`:
|
||||
|
||||
```yaml
|
||||
xpack.encryptedSavedObjects:
|
||||
encryptionKey: "NEW-encryption-key"
|
||||
keyRotation:
|
||||
decryptionOnlyKeys: ["OLD-encryption-key-1", "OLD-encryption-key-2"]
|
||||
```
|
||||
|
||||
Before old decryption-only key is disposed administrators may want to call a dedicated and _protected_ API endpoint that will go through all registered Saved Objects with encrypted attributes and try to re-encrypt them with the primary encryption key:
|
||||
|
||||
```http request
|
||||
POST https://localhost:5601/api/encrypted_saved_objects/rotate_key?conflicts=abort
|
||||
Content-Type: application/json
|
||||
Kbn-Xsrf: true
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
Today when encryption key changes we can no longer decrypt Saved Objects attributes that were previously encrypted with the `EncryptedSavedObjects` plugin. We handle this case in two different ways depending on whether consumers explicitly requested decryption or not:
|
||||
|
||||
* If consumers explicitly request decryption via `getDecryptedAsInternalUser()` we abort operation and throw exception.
|
||||
* If consumers fetch Saved Objects with encrypted attributes that should be automatically decrypted (the ones with `dangerouslyExposeValue: true` marker) via standard Saved Objects APIs we don't abort operation, but rather strip all encrypted attributes from the response and record decryption error in the `error` Saved Object field.
|
||||
* If Kibana tries to migrate encrypted Saved Objects at the start up time we abort operation and throw exception.
|
||||
|
||||
In both of these cases we throw or record error with the specific type to allow consumers to gracefully handle this scenario and either drop Saved Objects with unrecoverable encrypted attributes or facilitate the process of re-entering and re-encryption of the new values.
|
||||
|
||||
This approach works reasonably well in some scenarios, but it may become very troublesome if we have to deal with lots of Saved Objects. Moreover, we'd like to recommend our users to periodically rotate encryption keys even if they aren't compromised. Hence, we need to provide a way of seamless migration of the existing encrypted Saved Objects to a new encryption key.
|
||||
|
||||
There are two main scenarios we'd like to cover in this RFC:
|
||||
|
||||
## Encryption key is not available
|
||||
|
||||
Administrators may lose existing encryption key or explicitly decide to not use it if it was compromised and users can no longer trust encrypted content that may have been tampered with. In this scenario encrypted portion of the existing Saved Objects is considered lost, and the only way to recover from this state is a manual intervention described previously. That means `EncryptedSavedObjects` plugin consumers __should__ continue supporting this scenario even after we implement a proper encryption key rotation mechanism described in this RFC.
|
||||
|
||||
## Encryption key is available, but needs to be rotated
|
||||
|
||||
In this scenario a new encryption key (primary encryption key) will be generated, and we will use it to encrypt new or updated Saved Objects. We will still need to know the old encryption key to decrypt existing attributes, but we will no longer use this key to encrypt any of the new or existing Saved Objects. It's also should be possible to have multiple old decryption-only keys.
|
||||
|
||||
The old old decryption-only keys should be eventually disposed and users should have a way to make sure all existing Saved Objects are re-encrypted with the new primary encryption key.
|
||||
|
||||
__NOTE:__ users can get into a state when different Saved Objects are encrypted with different encryption keys even if they didn't intend to rotate the encryption key. We anticipate that it can happen during initial Elastic Stack HA setup, when by mistake or intentionally different Kibana instances were using different encryption keys. Key rotation mechanism can help to fix this issue without a data loss.
|
||||
|
||||
# Detailed design
|
||||
|
||||
The core idea is that when the encryption key needs to be rotated then a new key is generated and becomes a primary one, and the old one moves to the `keyRotation` section:
|
||||
|
||||
```yaml
|
||||
xpack.encryptedSavedObjects:
|
||||
encryptionKey: "NEW-encryption-key"
|
||||
keyRotation:
|
||||
decryptionOnlyKeys: ["OLD-encryption-key"]
|
||||
```
|
||||
|
||||
As the name implies, the key from the `decryptionOnlyKeys` is only used to decrypt content that we cannot decrypt with the primary encryption key. It's allowed to have multiple decryption-only keys at the same time. When user creates a new Saved Object or updates the existing one then its content is always encrypted with the primary encryption key. Config schema won't allow having the same key in `encryptionKey` and `decryptionOnlyKeys`.
|
||||
|
||||
Having multiple decryption keys at the same time brings one problem though: we need to figure out which key to use to decrypt specific Saved Object. If our encryption keys could have a unique ID that we would store together with the encrypted data (we cannot use encryption key hash for that for obvious reasons) we could know for sure which key to use, but we don't have such functionality right now and it may not be the easiest one to manage through `yml` configuration anyway.
|
||||
|
||||
Instead, this RFC proposes to try available existing decryption keys one by one to decrypt Saved Object and always start from the primary one. This way we won't incur any penalty while decrypting Saved Objects that are already encrypted with the primary encryption key, but there will still be some cost when we have to perform multiple decryption attempts. See the [`Drawbacks`](#drawbacks) section for the details.
|
||||
|
||||
Technically just having `decryptionOnlyKeys` would be enough to cover the majority of the use cases, but the old decryption-only keys should be eventually disposed. At this point administrators would like to make sure _all_ Saved Objects are encrypted with the new primary encryption key. Another reason to re-encrypt all existing Saved Objects with the new key at once is to preventively reduce the performance impact of the multiple decryption attempts.
|
||||
|
||||
We'd like to make this process as simple as possible while meeting the following requirements:
|
||||
|
||||
* It should not be required to restart Kibana to perform this type of migration since Saved Objects encrypted with the another encryption key can theoretically appear at any point in time.
|
||||
* It should be possible to integrate this operation into other operational flows our users may have and any user-friendly key management UIs we may introduce in this future.
|
||||
* Any possible failures that may happen during this operation shouldn't make Kibana nonfunctional.
|
||||
* Ordinary users should not be able to trigger this migration since it may consume a considerable amount of computing resources.
|
||||
|
||||
We think that the best option we have right now is a dedicated API endpoint that would trigger this migration:
|
||||
|
||||
```http request
|
||||
POST https://localhost:5601/api/encrypted_saved_objects/rotate_key?conflicts=abort
|
||||
Content-Type: application/json
|
||||
Kbn-Xsrf: true
|
||||
```
|
||||
|
||||
This will be a protected endpoint and only user with enough privileges will be able to use it.
|
||||
|
||||
Under the hood we'll scroll over all Saved Objects that are registered with `EncryptedSavedObjects` plugin and re-encrypt attributes only for those of them that can only be decrypted with any of the old decryption-only keys. Saved Objects that can be decrypted with the primary encryption key will be ignored. We'll also ignore the ones that cannot be decrypted with any of the available decryption keys at all, and presumably return their IDs in the response.
|
||||
|
||||
As for any other encryption or decryption operation we'll record relevant bits in the audit logs.
|
||||
|
||||
# Benefits
|
||||
|
||||
* The concept of decryption-only keys is easy to grasp and allows Kibana to function even if it has a mix of Saved Objects encrypted with different encryption keys.
|
||||
* Support of the key rotation out of the box decreases the chances of the data loss and makes `EncryptedSavedObjects` story more secure and approachable overall.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
* Multiple decryption attempts affect performance. See [the performance test results](https://github.com/elastic/kibana/pull/72420#issue-453400211) for more details, but making two decryption attempts is basically twice as slow as with a single attempt. Although it's only relevant for the encrypted Saved Objects migration performed at the start up time and batch operations that trigger automatic decryption (only for the Saved Objects registered with `dangerouslyExposeValue: true` marker that nobody is using in Kibana right now), we may have more use cases in the future.
|
||||
* Historically we supported Kibana features with either configuration or dedicated UI, but in this case we want to introduce an API endpoint that _should be_ used directly. We may have a key management UI in the future though.
|
||||
|
||||
# Alternatives
|
||||
|
||||
We cannot think of any better alternative for `decryptionOnlyKeys` at the moment, but instead of API endpoint for the batch re-encryption we could potentially use another `kibana.yml` config option. For example `keyRotation.mode: onWrite | onStart | both`, but it feels a bit hacky and cannot be really integrated with anything else.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Adoption strategy is pretty straightforward since the feature is an enhancement and doesn't bring any BWC concerns.
|
||||
|
||||
# How we teach this
|
||||
|
||||
Key rotation is a well-known paradigm. We'll update `README.md` of the `EncryptedSavedObjects` plugin and create a dedicated section in the public Kibana documentation.
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
* Is it reasonable to have this feature in Basic?
|
||||
* Are there any other use-cases that are not covered by the proposal?
|
|
@ -1,827 +0,0 @@
|
|||
- Start Date: 2020-05-11
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
---
|
||||
- [1. Summary](#1-summary)
|
||||
- [2. Motivation](#2-motivation)
|
||||
- [3. Saved Object Migration Errors](#3-saved-object-migration-errors)
|
||||
- [4. Design](#4-design)
|
||||
- [4.0 Assumptions and tradeoffs](#40-assumptions-and-tradeoffs)
|
||||
- [4.1 Discover and remedy potential failures before any downtime](#41-discover-and-remedy-potential-failures-before-any-downtime)
|
||||
- [4.2 Automatically retry failed migrations until they succeed](#42-automatically-retry-failed-migrations-until-they-succeed)
|
||||
- [4.2.1 Idempotent migrations performed without coordination](#421-idempotent-migrations-performed-without-coordination)
|
||||
- [4.2.1.1 Restrictions](#4211-restrictions)
|
||||
- [4.2.1.2 Migration algorithm: Cloned index per version](#4212-migration-algorithm-cloned-index-per-version)
|
||||
- [Known weaknesses:](#known-weaknesses)
|
||||
- [4.2.1.3 Upgrade and rollback procedure](#4213-upgrade-and-rollback-procedure)
|
||||
- [4.2.1.4 Handling documents that belong to a disabled plugin](#4214-handling-documents-that-belong-to-a-disabled-plugin)
|
||||
- [5. Alternatives](#5-alternatives)
|
||||
- [5.1 Rolling upgrades](#51-rolling-upgrades)
|
||||
- [5.2 Single node migrations coordinated through a lease/lock](#52-single-node-migrations-coordinated-through-a-leaselock)
|
||||
- [5.2.1 Migration algorithm](#521-migration-algorithm)
|
||||
- [5.2.2 Document lock algorithm](#522-document-lock-algorithm)
|
||||
- [5.2.3 Checking for "weak lease" expiry](#523-checking-for-weak-lease-expiry)
|
||||
- [5.3 Minimize data loss with mixed Kibana versions during 7.x](#53-minimize-data-loss-with-mixed-kibana-versions-during-7x)
|
||||
- [5.4 In-place migrations that re-use the same index (8.0)](#54-in-place-migrations-that-re-use-the-same-index-80)
|
||||
- [5.4.1 Migration algorithm (8.0):](#541-migration-algorithm-80)
|
||||
- [5.4.2 Minimizing data loss with unsupported upgrade configurations (8.0)](#542-minimizing-data-loss-with-unsupported-upgrade-configurations-80)
|
||||
- [5.5 Tag objects as “invalid” if their transformation fails](#55-tag-objects-as-invalid-if-their-transformation-fails)
|
||||
- [6. How we teach this](#6-how-we-teach-this)
|
||||
- [7. Unresolved questions](#7-unresolved-questions)
|
||||
|
||||
# 1. Summary
|
||||
|
||||
Improve the Saved Object migration algorithm to ensure a smooth Kibana upgrade
|
||||
procedure.
|
||||
|
||||
# 2. Motivation
|
||||
|
||||
Kibana version upgrades should have a minimal operational impact. To achieve
|
||||
this, users should be able to rely on:
|
||||
|
||||
1. A predictable downtime window.
|
||||
2. A small downtime window.
|
||||
1. (future) provide a small downtime window on indices with 10k or even
|
||||
a 100k documents.
|
||||
3. The ability to discover and remedy potential failures before initiating the
|
||||
downtime window.
|
||||
4. Quick roll-back in case of failure.
|
||||
5. Detailed documentation about the impact of downtime on the features they
|
||||
are using (e.g. actions, task manager, fleet, reporting).
|
||||
6. Mixed Kibana versions shouldn’t cause data loss.
|
||||
7. (stretch goal) Maintain read-only functionality during the downtime window.
|
||||
|
||||
The biggest hurdle to achieving the above is Kibana’s Saved Object migrations.
|
||||
Migrations aren’t resilient and require manual intervention anytime an error
|
||||
occurs (see [3. Saved Object Migration
|
||||
Errors](#3-saved-object-migration-errors)).
|
||||
|
||||
It is impossible to discover these failures before initiating downtime. Errors
|
||||
often force users to roll-back to a previous version of Kibana or cause hours
|
||||
of downtime. To retry the migration, users are asked to manually delete a
|
||||
`.kibana_x` index. If done incorrectly this can lead to data loss, making it a
|
||||
terrifying experience (restoring from a pre-upgrade snapshot is a safer
|
||||
alternative but not mentioned in the docs or logs).
|
||||
|
||||
Cloud users don’t have access to Kibana logs to be able to identify and remedy
|
||||
the cause of the migration failure. Apart from blindly retrying migrations by
|
||||
restoring a previous snapshot, cloud users are unable to remedy a failed
|
||||
migration and have to escalate to support which can further delay resolution.
|
||||
|
||||
Taken together, version upgrades are a major operational risk and discourage
|
||||
users from adopting the latest features.
|
||||
|
||||
# 3. Saved Object Migration Errors
|
||||
|
||||
Any of the following classes of errors could result in a Saved Object
|
||||
migration failure which requires manual intervention to resolve:
|
||||
|
||||
1. A bug in a plugin’s registered document transformation function causes it
|
||||
to throw an exception on _valid_ data.
|
||||
2. _Invalid_ data stored in Elasticsearch causes a plugin’s registered
|
||||
document transformation function to throw an exception .
|
||||
3. Failures resulting from an unhealthy Elasticsearch cluster:
|
||||
1. Maximum shards open
|
||||
2. Too many scroll contexts
|
||||
3. `circuit_breaking_exception` (insufficient heap memory)
|
||||
4. `process_cluster_event_timeout_exception` for index-aliases, create-index, put-mappings
|
||||
5. Read-only indices due to low disk space (hitting the flood_stage watermark)
|
||||
6. Re-index failed: search rejected due to missing shards
|
||||
7. `TooManyRequests` while doing a `count` of documents requiring a migration
|
||||
8. Bulk write failed: primary shard is not active
|
||||
4. The Kibana process is killed while migrations are in progress.
|
||||
|
||||
# 4. Design
|
||||
## 4.0 Assumptions and tradeoffs
|
||||
The proposed design makes several important assumptions and tradeoffs.
|
||||
|
||||
**Background:**
|
||||
|
||||
The 7.x upgrade documentation lists taking an Elasticsearch snapshot as a
|
||||
required step, but we instruct users to retry migrations and perform rollbacks
|
||||
by deleting the failed `.kibana_n` index and pointing the `.kibana` alias to
|
||||
`.kibana_n-1`:
|
||||
- [Handling errors during saved object
|
||||
migrations.](https://github.com/elastic/kibana/blob/75444a9f1879c5702f9f2b8ad4a70a3a0e75871d/docs/setup/upgrade/upgrade-migrations.asciidoc#handling-errors-during-saved-object-migrations)
|
||||
- [Rolling back to a previous version of Kibana.](https://github.com/elastic/kibana/blob/75444a9f1879c5702f9f2b8ad4a70a3a0e75871d/docs/setup/upgrade/upgrade-migrations.asciidoc#rolling-back-to-a-previous-version-of-kib)
|
||||
- Server logs from failed migrations.
|
||||
|
||||
**Assumptions and tradeoffs:**
|
||||
1. It is critical to maintain a backup index during 7.x to ensure that anyone
|
||||
following the existing upgrade / rollback procedures don't end up in a
|
||||
position where they no longer can recover their data.
|
||||
1. This excludes us from introducing in-place migrations to support huge
|
||||
indices during 7.x.
|
||||
2. The simplicity of idempotent, coordination-free migrations outweighs the
|
||||
restrictions this will impose on the kinds of migrations we're able to
|
||||
support in the future. See (4.2.1)
|
||||
3. A saved object type (and it's associated migrations) will only ever be
|
||||
owned by one plugin. If pluginA registers saved object type `plugin_a_type`
|
||||
then pluginB must never register that same type, even if pluginA is
|
||||
disabled. Although we cannot enforce it on third-party plugins, breaking
|
||||
this assumption may lead to data loss.
|
||||
|
||||
## 4.1 Discover and remedy potential failures before any downtime
|
||||
|
||||
> Achieves goals: (2.3)
|
||||
> Mitigates errors: (3.1), (3.2)
|
||||
|
||||
1. Introduce a CLI option to perform a dry run migration to allow
|
||||
administrators to locate and fix potential migration failures without
|
||||
taking their existing Kibana node(s) offline.
|
||||
2. To have the highest chance of surfacing potential failures such as low disk
|
||||
space, dry run migrations should not be mere simulations. A dry run should
|
||||
perform a real migration in a way that doesn’t impact the existing Kibana
|
||||
cluster.
|
||||
3. The CLI should generate a migration report to make it easy to create a
|
||||
support request from a failed migration dry run.
|
||||
1. The report would be an NDJSON export of all failed objects.
|
||||
2. If support receives such a report, we could modify all the objects to
|
||||
ensure the migration would pass and send this back to the client.
|
||||
3. The client can then import the updated objects using the standard Saved
|
||||
Objects NDJSON import and run another dry run to verify all problems
|
||||
have been fixed.
|
||||
4. Make running dry run migrations a required step in the upgrade procedure
|
||||
documentation.
|
||||
5. (Optional) Add dry run migrations to the standard cloud upgrade procedure?
|
||||
|
||||
## 4.2 Automatically retry failed migrations until they succeed
|
||||
|
||||
> Achieves goals: (2.2), (2.6)
|
||||
> Mitigates errors (3.3) and (3.4)
|
||||
|
||||
External conditions such as failures from an unhealthy Elasticsearch cluster
|
||||
(3.3) can cause the migration to fail. The Kibana cluster should be able to
|
||||
recover automatically once these external conditions are resolved. There are
|
||||
two broad approaches to solving this problem based on whether or not
|
||||
migrations are idempotent:
|
||||
|
||||
| Idempotent migrations |Description |
|
||||
| --------------------- | --------------------------------------------------------- |
|
||||
| Yes | Idempotent migrations performed without coordination |
|
||||
| No | Single node migrations coordinated through a lease / lock |
|
||||
|
||||
Idempotent migrations don't require coordination making the algorithm
|
||||
significantly less complex and will never require manual intervention to
|
||||
retry. We, therefore, prefer this solution, even though it introduces
|
||||
restrictions on migrations (4.2.1.1). For other alternatives that were
|
||||
considered see section [(5)](#5-alternatives).
|
||||
|
||||
## 4.2.1 Idempotent migrations performed without coordination
|
||||
|
||||
The migration system can be said to be idempotent if the same results are
|
||||
produced whether the migration was run once or multiple times. This property
|
||||
should hold even if new (up to date) writes occur in between migration runs
|
||||
which introduces the following restrictions:
|
||||
|
||||
### 4.2.1.1 Restrictions
|
||||
|
||||
1. All document transforms need to be deterministic, that is a document
|
||||
transform will always return the same result for the same set of inputs.
|
||||
2. It should always be possible to construct the exact set of inputs required
|
||||
for (1) at any point during the migration process (before, during, after).
|
||||
|
||||
Although these restrictions require significant changes, it does not prevent
|
||||
known upcoming migrations such as [sharing saved-objects in multiple spaces](https://github.com/elastic/kibana/issues/27004) or [splitting a saved
|
||||
object into multiple child
|
||||
documents](https://github.com/elastic/kibana/issues/26602). To ensure that
|
||||
these migrations are idempotent, they will have to generate new saved object
|
||||
id's deterministically with e.g. UUIDv5.
|
||||
|
||||
|
||||
### 4.2.1.2 Migration algorithm: Cloned index per version
|
||||
Note:
|
||||
- The description below assumes the migration algorithm is released in 7.10.0.
|
||||
So >= 7.10.0 will use the new algorithm.
|
||||
- We refer to the alias and index that outdated nodes use as the source alias
|
||||
and source index.
|
||||
- Every version performs a migration even if mappings or documents aren't outdated.
|
||||
|
||||
1. Locate the source index by fetching kibana indices:
|
||||
|
||||
```
|
||||
GET '/_indices/.kibana,.kibana_7.10.0'
|
||||
```
|
||||
|
||||
The source index is:
|
||||
1. the index the `.kibana` alias points to, or if it doesn't exist,
|
||||
2. the v6.x `.kibana` index
|
||||
|
||||
If none of the aliases exists, this is a new Elasticsearch cluster and no
|
||||
migrations are necessary. Create the `.kibana_7.10.0_001` index with the
|
||||
following aliases: `.kibana` and `.kibana_7.10.0`.
|
||||
2. If the source is a < v6.5 `.kibana` index or < 7.4 `.kibana_task_manager`
|
||||
index prepare the legacy index for a migration:
|
||||
1. Mark the legacy index as read-only and wait for all in-flight operations to drain (requires https://github.com/elastic/elasticsearch/pull/58094). This prevents any further writes from outdated nodes. Assuming this API is similar to the existing `/<index>/_close` API, we expect to receive `"acknowledged" : true` and `"shards_acknowledged" : true`. If all shards don’t acknowledge within the timeout, retry the operation until it succeeds.
|
||||
2. Create a new index which will become the source index after the legacy
|
||||
pre-migration is complete. This index should have the same mappings as
|
||||
the legacy index. Use a fixed index name i.e `.kibana_pre6.5.0_001` or
|
||||
`.kibana_task_manager_pre7.4.0_001`. Ignore index already exists errors.
|
||||
3. Reindex the legacy index into the new source index with the
|
||||
`convertToAlias` script if specified. Use `wait_for_completion: false`
|
||||
to run this as a task. Ignore errors if the legacy source doesn't exist.
|
||||
4. Wait for the reindex task to complete. If the task doesn’t complete
|
||||
within the 60s timeout, log a warning for visibility and poll again.
|
||||
Ignore errors if the legacy source doesn't exist.
|
||||
5. Delete the legacy index and replace it with an alias of the same name
|
||||
```
|
||||
POST /_aliases
|
||||
{
|
||||
"actions" : [
|
||||
{ "remove_index": { "index": ".kibana" } }
|
||||
{ "add": { "index": ".kibana_pre6.5.0_001", "alias": ".kibana" } },
|
||||
]
|
||||
}
|
||||
```.
|
||||
Unlike the delete index API, the `remove_index` action will fail if
|
||||
provided with an _alias_. Therefore, if another instance completed this
|
||||
step, the `.kibana` alias won't be added to `.kibana_pre6.5.0_001` a
|
||||
second time. This avoids a situation where `.kibana` could point to both
|
||||
`.kibana_pre6.5.0_001` and `.kibana_7.10.0_001`. These actions are
|
||||
applied atomically so that other Kibana instances will always see either
|
||||
a `.kibana` index or an alias, but never neither.
|
||||
|
||||
Ignore "The provided expression [.kibana] matches an alias, specify the
|
||||
corresponding concrete indices instead." or "index_not_found_exception"
|
||||
errors as this means another instance has already completed this step.
|
||||
6. Use the reindexed legacy `.kibana_pre6.5.0_001` as the source for the rest of the migration algorithm.
|
||||
3. If `.kibana` and `.kibana_7.10.0` both exists and are pointing to the same index this version's migration has already been completed.
|
||||
1. Because the same version can have plugins enabled at any point in time,
|
||||
migrate outdated documents with step (9) and perform the mappings update in step (10).
|
||||
2. Skip to step (12) to start serving traffic.
|
||||
4. Fail the migration if:
|
||||
1. `.kibana` is pointing to an index that belongs to a later version of Kibana .e.g. `.kibana_7.12.0_001`
|
||||
2. (Only in 8.x) The source index contains documents that belong to an unknown Saved Object type (from a disabled plugin). Log an error explaining that the plugin that created these documents needs to be enabled again or that these objects should be deleted. See section (4.2.1.4).
|
||||
5. Search the source index for documents with types not registered within Kibana. Fail the migration if any document is found.
|
||||
6. Set a write block on the source index. This prevents any further writes from outdated nodes.
|
||||
7. Create a new temporary index `.kibana_7.10.0_reindex_temp` with `dynamic: false` on the top-level mappings so that any kind of document can be written to the index. This allows us to write untransformed documents to the index which might have fields which have been removed from the latest mappings defined by the plugin. Define minimal mappings for the `migrationVersion` and `type` fields so that we're still able to search for outdated documents that need to be transformed.
|
||||
1. Ignore errors if the target index already exists.
|
||||
8. Reindex the source index into the new temporary index using a 'client-side' reindex, by reading batches of documents from the source, migrating them, and indexing them into the temp index.
|
||||
1. Use `op_type=index` so that multiple instances can perform the reindex in parallel (last node running will override the documents, with no effect as the input data is the same)
|
||||
2. Ignore `version_conflict_engine_exception` exceptions as they just mean that another node was indexing the same documents
|
||||
3. If a `target_index_had_write_block` exception is encountered for all document of a batch, assume that another node already completed the temporary index reindex, and jump to the next step
|
||||
4. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents (without writing them to the temp index). If any failures occured, log the complete list of documents that failed to transform, then fail the migration.
|
||||
9. Clone the temporary index into the target index `.kibana_7.10.0_001`. Since any further writes will only happen against the cloned target index this prevents a lost delete from occuring where one instance finishes the migration and deletes a document and another instance's reindex operation re-creates the deleted document.
|
||||
1. Set a write block on the temporary index
|
||||
2. Clone the temporary index into the target index while specifying that the target index should have writes enabled.
|
||||
3. If the clone operation fails because the target index already exist, ignore the error and wait for the target index to become green before proceeding.
|
||||
4. (The `001` postfix in the target index name isn't used by Kibana, but allows for re-indexing an index should this be required by an Elasticsearch upgrade. E.g. re-index `.kibana_7.10.0_001` into `.kibana_7.10.0_002` and point the `.kibana_7.10.0` alias to `.kibana_7.10.0_002`.)
|
||||
10. Transform documents by reading batches of outdated documents from the target index then transforming and updating them with optimistic concurrency control.
|
||||
1. Ignore any version conflict errors.
|
||||
2. If a document transform throws an exception, add the document to a failure list and continue trying to transform all other documents. If any failures occured, log the complete list of documents that failed to transform. Fail the migration.
|
||||
11. Update the mappings of the target index
|
||||
1. Retrieve the existing mappings including the `migrationMappingPropertyHashes` metadata.
|
||||
2. Update the mappings with `PUT /.kibana_7.10.0_001/_mapping`. The API deeply merges any updates so this won't remove the mappings of any plugins that are disabled on this instance but have been enabled on another instance that also migrated this index.
|
||||
3. Ensure that fields are correctly indexed using the target index's latest mappings `POST /.kibana_7.10.0_001/_update_by_query?conflicts=proceed`. In the future we could optimize this query by only targeting documents:
|
||||
1. That belong to a known saved object type.
|
||||
12. Mark the migration as complete. This is done as a single atomic
|
||||
operation (requires https://github.com/elastic/elasticsearch/pull/58100)
|
||||
to guarantee that when multiple versions of Kibana are performing the
|
||||
migration in parallel, only one version will win. E.g. if 7.11 and 7.12
|
||||
are started in parallel and migrate from a 7.9 index, either 7.11 or 7.12
|
||||
should succeed and accept writes, but not both.
|
||||
1. Check that `.kibana` alias is still pointing to the source index
|
||||
2. Point the `.kibana_7.10.0` and `.kibana` aliases to the target index.
|
||||
3. Remove the temporary index `.kibana_7.10.0_reindex_temp`
|
||||
4. If this fails with a "required alias [.kibana] does not exist" error or "index_not_found_exception" for the temporary index, fetch `.kibana` again:
|
||||
1. If `.kibana` is _not_ pointing to our target index fail the migration.
|
||||
2. If `.kibana` is pointing to our target index the migration has succeeded and we can proceed to step (12).
|
||||
13. Start serving traffic. All saved object reads/writes happen through the
|
||||
version-specific alias `.kibana_7.10.0`.
|
||||
|
||||
Together with the limitations, this algorithm ensures that migrations are
|
||||
idempotent. If two nodes are started simultaneously, both of them will start
|
||||
transforming documents in that version's target index, but because migrations
|
||||
are idempotent, it doesn’t matter which node’s writes win.
|
||||
#### Known weaknesses:
|
||||
(Also present in our existing migration algorithm since v7.4)
|
||||
When the task manager index gets reindexed a reindex script is applied.
|
||||
Because we delete the original task manager index there is no way to rollback
|
||||
a failed task manager migration without a snapshot. Although losing the task
|
||||
manager data has a fairly low impact.
|
||||
|
||||
(Also present in our existing migration algorithm since v6.5)
|
||||
If the outdated instance isn't shutdown before starting the migration, the
|
||||
following data-loss scenario is possible:
|
||||
1. Upgrade a 7.9 index without shutting down the 7.9 nodes
|
||||
2. Kibana v7.10 performs a migration and after completing points `.kibana`
|
||||
alias to `.kibana_7.11.0_001`
|
||||
3. Kibana v7.9 writes unmigrated documents into `.kibana`.
|
||||
4. Kibana v7.10 performs a query based on the updated mappings of documents so
|
||||
results potentially don't match the acknowledged write from step (3).
|
||||
|
||||
Note:
|
||||
- Data loss won't occur if both nodes have the updated migration algorithm
|
||||
proposed in this RFC. It is only when one of the nodes use the existing
|
||||
algorithm that data loss is possible.
|
||||
- Once v7.10 is restarted it will transform any outdated documents making
|
||||
these visible to queries again.
|
||||
|
||||
It is possible to work around this weakness by introducing a new alias such as
|
||||
`.kibana_current` so that after a migration the `.kibana` alias will continue
|
||||
to point to the outdated index. However, we decided to keep using the
|
||||
`.kibana` alias despite this weakness for the following reasons:
|
||||
- Users might rely on `.kibana` alias for snapshots, so if this alias no
|
||||
longer points to the latest index their snapshots would no longer backup
|
||||
kibana's latest data.
|
||||
- Introducing another alias introduces complexity for users and support.
|
||||
The steps to diagnose, fix or rollback a failed migration will deviate
|
||||
depending on the 7.x version of Kibana you are using.
|
||||
- The existing Kibana documentation clearly states that outdated nodes should
|
||||
be shutdown, this scenario has never been supported by Kibana.
|
||||
|
||||
<details>
|
||||
<summary>In the future, this algorithm could enable (2.6) "read-only functionality during the downtime window" but this is outside of the scope of this RFC.</summary>
|
||||
|
||||
Although the migration algorithm guarantees there's no data loss while providing read-only access to outdated nodes, this could cause plugins to behave in unexpected ways. If we wish to persue it in the future, enabling read-only functionality during the downtime window will be it's own project and must include an audit of all plugins' behaviours.
|
||||
</details>
|
||||
|
||||
### 4.2.1.3 Upgrade and rollback procedure
|
||||
When a newer Kibana starts an upgrade, it blocks all writes to the outdated index to prevent data loss. Since Kibana is not designed to gracefully handle a read-only index this could have unintended consequences such as a task executing multiple times but never being able to write that the task was completed successfully. To prevent unintended consequences, the following procedure should be followed when upgrading Kibana:
|
||||
|
||||
1. Gracefully shutdown outdated nodes by sending a `SIGTERM` signal
|
||||
1. Node starts returning `503` from it's healthcheck endpoint to signal to
|
||||
the load balancer that it's no longer accepting new traffic (requires https://github.com/elastic/kibana/issues/46984).
|
||||
2. Allows ungoing HTTP requests to complete with a configurable timeout
|
||||
before forcefully terminating any open connections.
|
||||
3. Closes any keep-alive sockets by sending a `connection: close` header.
|
||||
4. Shutdown all plugins and Core services.
|
||||
2. (recommended) Take a snapshot of all Kibana's Saved Objects indices. This simplifies doing a rollback to a simple snapshot restore, but is not required in order to do a rollback if a migration fails.
|
||||
3. Start the upgraded Kibana nodes. All running Kibana nodes should be on the same version, have the same plugins enabled and use the same configuration.
|
||||
|
||||
To rollback to a previous version of Kibana with a snapshot
|
||||
1. Shutdown all Kibana nodes.
|
||||
2. Restore the Saved Object indices and aliases from the snapshot
|
||||
3. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration.
|
||||
|
||||
To rollback to a previous version of Kibana without a snapshot:
|
||||
(Assumes the migration to 7.11.0 failed)
|
||||
1. Shutdown all Kibana nodes.
|
||||
2. Remove the index created by the failed Kibana migration by using the version-specific alias e.g. `DELETE /.kibana_7.11.0`
|
||||
3. Remove the write block from the rollback index using the `.kibana` alias
|
||||
`PUT /.kibana/_settings {"index.blocks.write": false}`
|
||||
4. Start the rollback Kibana nodes. All running Kibana nodes should be on the same rollback version, have the same plugins enabled and use the same configuration.
|
||||
|
||||
### 4.2.1.4 Handling documents that belong to a disabled plugin
|
||||
It is possible for a plugin to create documents in one version of Kibana, but then when upgrading Kibana to a newer version, that plugin is disabled. Because the plugin is disabled it cannot register it's Saved Objects type including the mappings or any migration transformation functions. These "orphan" documents could cause future problems:
|
||||
- A major version introduces breaking mapping changes that cannot be applied to the data in these documents.
|
||||
- Two majors later migrations will no longer be able to migrate this old schema and could fail unexpectadly when the plugin is suddenly enabled.
|
||||
|
||||
As a concrete example of the above, consider a user taking the following steps:
|
||||
1. Installs Kibana 7.6.0 with spaces=enabled. The spaces plugin creates a default space saved object.
|
||||
2. User upgrades to 7.10.0 but uses the OSS download which has spaces=disabled. Although the 7.10.0 spaces plugin includes a migration for space documents, the OSS release cannot migrate the documents or update it's mappings.
|
||||
3. User realizes they made a mistake and use Kibana 7.10.0 with x-pack and the spaces plugin enabled. At this point we have a completed migration for 7.10.0 but there's outdated spaces documents with migrationVersion=7.6.0 instead of 7.10.0.
|
||||
|
||||
There are several approaches we could take to dealing with these orphan documents:
|
||||
|
||||
1. Start up but refuse to query on types with outdated documents until a user manually triggers a re-migration
|
||||
|
||||
Advantages:
|
||||
- The impact is limited to a single plugin
|
||||
|
||||
Disadvantages:
|
||||
- It might be less obvious that a plugin is in a degraded state unless you read the logs (not possible on Cloud) or view the `/status` endpoint.
|
||||
- If a user doesn't care that the plugin is degraded, orphan documents are carried forward indefinitely.
|
||||
- Since Kibana has started receiving traffic, users can no longer
|
||||
downgrade without losing data. They have to re-migrate, but if that
|
||||
fails they're stuck.
|
||||
- Introduces a breaking change in the upgrade behaviour
|
||||
|
||||
To perform a re-migration:
|
||||
- Remove the `.kibana_7.10.0` alias
|
||||
- Take a snapshot OR set the configuration option `migrations.target_index_postfix: '002'` to create a new target index `.kibana_7.10.0_002` and keep the `.kibana_7.10.0_001` index to be able to perform a rollback.
|
||||
- Start up Kibana
|
||||
|
||||
2. Refuse to start Kibana until the plugin is enabled or it's data deleted
|
||||
|
||||
Advantages:
|
||||
- Admin’s are forced to deal with the problem as soon as they disable a plugin
|
||||
|
||||
Disadvantages:
|
||||
- Cannot temporarily disable a plugin to aid in debugging or to reduce the load a Kibana plugin places on an ES cluster.
|
||||
- Introduces a breaking change
|
||||
|
||||
3. Refuse to start a migration until the plugin is enabled or it's data deleted
|
||||
|
||||
Advantages:
|
||||
- We force users to enable a plugin or delete the documents which prevents these documents from creating future problems like a mapping update not being compatible because there are fields which are assumed to have been migrated.
|
||||
- We keep the index “clean”.
|
||||
|
||||
Disadvantages:
|
||||
- Since users have to take down outdated nodes before they can start the upgrade, they have to enter the downtime window before they know about this problem. This prolongs the downtime window and in many cases might cause an operations team to have to reschedule their downtime window to give them time to investigate the documents that need to be deleted. Logging an error on every startup could warn users ahead of time to mitigate this.
|
||||
- We don’t expose Kibana logs on Cloud so this will have to be escalated to support and could take 48hrs to resolve (users can safely rollback, but without visibility into the logs they might not know this). Exposing Kibana logs is on the cloud team’s roadmap.
|
||||
- It might not be obvious just from the saved object type, which plugin created these objects.
|
||||
- Introduces a breaking change in the upgrade behaviour
|
||||
|
||||
4. Use a hash of enabled plugins as part of the target index name
|
||||
Using a migration target index name like
|
||||
`.kibana_7.10.0_${hash(enabled_plugins)}_001` we can migrate all documents
|
||||
every time a plugin is enabled / disabled.
|
||||
|
||||
Advantages:
|
||||
- Outdated documents belonging to disabled plugins will be upgraded as soon
|
||||
as the plugin is enabled again.
|
||||
|
||||
Disadvantages:
|
||||
- Disabling / enabling a plugin will cause downtime (breaking change).
|
||||
- When a plugin is enabled, disabled and enabled again our target index
|
||||
will be an existing outdated index which needs to be deleted and
|
||||
re-cloned. Without a way to check if the index is outdated, we cannot
|
||||
deterministically perform the delete and re-clone operation without
|
||||
coordination.
|
||||
|
||||
5. Transform outdated documents (step 8) on every startup
|
||||
Advantages:
|
||||
- Outdated documents belonging to disabled plugins will be upgraded as soon
|
||||
as the plugin is enabled again.
|
||||
|
||||
Disadvantages:
|
||||
- Orphan documents are retained indefinitely so there's still a potential
|
||||
for future problems.
|
||||
- Slightly slower startup time since we have to query for outdated
|
||||
documents every time.
|
||||
|
||||
We prefer option (3) since it provides flexibility for disabling plugins in
|
||||
the same version while also protecting users' data in all cases during an
|
||||
upgrade migration. However, because this is a breaking change we will
|
||||
implement (5) during 7.x and only implement (3) during 8.x.
|
||||
|
||||
# 5. Alternatives
|
||||
## 5.1 Rolling upgrades
|
||||
We considered implementing rolling upgrades to provide zero downtime
|
||||
migrations. However, this would introduce significant complexity for plugins:
|
||||
they will need to maintain up and down migration transformations and ensure
|
||||
that queries match both current and outdated documents across all
|
||||
versions. Although we can afford the once-off complexity of implementing
|
||||
rolling upgrades, the complexity burden of maintaining plugins that support
|
||||
rolling-upgrades will slow down all development in Kibana. Since a predictable
|
||||
downtime window is sufficient for our users, we decided against trying to
|
||||
achieve zero downtime with rolling upgrades. See "Rolling upgrades" in
|
||||
https://github.com/elastic/kibana/issues/52202 for more information.
|
||||
|
||||
## 5.2 Single node migrations coordinated through a lease/lock
|
||||
This alternative is a proposed algorithm for coordinating migrations so that
|
||||
these only happen on a single node and therefore don't have the restrictions
|
||||
found in [(4.2.1.1)](#4311-restrictions). We decided against this algorithm
|
||||
primarily because it is a lot more complex, but also because it could still
|
||||
require manual intervention to retry from certain unlikely edge cases.
|
||||
|
||||
<details>
|
||||
<summary>It's impossible to guarantee that a single node performs the
|
||||
migration and automatically retry failed migrations.</summary>
|
||||
|
||||
Coordination should ensure that only one Kibana node performs the migration at
|
||||
a given time which can be achived with a distributed lock built on top of
|
||||
Elasticsearch. For the Kibana cluster to be able to retry a failed migration,
|
||||
requires a specialized lock which expires after a given amount of inactivity.
|
||||
We will refer to such expiring locks as a "lease".
|
||||
|
||||
If a Kibana process stalls, it is possible that the process' lease has expired
|
||||
but the process doesn't yet recognize this and continues the migration. To
|
||||
prevent this from causing data loss each lease should be accompanied by a
|
||||
"guard" that prevents all writes after the lease has expired. See
|
||||
[how to do distributed
|
||||
locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html)
|
||||
for an in-depth discussion.
|
||||
|
||||
Elasticsearch doesn't provide any building blocks for constructing such a guard.
|
||||
</details>
|
||||
|
||||
However, we can implement a lock (that never expires) with strong
|
||||
data-consistency guarantees. Because there’s no expiration, a failure between
|
||||
obtaining the lock and releasing it will require manual intervention. Instead
|
||||
of trying to accomplish the entire migration after obtaining a lock, we can
|
||||
only perform the last step of the migration process, moving the aliases, with
|
||||
a lock. A permanent failure in only this last step is not impossible, but very
|
||||
unlikely.
|
||||
|
||||
### 5.2.1 Migration algorithm
|
||||
1. Obtain a document lock (see [5.2.2 Document lock
|
||||
algorithm](#522-document-lock-algorithm)). Convert the lock into a "weak
|
||||
lease" by expiring locks for nodes which aren't active (see [4.2.2.4
|
||||
Checking for lease expiry](#4324-checking-for-lease-expiry)). This "weak
|
||||
lease" doesn't require strict guarantees since it's only used to prevent
|
||||
multiple Kibana nodes from performing a migration in parallel to reduce the
|
||||
load on Elasticsearch.
|
||||
2. Migrate data into a new process specific index (we could use the process
|
||||
UUID that’s used in the lease document like
|
||||
`.kibana_3ef25ff1-090a-4335-83a0-307a47712b4e`).
|
||||
3. Obtain a document lock (see [5.2.2 Document lock
|
||||
algorithm](#522-document-lock-algorithm)).
|
||||
4. Finish the migration by pointing `.kibana` →
|
||||
`.kibana_3ef25ff1-090a-4335-83a0-307a47712b4e`. This automatically releases
|
||||
the document lock (and any leases) because the new index will contain an
|
||||
empty `kibana_cluster_state`.
|
||||
|
||||
If a process crashes or is stopped after (3) but before (4) the lock will have
|
||||
to be manually removed by deleting the `kibana_cluster_state` document from
|
||||
`.kibana` or restoring from a snapshot.
|
||||
|
||||
### 5.2.2 Document lock algorithm
|
||||
To improve on the existing Saved Objects migrations lock, a locking algorithm
|
||||
needs to satisfy the following requirements:
|
||||
- Must guarantee that only a single node can obtain the lock. Since we can
|
||||
only provide strong data-consistency guarantees on the document level in
|
||||
Elasticsearch our locking mechanism needs to be based on a document.
|
||||
- Manually removing the lock
|
||||
- shouldn't have any risk of accidentally causing data loss.
|
||||
- can be done with a single command that's always the same (shouldn’t
|
||||
require trying to find `n` for removing the correct `.kibana_n` index).
|
||||
- Must be easy to retrieve the lock/cluster state to aid in debugging or to
|
||||
provide visibility.
|
||||
|
||||
Algorithm:
|
||||
1. Node reads `kibana_cluster_state` lease document from `.kibana`
|
||||
2. It sends a heartbeat every `heartbeat_interval` seconds by sending an
|
||||
update operation that adds it’s UUID to the `nodes` array and sets the
|
||||
`lastSeen` value to the current local node time. If the update fails due to
|
||||
a version conflict the update operation is retried after a random delay by
|
||||
fetching the document again and attempting the update operation once more.
|
||||
3. To obtain a lease, a node:
|
||||
1. Fetches the `kibana_cluster_state` document
|
||||
2. If all the nodes’ `hasLock === false` it sets it’s own `hasLock` to
|
||||
true and attempts to write the document. If the update fails
|
||||
(presumably because of another node’s heartbeat update) it restarts the
|
||||
process to obtain a lease from step (3).
|
||||
3. If another nodes’ `hasLock === true` the node failed to acquire a
|
||||
lock and waits until the active lock has expired before attempting to
|
||||
obtain a lock again.
|
||||
4. Once a node is done with its lock, it releases it by fetching and then
|
||||
updating `hasLock = false`. The fetch + update operations are retried until
|
||||
this node’s `hasLock === false`.
|
||||
|
||||
Each machine writes a `UUID` to a file, so a single machine may have multiple
|
||||
processes with the same Kibana `UUID`, so we should rather generate a new UUID
|
||||
just for the lifetime of this process.
|
||||
|
||||
`KibanaClusterState` document format:
|
||||
```js
|
||||
nodes: {
|
||||
"852bd94e-5121-47f3-a321-e09d9db8d16e": {
|
||||
version: "7.6.0",
|
||||
lastSeen: [ 1114793, 555149266 ], // hrtime() big int timestamp
|
||||
hasLease: true,
|
||||
hasLock: false,
|
||||
},
|
||||
"8d975c5b-cbf6-4418-9afb-7aa3ea34ac90": {
|
||||
version: "7.6.0",
|
||||
lastSeen: [ 1114862, 841295591 ],
|
||||
hasLease: false,
|
||||
hasLock: false,
|
||||
},
|
||||
"3ef25ff1-090a-4335-83a0-307a47712b4e": {
|
||||
version: "7.6.0",
|
||||
lastSeen: [ 1114877, 611368546 ],
|
||||
hasLease: false,
|
||||
hasLock: false,
|
||||
},
|
||||
},
|
||||
oplog: [
|
||||
{op: 'ACQUIRE_LOCK', node: '852bd94e...', timestamp: '2020-04-20T11:58:56.176Z'}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### 5.2.3 Checking for "weak lease" expiry
|
||||
The simplest way to check for lease expiry is to inspect the `lastSeen` value.
|
||||
If `lastSeen + expiry_timeout > now` the lock is considered expired. If there
|
||||
are clock drift or daylight savings time adjustments, there’s a risk that a
|
||||
node loses it’s lease before `expiry_timeout` has occurred. Since losing a
|
||||
lock prematurely will not lead to data loss it’s not critical that the
|
||||
expiry time is observed under all conditions.
|
||||
|
||||
A slightly safer approach is to use a monotonically increasing clock
|
||||
(`process.hrtime()`) and relative time to determine expiry. Using a
|
||||
monotonically increasing clock guarantees that the clock will always increase
|
||||
even if the system time changes due to daylight savings time, NTP clock syncs,
|
||||
or manually setting the time. To check for expiry, other nodes poll the
|
||||
cluster state document. Once they see that the `lastSeen` value has increased,
|
||||
they capture the current hr time `current_hr_time` and starts waiting until
|
||||
`process.hrtime() - current_hr_time > expiry_timeout` if at that point
|
||||
`lastSeen` hasn’t been updated the lease is considered to have expired. This
|
||||
means other nodes can take up to `2*expiry_timeout` to recognize an expired
|
||||
lease, but a lease will never expire prematurely.
|
||||
|
||||
Any node that detects an expired lease can release that lease by setting the
|
||||
expired node’s `hasLease = false`. It can then attempt to acquire its lease.
|
||||
|
||||
## 5.3 Minimize data loss with mixed Kibana versions during 7.x
|
||||
When multiple versions of Kibana are running at the same time, writes from the
|
||||
outdated node can end up either in the outdated Kibana index, the newly
|
||||
migrated index, or both. New documents added (and some updates) into the old
|
||||
index while a migration is in-progress will be lost. Writes that end up in the
|
||||
new index will be in an outdated format. This could cause queries on the data
|
||||
to only return a subset of the results which leads to incorrect results or
|
||||
silent data loss.
|
||||
|
||||
Minimizing data loss from mixed 7.x versions, introduces two additional steps
|
||||
to rollback to a previous version without a snapshot:
|
||||
1. (existing) Point the `.kibana` alias to the previous Kibana index `.kibana_n-1`
|
||||
2. (existing) Delete `.kibana_n`
|
||||
3. (new) Enable writes on `.kibana_n-1`
|
||||
4. (new) Delete the dummy "version lock" document from `.kibana_n-1`
|
||||
|
||||
Since our documentation and server logs have implicitly encouraged users to
|
||||
rollback without using snapshots, many users might have to rely on these
|
||||
additional migration steps to perform a rollback. Since even the existing
|
||||
steps are error prone, introducing more steps will likely introduce more
|
||||
problems than what it solves.
|
||||
|
||||
1. All future versions of Kibana 7.x will use the `.kibana_saved_objects`
|
||||
alias to locate the current index. If `.kibana_saved_objects` doesn't
|
||||
exist, newer versions will fallback to reading `.kibana`.
|
||||
2. All future versions of Kibana will locate the index that
|
||||
`.kibana_saved_objects` points to and then read and write directly from
|
||||
the _index_ instead of the alias.
|
||||
3. Before starting a migration:
|
||||
1. Write a new dummy "version lock" document to the `.kibana` index with a
|
||||
`migrationVersion` set to the current version of Kibana. If an outdated
|
||||
node is started up after a migration was started it will detect that
|
||||
newer documents are present in the index and refuse to start up.
|
||||
2. Set the outdated index to read-only. Since `.kibana` is never advanced,
|
||||
it will be pointing to a read-only index which prevent writes from
|
||||
6.8+ releases which are already online.
|
||||
|
||||
## 5.4 In-place migrations that re-use the same index (8.0)
|
||||
> We considered an algorithm that re-uses the same index for migrations and an approach to minimize data-loss if our upgrade procedures aren't followed. This is no longer our preferred approach because of several downsides:
|
||||
> - It requires taking snapshots to prevent data loss so we can only release this in 8.x
|
||||
> - Minimizing data loss with unsupported upgrade configurations adds significant complexity and still doesn't guarantee that data isn't lost.
|
||||
|
||||
### 5.4.1 Migration algorithm (8.0):
|
||||
1. Exit Kibana with a fatal error if a newer node has started a migration by
|
||||
checking for:
|
||||
1. Documents with a newer `migrationVersion` numbers.
|
||||
2. If the mappings are out of date, update the mappings to the combination of
|
||||
the index's current mappings and the expected mappings.
|
||||
3. If there are outdated documents, migrate these in batches:
|
||||
1. Read a batch of outdated documents from the index.
|
||||
2. Transform documents by applying the migration transformation functions.
|
||||
3. Update the document batch in the same index using optimistic concurrency
|
||||
control. If a batch fails due to an update version mismatch continue
|
||||
migrating the other batches.
|
||||
4. If a batch fails due other reasons repeat the entire migration process.
|
||||
4. If any of the batches in step (3.3) failed, repeat the entire migration
|
||||
process. This ensures that in-progress bulk update operations from an
|
||||
outdated node won't lead to unmigrated documents still being present after
|
||||
the migration.
|
||||
5. Once all documents are up to date, the migration is complete and Kibana can
|
||||
start serving traffic.
|
||||
|
||||
Advantages:
|
||||
- Not duplicating all documents into a new index will speed up migrations and
|
||||
reduce the downtime window. This will be especially important for the future
|
||||
requirement to support > 10k or > 100k documents.
|
||||
- We can check the health of an existing index before starting the migration,
|
||||
but we cannot detect what kind of failures might occur while creating a new
|
||||
index. Whereas retrying migrations will eventually recover from the errors
|
||||
in (3.3), re-using an index allows us to detect these problems before trying
|
||||
and avoid errors like (3.3.1) altogether.
|
||||
- Single index to backup instead of “index pattern” that matches any
|
||||
`.kibana_n`.
|
||||
- Simplifies Kibana system index Elasticsearch plugin since it needs to work
|
||||
on one index per "tenant".
|
||||
- By leveraging optimistic concurrency control we can further minimize data
|
||||
loss for unsupported upgrade configurations in the future.
|
||||
|
||||
Drawbacks:
|
||||
- Cannot make breaking mapping changes (even though it was possible, we have not
|
||||
introduced a breaking mapping change during 7.x).
|
||||
- Rollback is only possible by restoring a snapshot which requires educating
|
||||
users to ensure that they don't rely on `.kibana_n` indices as backups.
|
||||
(Apart from the need to educate users, snapshot restores provide many
|
||||
benefits).
|
||||
- It narrows the second restriction under (4.2.1) even further: migrations
|
||||
cannot rely on any state that could change as part of a migration because we
|
||||
can no longer use the previous index as a snapshot of unmigrated state.
|
||||
- We can’t automatically perform a rollback from a half-way done migration.
|
||||
- It’s impossible to provide read-only functionality for outdated nodes which
|
||||
means we can't achieve goal (2.7).
|
||||
|
||||
### 5.4.2 Minimizing data loss with unsupported upgrade configurations (8.0)
|
||||
> This alternative can reduce some data loss when our upgrade procedure isn't
|
||||
> followed with the algorithm in (5.4.1).
|
||||
|
||||
Even if (4.5.2) is the only supported upgrade procedure, we should try to
|
||||
prevent data loss when these instructions aren't followed.
|
||||
|
||||
To prevent data loss we need to prevent any writes from older nodes. We use
|
||||
a version-specific alias for this purpose. Each time a migration is started,
|
||||
all other aliases are removed. However, aliases are stored inside
|
||||
Elasticsearch's ClusterState and this state could remain inconsistent between
|
||||
nodes for an unbounded amount of time. In addition, bulk operations that were
|
||||
accepted before the alias was removed will continue to run even after removing
|
||||
the alias.
|
||||
|
||||
As a result, Kibana cannot guarantee that there would be no data loss but
|
||||
instead, aims to minimize it as much as possible by adding the bold sections
|
||||
to the migration algorithm from (5.4.1)
|
||||
|
||||
1. **Disable `action.auto_create_index` for the Kibana system indices.**
|
||||
2. Exit Kibana with a fatal error if a newer node has started a migration by
|
||||
checking for:
|
||||
1. **Version-specific aliases on the `.kibana` index with a newer version.**
|
||||
2. Documents with newer `migrationVersion` numbers.
|
||||
3. **Remove all other aliases and create a new version-specific alias for
|
||||
reading and writing to the `.kibana` index .e.g `.kibana_8.0.1`. During and
|
||||
after the migration, all saved object reads and writes use this alias
|
||||
instead of reading or writing directly to the index. By using the atomic
|
||||
`POST /_aliases` API we minimize the chance that an outdated node creating
|
||||
new outdated documents can cause data loss.**
|
||||
4. **Wait for the default bulk operation timeout of 30s. This ensures that any
|
||||
bulk operations accepted before the removal of the alias have either
|
||||
completed or returned a timeout error to it's initiator.**
|
||||
5. If the mappings are out of date, update the mappings **through the alias**
|
||||
to the combination of the index's current mappings and the expected
|
||||
mappings. **If this operation fails due to an index missing exception (most
|
||||
likely because another node removed our version-specific alias) repeat the
|
||||
entire migration process.**
|
||||
6. If there are outdated documents, migrate these in batches:
|
||||
1. Read a batch of outdated documents from `.kibana_n`.
|
||||
2. Transform documents by applying the migration functions.
|
||||
3. Update the document batch in the same index using optimistic concurrency
|
||||
control. If a batch fails due to an update version mismatch continue
|
||||
migrating the other batches.
|
||||
4. If a batch fails due other reasons repeat the entire migration process.
|
||||
7. If any of the batches in step (6.3) failed, repeat the entire migration
|
||||
process. This ensures that in-progress bulk update operations from an
|
||||
outdated node won't lead to unmigrated documents still being present after
|
||||
the migration.
|
||||
8. Once all documents are up to date, the migration is complete and Kibana can
|
||||
start serving traffic.
|
||||
|
||||
Steps (2) and (3) from the migration algorithm in minimize the chances of the
|
||||
following scenarios occuring but cannot guarantee it. It is therefore useful
|
||||
to enumarate some scenarios and their worst case impact:
|
||||
1. An outdated node issued a bulk create to it's version-specific alias.
|
||||
Because a user doesn't wait for all traffic to drain a newer node starts
|
||||
it's migration before the bulk create was complete. Since this bulk create
|
||||
was accepted before the newer node deleted the previous version-specific
|
||||
aliases, it is possible that the index now contains some outdated documents
|
||||
that the new node is unaware of and doesn't migrate. Although these outdated
|
||||
documents can lead to inconsistent query results and data loss, step (4)
|
||||
ensures that an error will be returned to the node that created these
|
||||
objects.
|
||||
2. A 8.1.0 node and a 8.2.0 node starts migrating a 8.0.0 index in parallel.
|
||||
Even though the 8.2.0 node will remove the 8.1.0 version-specific aliases,
|
||||
the 8.1.0 node could have sent an bulk update operation that got accepted
|
||||
before its alias was removed. When the 8.2.0 node tries to migrate these
|
||||
8.1.0 documents it gets a version conflict but cannot be sure if this was
|
||||
because another node of the same version migrated this document (which can
|
||||
safely be ignored) or interference from a different Kibana version. The
|
||||
8.1.0 node will hit the error in step (6.3) and restart the migration but
|
||||
then ultimately fail at step (2). The 8.2.0 node will repeat the entire
|
||||
migration process from step (7) thus ensuring that all documents are up to
|
||||
date.
|
||||
3. A race condition with another Kibana node on the same version, but with
|
||||
different enabled plugins caused this node's required mappings to be
|
||||
overwritten. If this causes a mapper parsing exception in step (6.3) we can
|
||||
restart the migration. Because updating the mappings is additive and saved
|
||||
object types are unique to a plugin, restarting the migration will allow
|
||||
the node to update the mappings to be compatible with node's plugins. Both
|
||||
nodes will be able to successfully complete the migration of their plugins'
|
||||
registered saved object types. However, if the migration doesn't trigger a
|
||||
mapper parsing exception the incompatible mappings would go undetected
|
||||
which can cause future problems like write failures or inconsistent query
|
||||
results.
|
||||
|
||||
## 5.5 Tag objects as “invalid” if their transformation fails
|
||||
> This alternative prevents a failed migration when there's a migration transform function bug or a document with invalid data. Although it seems preferable to not fail the entire migration because of a single saved object type's migration transform bug or a single invalid document this has several pitfalls:
|
||||
> 1. When an object fails to migrate the data for that saved object type becomes inconsistent. This could load to a critical feature being unavailable to a user leaving them with no choice but to downgrade.
|
||||
> 2. Because Kibana starts accepting traffic after encountering invalid objects a rollback will lead to data loss leaving users with no clean way to recover.
|
||||
> As a result we prefer to let an upgrade fail and making it easy for users to rollback until they can resolve the root cause.
|
||||
|
||||
> Achieves goals: (2.2)
|
||||
> Mitigates Errors (3.1), (3.2)
|
||||
|
||||
1. Tag objects as “invalid” if they cause an exception when being transformed,
|
||||
but don’t fail the entire migration.
|
||||
2. Log an error message informing administrators that there are invalid
|
||||
objects which require inspection. For each invalid object, provide an error
|
||||
stack trace to aid in debugging.
|
||||
3. Administrators should be able to generate a migration report (similar to
|
||||
the one dry run migrations create) which is an NDJSON export of all objects
|
||||
tagged as “invalid”.
|
||||
1. Expose this as an HTTP API first
|
||||
2. (later) Notify administrators and allow them to export invalid objects
|
||||
from the Kibana UI.
|
||||
4. When an invalid object is read, the Saved Objects repository will throw an
|
||||
invalid object exception which should include a link to the documentation
|
||||
to help administrators resolve migration bugs.
|
||||
5. Educate Kibana developers to no longer simply write back an unmigrated
|
||||
document if an exception occurred. A migration function should either
|
||||
successfully transform the object or throw.
|
||||
|
||||
# 6. How we teach this
|
||||
1. Update documentation and server logs to start educating users to depend on
|
||||
snapshots for Kibana rollbacks.
|
||||
2. Update developer documentation and educate developers with best practices
|
||||
for writing migration functions.
|
||||
|
||||
# 7. Unresolved questions
|
||||
1. When cloning an index we can only ever add new fields to the mappings. When
|
||||
a saved object type or specific field is removed, the mappings will remain
|
||||
until we re-index. Is it sufficient to only re-index every major? How do we
|
||||
track the field count as it grows over every upgrade?
|
||||
2. More generally, how do we deal with the growing field count approaching the
|
||||
default limit of 1000?
|
|
@ -1,488 +0,0 @@
|
|||
- Start Date: (fill me in with today's date, YYYY-MM-DD)
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
- Architecture diagram: https://app.lucidchart.com/documents/edit/cf35b512-616a-4734-bc72-43dde70dbd44/0_0
|
||||
- Mockups: https://www.figma.com/proto/FD2M7MUpLScJKOyYjfbmev/ES-%2F-Query-Management-v4?node-id=440%3A1&viewport=984%2C-99%2C0.09413627535104752&scaling=scale-down
|
||||
- Old issue: https://github.com/elastic/kibana/issues/53335
|
||||
- Search Sessions roadmap: https://github.com/elastic/kibana/issues/61738
|
||||
- POC: https://github.com/elastic/kibana/pull/64641
|
||||
|
||||
# Summary
|
||||
|
||||
Search Sessions will enable Kibana applications and solutions to start a group of related search requests (such as those coming from a single load of a dashboard or SIEM timeline), navigate away or close the browser, then retrieve the results when they have completed.
|
||||
|
||||
# Basic example
|
||||
|
||||
At its core, search sessions are enabled via several new APIs, that:
|
||||
- Start a session, associating multiple search requests with a single entity
|
||||
- Store the session (and continue search requests in the background)
|
||||
- Restore the saved search session
|
||||
|
||||
```ts
|
||||
const searchService = dataPluginStart.search;
|
||||
|
||||
if (appState.sessionId) {
|
||||
// If we are restoring a session, set the session ID in the search service
|
||||
searchService.session.restore(sessionId);
|
||||
} else {
|
||||
// Otherwise, start a new search session to associate our search requests
|
||||
appState.sessionId = searchService.session.start();
|
||||
}
|
||||
|
||||
// Search, passing in the generated session ID.
|
||||
// If this is a new session, the `search_interceptor` will associate and keep track of the async search ID with the session ID.
|
||||
// If this is a restored session, the server will immediately return saved results.
|
||||
// In the case where there is no saved result for a given request, or if the results have expired, `search` will throw an error with a meaningful error code.
|
||||
const request = buildKibanaRequest(...);
|
||||
request.sessionId = searchService.session.get();
|
||||
const response$ = await searchService.search(request);
|
||||
|
||||
// Calling `session.store()`, creates a saved object for this session, allowing the user to navigate away.
|
||||
// The session object will be saved with all async search IDs that were executed so far.
|
||||
// Any follow up searches executed with this sessionId will be saved into this object as well.
|
||||
const searchSession = await searchService.session.store();
|
||||
```
|
||||
|
||||
# Motivation
|
||||
|
||||
Kibana is great at providing fast results from large sets of "hot" data. However, there is an increasing number of use cases where users want to analyze large amounts of "colder" data (such as year-over-year reports, historical or audit data, batch queries, etc.).
|
||||
|
||||
For these cases, users run into two limitations:
|
||||
1. Kibana has a default timeout of 30s per search. This is controlled by the `elasticsearch.requestTimeout` setting (originally intended to protect clusters from unintentional overload by a single query).
|
||||
2. Kibana cancels queries upon navigating away from an application, once again, as means of protecting clusters and reducing unnecessary load.
|
||||
|
||||
In 7.7, with the introduction of the `_async_search` API in Elasticsearch, we provided Kibana users a way to bypass the timeout, but users still need to remain on-screen for the entire duration of the search requests.
|
||||
|
||||
The primary motivation of this RFC is to enable users to do the following without needing to keep Kibana open, or while moving onto other work inside Kibana:
|
||||
|
||||
- Run long search requests (beyond 30 seconds)
|
||||
- View their status (complete/incomplete)
|
||||
- Cancel incomplete search requests
|
||||
- Retrieve completed search request results
|
||||
|
||||
# Detailed design
|
||||
|
||||
Because a single view (such as a dashboard with multiple visualizations) can initiate multiple search requests, we need a way to associate the search requests together in a single entity.
|
||||
|
||||
We call this entity a `session`, and when a user decides that they want to continue running the search requests while moving onto other work, we will create a saved object corresponding with that specific `session`, persisting the *sessionId* along with a mapping of each *request's hash* to the *async ID* returned by Elasticsearch.
|
||||
|
||||
## High Level Flow Charts
|
||||
|
||||
### Client side search
|
||||
|
||||
This diagram matches any case where `data.search` is called from the front end:
|
||||
|
||||

|
||||
|
||||
### Server side search
|
||||
|
||||
This case happens if the server is the one to invoke the `data.search` endpoint, for example with TSVB.
|
||||
|
||||

|
||||
|
||||
## Data and Saved Objects
|
||||
|
||||
### Search Session Status
|
||||
|
||||
```ts
|
||||
export enum SearchSessionStatus {
|
||||
Running, // The session has at least one running search ID associated with it.
|
||||
Done, // All search IDs associated with this session have completed.
|
||||
Error, // At least one search ID associated with this session had an error.
|
||||
Expired, // The session has expired. Associated search ID data was cleared from ES.
|
||||
}
|
||||
```
|
||||
|
||||
### Saved Object Structure
|
||||
|
||||
The saved object created for a search session will be scoped to a single space, and will be a `hidden` saved object
|
||||
(so that it doesn't show in the management listings). We will provide a separate interface for users to manage their own
|
||||
saved search sessions (which will use the `list`, `expire`, and `extend` methods described below, which will be restricted
|
||||
per-user).
|
||||
|
||||
```ts
|
||||
interface SearchSessionAttributes extends SavedObjectAttributes {
|
||||
sessionId: string;
|
||||
userId: string; // Something unique to the user who generated this session, like username/realm-name/realm-type
|
||||
status: SearchSessionStatus;
|
||||
name: string;
|
||||
creation: Date;
|
||||
expiration: Date;
|
||||
idMapping: { [key: string]: string };
|
||||
url: string; // A URL relative to the Kibana root to retrieve the results of a completed search session (and/or to return to an incomplete view)
|
||||
metadata: { [key: string]: any } // Any data the specific application requires to restore a search session view
|
||||
}
|
||||
```
|
||||
|
||||
The URL that is provided will need to be generated by the specific application implementing search sessions. We
|
||||
recommend using the URL generator to ensure that URLs are backwards-compatible since search sessions may exist as
|
||||
long as a user continues to extend the expiration.
|
||||
|
||||
## Frontend Services
|
||||
|
||||
Most sessions will probably not be saved. Therefore, to avoid creating unnecessary saved objects, the browser will keep track of requests and their respective search IDs, until the user chooses to store the session. Once a session is stored, any additional searches will be immediately saved on the server side.
|
||||
|
||||
### New Session Service
|
||||
|
||||
We will expose a new frontend `session` service on the `data` plugin `search` service.
|
||||
|
||||
The service will expose the following APIs:
|
||||
|
||||
```ts
|
||||
interface ISessionService {
|
||||
/**
|
||||
* Returns the current session ID
|
||||
*/
|
||||
getActiveSessionId: () => string;
|
||||
|
||||
/**
|
||||
* Sets the current session
|
||||
* @param sessionId: The ID of the session to set
|
||||
* @param isRestored: Whether or not the session is being restored
|
||||
*/
|
||||
setActiveSessionId: (sessionId: string, isRestored: boolean) => void;
|
||||
|
||||
/**
|
||||
* Start a new session, by generating a new session ID (calls `setActiveSessionId` internally)
|
||||
*/
|
||||
start: () => string;
|
||||
|
||||
/**
|
||||
* Store a session, alongside with any tracked searchIds.
|
||||
* @param sessionId Session ID to store. Probably retrieved from `sessionService.get()`.
|
||||
* @param name A display name for the session.
|
||||
* @param url TODO: is the URL provided here? How?
|
||||
* @returns The stored `SearchSessionAttributes` object
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
store: (sessionId: string, name: string, url: string) => Promise<SearchSessionAttributes>
|
||||
|
||||
/**
|
||||
* @returns Is the current session stored (i.e. is there a saved object corresponding with this sessionId).
|
||||
*/
|
||||
isStored: () => boolean;
|
||||
|
||||
/**
|
||||
* @returns Is the current session a restored session
|
||||
*/
|
||||
isRestored: () => boolean;
|
||||
|
||||
/**
|
||||
* Mark a session and and all associated searchIds as expired.
|
||||
* Cancels active requests, if there are any.
|
||||
* @param sessionId Session ID to store. Probably retrieved from `sessionService.get()`.
|
||||
* @returns success status
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
expire: (sessionId: string) => Promise<boolean>
|
||||
|
||||
/**
|
||||
* Extend a session and all associated searchIds.
|
||||
* @param sessionId Session ID to extend. Probably retrieved from `sessionService.get()`.
|
||||
* @param extendBy Time to extend by, can be a relative or absolute string.
|
||||
* @returns success status
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
extend: (sessionId: string, extendBy: string)=> Promise<boolean>
|
||||
|
||||
/**
|
||||
* @param sessionId the ID of the session to retrieve the saved object.
|
||||
* @returns a filtered list of SearchSessionAttributes objects.
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
get: (sessionId: string) => Promise<SearchSessionAttributes>
|
||||
|
||||
/**
|
||||
* @param options The options to query for specific search session saved objects.
|
||||
* @returns a filtered list of SearchSessionAttributes objects.
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
list: (options: SavedObjectsFindOptions) => Promise<SearchSessionAttributes[]>
|
||||
|
||||
/**
|
||||
* Clears out any session info as well as the current session. Called internally whenever the user navigates
|
||||
* between applications.
|
||||
* @internal
|
||||
*/
|
||||
clear: () => void;
|
||||
|
||||
/**
|
||||
* Track a search ID of a sessionId, if it exists. Called internally by the search service.
|
||||
* @param sessionId
|
||||
* @param request
|
||||
* @param searchId
|
||||
* @internal
|
||||
*/
|
||||
trackSearchId: (
|
||||
sessionId: string,
|
||||
request: IKibanaSearchRequest,
|
||||
searchId: string,
|
||||
) => Promise<boolean>
|
||||
}
|
||||
```
|
||||
|
||||
## Backend Services and Routes
|
||||
|
||||
The server side's feature implementation builds on how Elasticsearch's `async_search` endpoint works. When making an
|
||||
initial new request to Elasticsearch, it returns a search ID that can be later used to retrieve the results.
|
||||
|
||||
The server will then store that `request`, `sessionId`, and `searchId` in a mapping in memory, and periodically query
|
||||
for a saved object corresponding with that session. If the saved object is found, it will update the saved object to
|
||||
include this `request`/`searchId` combination, and remove it from memory. If, after a period of time (5 minutes?) the
|
||||
saved object has not been found, we will stop polling for that `sessionId` and remove the `request`/`searchId` from
|
||||
memory.
|
||||
|
||||
When the server receives a search request that has a `sessionId` and is marked as a `restore` request, the server will
|
||||
attempt to find the correct id within the saved object, and use it to retrieve the results previously saved.
|
||||
|
||||
### New Session Service
|
||||
|
||||
```ts
|
||||
interface ISessionService {
|
||||
/**
|
||||
* Adds a search ID to a Search Session, if it exists.
|
||||
* Also extends the expiration of the search ID to match the session's expiration.
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @param searchId
|
||||
* @returns true if id was added, false if Search Session doesn't exist or if there was an error while updating.
|
||||
* @throws an error if `searchId` already exists in the mapping for this `sessionId`
|
||||
*/
|
||||
trackSearchId: (
|
||||
request: KibanaRequest,
|
||||
sessionId: string,
|
||||
searchId: string,
|
||||
) => Promise<boolean>
|
||||
|
||||
/**
|
||||
* Get a Search Session object.
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @returns the Search Session object if exists, or undefined.
|
||||
*/
|
||||
get: async (
|
||||
request: KibanaRequest,
|
||||
sessionId: string
|
||||
) => Promise<SearchSessionAttributes?>
|
||||
|
||||
/**
|
||||
* Get a searchId from a Search Session object.
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @returns the searchID if exists on the Search Session, or undefined.
|
||||
*/
|
||||
getSearchId: async (
|
||||
request: KibanaRequest,
|
||||
sessionId: string
|
||||
) => Promise<string?>
|
||||
|
||||
/**
|
||||
* Store a session.
|
||||
* @param request
|
||||
* @param sessionId Session ID to store. Probably retrieved from `sessionService.get()`.
|
||||
* @param searchIdMap A mapping of hashed requests mapped to the corresponding searchId.
|
||||
* @param url TODO: is the URL provided here? How?
|
||||
* @returns The stored `SearchSessionAttributes` object
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
store: (
|
||||
request: KibanaRequest,
|
||||
sessionId: string,
|
||||
name: string,
|
||||
url: string,
|
||||
searchIdMapping?: Record<string, string>
|
||||
) => Promise<SearchSessionAttributes>
|
||||
|
||||
/**
|
||||
* Mark a session as and all associated searchIds as expired.
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @returns success status
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
expire: async (
|
||||
request: KibanaRequest,
|
||||
sessionId: string
|
||||
) => Promise<boolean>
|
||||
|
||||
/**
|
||||
* Extend a session and all associated searchIds.
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @param extendBy Time to extend by, can be a relative or absolute string.
|
||||
* @returns success status
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
extend: async (
|
||||
request: KibanaRequest,
|
||||
sessionId: string,
|
||||
extendBy: string,
|
||||
) => Promise<boolean>
|
||||
|
||||
/**
|
||||
* Get a list of Search Session objects.
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @returns success status
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
list: async (
|
||||
request: KibanaRequest,
|
||||
) => Promise<SearchSessionAttributes[]>
|
||||
|
||||
/**
|
||||
* Update the status of a given session
|
||||
* @param request
|
||||
* @param sessionId
|
||||
* @param status
|
||||
* @returns success status
|
||||
* @throws Throws an error in OSS.
|
||||
*/
|
||||
updateStatus: async (
|
||||
request: KibanaRequest,
|
||||
sessionId: string,
|
||||
status: SearchSessionStatus
|
||||
) => Promise<boolean>
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
### Search Service Changes
|
||||
|
||||
There are cases where search requests are issued by the server (Like TSVB).
|
||||
We can simplify this flow by introducing a mechanism, similar to the frontend one, tracking the information in memory and polling for a saved object with a corresponding sessionId to store the ids into it.
|
||||
|
||||
```ts
|
||||
interface SearchService {
|
||||
/**
|
||||
* The search API will accept the option `trackId`, which will track the search ID, if available, on the server, until a corresponding saved object is created.
|
||||
**/
|
||||
search: async (
|
||||
context: RequestHandlerContext,
|
||||
request: IEnhancedEsSearchRequest,
|
||||
options?: ISearchOptions
|
||||
) => ISearchResponse<Payload=any>
|
||||
}
|
||||
```
|
||||
|
||||
### Server Routes
|
||||
|
||||
Each route exposes the corresponding method from the Session Service (used only by the client-side service, not meant to be used directly by any consumers):
|
||||
|
||||
`POST /internal/session/store`
|
||||
|
||||
`POST /internal/session/extend`
|
||||
|
||||
`POST /internal/session/expire`
|
||||
|
||||
`GET /internal/session/list`
|
||||
|
||||
### Search Strategy Integration
|
||||
|
||||
If the `EnhancedEsSearchStrategy` receives a `restore` option, it will attempt reloading data using the Search Session saved object matching the provided `sessionId`. If there are any errors during that process, the strategy will return an error response and *not attempt to re-run the request.
|
||||
|
||||
The strategy will track the asyncId on the server side, if `trackId` option is provided.
|
||||
|
||||
### Monitoring Service
|
||||
|
||||
The `data` plugin will register a task with the task manager, periodically monitoring the status of incomplete search sessions.
|
||||
|
||||
It will query the list of all incomplete sessions, and check the status of each search that is executing. If the search requests are all complete, it will update the corresponding saved object to have a `status` of `complete`. If any of the searches return an error, it will update the saved object to an `error` state. If the search requests have expired, it will update the saved object to an `expired` state. Expired sessions will be purged once they are older than the time definedby the `EXPIRED_SESSION_TTL` advanced setting.
|
||||
|
||||
Once there's a notification area in Kibana, we may use that mechanism to push completion \ error notifications to the client.
|
||||
|
||||
## Miscellaneous
|
||||
|
||||
#### Relative dates and restore URLs
|
||||
|
||||
Restoring a sessionId depends on each request's `sha-256` hash matching exactly to the ones saved, requiring special attention to relative date ranges, as having these might yield ambiguous results.
|
||||
|
||||
There are two potential scenarios:
|
||||
- A relative date (for example `now-1d`) is being used in query DSL - In this case any future hash will match, but the returned data *won't match the displayed timeframe*. For example, a report might state that it shows data from yesterday, but actually show data from a week ago.
|
||||
- A relative date is being translated by the application before being set to the query DSL - In this case a different date will be sent and the hash will never match, resulting in an error restoring the dashboard.
|
||||
|
||||
Both scenarios require careful attention during the UI design and implementation.
|
||||
|
||||
The former can be resolved by clearly displaying the creation time of the restored Search Session. We could also attempt translating relative dates to absolute one's, but this might be challenging as relative dates may appear deeply nested within the DSL.
|
||||
|
||||
The latter case happens at the moment for the timepicker only: The relative date is being translated each time into an absolute one, before being sent to Elasticsearch. In order to avoid issues, we'll have to make sure that restore URLs are generated with an absolute date, to make sure they are restored correctly.
|
||||
|
||||
#### Changing a restored session
|
||||
|
||||
If you have restored a Search Session, making any type of change to it (time range, filters, etc.) will trigger new (potentially long) searches. There should be a clear indication in the UI that the data is no longer stored. A user then may choose to send it to background, resulting in a new Search Session being saved.
|
||||
|
||||
#### Loading an errored \ expired \ canceled session
|
||||
|
||||
When trying to restore a Search Session, if any of the requests hashes don't match the ones saved, or if any of the saved async search IDs are expired, a meaningful error code will be returned by the server **by those requests**. It is each application's responsibility to handle these errors appropriately.
|
||||
|
||||
In such a scenario, the session will be partially restored.
|
||||
|
||||
#### Extending Expiration
|
||||
|
||||
Sessions are given an expiration date defined in an advanced setting (5 days by default). This expiration date is measured from the time the Search Session is saved, and it includes the time it takes to generate the results.
|
||||
|
||||
A session's expiration date may be extended indefinitely. However, if a session was canceled or has already expired, it needs to be re-run.
|
||||
|
||||
# Limitations
|
||||
|
||||
In the first iteration, cases which require multiple search requests to be made serially will not be supported. The
|
||||
following are examples of such scenarios:
|
||||
|
||||
- When a visualization is configured with a terms agg with an "other" bucket
|
||||
- When using blended layers or term joins in Maps
|
||||
|
||||
Eventually, when expressions can be run on the server, they will run in the context of a specific `sessionId`, hence enabling those edge cases too.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
One drawback of this approach is that we will be regularly polling Elasticsearch for saved objects, which will increase
|
||||
load on the Elasticsearch server, in addition to the Kibana server (since all server-side processes share the same event
|
||||
loop). We've opened https://github.com/elastic/kibana/issues/77293 to track this, and hopefully come up with benchmarks
|
||||
so we feel comfortable moving forward with this approach.
|
||||
|
||||
Two potential drawbacks stem from storing things in server memory. If a Kibana server is restarted, in-memory results
|
||||
will be lost. (This can be an issue if a search request has started, and the user has sent to background, but the
|
||||
search session saved object has not yet been updated with the search request ID.) In such cases, the user interface
|
||||
will need to indicate errors for requests that were not stored in the saved object.
|
||||
|
||||
There is also the consideration of the memory footprint of the Kibana server; however, since
|
||||
we are only storing a hash of the request and search request ID, and are periodically cleaning it up (see Backend
|
||||
Services and Routes), we do not anticipate the footprint to increase significantly.
|
||||
|
||||
The results of search requests that have been sent to the background will be stored in Elasticsearch for several days,
|
||||
even if they will only be retrieved once. This will be mitigated by allowing the user manually delete a search
|
||||
session object after it has been accessed.
|
||||
|
||||
# Alternatives
|
||||
|
||||
What other designs have been considered? What is the impact of not doing this?
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
(See "Basic example" above.)
|
||||
|
||||
Any application or solution that uses the `data` plugin `search` services will be able to facilitate search sessions
|
||||
fairly simply. The public side will need to create/clear sessions when appropriate, and ensure the `sessionId` is sent
|
||||
with all search requests. It will also need to ensure that any necessary application data, as well as a `restoreUrl` is
|
||||
sent when creating the saved object.
|
||||
|
||||
The server side will just need to ensure that the `sessionId` is sent to the `search` service. If bypassing the `search`
|
||||
service, it will need to also call `trackSearchId` when the first response is received, and `getSearchId` when restoring
|
||||
the view.
|
||||
|
||||
# How we teach this
|
||||
|
||||
What names and terminology work best for these concepts and why? How is this
|
||||
idea best presented? As a continuation of existing Kibana patterns?
|
||||
|
||||
Would the acceptance of this proposal mean the Kibana documentation must be
|
||||
re-organized or altered? Does it change how Kibana is taught to new developers
|
||||
at any level?
|
||||
|
||||
How should this feature be taught to existing Kibana developers?
|
||||
|
||||
# Unresolved questions
|
||||
|
||||
Optional, but suggested for first drafts. What parts of the design are still
|
||||
TBD?
|
|
@ -1,442 +0,0 @@
|
|||
- Start Date: 2020-12-21
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
- [POC PR](https://github.com/elastic/kibana/pull/86232)
|
||||
|
||||
# Goal
|
||||
|
||||
Automatically generate API documentation for every plugin that exposes a public API within Kibana in order to help Kibana plugin developers
|
||||
find and understand the services available to them. Automatic generation ensures the APIs are _always_ up to date. The system will make it easy to find
|
||||
APIs that are lacking documentation.
|
||||
|
||||
Note this does not cover REST API docs, but is targetted towards our javascript
|
||||
plugin APIs.
|
||||
|
||||
# Technology: ts-morph vs api-extractor
|
||||
|
||||
[Api-extractor](https://api-extractor.com/) is a utility built from microsoft that parses typescript code into json files that can then be used in a custom [api-documenter](https://api-extractor.com/pages/setup/generating_docs/) in order to build documentation. This is what we [have now](https://github.com/elastic/kibana/tree/master/docs/development), except we use the default api-documenter.
|
||||
|
||||
## Limitations with the current implementation using api-extractor & api-documenter
|
||||
|
||||
The current implementation relies on the default api-documenter. It has the following limitations:
|
||||
|
||||
- One page per API item
|
||||
- Files are .md not .mdx
|
||||
- There is no entry page per plugin (just an index.md per plugin/public and plugin/server)
|
||||
- Incorrectly marks these entries as packages.
|
||||
|
||||

|
||||
|
||||
- Does not generate links to APIs exposed from other plugins, nor inside the same plugin.
|
||||
|
||||

|
||||
|
||||
## Options to improve
|
||||
|
||||
We have two options to improve on the current implementation. We can use a custom api-documenter, or use ts-morph.
|
||||
|
||||
### Custom Api-Documenter
|
||||
|
||||
- According to the current maintainer of the sample api-documenter, it's a surprising amount of work to maintain.
|
||||
- If we wish to re-use code from the sample api-documenter, we'll have to fork the rush-stack repo, or copy their code into our system.
|
||||
- No verified ability to support cross plugin links. We do have some ideas (can explore creating a package.json for every page, and/or adding source file information to every node).
|
||||
- More limited feature set, we wouldn't get thinks like references and source file paths.
|
||||
- There are very few examples of other companies using custom api-documenters to drive their documentation systems (I could not find any on github).
|
||||
|
||||
### Custom implementation using ts-morph
|
||||
|
||||
[ts-morph](https://github.com/dsherret/ts-morph) is a utility built and maintained by a single person, which sits a layer above the raw typescript compiler.
|
||||
|
||||
- Requires manually converting the types to how we want them to be displayed in the UI. Certain types have to be handled specially to show up
|
||||
in the right way (for example, for arrow functions to be categorized as functions). This special handling is the bulk of the logic in the PR, and
|
||||
may be a maintenance burden.
|
||||
- Relies on a package maintained by a single person, albiet they have been very responsive and have a history of keeping the library up to date with
|
||||
typescript upgrades.
|
||||
- Affords us flexibility to do things like extract the setup and start types, grab source file paths to create links to github, and get
|
||||
reference counts (reference counts not implemented in MVP).
|
||||
- There are some issues with type links and signatures not working correctly (see https://github.com/dsherret/ts-morph/issues/923).
|
||||
|
||||

|
||||
|
||||
## Recommendation: ts-morph for the short term, switch to api-extractor when limitations can be worked around
|
||||
|
||||
Both approaches will have a decent amount of code to maintain, but the api-extractor approach appears to be a more stable long term solution, since it's built and maintained by Microsoft and
|
||||
is likely going to grow in popularity as more TypeScript API doc systems exist.
|
||||
If we had a working example that supported cross plugin links, I would suggest continuing down that road. However, we don't, while we _do_ have a working ts-morph implementation.
|
||||
|
||||
I recommend that we move ahead with ts-morph in the short term, because we have an implementation that offers a much improved experience over the current system, but that we continually
|
||||
re-evaluate as time goes on and we learn more about the maintenance burden of the current approach, and see what happens with our priorities and the api-extractor library.
|
||||
|
||||
Progress over perfection.
|
||||
|
||||

|
||||
|
||||
If we do switch, we can re-use all of the tests that take example TypeScript files and verify the resulting ApiDeclaration shapes.
|
||||
|
||||
# Terminology
|
||||
|
||||
**API** - A plugin's public API consists of every function, class, interface, type, variable, etc, that is exported from it's index.ts file, or returned from it's start or setup
|
||||
contract.
|
||||
|
||||
**API Declaration** - Each function, class, interface, type, variable, etc, that is part of a plugins public API is a "declaration". This
|
||||
terminology is motivated by [these docs](https://www.typescriptlang.org/docs/handbook/modules.html#exporting-a-declaration).
|
||||
|
||||
# MVP
|
||||
|
||||
Every plugin will have one or more API reference pages. Every exported declaration will be listed in the page. It is first split by "scope" - client, server and common. Underneath
|
||||
that, setup and start contracts are at the top, the remaining declarations are grouped by type (classes, functions, interfaces, etc).
|
||||
Plugins may opt to have their API split into "service" sections (see [proposed manifest file changes](#manifest-file-changes)). If a plugin uses service folders, the API doc system will automatically group declarations that are defined inside the service folder name. This is a simple way to break down very large plugins. The start and setup contract will
|
||||
always remain with the main plugin name.
|
||||
|
||||

|
||||
|
||||
- Cross plugin API links work inside `signature`.
|
||||
- Github links with source file and line number
|
||||
- using `serviceFolders` to split large plugins
|
||||
|
||||
## Post MVP
|
||||
|
||||
- Plugin `{@link AnApi}` links work. Will need to decide if we only support per plugin links, or if we should support a way to do this across plugins.
|
||||
- Ingesting stats like number of public APIs, and number of those missing comments
|
||||
- Include and expose API references
|
||||
- Use namespaces to split large plugins
|
||||
|
||||
# Information available for each API declaration
|
||||
|
||||
We have the following pieces of information available from each declaration:
|
||||
|
||||
- Label. The name of the function, class, interface, etc.
|
||||
|
||||
- Description. Any comment that was able to be extracted. Currently it's not possible for this data to be formatted, for example if it has a code example with back tics. This
|
||||
is dependent on the elastic-docs team moving the infrastructure to NextJS instead of Gatsby, but it will eventually be supported.
|
||||
|
||||
- Tags. Any `@blahblah` tags that were extracted from comments. Known tags, like `beta`, will be show help text in a tooltip when hovered over.
|
||||
|
||||
- Type. This can be thought of as the _kind_ of type (see [TypeKind](#typekind)). It allows us to group each type into a category. It can be a primitive, or a
|
||||
more complex grouping. Possibilities are: array, string, number, boolean, object, class, interface, function, compound (unions or intersections)
|
||||
|
||||
- Required or optional. (whether or not the type was written with `| undefined` or `?`). This terminology makes the most sense for function
|
||||
parameters, not as much when thinking about an exported variable that might be undefined.
|
||||
|
||||
- Signature. This is only relevant for some types: functions, objects, type, arrays and compound. Classes and interfaces would be too large.
|
||||
For primitives, this is equivalent to "type".
|
||||
|
||||
- Children. Only relevant for some types, this would include parameters for functions, class members and functions for classes, properties for
|
||||
interfaces and objects. This makes the structure recursive. Each child is a nested API component.
|
||||
|
||||
- Return comment. Only relevant for function types.
|
||||
|
||||

|
||||
|
||||
|
||||
### ApiDeclaration type
|
||||
|
||||
```ts
|
||||
interface ApiDeclaration {
|
||||
label: string;
|
||||
type: TypeKind; // string, number, boolean, class, interface, function, type, etc.
|
||||
description: TextWithLinks;
|
||||
signature: TextWithLinks;
|
||||
tags: string[]; // Declarations may be tagged as beta, or deprecated.
|
||||
children: ApiDeclaration[]; // Recursive - this could be function parameters, class members, or interface/object properties.
|
||||
returnComment?: TextWithLinks;
|
||||
lifecycle?: Lifecycle.START | Lifecycle.SETUP;
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
# Architecture design
|
||||
|
||||
## Location
|
||||
|
||||
The generated docs will reside inside the kibana repo, inside a top level `api_docs` folder. In the long term, we could investigate having the docs system run a script to generated the mdx files, so we don’t need to store them inside the repo. Every ci run should destroy and re-create this folder so removed plugins don't have lingering documentation files.
|
||||
|
||||
They will be hosted online wherever the new docs system ends up. This can temporarily be accessed at https://elasticdocstest.netlify.app/docs/.
|
||||
|
||||
## Algorithm overview
|
||||
|
||||
The first stage is to collect the list of plugins using the existing `findPlugins` logic.
|
||||
|
||||
For every plugin, the initial list of ts-morph api node declarations are collected from three "scope" files:
|
||||
- plugin/public/index.ts
|
||||
- plugin/server/index.ts
|
||||
- plugin/common/index.ts
|
||||
|
||||
Each ts-morph declaration is then transformed into an [ApiDeclaration](#ApiDeclaration-type) type, which is recursive due to the `children` property. Each
|
||||
type of declaration is handled slightly differently, mainly in regard to whether or not a signature or return type is added, and how children are added.
|
||||
|
||||
For example:
|
||||
|
||||
```ts
|
||||
if (node.isClassDeclaration()) {
|
||||
// No signature or return.
|
||||
return {
|
||||
label,
|
||||
description,
|
||||
type: TypeKind.ClassKind,
|
||||
// The class members are captured in the children array.
|
||||
children: getApiDeclaration(node.getMembers()),
|
||||
}
|
||||
} else if (node.isFunctionDeclaration()) {
|
||||
return {
|
||||
label,
|
||||
description,
|
||||
signature: getSignature(node),
|
||||
returnComment: getReturnComment(node),
|
||||
type: TypeKind.FunctionKind,
|
||||
// The function parameters are captured in the children array. This logic is more specific because
|
||||
// the comments for a function parameter are captured in the function comment, with "@param" tags.
|
||||
children: getParameterList(node.getParameters(), getParamTagComments(node)),
|
||||
}
|
||||
} if (...)
|
||||
....
|
||||
```
|
||||
|
||||
The handling of each specific type is what encompasses the vast majority of the logic in the PR.
|
||||
|
||||
The public and server scope have 0-2 special interfaces indicated by "lifecycle". This is determined by using ts-morph to extract the first two generic types
|
||||
passed to `... extends Plugin<start, setup>` in the class defined inside the plugin's `plugin.ts` file.
|
||||
|
||||
A [PluginApi](#pluginapi) is generated for each plugin, which is used to generate the json and mdx files. One or more json/mdx file pair
|
||||
per plugin may be created, depending on the value of `serviceFolders` inside the plugin's manifest files. This is because some plugins have such huge APIs that
|
||||
it is too large to render in a single page.
|
||||
|
||||

|
||||
|
||||
## Types
|
||||
|
||||
### TypeKind
|
||||
|
||||
TypeKind is an enum that will identify what "category" or "group" name we can call this particular export. Is it a function, an interface, a class a variable, etc.
|
||||
This list is likely incomplete, and we'll expand as needed.
|
||||
|
||||
```ts
|
||||
export enum TypeKind {
|
||||
ClassKind = 'Class',
|
||||
FunctionKind = 'Function',
|
||||
ObjectKind = 'Object',
|
||||
InterfaceKind = 'Interface',
|
||||
TypeKind = 'Type', // For things like `export type Foo = ...`
|
||||
UnknownKind = 'Unknown', // For the special "unknown" typescript type.
|
||||
AnyKind = 'Any', // For the "any" kind, which should almost never be used in our public API.
|
||||
UnCategorized = 'UnCategorized', // There are a lot of ts-morph types, if I encounter something not handled, I dump it in here.
|
||||
StringKind = 'string',
|
||||
NumberKind = 'number',
|
||||
BooleanKind = 'boolean',
|
||||
ArrayKind = 'Array',
|
||||
CompoundTypeKind = 'CompoundType', // Unions & intersections, to handle things like `string | number`.
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Text with reference links
|
||||
|
||||
Signatures, descriptions and return comments may all contain links to other API declarations. This information needs to be serializable into json. This serializable type encompasses the information needed to build the DocLink components within these fields. The logic of building
|
||||
the DocLink components currently resides inside the elastic-docs system. It's unclear if this will change.
|
||||
|
||||
```ts
|
||||
/**
|
||||
* This is used for displaying code or comments that may contain reference links. For example, a function
|
||||
* signature that is `(a: import("src/plugin_b").Bar) => void` will be parsed into the following Array:
|
||||
*
|
||||
* ```ts
|
||||
* [
|
||||
* '(a: ',
|
||||
* { docId: 'pluginB', section: 'Bar', text: 'Bar' },
|
||||
* ') => void'
|
||||
* ]
|
||||
* ```
|
||||
*
|
||||
* This is then used to render text with nested DocLinks so it looks like this:
|
||||
*
|
||||
* `(a: => <DocLink docId="pluginB" section="Bar" text="Bar"/>) => void`
|
||||
*/
|
||||
export type TextWithLinks = Array<string | Reference>;
|
||||
|
||||
/**
|
||||
* The information neccessary to build a DocLink.
|
||||
*/
|
||||
export interface Reference {
|
||||
docId: string;
|
||||
section: string;
|
||||
text: string;
|
||||
}
|
||||
```
|
||||
|
||||
### ScopeApi
|
||||
|
||||
Scope API is essentially just grouping an array of ApiDeclarations into different categories that makes building the mdx files from a
|
||||
single json file easier.
|
||||
|
||||
```ts
|
||||
export interface ScopeApi {
|
||||
setup?: ApiDeclaration;
|
||||
start?: ApiDeclaration;
|
||||
classes: ApiDeclaration[];
|
||||
functions: ApiDeclaration[];
|
||||
interfaces: ApiDeclaration[];
|
||||
objects: ApiDeclaration[];
|
||||
enums: ApiDeclaration[];
|
||||
misc: ApiDeclaration[];
|
||||
// We may add more here as we sit fit to pull out of `misc`.
|
||||
}
|
||||
```
|
||||
|
||||
With this structure, the mdx files end up looking like:
|
||||
|
||||
```
|
||||
### Start
|
||||
<DocDefinitionList data={[actionsJson.server.start]}/>
|
||||
### Functions
|
||||
<DocDefinitionList data={actionsJson.server.functions}/>
|
||||
### Interfaces
|
||||
<DocDefinitionList data={actionsJson.server.interfaces}/>
|
||||
```
|
||||
|
||||
### PluginApi
|
||||
|
||||
A plugin API is the component that is serialized into the json file. It is broken into public, server and common components. `serviceFolders` is a way for the system to
|
||||
write separate mdx files depending on where each declaration is defined. This is because certain plugins (and core)
|
||||
are huge, and can't be rendered in a single page.
|
||||
|
||||
|
||||
```ts
|
||||
export interface PluginApi {
|
||||
id: string;
|
||||
serviceFolders?: readonly string[];
|
||||
client: ScopeApi;
|
||||
server: ScopeApi;
|
||||
common: ScopeApi;
|
||||
}
|
||||
```
|
||||
|
||||
## kibana.json Manifest file changes
|
||||
|
||||
### Using a kibana.json file for core
|
||||
|
||||
For the purpose of API infrastructure, core is treated like any other plugin. This means it has to specify serviceFolders section inside a manifest file to be split into sub folders. There are other ways to tackle this - like a hard coded array just for the core folder, but I kept the logic as similar to the other plugins as possible.
|
||||
|
||||
### New parameters
|
||||
|
||||
**serviceFolders?: string[]**
|
||||
|
||||
Used by the system to group services into sub-pages. Some plugins, like data and core, have such huge APIs they are very slow to contain in a single page, and they are less consummable by solution developers. The addition of an optional list of services folders will cause the system to automatically create a separate page with every API that is defined within that folder. The caveat is that core will need to define a manifest file in order to define its service folders...
|
||||
|
||||
**teamOwner: string**
|
||||
|
||||
Team owner can be determined via github CODEOWNERS file, but we want to encourage single team ownership per plugin. Requiring a team owner string in the manifest file will help with this and will allow the API doc system to manually add a section to every page that has a link to the team owner. Additional ideas are teamSlackChannel or teamEmail for further contact.
|
||||
|
||||
**summary: string**
|
||||
|
||||
|
||||
A brief description of the plugin can then be displayed in the automatically generated API documentation.
|
||||
|
||||
# Future features
|
||||
|
||||
## Indexing stats
|
||||
|
||||
Can we index statistics about our API as part of this system? For example, I'm dumping information about which api declarations are missing comments in the console.
|
||||
|
||||
## Longer term approach to "plugin service folders"
|
||||
|
||||
Using sub folders is a short term plan. A long term plan hasn't been established yet, but it should fit in with our folder structure hierarchy goals, along with
|
||||
any support we have for sharing services among a related set of plugins, that are not exposed as part of the public API.
|
||||
# Recommendations for writing comments
|
||||
|
||||
## @link comments for the referenced type
|
||||
|
||||
Core has a pattern of writing comments like this:
|
||||
|
||||
```ts
|
||||
/** {@link IUiSettingsClient} */
|
||||
uiSettings: IUiSettingsClient;
|
||||
```
|
||||
|
||||
I don't see the value in this. In the IDE, I can click on the IUiSettingsClient type and get directed there, and in the API doc system, the
|
||||
type will already be clickable. This ends up with a weird looking API:
|
||||
|
||||

|
||||
|
||||
The plan is to make @link comments work like links, which means this is unneccessary information.
|
||||
|
||||
I propose we avoid this kind of pattern.
|
||||
|
||||
## Export every referenced type
|
||||
|
||||
The docs system handles broken link warnings but to avoid breaking the ci, I suggest we turn this off initially. However, this will mean
|
||||
we may miss situations where we are referencing a type that is not actually exported. This will cause a broken link in the docs
|
||||
system
|
||||
|
||||
For example if your index.ts file has:
|
||||
```ts
|
||||
export type foo: string | AnInterface;
|
||||
```
|
||||
|
||||
and does not also export `AnInterface`, this will be a broken link in the docs system.
|
||||
|
||||
Until we have better CI tools to catch these mistakes, developers will need to export every referenced type.
|
||||
|
||||
## Avoid `Pick` pattern
|
||||
|
||||
Connected to the above, if you use `Pick`, there are two problems. One is that it's difficult for a developer to see the functionality
|
||||
available to them at a glance, since they would have to keep flipping from the interface definition to the properties that have been picked.
|
||||
|
||||
The second potential problem is that you will have to export the referenced type, and in some situations, it's an internal type that isn't exported.
|
||||
|
||||

|
||||
|
||||
# Open questions
|
||||
|
||||
## Required attribute
|
||||
|
||||
`isRequired` is an optional parameter that can be used to display a badge next to the API.
|
||||
We can mark function parameters that do not use `?` or `| undefined` as required. Open questions:
|
||||
|
||||
1. Are we okay with a badge showing for `required` rather than `optional` when marking a parameter as optional is extra work for a developer, and `required` is the default?
|
||||
|
||||
2. Should we only mark function parameters as `required` or interface/class parameters? Essentially, should any declaration that is not nullable
|
||||
have the `required` tag?
|
||||
|
||||
## Signatures on primitive types
|
||||
|
||||
1. Should we _always_ include a signature for variables and parameters, even if they are a repeat of the TypeKind? For example:
|
||||
|
||||

|
||||
|
||||
2. If no, should we include signatures when the only difference is `| undefined`? For function parameters this information is captured by
|
||||
the absence of the `required` badge. Is this obvious? What about class members/interface props?
|
||||
|
||||
## Out of scope
|
||||
|
||||
### REST API
|
||||
|
||||
This RFC does not cover REST API documentation, though it worth considering where
|
||||
REST APIs registered by plugins should go in the docs. The docs team has a proposal for this but it is not inside the `Kibana Developer Docs` mission.
|
||||
|
||||
### Package APIs
|
||||
|
||||
Package APIs are not covered in this RFC.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
In order to generate useful API documentation, we need to approach this by two sides.
|
||||
|
||||
1. Establish a habit of writing documentation.
|
||||
2. Establish a habit of reading documentation.
|
||||
|
||||
Currently what often happens is a developer asks another developer a question directly, and it is answered. Every time this happens, ask yourself if
|
||||
there is a link you can share instead of a direct answer. If there isn't, file an issue for that documentation to be created. When we start responding
|
||||
to questions with links, solution developers will naturally start to look in the documentation _first_, saving everyone time!
|
||||
|
||||
The APIs WILL need to be well commented or they won't be useful. We can measure the amount of missing comments and set a goal of reducing this number.
|
||||
|
||||
# External documentation system examples
|
||||
|
||||
- [Microsoft .NET](https://docs.microsoft.com/en-us/dotnet/api/microsoft.visualbasic?view=netcore-3.1)
|
||||
- [Android](https://developer.android.com/reference/androidx/packages)
|
||||
|
||||
# Architecure review
|
||||
|
||||
The primary concern coming out of the architecture review was over the technology choice of ts-morph vs api-extractor, and the potential maintenance
|
||||
burdern of using ts-morph. For the short term, we've decide tech leads will own this section of code, we'll consider it experimental and
|
||||
focus on deriving value out of it. Once we are confident of the value, we can focus on stabilizing the implementation details.
|
|
@ -1,309 +0,0 @@
|
|||
- Start Date: 2021-02-24
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
# Summary
|
||||
|
||||
Adopt Bazel, an open-source build and test tool as the build system for Kibana.
|
||||
|
||||
|
||||
# What is Bazel
|
||||
|
||||
Bazel is an open-source build and test tool similar to Make, Maven, and Gradle. It uses a human-readable, high-level build language. Bazel supports projects in multiple languages and builds outputs for multiple platforms. Bazel supports large codebases across multiple repositories, and large numbers of users.
|
||||
|
||||
Bazel offers the following advantages:
|
||||
|
||||
* **High-level build language**. Bazel uses an abstract, human-readable language to describe the build properties of your project at a high semantical level. Unlike other tools, Bazel operates on the concepts of libraries, binaries, scripts, and data sets, shielding you from the complexity of writing individual calls to tools such as compilers and linkers.
|
||||
* **Bazel is fast and reliable**. Bazel caches all previously done work and tracks changes to both file content and build commands. This way, Bazel knows when something needs to be rebuilt, and rebuilds only that. To further speed up your builds, you can set up your project to build in a highly parallel and incremental fashion.
|
||||
* **Bazel is multi-platform**. Bazel runs on Linux, macOS, and Windows. Bazel can build binaries and deployable packages for multiple platforms, including desktop, server, and mobile, from the same project.
|
||||
* **Bazel scales**. Bazel maintains agility while handling builds with 100k+ source files. It works with multiple repositories and user bases in the tens of thousands.
|
||||
* **Bazel is extensible**. Many languages are supported, and you can extend Bazel to support any other language or framework.
|
||||
|
||||
For more information, please refer to the [Bazel website](https://www.bazel.build/).
|
||||
|
||||
|
||||
# Motivation
|
||||
|
||||
Kibana has grown substantially over the years and now includes more than 2,100,000 lines of code across 25,000 TypeScript and Javascript files, excluding NPM dependencies. For someone to get Kibana up and running, they rely on five main steps:
|
||||
|
||||
### Installation of NPM dependencies
|
||||
|
||||
Yarn Package Manager handles the installation of NPM dependencies, and the migration to Bazel will not immediately affect the time this step takes.
|
||||
|
||||
### Building packages
|
||||
|
||||
The building of [packages](https://github.com/elastic/kibana/tree/master/packages) happens during the bootstrap process initiated by running `yarn kbn bootstrap` and without any cache takes about a minute. Currently, we maintain a single cache item per package, so drastic changes like switching branches frequently results in the worst-case scenario of no-cache being usable.
|
||||
|
||||
### Building TypeScript project references
|
||||
|
||||
The size of the project and the amount of TypeScript has created scaling issues, resulting in slow project completion and IDE unresponsiveness. To combat this, we have been migrating plugins to use [project references](https://www.typescriptlang.org/docs/handbook/project-references.html) and pre-build them during bootstrap. Currently, this takes over five minutes to complete.
|
||||
|
||||
### Building client-side plugins
|
||||
|
||||
The [@kbn/optimizer](https://github.com/elastic/kibana/tree/master/packages/kbn-optimizer) package is responsible for building client-side plugins and is initiated during `yarn start`. Without any cache, it takes between three and four minutes, but is highly dependent on the amount of CPU cores available. The caching works similar to packages and requires a rebuild if any files change. Under the hood, this package is managing a set number of workers to run individual Webpack instances. When we first introduced Webpack back in [June of 2015](https://github.com/elastic/kibana/pull/4335), it was responsible for bundling all client-side code within a single process. As the Kibana project continued to grow over time, this Webpack process continued to impact the developer experience. A common theme to address these issues was through reducing the responsibilities of Webpack by separating [SCSS](https://github.com/elastic/kibana/pull/19643) and [vendor code](https://github.com/elastic/kibana/pull/22618). Knowing we would need to continue to scale, one of the new platform’s core objectives was to be able to build each plugin independently. This work paved the way for what we are proposing here and led to the [creation of @kbn/optimizer](https://github.com/elastic/kibana/pull/53976), which improved performance by separating and parallelizing Webpack builds.
|
||||
|
||||
### Compiling server-side code
|
||||
|
||||
While in development, we rely on [@babel/register](https://babel.dev/docs/en/babel-register) to transpile server-side code during runtime. The use of `@babel/register` results in the compile-time cost being paid when the code is run, mostly during startup or when initiating a unit test. When comparing development startup to production, where the code is pre-compiled, we see startup time taking about twice as long even when the Babel cache exists.
|
||||
|
||||
---
|
||||
|
||||
These steps cost developers more than ten minutes of their time when updating or changing branches. These times will continue to worsen as the project continues to grow. Instead of making small incremental changes as we have done in the past, like improving caching in a single area, we would like to leverage Bazel, where there are already solutions to many of these problems.
|
||||
|
||||
One of the primary advantages Bazel provides is that the builds are hermetic, meaning they are dependent only on a known set of inputs to ensure the builds are reproducible and cacheable. To create these assurances, builds utilize a sandbox environment with only the defined dependencies available. This not only allows for aggressive local caching but the use of remote caching as well. If Bazel determines that a package or plugin needs to be re-built and it's not in the local cache, it will check the remote cache and, if found, will persist locally for subsequent builds. Once the project has completely migrated to Bazel, a developer will only build code they have directly modified or is dependent on those changes. In building packages and plugins, the expected cost for most developers will be downloading the builds from the remote cache for anything changed since their last build.
|
||||
|
||||
Building TypeScript reference definitions and using `@babel/register` will be negated by using the TypeScript compiler directly instead of using Babel. Currently, we use Babel for code generation and `tsc` for type check and type declaration output. Additionally, the TypeScript implementation in [rules_nodejs](https://bazelbuild.github.io/rules_nodejs/TypeScript.html) for Bazel handles incremental builds, resulting in faster re-builds.
|
||||
|
||||
In addition to the benefits of building code, there are also benefits regarding running unit tests. Developers currently need to understand what unit tests to run to validate changes or rely on waiting for CI, which has a long feedback loop. Since Bazel knows the dependency tree, it will only run unit tests for a package or plugin modified or dependent on those modifications. This optimization helps developers and will significantly reduce the amount of work CI needs to do. On CI, unit tests take 35 minutes to complete, where the average single Jest project takes just twenty seconds.
|
||||
|
||||
# Detailed design
|
||||
|
||||
## Installation and configuration
|
||||
|
||||
To avoid adding Bazel as a dependency that developers need to manage, we will be using a project called Bazelisk to provide that resolution, similar to Gradle Wrapper. The bootstrap command will ensure that the `@bazel/bazelisk` package is installed globally. Two files will exist at the root of the repository, `.bazeliskversion` and `.bazelversion`, to define the required versions of those packages, similar to specifying the Node version today.
|
||||
|
||||
|
||||
## Typescript
|
||||
|
||||
The [NodeJS](https://bazelbuild.github.io/rules_nodejs/TypeScript.html) rules for Bazel contain two different methods for handling TypeScript; `ts_library` and `ts_project`. We will be using `ts_project`, as it provides a wrapper around `tsc` where `ts_library` is an open-sourced version of the rule used to compile TypeScript at Google. While there are advantages to `ts_library`, it’s very opinionated and hard to migrate an existing project to while also locking us into a specific version of TypeScript. Over time, it’s expected that `ts_project` will catch up to that of `ts_library`.
|
||||
|
||||
Bazel maintains a persistent worker which `ts_project` takes advantage of by keeping the AST in memory and providing incremental updates. This should improve the time it takes for changes to be represented.
|
||||
|
||||
A Bazel [macro](https://docs.bazel.build/versions/master/skylark/macros.html) will be created to centralize the usage of `ts_project`. The macro will, at minimum, accept a TypeScript configuration file, supply the base `tsconfig.js` file as a source and ensure incremental builds are enabled.
|
||||
|
||||
|
||||
## Webpack
|
||||
|
||||
A Bazel [macro](https://docs.bazel.build/versions/master/skylark/macros.html) will be created to centralize the usage of Webpack. The macro will, at minimum, accept a configuration file and supply a base `webpack.config.js` file. Currently, all plugins share the same Webpack configuration. Allowing a plugin to provide additional configuration will allow plugins the ability to add loaders without affecting the performance of others.
|
||||
|
||||
While running Kibana from source in development, the proxy server will ensure that client-side code for plugins is compiled and available. This is currently handled by the [basePathProxy](https://github.com/elastic/kibana/blob/master/src/core/server/http/base_path_proxy_server.ts), where server restarts and optimizer builds are observed and cause the proxy to pause requests. With Bazel, we will utilize [iBazel](https://github.com/bazelbuild/bazel-watche) to watch for file changes and re-build the plugin targets when necessary. The watcher will emit [events](https://github.com/bazelbuild/bazel-watcher#remote-events) that we will use to block requests and provide feedback to the logs.
|
||||
|
||||
While there are a few proofs of concepts for a Webpack 5 Bazel rule, none currently exist which are deemed production-ready. In the meantime, we can use the Webpack CLI directly. One of the main advantages being explored in these rules will be the support for using the Bazel worker to provide incremental builds similar to what `@kbn/optimizer` is doing today.
|
||||
|
||||
We are aware there are quite a few alternatives to Webpack, but our plan is to continue using it during the migration. Once all packages have been migrated to Bazel, it will be much easier to test alternatives through changing the targets of a single plugins `BUILD.bazel` file.
|
||||
|
||||
|
||||
### Unit Testing
|
||||
|
||||
A Bazel macro will be created to centralize the usage of Jest unit testing. The macro will, at minimum, accept a Jest configuration file, add the [Jest preset](https://github.com/elastic/kibana/blob/master/packages/kbn-test/jest-preset.js) and its dependencies as sources, then use the Jest CLI to execute tests.
|
||||
|
||||
Developers currently use `yarn test:jest` to efficiently run tests in a given directory without remembering the command or path. This command will continue to work as it does today, but will begin running tests through Bazel for packages or plugins which have been migrated.
|
||||
|
||||
CI will have an additional job to run `bazel test //…:jest`. This will run unit tests for any package or plugin modified or dependent on modifications since the last successful CI run on that branch.
|
||||
|
||||
When migrating a package or plugin using Jest to Bazel, a `jest` target using our macro will be defined in its `BUILD.bazel` file. The project is then excluded from the root `jest.config.js` file to ensure the tests do not needlessly run multiple times. While we could still use Babel for supporting TypeScript in Jest, there would be advantages to utilizing Bazel to handle compiling TypeScript. Not only would developers immediately receive type checking, but those builds would also be shared with anything else using the target, like the Kibana server or Webpack.
|
||||
|
||||
|
||||
## Yarn & Node Version Management
|
||||
|
||||
Bazel provides the ability to define the version of Node and Yarn which are used, and once we have fully migrated to Bazel, developers will no longer need to take action when we choose to change versions. The only requirement would be to have a single version of Yarn installed so scripts defined in the `package.json` could be executed.
|
||||
|
||||
Example excerpt from `WORKSPACE.bazel`:
|
||||
```python
|
||||
node_repositories(
|
||||
node_repositories = {
|
||||
"14.15.4-darwin_amd64": ("node-v14.15.4-darwin-x64.tar.gz", "node-v14.15.4-darwin-x64", "6b0e19e5c2601ef97510f7eb4f52cc8ee261ba14cb05f31eb1a41a5043b0304e"),
|
||||
"14.15.4-linux_arm64": ("node-v14.15.4-linux-arm64.tar.xz", "node-v14.15.4-linux-arm64", "b990bd99679158c3164c55a20c2a6677c3d9e9ffdfa0d4a40afe9c9b5e97a96f"),
|
||||
"14.15.4-linux_s390x": ("node-v14.15.4-linux-s390x.tar.xz", "node-v14.15.4-linux-s390x", "29f794d492eccaf0b08e6492f91162447ad95cfefc213fc580a72e29e11501a9"),
|
||||
"14.15.4-linux_amd64": ("node-v14.15.4-linux-x64.tar.xz", "node-v14.15.4-linux-x64", "ed01043751f86bb534d8c70b16ab64c956af88fd35a9506b7e4a68f5b8243d8a"),
|
||||
"14.15.4-windows_amd64": ("node-v14.15.4-win-x64.zip", "node-v14.15.4-win-x64", "b2a0765240f8fbd3ba90a050b8c87069d81db36c9f3745aff7516e833e4d2ed6"),
|
||||
},
|
||||
node_version = "14.15.4",
|
||||
node_urls = [
|
||||
"https://nodejs.org/dist/v{version}/{filename}",
|
||||
],
|
||||
yarn_repositories = {
|
||||
"1.21.1": ("yarn-v1.21.1.tar.gz", "yarn-v1.21.1", "d1d9f4a0f16f5ed484e814afeb98f39b82d4728c6c8beaafb5abc99c02db6674"),
|
||||
},
|
||||
yarn_version = "1.21.1",
|
||||
yarn_urls = [
|
||||
"https://github.com/yarnpkg/yarn/releases/download/v{version}/{filename}",
|
||||
],
|
||||
package_json = ["//:package.json"],
|
||||
)
|
||||
```
|
||||
|
||||
## Target outputs
|
||||
|
||||
The Kibana project will contain a new `bazel` directory with symlinks to current builds and logs. This directory is not checked in and is covered by gitignore. More details can be found in the Bazel documentation for [output directory layout](https://docs.bazel.build/versions/master/output_directories.html). Keep in mind we specify a [symlink prefix](https://docs.bazel.build/versions/master/user-manual.html#flag--symlink_prefix) of “bazel” to maintain a single directory.
|
||||
|
||||
For most, this change will be welcomed as it has been a common complaint that our targets are scattered throughout the repository making it difficult to search without configuring the ignore list.
|
||||
|
||||
|
||||
## Preserve Symlinks
|
||||
|
||||
Bazel outputs are created in a folder relative to the monorepo at `./bazel`. However, that folder is just a compilation of symlinks that Bazel creates pointing to temporary folders on the local disk. During the migration, we will begin referencing packages within the `bazel/bin` directory. Internally, Yarn will handle this by creating another symlink from within the `node_modules` directory to packages. By default, any import will be based on the location of the file and not the location of the symlink. This causes issues with module resolution since the `node_modules` directory populated by other dependencies will not be within the tree. To resolve this, we will use the node flag `--preserve-symlinks` that will patch the require calls and prevent Node from expanding the symlinks into their real path during the module resolution.
|
||||
|
||||
|
||||
## Build Packaging
|
||||
|
||||
One of the additional benefits to Bazel is that it is multi-platform. While it runs on Linux, macOS, and Windows, it can build binaries across platforms.
|
||||
|
||||
Bazel provides a [pkg](https://github.com/bazelbuild/rules_pkg/tree/main/pkg) rule providing tar, deb, and rpm support. To facilitate cross-platform tar support in the distributable build, we are currently using tar through Node, which is slow. The pkg tar rule will provide an improvement in performance. For deb and RPM builds, Kibana is currently using a Ruby package called [fpm](https://github.com/jordansissel/fpm) created by a former Elastic employee.
|
||||
|
||||
For Docker, we currently create the images during the build using Docker then extract the image as a tar to provide the Release Manager which publishes it to our repository. For ARM, we only create a Docker context which Release Manager uses to create the image on ARM hardware. Bazel has a [docker](https://github.com/bazelbuild/rules_docker) rule, which should allow us to cross-build, and do so without actually using Docker.
|
||||
|
||||
The current build is fairly procedural and has little caching where subsequent builds take almost as long as the previous. When working on a step later in the build system, one ultimately ends up commenting out previously completed steps to save time when testing. With Bazel, each target consumes sources or dependencies which could be other targets. Conceivably, we will have a target called release, which is dependent on another target for each of the assets in the distribution (Windows zip, Linux 64-bit tar, Darwin tar, RPM 64-bit, Deb 64-bit, Bed Aarch64, etc). Each one of these assets will then depend on the Kibana core and the rest of the plugins. The entire dependency tree for this will be resolved and rebuilt only when necessary.
|
||||
|
||||
|
||||
## scripts/*
|
||||
|
||||
We decided to use scripts to define and list any command-line utility for the repository. With Bazel, we can still use these entry points, but they will need to consume the code from `bazel/dist` instead of relying on `src/setup_node_env` to provide transpiling using `@babel/register`.
|
||||
|
||||
After the entire migration, we should consider using Bazel targets to execute the script, as with it, we can automatically resolve the dependencies and build anything not yet available.
|
||||
|
||||
|
||||
## Remote Cache
|
||||
|
||||
As mentioned previously, the remote cache is an essential feature of Bazel and something we plan to utilize.
|
||||
|
||||
The Node binary is platform-specific, and because it’s used as an input to build the majority of our targets, we will need to write cache for each platform we support in development. A CI job will build and test all Bazel targets for Linux, macOS, and Windows on merge to a tracked branch. It’s important that this job completes as soon as possible to ensure anyone updating with that branch will have cache available. In the future, we will consider allowing pull request jobs to also write to the cache to minimize this race condition.
|
||||
|
||||
We have created a proof of concept using persistent storage on Google Cloud and are currently in a trial with [BuildBuddy](https://www.buildbuddy.io/) which provides not only caching but an event viewer, result store, and remote execution of builds. If we decide to move forward with BuildBuddy, we will most likely use their self-hosted solution where we can provide our own GCP infrastructure.
|
||||
|
||||
|
||||
## Packages Build Outline
|
||||
|
||||
Within Bazel, the packages will have new overall rules:
|
||||
|
||||
* It cannot contain build scripts. Every package build will be written using a Bazel `BUILD.bazel` file
|
||||
* It cannot have side effects. Every package build should be cacheable and reproducible and can not produce any side effects
|
||||
* Each package should define three major public target rules in `BUILD.bazel` files: `build`, `jest`, and a js_library target with the same name of the folder where the package is living.
|
||||
* In order to output its targets in the most Bazel friendly way, each package will output its target according to the following folder structure: for node targets, it will be `target_server`, for web target it will be `target_web` and for types, it will be `target_types`.
|
||||
|
||||
|
||||
## package.json’s Outline
|
||||
|
||||
As a prerequisite for Bazel and for additional benefits outlined in pull-request [#76412](https://github.com/elastic/kibana/issues/76412), the Kibana repository went from using Yarn Workspaces to a single `package.json` defining all dependencies.
|
||||
|
||||
One of the benefits Bazel has over Gradle is the support for Node modules. Bazel will manage the dependencies using either the NPM or Yarn Package Manager. When doing this, a `BUILD.bazel` file will be generated for each module allowing for fine-grained control.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
The project is broken down into four initial phases, providing improvements along the way.
|
||||
|
||||
|
||||
## Phase I - Infrastructure & Packages
|
||||
|
||||
In this phase, we set out to provide the necessary infrastructure outlined previously to begin utilizing Bazel and begin doing so by migrating the current 38 packages.
|
||||
|
||||
A `BUILD.bazel` file will be added to the root of each package defining a `build` target. This filegroup target will be what we call during the bootstrap phase to build all packages migrated to Bazel. This target is temporary to maintain similar functionality during our transition. In the future, these procedural build steps will be removed in favor of dependency, tree-driven actions where work will only be done if it’s necessary for the given task like running the Kibana server or executing a unit test.
|
||||
|
||||
The `@kbn/pm` package was updated in https://github.com/elastic/kibana/pull/89961 to run the new packages build target, invoked by calling `bazel build //packages:build`, before executing the existing legacy package builds.
|
||||
|
||||
The build targets will no longer reside within the package themselves and instead will be within the `bazel/bin` directory. To account for this, any defined dependency will need to be updated to reference the new directory (example: `link:bazel/bin/packages/elastic-datemath`). While also in this transition period, the build will need to copy over the packages from `bazel/bin` into the `node_modules` of the build target.
|
||||
|
||||
Example package BUILD.bazel for `packages/elastic-datemath`:
|
||||
|
||||
```python
|
||||
load("@build_bazel_rules_nodejs//:index.bzl", "pkg_npm")
|
||||
load("@build_bazel_rules_nodejs//internal/js_library:js_library.bzl", "js_library")
|
||||
|
||||
SRCS = [
|
||||
".npmignore",
|
||||
"index.js",
|
||||
"index.d.ts",
|
||||
"package.json",
|
||||
"readme",
|
||||
"tsconfig.json",
|
||||
]
|
||||
|
||||
filegroup(
|
||||
name = "src",
|
||||
srcs = glob(SRCS),
|
||||
)
|
||||
|
||||
js_library(
|
||||
name = "elastic-datemath",
|
||||
srcs = [ ":src" ],
|
||||
deps = [ "@npm//moment" ],
|
||||
package_name = "@elastic/datemath",
|
||||
visibility = ["//visibility:public"],
|
||||
)
|
||||
|
||||
alias(
|
||||
name = "build",
|
||||
actual = "elastic-datemath",
|
||||
visibility = ["//visibility:public"],
|
||||
)
|
||||
```
|
||||
|
||||
If the package has unit tests, they will need to be migrated which will be invoked with `bazel test` as described in the Unit Testing section.
|
||||
|
||||
|
||||
## Phase II - Docs, Developer Experience
|
||||
|
||||
Packages were a likely choice for phase 1 for a few reasons; they aren’t often updated and the developer experience is quite lacking making it easy to maintain parity with. In phase 2, we will bring the developer experience of packages to that which developers are accustomed to with plugins. This means re-builds will be automatic when a change occurs as well as giving time to address any developer experience shortcomings which were not foreseen. During this time we will work on overall Bazel documentation as it pertains to the Kibana repository.
|
||||
|
||||
|
||||
## Phase III - Core & Plugins
|
||||
|
||||
In this phase, we will be migrating each of the 135 plugins over to being built and unit tested using Bazel. During this time, the legacy systems will stay in place and run in parallel with Bazel. Once all plugins have been migrated, we can decommission the legacy systems.
|
||||
|
||||
The `BUILD.bazel` files will look similar to that of packages, there will be a target for `web`, `server`, and `jest`. Just like packages, as the Jest unit tests are migrated, they will need to be removed from the root `jest.config.js` file as described in the Unit Testing section.
|
||||
|
||||
Plugins are built in a sandbox, so they will no longer be able to use relative imports from one another. For Typescript, relative imports will be replaced with a path reference to the `bazel/bin`.
|
||||
|
||||
Static imports across plugins are a concern that would affect the developer experience due to cascading re-builds. For example, if every plugin has static imports from `src/core`, any changes to `src/core` would cause all those plugins to re-build. There are a few options to address this; the first would be to minimize or eliminate these imports. Most plugins are importing types, so we can also ensure that only type-level changes actually trigger a re-build. Additionally, these types of dependencies could be further broken down into smaller packages to reduce the times further this is necessary.
|
||||
|
||||
```
|
||||
"compilerOptions": {
|
||||
"rootDirs": [
|
||||
".",
|
||||
"./bazel/out/host/bin/path/to",
|
||||
"./bazel/out/darwin-fastbuild/bin/path/to",
|
||||
"./bazel/out/k8-fastbuild/bin/path/to",
|
||||
"./bazel/out/x64_windows-fastbuild/bin/path/to",
|
||||
"./bazel/out/darwin-dbg/bin/path/to",
|
||||
"./bazel/out/k8-dbg/bin/path/to",
|
||||
"./bazel/out/x64_windows-dbg/bin/path/to",
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Phase IV - Build Packaging
|
||||
|
||||
In this phase, we will be replacing our current build tooling located at `src/dev/build` to use Bazel. A single target of `release` will provide all assets needed by the release manager:
|
||||
|
||||
* Windows (zip)
|
||||
* Linux 64-bit (tar)
|
||||
* Linux aarch64 (tar)
|
||||
* RPM 64-bit
|
||||
* RPM aarch64
|
||||
* Deb 64-bit
|
||||
* Deb aarch64
|
||||
* Darwin 64-bit (tar)
|
||||
* CentOS 64-bit Docker Image & Context (tar)
|
||||
* CentOS aarch64 Docker Image & Context (tar)
|
||||
* UBI Docker Image & Context (tar)
|
||||
* Ironbank Docker Context (tar)
|
||||
|
||||
There are a few rules already available provided by Bazel that should be used. In some cases, like tar, they have been re-implemented to ensure the output is hermetic. `rules_pgk` has `pkg_tar`, `pkg_deb`, `pkg_rpm`, and `pkg_zip` to assist with this. `rules_docker` provides the ability to build containers without depending on Docker to be installed and providing the ability to build for other platforms.
|
||||
|
||||
While this phase can begin with phase 1, it can not be completed until all packages and plugins have been migrated.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
With Bazel, all dependencies need to be defined on each package which can become tedious. However, that is also how Bazel is able to provide the level of cache and performance which it does.
|
||||
|
||||
Bazel is substantially different from what people in the Javascript community are accustomed to, so teaching might be difficult. For example, in Javascript when you would like to add support for Typescript, you would probably find a package that adds the support to Jest, Webpack, or Babel. However, in Bazel, it works on inputs and outputs. You could still do what was previously described, but it wouldn’t be efficient. Instead, you would have your Typescript code as an input, which would use the Typescript compiler to output Javascript which would be the input to Webpack or Jest. This way, that compile step would only happen once for each of those paths.
|
||||
|
||||
It’s also possible there is something better out there for our use, or as some have suggested splitting up our repository into smaller pieces.
|
||||
|
||||
|
||||
# Alternatives
|
||||
|
||||
Gradle is widely used at Elastic, however, it doesn’t have the NodeJS specific support which Bazel has.
|
||||
|
||||
There are other alternatives that seem to have been created by past Google employees who wanted something like Blaze which is the internal tool used at Google before they open-sourced Bazel (an anagram of Blaze). Most of these just didn’t have a large enough community or provide the level of caching and scaling we were looking for.
|
||||
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
The migration would happen in phases starting with packages, then the build system, then plugins. All steps in the phase can happen gradually and over time.
|
||||
|
||||
|
||||
# How we teach this
|
||||
|
||||
There will be a lot to teach here, and we have been iterating on a talk which we would give to the entire Kibana team. The Operations team would be available to assist anyone with questions or assistance with Bazel aspects of the build system.
|
|
@ -1,323 +0,0 @@
|
|||
- Start Date: 2020-03-01
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
---
|
||||
- [1. Summary](#1-summary)
|
||||
- [2. Motivation](#2-motivation)
|
||||
- [3. Detailed design](#3-detailed-design)
|
||||
- [4. Drawbacks](#4-drawbacks)
|
||||
- [5. Alternatives](#5-alternatives)
|
||||
- [6. Adoption strategy](#6-adoption-strategy)
|
||||
- [7. How we teach this](#7-how-we-teach-this)
|
||||
- [8. Unresolved questions](#8-unresolved-questions)
|
||||
|
||||
# 1. Summary
|
||||
|
||||
Object-level security ("OLS") authorizes Saved Object CRUD operations on a per-object basis.
|
||||
This RFC focuses on the [phase 1](https://github.com/elastic/kibana/issues/82725), which introduces "private" saved object types. These private types
|
||||
are owned by individual users, and are _generally_ only accessible by their owners.
|
||||
|
||||
This RFC does not address any [followup phases](https://github.com/elastic/kibana/issues/39259), which may support sharing, and ownership of "public" objects.
|
||||
|
||||
# 2. Motivation
|
||||
|
||||
OLS allows saved objects to be owned by individual users. This allows Kibana to store information that is specific
|
||||
to each user, which enables further customization and collaboration throughout our solutions.
|
||||
|
||||
The most immediate feature this unlocks is [User settings and preferences (#17888)](https://github.com/elastic/kibana/issues/17888),
|
||||
which is a very popular and long-standing request.
|
||||
|
||||
# 3. Detailed design
|
||||
|
||||
Phase 1 of OLS allows consumers to register "private" saved object types.
|
||||
These saved objects are owned by individual end users, and are subject to additional security controls.
|
||||
|
||||
Public (non-private) saved object types are not impacted by this RFC. This proposal does not allow types to transition to/from `public`/`private`, and is considered out of scope for phase 1.
|
||||
|
||||
## 3.1 Saved Objects Service
|
||||
|
||||
### 3.1.1 Type registry
|
||||
The [saved objects type registry](https://github.com/elastic/kibana/blob/701697cc4a34d07c0508c3bdf01dca6f9d40a636/src/core/server/saved_objects/saved_objects_type_registry.ts) will allow consumers to register "private" saved object types via a new `accessClassification` property:
|
||||
|
||||
```ts
|
||||
/**
|
||||
* The accessClassification dictates the protection level of the saved object:
|
||||
* * public (default): instances of this saved object type will be accessible to all users within the given namespace, who are authorized to act on objects of this type.
|
||||
* * private: instances of this saved object type will belong to the user who created them, and will not be accessible by other users, except for administrators.
|
||||
*/
|
||||
export type SavedObjectsAccessClassification = 'public' | 'private';
|
||||
|
||||
// Note: some existing properties have been omitted for brevity.
|
||||
export interface SavedObjectsType {
|
||||
name: string;
|
||||
hidden: boolean;
|
||||
namespaceType: SavedObjectsNamespaceType;
|
||||
mappings: SavedObjectsTypeMappingDefinition;
|
||||
|
||||
/**
|
||||
* The {@link SavedObjectsAccessClassification | accessClassification} for the type.
|
||||
*/
|
||||
accessClassification?: SavedObjectsAccessClassification;
|
||||
}
|
||||
|
||||
// Example consumer
|
||||
class MyPlugin {
|
||||
setup(core: CoreSetup) {
|
||||
core.savedObjects.registerType({
|
||||
name: 'user-settings',
|
||||
accessClassification: 'private',
|
||||
namespaceType: 'single',
|
||||
hidden: false,
|
||||
mappings,
|
||||
})
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 3.1.2 Schema
|
||||
Saved object ownership will be recorded as metadata within each `private` saved object. We do so by adding a top-level `accessControl` object with a singular `owner` property. See [unresolved question 1](#81-accessControl.owner) for details on the `owner` property.
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Describes which users should be authorized to access this SavedObject.
|
||||
*
|
||||
* @public
|
||||
*/
|
||||
export interface SavedObjectAccessControl {
|
||||
/** The owner of this SavedObject. */
|
||||
owner: string;
|
||||
}
|
||||
|
||||
// Note: some existing fields have been omitted for brevity
|
||||
export interface SavedObject<T = unknown> {
|
||||
id: string;
|
||||
type: string;
|
||||
attributes: T;
|
||||
references: SavedObjectReference[];
|
||||
namespaces?: string[];
|
||||
/** Describes which users should be authorized to access this SavedObject. */
|
||||
accessControl?: SavedObjectAccessControl;
|
||||
}
|
||||
```
|
||||
|
||||
### 3.1.3 Saved Objects Client: Security wrapper
|
||||
|
||||
The [security wrapper](https://github.com/elastic/kibana/blob/701697cc4a34d07c0508c3bdf01dca6f9d40a636/x-pack/plugins/security/server/saved_objects/secure_saved_objects_client_wrapper.ts) authorizes and audits operations against saved objects.
|
||||
|
||||
There are two primary changes to this wrapper:
|
||||
|
||||
#### Attaching Access Controls
|
||||
|
||||
This wrapper will be responsible for attaching an access control specification to all private objects before they are created in Elasticsearch.
|
||||
It will also allow users to provide their own access control specification in order to support the import/create use cases.
|
||||
|
||||
Similar to the way we treat `namespaces`, it will not be possible to change an access control specification via the `update`/`bulk_update` functions in this first phase. We may consider adding a dedicated function to update the access control specification, similar to what we've done for sharing to spaces.
|
||||
|
||||
#### Authorization changes
|
||||
|
||||
This wrapper will be updated to ensure that access to private objects is only granted to authorized users. A user is authorized to operate on a private saved object if **all of the following** are true:
|
||||
Step 1) The user is authorized to perform the operation on saved objects of the requested type, within the requested space. (Example: `update` a `user-settings` saved object in the `marketing` space)
|
||||
Step 2) The user is authorized to access this specific instance of the saved object, as described by that object's access control specification. For this first phase, the `accessControl.owner` is allowed to perform all operations. The only other users who are allowed to access this object are administrators (see [resolved question 2](#92-authorization-for-private-objects))
|
||||
|
||||
Step 1 of this authorization check is the same check we perform today for all existing saved object types. Step 2 is a new authorization check, and **introduces additional overhead and complexity**. We explore the logic for this step in more detail later in this RFC. Alternatives to this approach are discussed in [alternatives, section 5.2](#52-re-using-the-repositorys-pre-flight-checks).
|
||||
|
||||

|
||||
|
||||
## 3.2 Saved Objects API
|
||||
|
||||
OLS Phase 1 does not introduce any new APIs, but rather augments the existing Saved Object APIs.
|
||||
|
||||
APIs which return saved objects are augmented to include the top-level `accessControl` property when it exists. This includes the `export` API.
|
||||
|
||||
APIs that create saved objects are augmented to accept an `accessControl` property. This includes the `import` API.
|
||||
|
||||
### `get` / `bulk_get`
|
||||
|
||||
The security wrapper will ensure the user is authorized to access private objects before returning them to the consumer.
|
||||
|
||||
#### Performance considerations
|
||||
None. The retrieved object contains all of the necessary information to authorize the current user, with no additional round trips to Elasticsearch.
|
||||
|
||||
### `create` / `bulk_create`
|
||||
|
||||
The security wrapper will ensure that an access control specification is attached to all private objects.
|
||||
|
||||
If the caller has requested to overwrite existing `private` objects, then the security wrapper must ensure that the user is authorized to do so.
|
||||
|
||||
#### Performance considerations
|
||||
When overwriting existing objects, the security wrapper must first retrieve all of the existing `private` objects to ensure that the user is authorized. This requires another round-trip to `get`/`bulk-get` all `private` objects so we can authorize the operation.
|
||||
|
||||
This overhead does not impact overwriting "public" objects. We only need to retrieve objects that are registered as `private`. As such, we do not expect any meaningful performance hit initially, but this will grow over time as the feature is used.
|
||||
|
||||
### `update` / `bulk_update`
|
||||
|
||||
The security wrapper will ensure that the user is authorized to update all existing `private` objects. It will also ensure that an access control specification is not provided, as updates to the access control specification are not permitted via `update`/`bulk_update`.
|
||||
|
||||
#### Performance considerations
|
||||
Similar to the "create / override" scenario above, the security wrapper must first retrieve all of the existing `private` objects to ensure that the user is authorized. This requires another round-trip to `get`/`bulk-get` all `private` objects so we can authorize the operation.
|
||||
|
||||
This overhead does not impact updating "public" objects. We only need to retrieve objects that are registered as `private`. As such, we do not expect any meaningful performance hit initially, but this will grow over time as the feature is used.
|
||||
|
||||
### `delete`
|
||||
|
||||
The security wrapper will first retrieve the requested `private` object to ensure the user is authorized.
|
||||
|
||||
#### Performance considerations
|
||||
The security wrapper must first retrieve the existing `private` object to ensure that the user is authorized. This requires another round-trip to `get` the `private` object so we can authorize the operation.
|
||||
|
||||
This overhead does not impact deleting "public" objects. We only need to retrieve objects that are registered as `private`. As such, we do not expect any meaningful performance hit initially, but this will grow over time as the feature is used.
|
||||
|
||||
|
||||
### `find`
|
||||
The security wrapper will supply or augment a [KQL `filter`](https://github.com/elastic/kibana/blob/701697cc4a34d07c0508c3bdf01dca6f9d40a636/src/core/server/saved_objects/types.ts#L118) which describes the objects the current user is authorized to see.
|
||||
|
||||
```ts
|
||||
// Sample KQL filter
|
||||
const filterClauses = typesToFind.reduce((acc, type) => {
|
||||
if (this.typeRegistry.isPrivate(type)) {
|
||||
return [
|
||||
...acc,
|
||||
// note: this relies on specific behavior of the SO service's `filter_utils`,
|
||||
// which automatically wraps this in an `and` node to ensure the type is accounted for.
|
||||
// we have added additional safeguards there, and functional tests will ensure that changes
|
||||
// to this logic will not accidentally alter our authorization model.
|
||||
|
||||
// This is equivalent to writing the following, if this syntax was allowed by the SO `filter` option:
|
||||
// esKuery.nodeTypes.function.buildNode('and', [
|
||||
// esKuery.nodeTypes.function.buildNode('is', `accessControl.owner`, this.getOwner()),
|
||||
// esKuery.nodeTypes.function.buildNode('is', `type`, type),
|
||||
// ])
|
||||
esKuery.nodeTypes.function.buildNode('is', `${type}.accessControl.owner`, this.getOwner()),
|
||||
];
|
||||
}
|
||||
return acc;
|
||||
}, []);
|
||||
|
||||
const privateObjectsFilter =
|
||||
filterClauses.length > 0 ? esKuery.nodeTypes.function.buildNode('or', filterClauses) : null;
|
||||
```
|
||||
|
||||
#### Performance considerations
|
||||
We are sending a more complex query to Elasticsearch for any find request which requests a `private` saved object. This has the potential to hurt query performance, but at this point it hasn't been quantified.
|
||||
|
||||
Since we are only requesting saved objects that the user is authorized to see, there is no additional overhead for Kibana once Elasticsearch has returned the results of the query.
|
||||
|
||||
|
||||
### `addToNamespaces` / `deleteFromNamespaces`
|
||||
|
||||
The security wrapper will ensure that the user is authorized to share/unshare all existing `private` objects.
|
||||
#### Performance considerations
|
||||
Similar to the "create / override" scenario above, the security wrapper must first retrieve all of the existing `private` objects to ensure that the user is authorized. This requires another round-trip to `get`/`bulk-get` all `private` objects so we can authorize the operation.
|
||||
|
||||
This overhead does not impact sharing/unsharing "public" objects. We only need to retrieve objects that are registered as `private`. As such, we do not expect any meaningful performance hit initially, but this will grow over time as the feature is used.
|
||||
|
||||
|
||||
## 3.3 Behavior with various plugin configurations
|
||||
Kibana can run with and without security enabled. When security is disabled,
|
||||
`private` saved objects will be accessible to all users.
|
||||
|
||||
| **Plugin Configuration** | Security | Security & Spaces | Spaces |
|
||||
| ---- | ------ | ------ | --- |
|
||||
|| ✅ Enforced | ✅ Enforced | 🚫 Not enforced: objects will be accessible to all
|
||||
|
||||
### Alternative
|
||||
If this behavior is not desired, we can prevent `private` saved objects from being accessed whenever security is disabled.
|
||||
|
||||
See [unresolved question 3](#83-behavior-when-security-is-disabled)
|
||||
|
||||
## 3.4 Impacts on telemetry
|
||||
|
||||
The proposed design does not have any impacts on telemetry collection or reporting. Telemetry collectors run in the background against an "unwrapped" saved objects client. That is to say, they run without space-awareness, and without security. Since the security enforcement for private objects exists within the security wrapper, telemetry collection can continue as it currently exists.
|
||||
|
||||
# 4. Drawbacks
|
||||
|
||||
As outlined above, this approach introduces additional overhead to many of the saved object APIs. We minimize this by denoting which saved object types require this additional authorization.
|
||||
|
||||
This first phase also does not allow a public object to become private. Search sessions may migrate to OLS in the future, but this will likely be a coordinated effort with Elasticsearch, due to the differing ownership models between OLS and async searches.
|
||||
|
||||
# 5. Alternatives
|
||||
|
||||
## 5.1 Document level security
|
||||
OLS can be thought of as a Kibana-specific implementation of [Document level security](https://www.elastic.co/guide/en/elasticsearch/reference/current/document-level-security.html) ("DLS"). As such, we could consider enhancing the existing DLS feature to fit our needs (DLS doesn't prevent writes at the moment, only reads). This would involve considerable work from the Elasticsearch security team before we could consider this, and may not scale to subsequent phases of OLS.
|
||||
|
||||
## 5.2 Re-using the repository's pre-flight checks
|
||||
The Saved Objects Repository uses pre-flight checks to ensure that operations against multi-namespace saved objects are adhering the user's current space. The currently proposed implementation has the security wrapper performing pre-flight checks for `private` objects.
|
||||
|
||||
If we have `private` multi-namespace saved objects, then we will end up performing two pre-flight requests, which is excessive. We could explore re-using the repository's pre-flight checks instead of introducing new checks.
|
||||
|
||||
The primary concern with this approach is audit logging. Currently, we audit create/update/delete events before they happen, so that we can record that the operation was attempted, even in the event of a network outage or other transient event.
|
||||
|
||||
If we re-use the repository's pre-flight checks, then the repository will need a way to signal that audit logging should occur. We have a couple of options to explore in this regard:
|
||||
|
||||
### 5.2.1 Move audit logging code into the repository
|
||||
Now that we no longer ship an OSS distribution, we could move the audit logging code directly into the repository. The implementation could still be provided by the security plugin, so we could still record information about the current user, and respect the current license.
|
||||
|
||||
If we take this approach, then we will need a way to create a repository without audit logging. Certain features rely on the fact that the repository does not perform its own audit logging (such as Alerting, and the background repair jobs for ML).
|
||||
|
||||
Core originally provided an [`audit_trail_service`](https://github.com/elastic/kibana/blob/v7.9.3/src/core/server/audit_trail/audit_trail_service.ts) for this type of functionality, with the thinking that OSS features could take advantage of this if needed. This was abandoned when we discovered that we had no such usages at the time, so we simplified the architecture. We could re-introduce this if desired, in order to support this initiative.
|
||||
|
||||
Not all saved object audit events can be recorded by the repository. When users are not authorized at the type level (e.g., user can't `create` `dashboards`), then the wrapper will record this and not allow the operation to proceed. This shared-responsibility model will likely be even more confusing to reason about, so I'm not sure it's worth the small performance optimization we would get in return.
|
||||
|
||||
### 5.2.2 Pluggable authorization
|
||||
This inverts the current model. Instead of security wrapping the saved objects client, security could instead provide an authorization module to the repository. The repository could decide when to perform authorization (including audit logging), passing along the results of any pre-flight operations as necessary.
|
||||
|
||||
This arguably a lot of work, but worth consideration as we evolve both our persistence and authorization mechanisms to support our maturing solutions.
|
||||
|
||||
Similar to alternative `5.2.1`, we would need a way to create a repository without authorization/auditing to support specific use cases.
|
||||
|
||||
### 5.2.3 Repository callbacks
|
||||
|
||||
A more rudimentary approach would be to provide callbacks via each saved object operation's `options` property. This callback would be provided by the security wrapper, and called by the repository when it was "safe" to perform the audit operation.
|
||||
|
||||
This is a very simplistic approach, and probably not an architecture that we want to encourage or support long-term.
|
||||
|
||||
### 5.2.4 Pass down preflight objects
|
||||
|
||||
Any client wrapper could fetch the object/s on its own and pass that down to the repository in an `options` field (preflightObject/s?) so the repository can reuse that result if it's defined, instead of initiating an entire additional preflight check. That resolves our problem without much additional complexity.
|
||||
Of course we don't want consumers (mis)using this field, we can either mark it as `@internal` or we could explore creating a separate "internal SOC" interface that is only meant to be used by the SOC wrappers.
|
||||
|
||||
|
||||
# 6. Adoption strategy
|
||||
|
||||
Adoption for net-new features is hopefully straightforward. Like most saved object features, the saved objects service will transparently handle all authorization and auditing of these objects, so long as they are properly registered.
|
||||
|
||||
Adoption for existing features (public saved object types) is not addressed in this first phase.
|
||||
|
||||
# 7. How we teach this
|
||||
|
||||
Updates to the saved object service's documentation to describe the different `accessClassification`s would be required. Like other saved object security controls, we want to ensure that engineers understand that this only "works" when the security wrapper is applied. Creating a bespoke instance of the saved objects client, or using the raw repository will intentionally bypass these authorization checks.
|
||||
|
||||
# 8. Unresolved questions
|
||||
|
||||
## 8.1 `accessControl.owner`
|
||||
|
||||
The `accessControl.owner` property will uniquely identify the owner of each `private` saved object. We are still iterating with the Elasticsearch security team on what this value will ultimately look like. It is highly likely that this will not be a human-readable piece of text, but rather a GUID-style identifier.
|
||||
|
||||
## 8.2 Authorization for private objects
|
||||
|
||||
This has been [resolved](#92-authorization-for-private-objects).
|
||||
|
||||
The user identified by `accessControl.owner` will be authorized for all operations against that instance, provided they pass the existing type/space/action authorization checks.
|
||||
|
||||
In addition to the object owner, we also need to allow administrators to manage these saved objects. This is beneficial if they need to perform a bulk import/export of private objects, or if they wish to remove private objects from users that no longer exist. The open question is: **who counts as an administrator?**
|
||||
|
||||
We have historically used the `Saved Objects Management` feature for these administrative tasks. This feature grants access to all saved objects, even if you're not authorized to access the "owning" application. Do we consider this privilege sufficient to see and potentially manipulate private saved objects?
|
||||
|
||||
## 8.3 Behavior when security is disabled
|
||||
|
||||
This has been [resolved](#93-behavior-when-security-is-disabled).
|
||||
|
||||
When security is disabled, should `private` saved objects still be accessible via the Saved Objects Client?
|
||||
|
||||
|
||||
# 9. Resolved Questions
|
||||
|
||||
## 9.2 Authorization for private objects
|
||||
|
||||
Users with the `Saved Objects Management` privilege will be authorized to access private saved objects belonging to other users.
|
||||
Additionally, we will introduce a sub-feature privilege which will allow administrators to control which of their users with `Saved Objects Management` access are authorized to access these private objects.
|
||||
|
||||
## 9.3 Behavior when security is disabled
|
||||
|
||||
When security is disabled, `private` objects will still be accessible via the Saved Objects Client.
|
|
@ -1,600 +0,0 @@
|
|||
- Start Date: 2021-03-26
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
|
||||
# Summary
|
||||
|
||||
Currently in the Kibana `share` plugin we have two services that deal with URLs.
|
||||
|
||||
One is *Short URL Service*: given a long internal Kibana URL it returns an ID.
|
||||
That ID can be used to "resolve" back to the long URL and redirect the user to
|
||||
that long URL page. (The Short URL Service is now used in Dashboard, Discover,
|
||||
Visualize apps, and have a few upcoming users, for example, when sharing panels
|
||||
by Slack or e-mail we will want to use short URLs.)
|
||||
|
||||
```ts
|
||||
// It does not have a plugin API, you can only use it through an HTTP request.
|
||||
const shortUrl = await http.post('/api/shorten_url', {
|
||||
url: '/some/long/kibana/url/.../very?long=true#q=(rison:approved)'
|
||||
});
|
||||
```
|
||||
|
||||
The other is the *URL Generator Service*: it simply receives an object of
|
||||
parameters and returns back a deep link within Kibana. (You can use it, for
|
||||
example, to navigate to some specific query with specific filters for a
|
||||
specific index pattern in the Discover app. As of this writing, there are
|
||||
eight registered URL generators, which are used by ten plugins.)
|
||||
|
||||
```ts
|
||||
// You first register a URL generator.
|
||||
const myGenerator = plugins.share.registerUrlGenerator(/* ... */);
|
||||
|
||||
// You can fetch it from the registry (if you don't already have it).
|
||||
const myGenerator = plugins.share.getUrlGenerator(/* ... */);
|
||||
|
||||
// Now you can use it to generate a deep link into Kibana.
|
||||
const deepLink: string = myGenerator.createUrl({ /* ... */ });
|
||||
```
|
||||
|
||||
|
||||
## Goals of the project
|
||||
|
||||
The proposal is to unify both of these services (Short URL Service and URL
|
||||
Generator Service) into a single new *URL Service*. The new unified service
|
||||
will still provide all the functionality the above mentioned services provide
|
||||
and in addition will implement the following improvements:
|
||||
|
||||
1. Standardize a way for apps to deep link and navigate into other Kibana apps,
|
||||
with ability to use *location state* to specify the state of the app which is
|
||||
not part of the URL.
|
||||
2. Combine Short URL Service with URL Generator Service to allow short URLs to
|
||||
be constructed from URL generators, which will also allow us to automatically
|
||||
migrate the short URLs if the parameters of the underlying URL generator
|
||||
change and be able to store location state in every short URL.
|
||||
3. Make the short url service easier to use. (It was previously undocumented,
|
||||
and no server side plugin APIs existed, which meant consumers had to use
|
||||
REST APIs which is discouraged. Merging the two services will help achieve
|
||||
this goal by simplifying the APIs.)
|
||||
4. Support short urls being deleted (previously not possible).
|
||||
5. Support short urls being migrated (previously not possible).
|
||||
|
||||
See more detailed explanation and other small improvements in the "Motivation"
|
||||
section below.
|
||||
|
||||
|
||||
# Terminology
|
||||
|
||||
In the proposed new service we introduce "locators". This is mostly a change
|
||||
in language, we are renaming "URL generators" to "locators". The old name would
|
||||
no longer make sense as we are not returning URLs from locators.
|
||||
|
||||
|
||||
# Basic example
|
||||
|
||||
The URL Service will have a client (`UrlServiceClient`) which will have the same
|
||||
interface, both, on the server-side and the client-side. It will also have a
|
||||
documented public set of HTTP API endpoints for use by: (1) the client-side
|
||||
client; (2) external users, Elastic Cloud, and Support.
|
||||
|
||||
The following code examples will work, both, on the server-side and the
|
||||
client-side, as the base `UrlServiceClient` interface will be similar in both
|
||||
environments.
|
||||
|
||||
Below we consider four main examples of usage of the URL Service. All four
|
||||
examples are existing use cases we currently have in Kibana.
|
||||
|
||||
|
||||
## Navigating within Kibana using locators
|
||||
|
||||
In this example let's consider a case where Discover app creates a locator,
|
||||
then another plugin uses that locator to navigate to a deep link within the
|
||||
Discover app.
|
||||
|
||||
First, the Discover plugin creates its locator (usually one per app). It needs
|
||||
to do this on the client and server.
|
||||
|
||||
|
||||
```ts
|
||||
const locator = plugins.share.locators.create({
|
||||
id: 'DISCOVER_DEEP_LINKS',
|
||||
getLocation: ({
|
||||
indexPattern,
|
||||
highlightedField,
|
||||
filters: [],
|
||||
query: {},
|
||||
fields: [],
|
||||
activeDoc: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxxx',
|
||||
}) => {
|
||||
app: 'discover',
|
||||
route: `/${indexPatten}#_a=(${risonEncode({filters, query, fields})})`,
|
||||
state: {
|
||||
highlightedField,
|
||||
activeDoc,
|
||||
},
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
Now, the Discover plugin exports this locator from its plugin contract.
|
||||
|
||||
```ts
|
||||
class DiscoverPlugin() {
|
||||
start() {
|
||||
return {
|
||||
locator,
|
||||
};
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Finally, if any other app now wants to navigate to a deep link within the
|
||||
Discover application, they use this exported locator.
|
||||
|
||||
```ts
|
||||
plugins.discover.locator.navigate({
|
||||
indexPattern: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
|
||||
highlightedField: 'foo',
|
||||
});
|
||||
```
|
||||
|
||||
Note, in this example the `highlightedField` parameter will not appear in the
|
||||
URL bar, it will be passed to the Discover app through [`history.pushState()`](https://developer.mozilla.org/en-US/docs/Web/API/History/pushState)
|
||||
mechanism (in Kibana case, using the [`history`](https://www.npmjs.com/package/history) package, which is used by `core.application.navigateToApp`).
|
||||
|
||||
|
||||
## Sending a deep link to Kibana
|
||||
|
||||
We have use cases were a deep link to some Kibana app is sent out, for example,
|
||||
through e-mail or as a Slack message.
|
||||
|
||||
In this example, lets consider some plugin gets hold of the Discover locator
|
||||
on the server-side.
|
||||
|
||||
```ts
|
||||
const location = plugins.discover.locator.getRedirectPath({
|
||||
indexPattern: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
|
||||
highlightedField: 'foo',
|
||||
});
|
||||
```
|
||||
|
||||
This would return the location of the client-side redirect endpoint. The redirect
|
||||
endpoint could look like this:
|
||||
|
||||
```
|
||||
/app/goto/_redirect/DISCOVER_DEEP_LINKS?params={"indexPattern":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","highlightedField":"foo"}¶msVersion=7.x
|
||||
```
|
||||
|
||||
This redirect client-side endpoint would find the Discover locator and and
|
||||
execute the `.navigate()` method on it.
|
||||
|
||||
|
||||
## Creating a short link
|
||||
|
||||
In this example, lets create a short link using the Discover locator.
|
||||
|
||||
```ts
|
||||
const shortUrl = await plugins.discover.locator.createShortUrl(
|
||||
{
|
||||
indexPattern: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
|
||||
highlightedField: 'foo',
|
||||
}
|
||||
'human-readable-slug',
|
||||
});
|
||||
```
|
||||
|
||||
The above example creates a short link and persists it in a saved object. The
|
||||
short URL can have a human-readable slug, which uniquely identifies that short
|
||||
URL.
|
||||
|
||||
```ts
|
||||
shortUrl.slug === 'human-readable-slug'
|
||||
```
|
||||
|
||||
The short URL can be used to navigate to the Discover app. The redirect
|
||||
client-side endpoint currently looks like this:
|
||||
|
||||
```
|
||||
/app/goto/human-readable-slug
|
||||
```
|
||||
|
||||
This persisted short URL would effectively work the same as the full version:
|
||||
|
||||
```
|
||||
/app/goto/_redirect/DISCOVER_DEEP_LINKS?params={"indexPattern":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","highlightedField":"foo"}¶msVersion=7.x
|
||||
```
|
||||
|
||||
|
||||
## External users navigating to a Kibana deep link
|
||||
|
||||
Currently Elastic Cloud and Support have many links linking into Kibana. Most of
|
||||
them are deep links into Discover and Dashboard apps where, for example, index
|
||||
pattern is selected, or filters and time range are set.
|
||||
|
||||
The external users could use the above mentioned client-side redirect endpoint
|
||||
to navigate to their desired deep location within Kibana, for example, to the
|
||||
Discover application:
|
||||
|
||||
```
|
||||
/app/goto/_redirect/DISCOVER_DEEP_LINKS?params={"indexPattern":"xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx","highlightedField":"foo"}¶msVersion=7.x
|
||||
```
|
||||
|
||||
|
||||
# Motivation
|
||||
|
||||
Our motivation to improve the URL services comes from us intending to use them
|
||||
more, for example, for panel sharing to Slack or e-mail; and we believe that the
|
||||
current state of the URL services needs an upgrade.
|
||||
|
||||
|
||||
## Limitations of the Short URL Service
|
||||
|
||||
We have identified the following limitations in the current implementation of
|
||||
the Short URL Service:
|
||||
|
||||
1. There is no migration system. If an application exposes this functionality,
|
||||
every possible URL that might be generated should be supported forever. A
|
||||
migration could be written inside the app itself, on page load, but this is a
|
||||
risky path for URLs with many possibilities.
|
||||
1. __Will do:__ Short URLs will be created using locators. We will use
|
||||
migrations provided by the locators to migrate the stored parameters
|
||||
in the short URL saved object.
|
||||
1. Short URLs store only the URL of the destination page. However, the
|
||||
destination page might have other state which affects the display of the page
|
||||
but is not present in the URL. Once the short URL is used to navigate to that
|
||||
page, any state that is kept only in memory is lost.
|
||||
1. __Will do:__ The new implementation of the short URLs will also persist
|
||||
the location state of the URL. That state would be provided to a
|
||||
Kibana app once a user navigates to that app using a short URL.
|
||||
1. It exposes only HTTP endpoint API.
|
||||
1. __Will do:__ We will also expose a URL Service client through plugin
|
||||
contract on the server and browser.
|
||||
1. It only has 3 HTTP endpoints, yet all three have different paths:
|
||||
(1) `/short_url`, (2) `/shorten_url`; and (3) `/goto`.
|
||||
1. __Will do:__ We will normalize the HTTP endpoints. We will use HTTP
|
||||
method "verbs" like POST, instead of verbs in the url like "shorten_url".
|
||||
1. There is not much documentation for developers.
|
||||
1. __Will do:__ The new service will have a much nicer API and docs.
|
||||
1. There is no way to delete short URLs once they are created.
|
||||
1. __Will do:__ The new service will provide CRUD API to manage short URLs,
|
||||
including deletion.
|
||||
1. Short URL service uses MD5 algorithm to hash long URLs. Security team
|
||||
requested to stop using that algorithm.
|
||||
1. __Will do:__ The new URL Service will not use MD5 algorithm.
|
||||
1. Short URLs are not automatically deleted when the target (say dashboard) is
|
||||
deleted. (#10450)
|
||||
1. __Could do:__ The URL Service will not provide such feature. Though the
|
||||
short URLs will keep track of saved object references used in the params
|
||||
to generate a short URL. Maybe those saved references could somehow be
|
||||
used in the future to provide such a facility.
|
||||
|
||||
Currently, there are two possible avenues for deleting a short URL when
|
||||
the underlying dashboard is deleted:
|
||||
|
||||
1. The Dashboard app could keep track of short URLs it generates for each
|
||||
dashboard. Once a dashboard is deleted, the Dashboard app also
|
||||
deletes all short URLs associated with that dashboard.
|
||||
1. Saved Objects Service could implement *cascading deletes*. Once a saved
|
||||
object is deleted, the associated saved objects are also deleted
|
||||
(#71453).
|
||||
1. Add additional metadata to each short URL.
|
||||
1. __Could do:__ Each short URL already keeps a counter of how often it was
|
||||
resolved, we could also keep track of a timestamp when it was last
|
||||
resolved, and have an ability for users to give a title to each short URL.
|
||||
1. Short URLs don't have a management UI.
|
||||
1. __Will NOT do:__ We will not create a dedicated UI for managing short
|
||||
URLs. We could improve how short URLs saved objects are presented in saved
|
||||
object management UI.
|
||||
1. Short URLs can't be created by read-only users (#18006).
|
||||
1. __Will NOT do:__ Currently short URLs are stored as saved objects of type
|
||||
`url`, we would like to keep it that way and benefit from saved object
|
||||
facilities like references, migrations, authorization etc.. The consensus
|
||||
is that we will not allow anonymous users to create short URLs. We want to
|
||||
continue using saved object for short URLs going forward and not
|
||||
compromise on their security model.
|
||||
|
||||
|
||||
## Limitations of the URL Generator Service
|
||||
|
||||
We have identified the following limitations in the current implementation of
|
||||
the URL Generator Service:
|
||||
|
||||
1. URL generator generate only the URL of the destination. However there is
|
||||
also the ability to use location state with `core.application.navigateToApp`
|
||||
navigation method.
|
||||
1. __Will do:__ The new locators will also generate the location state, which
|
||||
will be used in `.navigateToApp` method.
|
||||
1. URL generators are available only on the client-side. There is no way to use
|
||||
them together with short URLs.
|
||||
1. __Will do:__ We will implement locators also on the server-side
|
||||
(they will be available in both environments) and we will combine them
|
||||
with the Short URL Service.
|
||||
1. URL generators are not exposed externally, thus Cloud and Support cannot use
|
||||
them to generate deep links into Kibana.
|
||||
1. __Will do:__ We will expose HTTP endpoints on the server-side and the
|
||||
"redirect" app on the client-side which external users will be able to use
|
||||
to deep link into Kibana using locators.
|
||||
|
||||
|
||||
## Limitations of the architecture
|
||||
|
||||
One major reason we want to "refresh" the Short URL Service and the URL
|
||||
Generator Service is their architecture.
|
||||
|
||||
Currently, the Short URL Service is implemented on top of the `url` type saved
|
||||
object on the server-side. However, it only exposes the
|
||||
HTTP endpoints, it does not expose any API on the server for the server-side
|
||||
plugins to consume; on the client-side there is no plugin API either, developers
|
||||
need to manually execute HTTP requests.
|
||||
|
||||
The URL Generator Service is only available on the client-side, there is no way
|
||||
to use it on the server-side, yet we already have use cases (for example ML
|
||||
team) where a server-side plugin wants to use a URL generator.
|
||||
|
||||

|
||||
|
||||
The current architecture does not allow both services to be conveniently used,
|
||||
also as they are implemented in different locations, they are disjointed—
|
||||
we cannot create a short URL using an URL generator.
|
||||
|
||||
|
||||
# Detailed design
|
||||
|
||||
In general we will try to provide as much as possible the same API on the
|
||||
server-side and the client-side.
|
||||
|
||||
|
||||
## High level architecture
|
||||
|
||||
Below diagram shows the proposed architecture of the URL Service.
|
||||
|
||||

|
||||
|
||||
|
||||
## Plugin contracts
|
||||
|
||||
The aim is to provide developers the same experience on the server and browser.
|
||||
|
||||
Below are preliminary interfaces of the new URL Service. `IUrlService` will be
|
||||
a shared interface defined in `/common` folder shared across server and browser.
|
||||
This will allow us to provide users a common API interface on the server and
|
||||
browser, wherever they choose to use the URL Service:
|
||||
|
||||
```ts
|
||||
/**
|
||||
* Common URL Service client interface for the server-side and the client-side.
|
||||
*/
|
||||
interface IUrlService {
|
||||
locators: ILocatorClient;
|
||||
shortUrls: IShortUrlClient;
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Locators
|
||||
|
||||
The locator business logic will be contained in `ILocatorClient` client and will
|
||||
provide two main functionalities:
|
||||
|
||||
1. It will provide a facility to create locators.
|
||||
1. It will also be a registry of locators, every newly created locator is
|
||||
automatically added to the registry. The registry should never be used when
|
||||
locator ID is known at the compile time, but is reserved only for use cases
|
||||
when we only know ID of a locator at runtime.
|
||||
|
||||
```ts
|
||||
interface ILocatorClient {
|
||||
create<P>(definition: LocatorDefinition<P>): Locator<P>;
|
||||
get<P>(id: string): Locator<P>;
|
||||
}
|
||||
```
|
||||
|
||||
The `LocatorDefinition` interface is a developer-friendly interface for creating
|
||||
new locators. Mainly two things will be required from each new locator:
|
||||
|
||||
1. Implement the `getLocation()` method, which gives the locator specific `params`
|
||||
object returns a Kibana location, see description of `KibanaLocation` below.
|
||||
2. Implement the `PersistableState` interface which we use in Kibana. This will
|
||||
allow to migrate the locator `params`. Implementation of the `PersistableState`
|
||||
interface will replace the `.isDeprecated` and `.migrate()` properties of URL
|
||||
generators.
|
||||
|
||||
|
||||
```ts
|
||||
interface LocatorDefinition<P> extends PeristableState<P> {
|
||||
id: string;
|
||||
getLocation(params: P): KibanaLocation;
|
||||
}
|
||||
```
|
||||
|
||||
Each constructed locator will have the following interface:
|
||||
|
||||
```ts
|
||||
interface Locator<P> {
|
||||
/** Creates a new short URL saved object using this locator. */
|
||||
createShortUrl(params: P, slug?: string): Promise<ShortUrl>;
|
||||
/** Returns a relative URL to the client-side redirect endpoint using this locator. */
|
||||
getRedirectPath(params: P): string;
|
||||
/** Navigate using core.application.navigateToApp() using this locator. */
|
||||
navigate(params: P): void; // Only on browser.
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
### Short URLs
|
||||
|
||||
The short URL client `IShortUrlClient` which will be the same on the server and
|
||||
browser. However, the server and browser might add extra utility methods for
|
||||
convenience.
|
||||
|
||||
```ts
|
||||
/**
|
||||
* CRUD-like API for short URLs.
|
||||
*/
|
||||
interface IShortUrlClient {
|
||||
/**
|
||||
* Delete a short URL.
|
||||
*
|
||||
* @param slug The slug (ID) of the short URL.
|
||||
* @return Returns true if deletion was successful.
|
||||
*/
|
||||
delete(slug: string): Promise<boolean>;
|
||||
|
||||
/**
|
||||
* Fetch short URL.
|
||||
*
|
||||
* @param slug The slug (ID) of the short URL.
|
||||
*/
|
||||
get(slug: string): Promise<ShortUrl>;
|
||||
|
||||
/**
|
||||
* Same as `get()` but it also increments the "view" counter and the
|
||||
* "last view" timestamp of this short URL.
|
||||
*
|
||||
* @param slug The slug (ID) of the short URL.
|
||||
*/
|
||||
resolve(slug: string): Promise<ShortUrl>;
|
||||
}
|
||||
```
|
||||
|
||||
Note, that in this new service to create a short URL the developer will have to
|
||||
use a locator (instead of creating it directly from a long URL).
|
||||
|
||||
```ts
|
||||
const shortUrl = await plugins.share.shortUrls.create(
|
||||
plugins.discover.locator,
|
||||
{
|
||||
indexPattern: 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
|
||||
highlightedField: 'foo',
|
||||
},
|
||||
'optional-human-readable-slug',
|
||||
);
|
||||
```
|
||||
|
||||
These short URLs will be stored in saved objects of type `url` and will be
|
||||
automatically migrated using the locator. The long URL will NOT be stored in the
|
||||
saved object. The locator ID and locator params will be stored in the saved
|
||||
object, that will allow us to do the migrations for short URLs.
|
||||
|
||||
|
||||
### `KibanaLocation` interface
|
||||
|
||||
The `KibanaLocation` interface is a simple interface to store a location in some
|
||||
Kibana application.
|
||||
|
||||
```ts
|
||||
interface KibanaLocation {
|
||||
app: string;
|
||||
route: string;
|
||||
state: object;
|
||||
}
|
||||
```
|
||||
|
||||
It maps directly to a `.navigateToApp()` call.
|
||||
|
||||
```ts
|
||||
let location: KibanaLocation;
|
||||
|
||||
core.application.navigateToApp(location.app, {
|
||||
route: location.route,
|
||||
state: location.state,
|
||||
});
|
||||
```
|
||||
|
||||
|
||||
## HTTP endpoints
|
||||
|
||||
|
||||
### Short URL CRUD+ HTTP endpoints
|
||||
|
||||
Below HTTP endpoints are designed to work specifically with short URLs:
|
||||
|
||||
| HTTP method | Path | Description |
|
||||
|-----------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| __POST__ | `/api/short_url` | Endpoint for creating new short URLs. |
|
||||
| __GET__ | `/api/short_url/<slug>` | Endpoint for retrieving information about an existing short URL. |
|
||||
| __DELETE__ | `/api/short_url/<slug>` | Endpoint for deleting an existing short URL. |
|
||||
| __POST__ | `/api/short_url/<slug>` | Endpoint for updating information about an existing short URL. |
|
||||
| __POST__ | `/api/short_url/<slug>/_resolve` | Similar to `GET /api/short_url/<slug>`, but also increments the short URL access count counter and the last access timestamp. |
|
||||
|
||||
|
||||
### The client-side navigate endpoint
|
||||
|
||||
__NOTE.__ We are currently investigating if we really need this endpoint. The
|
||||
main user of it was expected to be Cloud and Support to deeply link into Kibana,
|
||||
but we are now reconsidering if we want to support this endpoint and possibly
|
||||
find a different solution.
|
||||
|
||||
The `/app/goto/_redirect/<locatorId>?params=...¶msVersion=...` client-side
|
||||
endpoint will receive the locator ID and locator params, it will use those to
|
||||
find the locator and execute `locator.navigate(params)` method.
|
||||
|
||||
The `paramsVersion` parameter will be used to specify the version of the
|
||||
`params` parameter. If the version is behind the latest version, then the migration
|
||||
facilities of the locator will be used to on-the-fly migrate the `params` to the
|
||||
latest version.
|
||||
|
||||
|
||||
### Legacy endpoints
|
||||
|
||||
Below are the legacy HTTP endpoints implemented by the `share` plugin, with a
|
||||
plan of action for each endpoint:
|
||||
|
||||
| HTTP method | Path | Description |
|
||||
|-----------------------|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| __ANY__ | `/goto/<slug>` | Endpoint for redirecting short URLs, we will keep it to redirect short URLs. |
|
||||
| __GET__ | `/api/short_url/<slug>` | The new `GET /api/short_url/<slug>` endpoint will return a superset of the payload that the legacy endpoint now returns. |
|
||||
| __POST__ | `/api/shorten_url` | The legacy endpoints for creating short URLs. We will remove it or deprecate this endpoint and maintain it until 8.0 major release. |
|
||||
|
||||
|
||||
# Drawbacks
|
||||
|
||||
Why should we *not* do this?
|
||||
|
||||
- Implementation cost will be a few weeks, but the code complexity and quality
|
||||
will improve.
|
||||
- There is a cost of migrating existing Kibana plugins to use the new API.
|
||||
|
||||
|
||||
# Alternatives
|
||||
|
||||
We haven't considered other design alternatives.
|
||||
|
||||
One alternative is still do the short URL improvements outlined above. But
|
||||
reconsider URL generators:
|
||||
|
||||
- Do we need URL generators at all?
|
||||
- Kibana URLs are not stable and have changed in our past experience. Hence,
|
||||
the URL generators were created to make the URL generator parameters stable
|
||||
unless a migration is available.
|
||||
- Do we want to put migration support in URL generators?
|
||||
- Alternative would be for each app to support URLs forever or do the
|
||||
migrations on the fly for old URLs.
|
||||
- Should Kibana URLs be stable and break only during major releases?
|
||||
- Should the Kibana application interface be extended such that some version of
|
||||
URL generators is built in?
|
||||
|
||||
The impact of not doing this change is essentially extending technical debt.
|
||||
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
Is this a breaking change? It is a breaking change in the sense that the API
|
||||
will change. However, all the existing use cases will be supported. When
|
||||
implementing this we will also adjust all Kibana code to use the new API. From
|
||||
the perspective of the developers when using the existing URL services nothing
|
||||
will change, they will simply need to review a PR which stops using the URL
|
||||
Generator Service and uses the combined URL Service instead, which will provide
|
||||
a superset of features.
|
||||
|
||||
Alternatively, we can deprecate the URL Generator Service and maintain it for a
|
||||
few minor releases.
|
||||
|
||||
|
||||
# How we teach this
|
||||
|
||||
For the existing short URL and URL generator functionality there is nothing to
|
||||
teach, as they will continue working with a largely similar API.
|
||||
|
||||
Everything else in the new URL Service will have JSDoc comments and good
|
||||
documentation on our website.
|
|
@ -1,923 +0,0 @@
|
|||
- Start Date: 2021-03-29
|
||||
- RFC PR: [#95070](https://github.com/elastic/kibana/pull/95070)
|
||||
- Kibana Issue: [#94630](https://github.com/elastic/kibana/issues/94630)
|
||||
|
||||
---
|
||||
|
||||
- [Summary](#summary)
|
||||
- [Motivation](#motivation)
|
||||
- [Required and Desired Capabilities](#required-and-desired-capabilities)
|
||||
- [Required](#required)
|
||||
- [Scalable](#scalable)
|
||||
- [Stable](#stable)
|
||||
- [Surfaces information intuitively](#surfaces-information-intuitively)
|
||||
- [Pipelines](#pipelines)
|
||||
- [Advanced Pipeline logic](#advanced-pipeline-logic)
|
||||
- [Cloud-friendly pricing model](#cloud-friendly-pricing-model)
|
||||
- [Public access](#public-access)
|
||||
- [Secrets handling](#secrets-handling)
|
||||
- [Support or Documentation](#support-or-documentation)
|
||||
- [Scheduled Builds](#scheduled-builds)
|
||||
- [Container support](#container-support)
|
||||
- [Desired](#desired)
|
||||
- [Customization](#customization)
|
||||
- [Core functionality is first-party](#core-functionality-is-first-party)
|
||||
- [First-class support for test results](#first-class-support-for-test-results)
|
||||
- [GitHub Integration](#github-integration)
|
||||
- [Buildkite - Detailed design](#buildkite---detailed-design)
|
||||
- [Overview](#overview)
|
||||
- [Required and Desired Capabilities](#required-and-desired-capabilities-1)
|
||||
- [Required](#required-1)
|
||||
- [Scalable](#scalable-1)
|
||||
- [Stable](#stable-1)
|
||||
- [Surfaces information intuitively](#surfaces-information-intuitively-1)
|
||||
- [Pipelines](#pipelines-1)
|
||||
- [Advanced Pipeline logic](#advanced-pipeline-logic-1)
|
||||
- [Cloud-friendly pricing model](#cloud-friendly-pricing-model-1)
|
||||
- [Public access](#public-access-1)
|
||||
- [Secrets handling](#secrets-handling-1)
|
||||
- [Support or Documentation](#support-or-documentation-1)
|
||||
- [Scheduled Builds](#scheduled-builds-1)
|
||||
- [Container support](#container-support-1)
|
||||
- [Desired](#desired-1)
|
||||
- [Customization](#customization-1)
|
||||
- [Core functionality is first-party](#core-functionality-is-first-party-1)
|
||||
- [First-class support for test results](#first-class-support-for-test-results-1)
|
||||
- [GitHub Integration](#github-integration-1)
|
||||
- [What we will build and manage](#what-we-will-build-and-manage)
|
||||
- [Elastic Buildkite Agent Manager](#elastic-buildkite-agent-manager)
|
||||
- [Overview](#overview-1)
|
||||
- [Design](#design)
|
||||
- [Protection against creating too many instances](#protection-against-creating-too-many-instances)
|
||||
- [Configuration](#configuration)
|
||||
- [Build / Deploy](#build--deploy)
|
||||
- [Elastic Buildkite PR Bot](#elastic-buildkite-pr-bot)
|
||||
- [Overview](#overview-2)
|
||||
- [Configuration](#configuration-1)
|
||||
- [Build / Deploy](#build--deploy-1)
|
||||
- [Infrastructure](#infrastructure)
|
||||
- [Monitoring / Alerting](#monitoring--alerting)
|
||||
- [Agent Image management](#agent-image-management)
|
||||
- [Buildkite org-level settings management](#buildkite-org-level-settings-management)
|
||||
- [IT Security Processes](#it-security-processes)
|
||||
- [Drawbacks](#drawbacks)
|
||||
- [Alternatives](#alternatives)
|
||||
- [Jenkins](#jenkins)
|
||||
- [Required](#required-2)
|
||||
- [Scalable](#scalable-2)
|
||||
- [Stable](#stable-2)
|
||||
- [Updates](#updates)
|
||||
- [Surfaces information intuitively](#surfaces-information-intuitively-2)
|
||||
- [Pipelines](#pipelines-2)
|
||||
- [Advanced Pipeline logic](#advanced-pipeline-logic-2)
|
||||
- [Cloud-friendly pricing model](#cloud-friendly-pricing-model-2)
|
||||
- [Public access](#public-access-2)
|
||||
- [Secrets handling](#secrets-handling-2)
|
||||
- [Support or Documentation](#support-or-documentation-2)
|
||||
- [Scheduled Builds](#scheduled-builds-2)
|
||||
- [Container support](#container-support-2)
|
||||
- [Desired](#desired-2)
|
||||
- [Customization](#customization-2)
|
||||
- [Core functionality is first-party](#core-functionality-is-first-party-2)
|
||||
- [First-class support for test results](#first-class-support-for-test-results-2)
|
||||
- [GitHub Integration](#github-integration-2)
|
||||
- [Other solutions](#other-solutions)
|
||||
- [CircleCI](#circleci)
|
||||
- [GitHub Actions](#github-actions)
|
||||
- [Adoption strategy](#adoption-strategy)
|
||||
- [How we teach this](#how-we-teach-this)
|
||||
|
||||
# Summary
|
||||
|
||||
Implement a CI system for Kibana teams that is highly scalable and stable, surfaces information in an intuitive way, and supports pipelines that are easy to understand and change.
|
||||
|
||||
This table provides an overview of the conclusions made throughout the rest of this document. A lot of this is subjective, but we've tried to take an honest look at each system and feature, based on a large amount of research on and/or experience with each system, our requirements, and our preferences as a team. Your team would likely come to different conclusions based on your preferences and requirements.
|
||||
|
||||
| | Jenkins | Buildkite | GitHub Actions | CircleCI | TeamCity |
|
||||
| ------------------------------------ | ------- | --------- | -------------- | -------- | -------- |
|
||||
| Scalable | No | Yes | No | Yes | No |
|
||||
| Stable | No | Yes | No | Yes | Partial |
|
||||
| Surfaces information intuitively | No | Yes | No | Yes | Yes |
|
||||
| Pipelines | Yes | Yes | Yes | Yes | Partial |
|
||||
| Advanced Pipeline logic | Yes | Yes | Partial | Partial | No |
|
||||
| Cloud-friendly pricing model | Yes | Yes | Yes | No | No |
|
||||
| Public access | Yes | Yes | Yes | Partial | Yes |
|
||||
| Secrets handling | Yes | Partial | Yes | Partial | Partial |
|
||||
| Support or Documentation | No | Yes | Yes | Partial | Yes |
|
||||
| Scheduled Builds | Yes | Yes | Yes | Yes | Yes |
|
||||
| Container support | Partial | Yes | Yes | Yes | Partial |
|
||||
| | | | | | |
|
||||
| Customization | No | Yes | No | No | No |
|
||||
| Core functionality is first-party | No | Yes | Mostly | Yes | Mostly |
|
||||
| First-class support for test results | Buggy | No | No | Yes | Yes |
|
||||
| GitHub Integration | Yes | Limited | Yes | Yes | Yes |
|
||||
|
||||
# Motivation
|
||||
|
||||
We have lived with the scalability and stability problems of our current Jenkins infrastructure for several years. We have spent a significant amount of time designing around problems, and are limited in how we can design our pipelines. Since the company-wide effort to move to a new system has been cancelled for the foreseeable future, we are faced with either re-engineering the way we use Jenkins, or exploring other solutions and potentially managing one ourselves.
|
||||
|
||||
This RFC is focused on the option of using a system other than Jenkins, and managing it ourselves (to the extent that it must be managed). If the RFC is rejected, the alternative will be to instead invest significantly into Jenkins to further stabilize and scale our usage of it.
|
||||
|
||||
## Required and Desired Capabilities
|
||||
|
||||
### Required
|
||||
|
||||
#### Scalable
|
||||
|
||||
- Able to run 100s of pipelines and 1000s of individual steps in parallel without issues.
|
||||
- If scaling agents/hosts is self-managed, dynamically scaling up and down based on usage should be supported and reasonably easy to do.
|
||||
|
||||
#### Stable
|
||||
|
||||
- Every minute of downtime can affect 100s of developers.
|
||||
- The Kibana Operations team can't have an on-call rotation, so we need to minimize our responsibilities around stability/uptime.
|
||||
- For systems provided as a service, they should not have frequent outages. This is a bit hard to define. 1-2 hours of downtime, twice a month, during peak working hours, is extremely disruptive. 10 minutes of downtime once or twice a week can also be very disruptive, as builds might need to be re-triggered, etc.
|
||||
- For self-hosted solutions, they should be reasonably easy to keep online and have a solution for high-availability. At a minimum, most upgrades should not require waiting for all currently running jobs to finish before deploying.
|
||||
- Failures are ideally handled gracefully. For example, agents may continue running tasks correctly, once the primary service becomes available again.
|
||||
|
||||
#### Surfaces information intuitively
|
||||
|
||||
- Developers should be able to easily understand what happened during their builds, and find information related to failures.
|
||||
- User interfaces should be functional and easy to use.
|
||||
- Overview and details about failures and execution time are particularly important.
|
||||
|
||||
#### Pipelines
|
||||
|
||||
- Pipelines should be defined as code.
|
||||
- Pipelines should be reasonably easy to understand and change. Kibana team members should be able to follow a simple guide and create new pipelines on their own.
|
||||
- Changes to pipelines should generally be able to be tested in Pull Requests before being merged.
|
||||
|
||||
#### Advanced Pipeline logic
|
||||
|
||||
With such a large codebase and CI pipeline, we often have complex requirements around when and how certain tasks should run, and we want the ability to handle this built into the system we use. It can be very difficult and require complex solutions for fairly simple use cases when the system does not support advanced pipeline logic out of the box.
|
||||
|
||||
For example, the flaky test suite runner that we currently have in Jenkins is fairly simple: run a given task (which might have a dependency) `N` number of times on `M` agents. This is very difficult to model in a system like TeamCity, which does not have dynamic dependencies.
|
||||
|
||||
- Retries
|
||||
- Automatic (e.g. run a test suite twice to account for flakiness) and manual (user-initiated)
|
||||
- Full (e.g. a whole pipeline) and partial (e.g. a single step)
|
||||
- Dynamic pipelines
|
||||
- Conditional dependencies/steps
|
||||
- Based on user input
|
||||
- Based on external events/data (e.g. PR label)
|
||||
- Based on source code or changes (e.g. only run this for .md changes)
|
||||
- Metadata and Artifacts re-usable between tasks
|
||||
- Metadata could be a docker image tag for a specific task, built from a previous step
|
||||
|
||||
#### Cloud-friendly pricing model
|
||||
|
||||
If the given system has a cost, the pricing model should be cloud-friendly and/or usage-based.
|
||||
|
||||
A per-agent or per-build model based on peak usage in a month is not a good model, because our peak build times are generally short-lived (e.g. around feature freeze).
|
||||
|
||||
A model based on build-minutes can also be bad, if it encourages running things in parallel on bigger machines to keep costs down. For example, running two tasks on a single 2-CPU machine with our own orchestration should not be cheaper than running two tasks on two 1-CPU machines using the system's built-in orchestration.
|
||||
|
||||
#### Public access
|
||||
|
||||
Kibana is a publicly-available repository with contributors from outside Elastic. CI information needs to be available publicly in some form.
|
||||
|
||||
#### Secrets handling
|
||||
|
||||
Good, first-class support for handling secrets is a must-have for any CI system. This support can take many forms.
|
||||
|
||||
- Secrets should not need to be stored in plaintext, in a repo nor on the server.
|
||||
- For systems provided as a service, it is ideal if secrets are kept mostly/entirely on our infrastructure.
|
||||
- There should be protections against accidentally leaking secrets to the console.
|
||||
- There should be programmatic ways to manage secrets.
|
||||
- Secrets are, by nature, harder to handle. However, the easier the system makes it, the more likely people are to follow best practices.
|
||||
|
||||
#### Support or Documentation
|
||||
|
||||
For paid systems, both self-hosted and as a service, good support is important. If a problem specific to Elastic is causing us downtime, we expect quick and efficient support. Again, 100s of developers are potentially affected by downtime.
|
||||
|
||||
For open source solutions, good documentation is especially important. If much of the operational knowledge of a system can only be gained by working with the system and/or reading the source code, it will be harder to solve problems quickly.
|
||||
|
||||
#### Scheduled Builds
|
||||
|
||||
We have certain pipelines (ES Snapshots) that run once daily, and `master` CI currently only runs once an hour. We need the ability to configure scheduled builds.
|
||||
|
||||
#### Container support
|
||||
|
||||
We have the desire to use containers to create fast, clean environments for CI stages that can also be used locally. We think that we can utilize [modern layer-caching options](https://github.com/moby/buildkit#cache), both local and remote, to optimize bootstrapping various CI stages, doing retries, etc.
|
||||
|
||||
For self-hosted options, containers will allow us to utilize longer-running instances (with cached layers, git repos, etc) without worrying about polluting the build environment between builds.
|
||||
|
||||
If we use containers for CI stages, when a test fails, developers can pull the image and reproduce the failure in the same environment that was used in CI.
|
||||
|
||||
So, we need a solution that at least allows us to build and run our own containers. The more features that exist for managing this, the easier it will be.
|
||||
|
||||
### Desired
|
||||
|
||||
#### Customization
|
||||
|
||||
We have very large CI pipelines which generate a lot of information (bundle sizes, performance numbers, etc). Being able to attach this information to builds, so that it lives with the builds in the CI system, is highly desirable. The alternative is building custom reports and UIs outside of the system.
|
||||
|
||||
#### Core functionality is first-party
|
||||
|
||||
Most core functionality that we depend on should be created and maintained by the organization maintaining the CI software. It's important for bugs to be addressed quickly, for security issues to be resolved, and for functionality to be tested before a new release of the system. In this way, there is a large amount of risk associated with relying on third-party solutions for too much core functionality.
|
||||
|
||||
#### First-class support for test results
|
||||
|
||||
One of the primary reasons we run CI is to run tests and make sure they pass. There are currently around 65,000 tests (unit, integration, and functional) that run in CI. Being able to see summaries, histories, and details of test execution directly on build pages is extremely useful. Flaky test identification is also very useful, as we deal with flaky tests on a daily basis.
|
||||
|
||||
For example, being able to easily see that a build passed but included 5,000 tests fewer than the previous build can make something like a pipeline misconfiguration more obvious. Being able to click on a failed test and see other recent builds where the same test failed can help identify what kind of failure it is and how important it is to resolve it quickly (e.g is it failing in 75% of builds or 5% of builds?).
|
||||
|
||||
For any system that doesn't have this kind of support, we will need to maintain our own solution, customize build pages to include this (if the system allows), or both.
|
||||
|
||||
#### GitHub Integration
|
||||
|
||||
- Ability to trigger jobs based on webhooks
|
||||
- Integrate GitHub-specific information into UI, e.g. a build for a PR should link back to the PR
|
||||
- Ability to set commit statuses based on job status
|
||||
- Fine-grained permission handling for pull request triggering
|
||||
|
||||
# Buildkite - Detailed design
|
||||
|
||||
For the alternative system in this RFC, we are recommending Buildkite. The UI, API, and documentation have been a joy to work with, they provide most of our desired features and functionality, the team is responsive and knowledgeable, and the pricing model does not encourage bad practices to lower cost.
|
||||
|
||||
## Overview
|
||||
|
||||
[Buildkite](https://buildkite.com/home) is a CI system where the user manages and hosts their own agents, and Buildkite manages and hosts everything else (core services, APIs, UI).
|
||||
|
||||
The [Buildkite features](https://buildkite.com/features) page is a great overview of the functionality offered.
|
||||
|
||||
For some public instances of Buildkite in action, see:
|
||||
|
||||
- [Bazel](https://buildkite.com/bazel)
|
||||
- [Rails](https://buildkite.com/rails)
|
||||
- [Chef](https://buildkite.com/chef-oss)
|
||||
|
||||
## Required and Desired Capabilities
|
||||
|
||||
How does Buildkite stack up against our required and desired capabilities?
|
||||
|
||||
### Required
|
||||
|
||||
#### Scalable
|
||||
|
||||
Buildkite claims to support up to 10,000 connected agents "without breaking a sweat."
|
||||
|
||||
We were able to connect 2,200 running agents and run a [single job with 1,800 parallel build steps](https://buildkite.com/elastic/kibana-custom/builds/8). The job ran with only about 15 seconds of total overhead (the rest of the time, the repo was being cloned, or the actual tasks were executing). We would likely never define a single job this large, but not only did it execute without any problems, the UI handles it very well.
|
||||
|
||||
2,200 agents was the maximum that we were able to test because of quotas on our GCP account that could not easily be increased.
|
||||
|
||||
We also created a job with 5 parallel steps, and triggered 300 parallel builds at once. The jobs executed and finished quickly, across ~1500 agents, with no issues and very little overhead. Interestingly, it seems that we were able to see the effects of our test in Buildkite's status page graphs (see below), but, from a user perspective, we were unable to notice any issues.
|
||||
|
||||

|
||||
|
||||
#### Stable
|
||||
|
||||
So far, we have witnessed no stability issues in our testing.
|
||||
|
||||
If Buildkite's status pages are accurate, they seem to be extremely stable, and respond quickly to issues.
|
||||
|
||||
- [Buildkite Status](https://www.buildkitestatus.com/)
|
||||
- [Historical Uptime](https://www.buildkitestatus.com/uptime)
|
||||
- [Incident History](https://www.buildkitestatus.com/history)
|
||||
|
||||
For agents, stability and availability will depend primarily on the infrastructure that we build and the availability of the cloud provider (GCP, primarily) running our agents. Since [we control our agents](#elastic-buildkite-agent-manager), we will be able to run agents across multiple zones, and possibly regions, in GCP for increased availability.
|
||||
|
||||
They have a [99.95% uptime SLA](https://buildkite.com/enterprise) for Enterprise customers.
|
||||
|
||||
#### Surfaces information intuitively
|
||||
|
||||
The Buildkite UI is very easy to use, and works as expected. Here is some of the information surfaced for each build:
|
||||
|
||||
- The overall status of the job, as well as which steps succeeded and failed.
|
||||
- Logs for each individual step
|
||||
- The timeline for each individual step, including how long it took Buildkite to schedule/handle the job on their end
|
||||
- Artifacts uploaded by each step
|
||||
- The entire agent/job configuration at the time the step executed, expressed as environment variables
|
||||
|
||||

|
||||
|
||||
Note that dependencies between steps are mostly not shown in the UI. See screenshot below for an example. There are several layers of dependencies between all of the steps in this pipeline. The only one that is shown is the final step (`Post All`), which executes after all steps beforehand are finished. There are some other strategies to help organize the steps (such as the new grouping functionality) if we need.
|
||||
|
||||

|
||||
|
||||
Buildkite has rich build page customization via "annotations" which will let us surface custom information. See the [customization section](#customization-1).
|
||||
|
||||
#### Pipelines
|
||||
|
||||
- [Buildkite pipelines](https://buildkite.com/docs/pipelines) must be defined as code. Even if you configure them through the UI, you still have to do so using yaml.
|
||||
- This is subjective, but the yaml syntax for pipelines is friendly and straightforward. We feel that it will be easy for teams to create and modify pipelines with minimal instructions.
|
||||
- If your pipeline is configured to use yaml stored in your repo for its definition, branches and PRs will use the version in their source by default. This means that PRs that change the pipeline can be tested as part of the PR CI.
|
||||
- Top-level pipeline configurations, i.e. basically a pointer to a repo that has the real pipeline yaml in it, can be configured via the UI, API, or terraform.
|
||||
|
||||
#### Advanced Pipeline logic
|
||||
|
||||
Buildkite supports very advanced pipeline logic, and has support for generating dynamic pipeline definitions at runtime.
|
||||
|
||||
- [Conditionals](https://buildkite.com/docs/pipelines/conditionals)
|
||||
- [Dependencies](https://buildkite.com/docs/pipelines/dependencies) with lots of options, including being optional/conditional
|
||||
- [Retries](https://buildkite.com/docs/pipelines/command-step#retry-attributes), both automatic and manual, including configuring retry conditions by different exit codes
|
||||
- [Dynamic pipelines](https://buildkite.com/docs/pipelines/defining-steps#dynamic-pipelines) - pipelines can be generated by running a script at runtime
|
||||
- [Metadata](https://buildkite.com/docs/pipelines/build-meta-data) can be set in one step, and read in other steps
|
||||
- [Artifacts](https://buildkite.com/docs/pipelines/artifacts) can be uploaded from and downloaded in steps, and are visible in the UI
|
||||
- [Parallelism and Concurrency](https://buildkite.com/docs/tutorials/parallel-builds) settings
|
||||
|
||||
Here's an example of a dynamically-generated pipeline based on user input that runs a job `RUN_COUNT` times (from user input), across up to a maximum of 25 agents at once:
|
||||
|
||||
```yaml
|
||||
# pipeline.yml
|
||||
|
||||
steps:
|
||||
- input: 'Test Suite Runner'
|
||||
fields:
|
||||
- select: 'Test Suite'
|
||||
key: 'test-suite'
|
||||
required: true
|
||||
options:
|
||||
- label: 'Default CI Group 1'
|
||||
value: 'default:cigroup:1'
|
||||
- label: 'Default CI Group 2'
|
||||
value: 'default:cigroup:2'
|
||||
- text: 'Number of Runs'
|
||||
key: 'run-count'
|
||||
required: true
|
||||
default: 75
|
||||
- wait
|
||||
- command: .buildkite/scripts/flaky-test-suite-runner.sh | buildkite-agent pipeline upload
|
||||
label: ':pipeline: Upload'
|
||||
```
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
|
||||
# flaky-test-suite-runner.sh
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
TEST_SUITE="$(buildkite-agent meta-data get 'test-suite')"
|
||||
export TEST_SUITE
|
||||
|
||||
RUN_COUNT="$(buildkite-agent meta-data get 'run-count')"
|
||||
export RUN_COUNT
|
||||
|
||||
UUID="$(cat /proc/sys/kernel/random/uuid)"
|
||||
export UUID
|
||||
|
||||
cat << EOF
|
||||
steps:
|
||||
- command: |
|
||||
echo 'Bootstrap'
|
||||
label: Bootstrap
|
||||
agents:
|
||||
queue: bootstrap
|
||||
key: bootstrap
|
||||
- command: |
|
||||
echo 'Build Default Distro'
|
||||
label: Build Default Distro
|
||||
agents:
|
||||
queue: bootstrap
|
||||
key: default-build
|
||||
depends_on: bootstrap
|
||||
- command: 'echo "Running $TEST_SUITE"; sleep 10;'
|
||||
label: 'Run $TEST_SUITE'
|
||||
agents:
|
||||
queue: ci-group
|
||||
parallelism: $RUN_COUNT
|
||||
concurrency: 25
|
||||
concurrency_group: '$UUID'
|
||||
depends_on: default-build
|
||||
EOF
|
||||
```
|
||||
|
||||
#### Cloud-friendly pricing model
|
||||
|
||||
Buildkite is priced using a per-user model, where a user is effectively an Elastic employee triggering builds for Kibana via PR, merging code, or through the Buildkite UI. That means that the cost essentially grows with our company size. Most importantly, we don't need to make CI pipeline design decisions based on the Buildkite pricing model.
|
||||
|
||||
However, since we manage our own agents, we will still pay for our compute usage, and will need to consider that cost when designing our pipelines.
|
||||
|
||||
#### Public access
|
||||
|
||||
Buildkite has read-only public access, configurable for each pipeline. An organization can contain a mix of both public and private pipelines.
|
||||
|
||||
There are not fine-grained settings for this, and all information in the build is publicly accessible.
|
||||
|
||||
#### Secrets handling
|
||||
|
||||
[Managing Pipeline Secrets](https://buildkite.com/docs/pipelines/secrets)
|
||||
|
||||
Because agents run on customers' infrastructure, secrets can stay completely in the customer's environment. For this reason, Buildkite doesn't provide a real mechanism for storing secrets, and instead provide recommendations for accessing secrets in pipelines in secure ways.
|
||||
|
||||
There are two recommended methods for handling secrets: using a third-party secrets service like Vault or GCP's Secret Manager, or baking them into agent images and only letting certain jobs access them. Since Elastic already uses Vault, we could utilize Vault the same way we do in Jenkins today.
|
||||
|
||||
Also, a new experimental feature, [redacted environment variables](https://buildkite.com/docs/pipelines/managing-log-output#redacted-environment-variables) can automatically redact the values of environment variables that match some configurable suffixes if they are accidentally written to the console. This would only redact environment variables that were set prior to execution of a build step, e.g. during the `environment` or `pre-command` hooks, and not variables that were created during execution, e.g. by accessing Vault in the middle of a build step.
|
||||
|
||||
#### Support or Documentation
|
||||
|
||||
[Buildkite's documentation](https://buildkite.com/docs/pipelines) is extensive and well-written, as mentioned earlier.
|
||||
|
||||
Besides this, [Enterprise](https://buildkite.com/enterprise) customers get 24/7 emergency help, prioritized support, a dedicated chat channel, and guaranteed response times. They will also consult on best practices, etc.
|
||||
|
||||
#### Scheduled Builds
|
||||
|
||||
[Buildkite has scheduled build](https://buildkite.com/docs/pipelines/scheduled-builds) support with a cron-like syntax. Schedules are defined separately from the pipeline yaml, and can be managed via the UI, API, or terraform.
|
||||
|
||||
#### Container support
|
||||
|
||||
Since we will manage our own agents with Buildkite, we have full control over the container management tools we install and use. In particular, this means that we can easily use modern container tooling, such as Docker with Buildkit, and we can pre-cache layers or other data in our agent images.
|
||||
|
||||
[Buildkite maintains](https://buildkite.com/docs/tutorials/docker-containerized-builds) two officially-supported plugins for making it easier to create pipelines using containers: [one for Docker](https://github.com/buildkite-plugins/docker-buildkite-plugin) and [one for Docker Compose](https://github.com/buildkite-plugins/docker-compose-buildkite-plugin).
|
||||
|
||||
The Docker plugin is essentially a wrapper around `docker run` that makes it easier to define steps that run in containers, while setting various flags. It also provides some logging, and provides mechanisms for automatically propagating environment variables or mounting the workspace into the container.
|
||||
|
||||
A simple, working example for running Jest tests using a container is below. The `Dockerfile` contains all dependencies for CI, and runs `yarn kbn bootstrap` so that it contains a full environment, ready to run tasks.
|
||||
|
||||
```yaml
|
||||
steps:
|
||||
- command: |
|
||||
export DOCKER_BUILDKIT=1 && \
|
||||
docker build -t gcr.io/elastic-kibana-184716/buildkite/ci/base:$BUILDKITE_COMMIT -f .ci/Dockerfile . --progress plain && \
|
||||
docker push gcr.io/elastic-kibana-184716/buildkite/ci/base:$BUILDKITE_COMMIT
|
||||
- wait
|
||||
- command: node scripts/jest --ci --verbose --maxWorkers=6
|
||||
label: 'Jest'
|
||||
artifact_paths: target/junit/**/*.xml
|
||||
plugins:
|
||||
- docker#v3.8.0:
|
||||
image: 'gcr.io/elastic-kibana-184716/buildkite/ci/base:$BUILDKITE_COMMIT'
|
||||
propagate-environment: true
|
||||
mount-checkout: false
|
||||
parallelism: 2
|
||||
timeout_in_minutes: 120
|
||||
```
|
||||
|
||||
### Desired
|
||||
|
||||
#### Customization
|
||||
|
||||
We have very large CI pipelines which generate a lot of information (bundle sizes, performance numbers, etc). Being able to attach this information to builds, so that it lives with the builds in the CI system, is highly desirable. The alternative is building custom reports and UIs outside of the system.
|
||||
|
||||
[Annotations](https://buildkite.com/docs/agent/v3/cli-annotate) provide a way to add rich, well-formatted, custom information to build pages using CommonMark Markdown. There are several built-in CSS classes for formatting and several visual styles. Images, emojis, and links can be embedded as well. Just for some examples: Metrics such as bundle sizes, links to the distro builds for that build, and screenshots for test failures could all be embedded directly into the build pages.
|
||||
|
||||
The structure of logs can also be easily customized by adding [collapsible groups](https://buildkite.com/docs/pipelines/managing-log-output#collapsing-output) for log messages.
|
||||
|
||||
#### Core functionality is first-party
|
||||
|
||||
There's a large number of [plugins for Buildkite](https://buildkite.com/plugins), but, so far, there are only two plugins we've been considering using (one for Docker and one for test results), and they're both maintained by Buildkite. All other functionality we've assessed that we need is either built directly into Buildkite, or [we are building it](#what-we-will-build-and-manage).
|
||||
|
||||
#### First-class support for test results
|
||||
|
||||
Buildkite doesn't really have any built-in support specifically for handling test results. Test result reports (e.g. JUnit) can be uploaded as artifacts, and test results can be rendered on the build page using annotations. They have [a plugin](https://github.com/buildkite-plugins/junit-annotate-buildkite-plugin) for automatically annotating builds with test results from JUnit reports in a simple fashion. We would likely want to build our own annotation for this.
|
||||
|
||||
This does mean that Buildkite lacks test-related features of other CI systems: tracking tests over time across build, flagging flaky tests, etc. We would likely need to ingest test results into Elasticsearch and build out Kibana dashboards/visualizations for this, or similar.
|
||||
|
||||
#### GitHub Integration
|
||||
|
||||
Buildkite's [GitHub Integration](https://buildkite.com/docs/integrations/github) can trigger builds based on GitHub webhooks (e.g. on commit/push for branches and PRs), and update commit statuses. Buildkite also adds basic information to build pages, such as links to commits on GitHub and links to PRs. This should cover what we need for tracked branch builds.
|
||||
|
||||
However, for Pull Requests, because we have a lot of requirements around when builds should run and who can run them, we will need to [build a solution](#elastic-buildkite-pr-bot) for handling PRs ourselves. The work for this is already close to complete.
|
||||
|
||||
## What we will build and manage
|
||||
|
||||
### Elastic Buildkite Agent Manager
|
||||
|
||||
#### Overview
|
||||
|
||||
Currently, with Buildkite, the agent lifecycle is managed entirely by customers. Customers can run "static" workers that are online all of the time, or dynamically scale their agents up and down as needed.
|
||||
|
||||
For AWS, Buildkite maintains an auto-scaling solution called [Elastic CI Stack for AWS](https://github.com/buildkite/elastic-ci-stack-for-aws).
|
||||
|
||||
Since, we primarily need support for GCP, we built our own agent manager. It's not 100% complete, but has been working very well during our testing/evaluation of Buildkite, and can handle 1000s of agents.
|
||||
|
||||
[Elastic Buildkite Agent Manager](https://github.com/brianseeders/buildkite-agent-manager)
|
||||
|
||||
Features:
|
||||
|
||||
- Handles many different agent configurations with one instance
|
||||
- Configures long-running agents, one-time use agents, and agents that will terminate after being idle for a configured amount of time
|
||||
- Configures both minimum and maximum agent limits - i.e. can ensure a certain number of agents are always online, even if no jobs currently require them
|
||||
- Supports overprovisioning agents by a percentage or a fixed number
|
||||
- Supports many GCE settings: zone, image/image family, machine type, disk type and size, tags, metadata, custom startup scripts
|
||||
- Agent configuration is stored in a separate repo and read at runtime
|
||||
- Agents are gracefully replaced (e.g. after they finish their current job) if they are running using an out-of-date agent configuration that can affect the underlying GCE instance
|
||||
- Detect and remove orphaned GCP instances
|
||||
- Handles 1000s of agents (tested with 2200 before we hit GCP quotas)
|
||||
- Does instance creation/deletion in large, parallel batches so that demand spikes are handled quickly
|
||||
|
||||
Also planned:
|
||||
|
||||
- Balance creating agents across numerous GCP zones for higher availability
|
||||
- Automatically gracefully replace agents if disk usage gets too high
|
||||
- Scaling idle timeouts: e.g. the first agent for a configuration might have an idle timeout of 1 hour, but the 200th might be 5 minutes
|
||||
|
||||
#### Design
|
||||
|
||||
The agent manager is primarily concerned with ensuring that, given an agent configuration, the number of online agents for that configuration is **greater than or equal to** the desired number. Buildkite then determines how to use the agents: which jobs they should execute and when they should go offline (due to being idle, done with jobs, etc). Even when stopping agents due to having an outdated configuration, Buildkite still determines the actual time that the agent should disconnect.
|
||||
|
||||
The current version of the agent manager only handles GCP-based agents, but support for other platforms could be added as well, such as AWS or Kubernetes. There's likely more complexity in managing all of the various agent images than in maintaining support in the agent manager.
|
||||
|
||||
It is also designed to itself be stateless, so that it is easy to deploy and reason about. State is effectively stored in GCP and Buildkite.
|
||||
|
||||

|
||||
|
||||
The high-level design for the agent manager is pretty straightforward. There are three primary stages during execution:
|
||||
|
||||
1. Gather Current State
|
||||
1. Data and agent configuration is gathered from various sources/APIs in parallel
|
||||
2. Create Plan
|
||||
1. Given the current state across the various services, a plan is created based on agent configurations, current Buildkite job queue sizes, and current GCE instances.
|
||||
2. Instances need to be created when there aren't enough online/in-progress agents of a particular configuration to satisfy the needs of its matching queue.
|
||||
3. Agents need to be stopped when the agents have been online for too long (based on their configuration) or when their configuration is out-of-date. This is a soft stop, they will terminate after finishing their current job.
|
||||
4. Instances need to be deleted if they have been stopped (which happens when their agent stops), or when they have been online past their hard stop time (based on configuration).
|
||||
3. Execute Plan
|
||||
1. The different types of actions in the plan are executed in parallel. Instance creating and deleting is done in batches to handle spikes quickly.
|
||||
|
||||
An error at any step, e.g. when checking current state of GCP instances, will cause the rest of the run to abort.
|
||||
|
||||
Because the service gathers data about the total current state and creates a plan based on that state each run, it's reasonably resistant to errors and it's self-healing.
|
||||
|
||||
##### Protection against creating too many instances
|
||||
|
||||
Creating too many instances in GCP could be costly, so it is worth mentioning here. Since the agent manager itself is stateless, and only looks at the current, external state when determining an execution plan, there is the possibility of creating too many instances.
|
||||
|
||||
There are two primary mechanisms to protect against this:
|
||||
|
||||
One is usage of GCP quotas. Maintaining reasonable GCP quotas will ensure that we don't create too many instances in a situation where something goes catastrophically wrong during operation. It's an extra failsafe.
|
||||
|
||||
The other is built into the agent manager. The agent manager checks both the number of connected agents in Buildkite for a given configuration, as well as the number of instances currently running and being created in GCP. It uses whichever number is greater as the current number of instances.
|
||||
|
||||
This is a simple failsafe, but means that a large number of unnecessary instances should only be able to be created in a pretty specific scenario (keep in mind that errors will abort the current agent manager run):
|
||||
|
||||
- The GCP APIs (both read and create) are returning success codes
|
||||
- The GCP API for listing instances is returning partial/missing/erroneous data, with a success code
|
||||
- GCP instances are successfully being created
|
||||
- Created GCP instances are unable to connect to Buildkite, or Buildkite Agents API is returning partial/missing/erroneous data
|
||||
|
||||
All of these things would need to be true at the same time for a large number of instances to be created. In the unlikely event that that were to happen, the GCP quotas would still be in-place.
|
||||
|
||||
#### Configuration
|
||||
|
||||
Here's an example configuration, which would likely reside in the `master` branch of the kibana repository.
|
||||
|
||||
```js
|
||||
{
|
||||
gcp: {
|
||||
// Configurations at this level are defaults for all configurations defined under `agents`
|
||||
project: 'elastic-kibana-184716',
|
||||
zone: 'us-central1-b',
|
||||
serviceAccount: 'elastic-buildkite-agent@elastic-kibana-184716.iam.gserviceaccount.com',
|
||||
agents: [
|
||||
{
|
||||
queue: 'default',
|
||||
name: 'kibana-buildkite',
|
||||
overprovision: 0, // percentage or flat number
|
||||
minimumAgents: 1,
|
||||
maximumAgents: 500,
|
||||
gracefulStopAfterSecs: 60 * 60 * 6,
|
||||
hardStopAfterSecs: 60 * 60 * 9,
|
||||
idleTimeoutSecs: 60 * 60,
|
||||
exitAfterOneJob: false,
|
||||
imageFamily: 'kibana-bk-dev-agents',
|
||||
machineType: 'n2-standard-1',
|
||||
diskType: 'pd-ssd',
|
||||
diskSizeGb: 75
|
||||
},
|
||||
{
|
||||
// ...
|
||||
},
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Build / Deploy
|
||||
|
||||
Currently, the agent manager is built and deployed using [Google Cloud Build](https://cloud.google.com/build). It is deployed to and hosted using [GKE Auto-Pilot](https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot) (Kubernetes). GKE was used, rather than Cloud Run, primarily because the agent manager runs continuously (with a 30sec pause between executions) whereas Cloud Run is for services that respond to HTTP requests.
|
||||
|
||||
It uses [Google Secret Manager](https://cloud.google.com/secret-manager) for storing/retrieving tokens for accessing Buildkite. It uses a GCP service account and [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) to manage GCP resources.
|
||||
|
||||
### Elastic Buildkite PR Bot
|
||||
|
||||
#### Overview
|
||||
|
||||
For TeamCity, we built a bot that was going to handle webhooks from GitHub and trigger builds for PRs based on configuration, user permissions, etc. Since we will not be moving to TeamCity, we've repurposed this bot for Buildkite, since Buildkite does not support all of our requirements around triggering builds for PRs out-of-the-box. The bot supports everything we currently use in Jenkins, and has some additional features as well.
|
||||
|
||||
[Elastic Buildkite PR Bot](https://github.com/elastic/buildkite-pr-bot)
|
||||
|
||||
Features supported by the bot:
|
||||
|
||||
- Triggering builds on commit / when the PR is opened
|
||||
- Triggering builds on comment
|
||||
- Permissions for who can trigger builds based on: Elastic org membership, write and/or admin access to the repo, or user present in an allowed list
|
||||
- Limit builds to PRs targeting a specific branch
|
||||
- Custom regex for trigger comment, e.g. "buildkite test this"
|
||||
- Triggering builds based on labels
|
||||
- Setting labels, comment body, and other PR info as env vars on triggered build
|
||||
- Skip triggering build if a customizable label is present
|
||||
- Option to set commit status on trigger
|
||||
- Capture custom arguments from comment text using capture groups and forward them to the triggered build
|
||||
|
||||
#### Configuration
|
||||
|
||||
The configuration is stored in a `json` file (default: `.ci/pull-requests.json`) in the repo for which pull requests will be monitored. Multiple branches in the repo can store different configurations, or one configuration (e.g. in `master`) can cover the entire repo.
|
||||
|
||||
Example configuration:
|
||||
|
||||
```json
|
||||
{
|
||||
"jobs": [
|
||||
{
|
||||
"repoOwner": "elastic",
|
||||
"repoName": "kibana",
|
||||
"pipelineSlug": "kibana",
|
||||
|
||||
"enabled": true,
|
||||
"target_branch": "master",
|
||||
"allow_org_users": true,
|
||||
"allowed_repo_permissions": ["admin", "write"],
|
||||
"allowed_list": ["renovate[bot]"],
|
||||
"set_commit_status": true,
|
||||
"commit_status_context": "kibana-buildkite",
|
||||
"trigger_comment_regex": "^(?:(?:buildkite\\W+)?(?:build|test)\\W+(?:this|it))|^retest$"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Github Webhooks must also be configured to send events to the deployed bot.
|
||||
|
||||
#### Build / Deploy
|
||||
|
||||
Currently, the bot is built and deployed using [Google Cloud Build](https://cloud.google.com/build). It is deployed to and hosted on [Google Cloud Run](https://cloud.google.com/run). It uses [Google Secret Manager](https://cloud.google.com/secret-manager) for storing/retrieving tokens for accessing GitHub and Buildkite.
|
||||
|
||||
[Build/deploy configuration](https://github.com/elastic/buildkite-pr-bot/blob/main/cloudbuild.yaml)
|
||||
|
||||
### Infrastructure
|
||||
|
||||
We will need to maintain our infrastructure related to Buildkite, primarily ephemeral agents. To start, it will mean supporting infrastructure in GCP, but could later mean AWS as well.
|
||||
|
||||
- Separate GCP project for CI resources
|
||||
- Hosting for bots/services we maintain, such as the Agent Manager (GKE Auto-Pilot) and GitHub PR bot (Cloud Run)
|
||||
- Google Storage Buckets for CI artifacts
|
||||
- Networking (security, we may also need Cloud NAT)
|
||||
- IAM and Security
|
||||
- Agent images
|
||||
|
||||
We are already using Terraform to manage most resources related to Buildkite, and will continue to do so.
|
||||
|
||||
### Monitoring / Alerting
|
||||
|
||||
We will need to set up and maintain monitoring and alerting for our GCP infrastructure, as well as Buildkite metrics.
|
||||
|
||||
Some examples:
|
||||
|
||||
GCP
|
||||
|
||||
- Number of instances by type
|
||||
- Age of instances
|
||||
- Resource Quotas
|
||||
|
||||
Buildkite
|
||||
|
||||
- Agent queues
|
||||
- Job wait times
|
||||
- Build status
|
||||
|
||||
### Agent Image management
|
||||
|
||||
We will need to maintain images used to create GCP instances for our Buildkite agents. These images would need to be built on a regular basis (daily, or possibly more often).
|
||||
|
||||
We could likely maintain a single linux-based image to cover all of our current CI needs. However, in the future, if we need to maintain many images across different operating systems and architectures, this is likely to become the most complex part of the CI system that we would need to maintain. Every operating system and architecture we need to support adds another group of required images, with unique dependencies and configuration automation.
|
||||
|
||||
Another thing to note: Just because we need to run something on a specific OS or architecture, it doesn't necessarily mean we need to maintain an agent image for it. For example, we might use something like Vagrant to create a separate VM, using the default, cloud-provided images, that we run something on (e.g. for testing system packages), rather than running it on the same machine as the agent. In this case, we would potentially only be managing a small number of images, or even a single image.
|
||||
|
||||
Also, we always have the option of running a small number of jobs using Jenkins, if we need to do so to target additional OSes and architectures.
|
||||
|
||||
For our testing, we have a single GCP image, [built using Packer](https://github.com/elastic/kibana/tree/kb-bk/.buildkite/agents/packer), with the Buildkite agent installed and all of our dependencies.
|
||||
|
||||
Summary of Responsibilities
|
||||
|
||||
- An automated process for creating new images, at least daily, running automated smoke tests against them, and promoting them
|
||||
- Delete old images when creating new ones
|
||||
- Ability to roll back images easily and/or pin specific image versions
|
||||
- Manage dependencies, failures, updates, etc across all supported OSes and architectures, on a regular basis
|
||||
|
||||
### Buildkite org-level settings management
|
||||
|
||||
There are a few settings outside of pipelines that we will need to manage.
|
||||
|
||||
- Top-level pipelines and their settings
|
||||
- Pipeline schedules / scheduled jobs
|
||||
- Public visibility of pipelines
|
||||
- Teams and Permissions
|
||||
- Single Sign On settings
|
||||
|
||||
Most of the content for our pipelines will be stored in repositories as YAML. However, a job still must exist in Buildkite that points to that repo and that YAML. For managing those top-level configurations, an official [Terraform provider](https://registry.terraform.io/providers/buildkite/buildkite/latest/docs/resources/pipeline) exists, which we will likely take advantage of.
|
||||
|
||||
Pipeline schedules can also be managed using the Terraform provider.
|
||||
|
||||
Teams can also be managed using Terraform, but it's unlikely we will need to use Teams.
|
||||
|
||||
For everything else, we will likely start off using UI and build automation (or contribute to the Terraform provider) where we see fit. Most of the other settings are easy to configure, and unlikely to change.
|
||||
|
||||
### IT Security Processes
|
||||
|
||||
There will likely be numerous IT Security processes we will need to follow, since we will be managing infrastructure. This could include regular audits, specific software and configurations that must be baked into our agents, documentation procedures, or other conditions that we will need to satisfy. There is risk here, as the processes and workload are currently unknown to us.
|
||||
|
||||
# Drawbacks
|
||||
|
||||
The biggest drawback to doing this is that we will be duplicating a large amount of work and providing/maintaining a service that is already provided to us by another team at Elastic. Jenkins is already provided to us, and there is automation for creating Jenkins worker images and managing worker instances in both AWS and GCP, and IT Security policies are already being handled for all of this. It is hard to predict what the extra workload will be for the Kibana Operations team if we move our CI processes to Buildkite, but we know we will have to maintain all of the things listed under [What we will build and manage](#what-we-will-build-and-manage).
|
||||
|
||||
Some other drawbacks:
|
||||
|
||||
- CI Pipelines and other jobs built in Jenkins will need to be re-built, which includes building support for things like CI Stats, Slack notifications, GitHub PR comments, etc.
|
||||
- Developers will need to learn a new system.
|
||||
- The service is an additional cost to the company.
|
||||
- There is a lot of Jenkins knowledge throughout the company, but likely little Buildkite knowledge.
|
||||
|
||||
# Alternatives
|
||||
|
||||
## Jenkins
|
||||
|
||||
We are not happy with the experience provided by our instance of Jenkins and our current pipelines. If we stick with Jenkins, we will need to invest a likely significant amount of time in improving the experience and making our pipelines scale given the limitations we face.
|
||||
|
||||
### Required
|
||||
|
||||
#### Scalable
|
||||
|
||||
Our current Jenkins instance only allows for 300-400 connected agents, before effectively going offline. We have struggled with this issue for several years, and completely redesigned our pipelines around this limitation. The resulting design, which involves running 20+ tasks in parallel on single, large machines, and managing all of the concurrency ourselves, is complicated and problematic.
|
||||
|
||||
Other teams at Elastic, especially over the last few months, have been experiencing this same limitation with their Jenkins instances as well. The team that manages Jenkins at Elastic is well aware of this issue, and is actively investigating. It is currently unknown whether or not it is a solvable problem (without sharding) or a limitation of Jenkins.
|
||||
|
||||
#### Stable
|
||||
|
||||
Firstly, Jenkins was not designed for high availability. If the primary/controller goes offline, CI is offline.
|
||||
|
||||
The two biggest sources of stability issues for us are currently related to scaling (see above) and updates.
|
||||
|
||||
##### Updates
|
||||
|
||||
The typical update process for Jenkins looks like this:
|
||||
|
||||
- Put Jenkins into shutdown mode, which stops any new builds from starting
|
||||
- Wait for all currently-running jobs to finish
|
||||
- Shutdown Jenkins
|
||||
- Do the update
|
||||
- Start Jenkins
|
||||
|
||||
For us, shutdown mode also means that `gobld` stops creating new agents for our jobs. This means that many running jobs will never finish executing while shutdown mode is active.
|
||||
|
||||
So, for us, the typical update process is:
|
||||
|
||||
- Put Jenkins into shutdown mode, which stops any new builds from starting, and many from finishing
|
||||
- Hard kill all of our currently running jobs
|
||||
- Shutdown Jenkins
|
||||
- Do the update
|
||||
- Start Jenkins
|
||||
- A human manually restarts CI for all PRs that were running before the update
|
||||
|
||||
This is pretty disruptive for us, as developers have to wait several hours longer before merging or seeing the status of their PRs, plus there is manual work that must be done to restart CI. If we stay with Jenkins, we'll need to fix this process, and likely build some automation for it.
|
||||
|
||||
#### Surfaces information intuitively
|
||||
|
||||
Our pipelines are very complex, mainly because of the issues mentioned above related to designing around scaling issues, and none of the UIs in Jenkins work well for us.
|
||||
|
||||
The [Stage View](https://kibana-ci.elastic.co/job/elastic+kibana+pipeline-pull-request) only works for very simple pipelines. Even if we were able to re-design our pipelines to populate this page better, there are just too many stages to display in this manner.
|
||||
|
||||
[Blue Ocean](https://kibana-ci.elastic.co/blue/organizations/jenkins/elastic%2Bkibana%2Bpipeline-pull-request/activity), which is intended to be the modern UI for Jenkins, doesn't work at all for our pipelines. We have nested parallel stages in our pipelines, which [are not supported](https://issues.jenkins.io/browse/JENKINS-54010).
|
||||
|
||||
[Pipeline Steps](https://kibana-ci.elastic.co/job/elastic+kibana+pipeline-pull-request/) (Choose a build -> Pipeline Steps) shows information fairly accurately (sometimes logs/errors are not attached to any steps, and do not show), but is very difficult to read. There are entire pages of largely irrelevant information (Setting environment variables, starting a `try` block, etc), which is difficult to read through, especially developers who don't interact with Jenkins every day.
|
||||
|
||||

|
||||
|
||||
We push a lot of information to GitHub and Slack, and have even built custom UIs, to try to minimize how much people need to interact directly with Jenkins. In particular, when things go wrong, it is very difficult to investigate using the Jenkins UI.
|
||||
|
||||
#### Pipelines
|
||||
|
||||
Jenkins supports pipeline-as-code through [Pipelines](https://www.jenkins.io/doc/book/pipeline), which we currently use.
|
||||
|
||||
Pros:
|
||||
|
||||
- Overall pretty powerful, pipelines execute Groovy code at runtime, so pipelines can do a lot and can be pretty complex, if you're willing to write the code
|
||||
- Pipeline changes can be tested in PRs
|
||||
- Shared Libraries allow shared code to be used across pipelines easily
|
||||
|
||||
Cons:
|
||||
|
||||
- The sandbox is pretty difficult to work with. There's a [hard-coded list](https://github.com/jenkinsci/script-security-plugin/tree/e99ba9cffb0502868b05d19ef5cd205ca7e0e5bd/src/main/resources/org/jenkinsci/plugins/scriptsecurity/sandbox/whitelists) of allowed methods for pipelines. Other methods must be approved separately, or put in a separate shared repository that runs trusted code.
|
||||
- Pipeline code is serialized by Jenkins, and the serialization process leads to a lot of issues that are difficult to debug and reason about. See [JENKINS-44924](https://issues.jenkins.io/browse/JENKINS-44924) - `List.sort()` doesn't work and silently returns `-1` instead of a list
|
||||
- Reasonably complex pipelines are difficult to view in the UI ([see above](#surfaces-information-intuitively-2))
|
||||
- Using Pipelines to manage certain configurations (such as Build Parameters) requires running an outdated job once and letting it fail to update it
|
||||
- Jobs that reference a pipeline have to be managed separately. Only third-party tools exist for managing these jobs as code (JJB and Job DSL).
|
||||
- Very difficult to test code without running it live in Jenkins
|
||||
|
||||
#### Advanced Pipeline logic
|
||||
|
||||
See above section. Jenkins supports very advanced pipeline logic using scripted pipelines and Groovy.
|
||||
|
||||
#### Cloud-friendly pricing model
|
||||
|
||||
Given that Jenkins is open-source, we pay only for infrastructure and people to manage it.
|
||||
|
||||
#### Public access
|
||||
|
||||
- Fine-grained authorization settings
|
||||
- Anonymous user access
|
||||
- Per-job authorization, so some jobs can be private
|
||||
|
||||
#### Secrets handling
|
||||
|
||||
- Supports [Credentials](https://www.jenkins.io/doc/book/using/using-credentials/), which are stored encrypted on disk and have authorization settings
|
||||
- Credentials are difficult to manage in an automated way
|
||||
- Pipeline support for accessing credentials
|
||||
- Credentials masked in log output
|
||||
- Support for masking custom values in log output
|
||||
|
||||
#### Support or Documentation
|
||||
|
||||
Documentation for Jenkins is notoriously fragmented. All major functionality is provided in plugins, and documentation is spread out across the Jenkins Handbook, the CloudBees website, JIRA issues, wikis, GitHub repos, JavaDoc pages. Many plugins have poor documentation, and source code often has to be read to understand how to configure something.
|
||||
|
||||
CloudBees offers paid support, but we're not familiar with it at this time.
|
||||
|
||||
#### Scheduled Builds
|
||||
|
||||
Jenkins supports scheduled builds via a Cron-like syntax, and can spread scheduled jobs out. For example, if many jobs are scheduled to run every day at midnight, a syntax is available that will automatically spread the triggered jobs evenly out across the midnight hour.
|
||||
|
||||
#### Container support
|
||||
|
||||
Jenkins has support for using Docker to [run containers for specific stages in a Pipeline](https://www.jenkins.io/doc/book/pipeline/docker/). It is effectively a wrapper around `docker run`. There are few conveniences, and figuring out how to do things like mount the workspace into the container is left up to the user. There are also gotchas that are not well-documented, such as the fact that the user running inside the container will be automatically changed using `-u`, which can cause issues.
|
||||
|
||||
Though we have control over the agents running our jobs at Elastic, and thus all of the container-related tooling, it is not currently easy for the Operations team to manage our container tooling. We are mostly dependent on another team to do this for us.
|
||||
|
||||
### Desired
|
||||
|
||||
#### Customization
|
||||
|
||||
The only way to customize information added to build pages is through custom plugins. [Creating and maintaining plugins for Jenkins](https://www.jenkins.io/doc/developer/plugin-development/) is a fairly significant investment, and we do not currently have a good way to manage plugins for Jenkins instances at Elastic. It's a pretty involved process that, at the moment, has to be done by another team.
|
||||
|
||||
Given that, we feel we would be able to build a higher-quality experience in less time by creating custom applications separate from Jenkins, which we have actually [done in the past](https://ci.kibana.dev/es-snapshots).
|
||||
|
||||
#### Core functionality is first-party
|
||||
|
||||
Jenkins is very modular, and almost all Jenkins functionality is provided by plugins.
|
||||
|
||||
It's difficult to understand which plugins are required to support which base features. For example, Pipelines support is provided by a group of many plugins, and many of them have outdated names ([Pipeline: Nodes and Processes](https://github.com/jenkinsci/workflow-durable-task-step-plugin) is actually a plugin called `workflow-durable-task-step-plugin`).
|
||||
|
||||
Many plugins are maintained by CloudBees employees, but it can be very difficult to determine which ones are, without knowing the names of CloudBees employees. All Jenkins community/third-party plugins reside under the `jenkinsci` organization in GitHub, which makes finding "official" ones difficult. Given the open source nature of the Jenkins ecosystem and the way that development is handled by Cloudbees, it might be incorrect to say that any plugins outside of the Cloudbees plugins (for the Cloudbees Jenkins distribution) are "first-party".
|
||||
|
||||
#### First-class support for test results
|
||||
|
||||
It's a bit buggy at times (for example, if you run the same test multiple times, you have to load pages in a specific order to see the correct results in the UI), but Jenkins does have support for ingesting and displaying test results, including graphs that show changes over time. We use this feature to ingest test results from JUnit files produced by unit tests, integration tests, and end-to-end/functional tests.
|
||||
|
||||
#### GitHub Integration
|
||||
|
||||
Jenkins has rich support for GitHub spread across many different plugins. It can trigger builds in response to webhook payloads, automatically create jobs for repositories in an organization, has support for self-hosted GitHub, and has many settings for triggering pull requests.
|
||||
|
||||
It's worth mentioning, however, that we've had and continue to have many issues with these integrations. For example, the GitHub Pull Request Builder plugin, which currently provides PR triggering for us and other teams, has been the source of several issues at Elastic. It's had performance issues, triggers builds erroneously, and has been mostly unmaintained for several years.
|
||||
|
||||
## Other solutions
|
||||
|
||||
### CircleCI
|
||||
|
||||
CircleCI is a mature, widely-used option that is scalable and fulfills a lot of our requirements. We felt that we could create a good CI experience with this solution, but it had several disadvantages for us compared to Buildkite:
|
||||
|
||||
- The pricing model for self-hosted runners felt punishing for breaking CI into smaller tasks
|
||||
- Public access to build pages is gated behind a login, and gives CircleCI access to your private repos by default
|
||||
- There are no customization options for adding information to build pages
|
||||
- Options for advanced pipeline logic are limited compared to other solutions
|
||||
|
||||
### GitHub Actions
|
||||
|
||||
GitHub Actions is an interesting option, but it didn't pass our initial consideration round for one main reason: scalability.
|
||||
|
||||
To ensure we're able to run the number of parallel tasks that we need to run, we'll have to use self-hosted runners. Self-hosted runners aren't subject to concurrency limits. However, managing auto-scaling runners seems to be pretty complex at the moment, and GitHub doesn't seem to have any official guidance on how to do it.
|
||||
|
||||
Also, even with self-hosted runners, there is a 1,000 API request per hour hard limit, though it does not specify which APIs. Assuming even that 1 parallel step in a job is one API request, given the large number of small tasks that we'd like to split our CI into, we will likely hit this limit pretty quickly.
|
||||
|
||||
# Adoption strategy
|
||||
|
||||
We have already done a lot of the required legwork to begin building and running pipelines in Buildkite, including getting approval from various business groups inside Elastic. After all business groups have signed off, and a deal has been signed with Buildkite, we can begin adopting Buildkite. A rough plan outline is below. It's not meant to be a full migration plan.
|
||||
|
||||
- Build minimal supporting services, automation, and pipelines to migrate a low-risk job from Jenkins to Buildkite (e.g. "Baseline" CI for tracked branches)
|
||||
- The following will need to exist (some of which has already been built)
|
||||
- New GCP project for infrastructure, with current implementations migrated
|
||||
- Agent Manager
|
||||
- Agent image build/promote
|
||||
- Slack notifications for failures (possibly utilize Buildkite's built-in solution)
|
||||
- The Buildkite pipeline and supporting code
|
||||
- Run the job in parallel with Jenkins until we have confidence that it's working well
|
||||
- Turn off the Jenkins version
|
||||
- Build, test, migrate the next low-risk pipelines: ES Snapshot and/or Flaky Test Suite Runner
|
||||
- Build, test, migrate tracked branch pipelines
|
||||
- Build, test, migrate PR pipelines
|
||||
- Will additionally need PR comment support
|
||||
- PR pipelines are the most disruptive if there are problems, so we should have a high level of confidence before migrating
|
||||
|
||||
# How we teach this
|
||||
|
||||
The primary way that developers interact with Jenkins/CI today is through pull requests. Since we push a lot of information to pull requests via comments, developers mostly only need to interact with Jenkins when something goes wrong.
|
||||
|
||||
The Buildkite UI is simple and intuitive enough that, even without documentation, there would likely be a pretty small learning curve to navigating the build page UI that will be linked from PR comments. That's not to say we're not going to provide documentation, we just think it would be easy even without it!
|
||||
|
||||
We would also like to provide simple documentation that will guide developers through setting up new pipelines without our help. Getting a new job up and running with our current Jenkins setup is a bit complicated for someone who hasn't done it before, and there isn't good documentation for it. We'd like to change that if we move to Buildkite.
|
||||
|
||||
To teach and inform, we will likely do some subset of these things:
|
||||
|
||||
- Documentation around new CI pipelines in Buildkite
|
||||
- Documentation on how to handle PR failures using Buildkite
|
||||
- Documentation on the new infrastructure, supporting services, etc.
|
||||
- Zoom sessions with walkthrough and Q&A
|
||||
- E-mail announcement with links to documentation
|
||||
- Temporarily add an extra message to PR comments, stating the change and adding links to relevant documentation
|
|
@ -1,217 +0,0 @@
|
|||
- Start Date: 2020-04-26
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: (leave this empty)
|
||||
|
||||
---
|
||||
- [1. Summary](#1-summary)
|
||||
- [2. Detailed design](#2-detailed-design)
|
||||
- [3. Unresolved questions](#3-unresolved-questions)
|
||||
|
||||
# 1. Summary
|
||||
|
||||
A timeslider is a UI component that allows users to intuitively navigate through time-based data.
|
||||
|
||||
This RFC proposes adding a timeslider control to the Maps application.
|
||||
|
||||
It proposes a two phased roll-out in the Maps application. The [design proposal](#2-detailed-design) focuses on the first phase.
|
||||
|
||||
Since the timeslider UI is relevant for other Kibana apps, the implementation should be portable. We propose to implement the control as a React-component
|
||||
without implicit dependencies on Kibana-state or Maps-state.
|
||||
|
||||
The RFC details the integration of this control in Maps. It includes specific consideration to how timeslider affects data-refresh in the context of Maps.
|
||||
|
||||
This RFC also outlines a possible integration of this Timeslider-React component with an Embeddable, and the introduction of a new piece of embeddable-state `Timeslice`.
|
||||
|
||||
This RFC does not address how this component should _behave_ in apps other than the Maps-app.
|
||||
|
||||
# 2. Detailed design
|
||||
|
||||
Below outlines:
|
||||
- the two delivery phases intended for Kibana Maps
|
||||
- outline of the Timeslider UI component implementation (phase 1)
|
||||
- outline of the integration in Maps of the Timeslider UI component (phase 1)
|
||||
|
||||
## 2.1 Design phases overview
|
||||
|
||||
|
||||
|
||||
### 2.1.1 Time-range selection and stepped navigation
|
||||
|
||||
A first phase includes arbitrary time-range selection and stepped navigation.
|
||||
|
||||

|
||||
|
||||
This is the focus of this RFC.
|
||||
|
||||
Check [https://github.com/elastic/kibana/pull/96791](https://github.com/elastic/kibana/pull/96791) for a POC implementation.
|
||||
|
||||
### 2.2.2 Data distribution preview with histogram and playback
|
||||
|
||||
A second phase adds a date histogram showing the preview of the data.
|
||||
|
||||

|
||||
|
||||
Details on this phase 2 are beyond the scope of this RFC.
|
||||
|
||||
## 2.2 The timeslider UI React-component (phase 1)
|
||||
|
||||
This focuses on Phase 1. Phase 2, with date histogram preview and auto-playback is out of scope for now.
|
||||
|
||||
### 2.2.1 Interface of the React-component
|
||||
|
||||
The core timeslider-UI is a React-component.
|
||||
|
||||
The component has no implicit dependencies on any Kibana-state or Maps-store state.
|
||||
|
||||
Its interface is fully defined by its `props`-contract.
|
||||
|
||||
```
|
||||
export type Timeslice = {
|
||||
from: number; // epoch timestamp
|
||||
to: number; // epoch timestamp
|
||||
};
|
||||
|
||||
export interface TimesliderProps {
|
||||
onTimesliceChanged: (timeslice: Timeslice) => void;
|
||||
timerange: TimeRange; // TimeRange The start and end time of the entire time-range. TimeRange is defined in `data/common`
|
||||
timeslice?: Timeslice; // The currently displayed timeslice. Needs to be set after onTimesliceChange to be reflected back in UI. If ommitted, the control selects the first timeslice.
|
||||
}
|
||||
```
|
||||
|
||||
`timeslice` is clamped to the bounds of `timeRange`.
|
||||
|
||||
Any change to `timeslice`, either by dragging the handles of the timeslider, or pressing the back or forward buttons, calls the `onTimesliceChanged` handler.
|
||||
|
||||
Since the initial use is inside Maps, the initial location of this React-component is inside the Maps plugin, `x-pack/plugins/maps/public/timeslider`.
|
||||
|
||||
Nonetheless, this UI-component should be easily "cut-and-pastable" to another location.
|
||||
|
||||
### 2.2.2 Internals
|
||||
|
||||
The timeslider automatically subdivides the timerange with equal breaks that are heuristically determined.
|
||||
|
||||
It assigns about 6-10 breaks within the `timerange`, snaps the "ticks" to a natural "pretty date" using `calcAutoIntervalNear` from `data/common`.
|
||||
|
||||
For example;
|
||||
- a `timerange` of 8.5 days, it would assign a 8 day-steps, plus some padding on either end, depending on the entire `timerange`.
|
||||
- a `timerange` of 6.9 years would snap to year-long step, plus some padding on either end, depending on the entire `timerange`.
|
||||
|
||||
The slider itself is a `<EuiDualRange>`.
|
||||
|
||||
### 2.2.2 End-user behavior
|
||||
|
||||
- the user can manipulate the `timeslice`-double ranged slider to any arbitrary range within `timerange`.
|
||||
- the user can press the forward and back buttons for a stepped navigation. The range of the current time-slice is preserved when there is room for the `timeslice` within the `timerange`.
|
||||
- when the user has _not modified the width_ of the `timeslice`, using the buttons means stepping through the pre-determined ticks (e.g. by year, by day, ...)
|
||||
- when the user has _already modified the width_ of the `timeslice`, it means stepping through the `timerange`, with a stride of the width of the `timeslice`.
|
||||
- the `timeslice` "snaps" to the beginning or end (depending on direction) of the `timerange`. In practice, this means the `timeslice` will collaps or reduce in width.
|
||||
|
||||
## 2.3 Maps integration of the timeslider React component
|
||||
|
||||
This control will be integrated in the Maps-UI.
|
||||
|
||||
Maps is Redux-based, so `timeslice` selection and activation/de-activation all propagates to the Redux-store.
|
||||
|
||||
#### 2.3.1 Position in the UI
|
||||
|
||||
The timeslider control is enabled/disabled by the timeslider on/off toggle-button in the main toolbar.
|
||||
|
||||

|
||||
|
||||
|
||||
#### 2.3.2 Layer interactions
|
||||
|
||||
|
||||
Enabling the Timeslider will automatically retrigger refresh of _all_ time-based layers to the currently selected `timeslice`.
|
||||
|
||||
The Layer-TOC will indicate which layer is currently "time-filtered" by the timeslider.
|
||||
|
||||
On a layer-per-layer basis, users will be able to explicitly opt-out if they should be governed by the timerange or not. This is an existing toggle in Maps already.
|
||||
This is relevant for having users add contextual layers that should _not_ depend on the time.
|
||||
|
||||
|
||||
#### 2.3.3 Omitting timeslider on a dashboard
|
||||
|
||||
Maps will not display the timeslider-activation button on Maps that are embedded in a Dashboard.
|
||||
|
||||
We believe that a Timeslider-embeddable would be a better vehicle to bring timeslider-functionality do Dashboards. See also the [last section](#3-unresolved-questions).
|
||||
|
||||
#### 2.3.3 Data-fetch considerations
|
||||
|
||||
---
|
||||
**NOTE**
|
||||
|
||||
The below section is very Maps-specific, although similar challenges would be present in other applications as well.
|
||||
|
||||
Some of these considerations will not generalize to all of Kibana.
|
||||
|
||||
The main ways that Maps distinguishes in data-fetch from other use-cases:
|
||||
- the majority of the data-fetching for layers in Maps depends on the scale and extent. Ie. different data is requested based on the current zoom-level and current-extent of the Map. So for example, even if two views share the same time, query and filter-state, if their extent and/or scale is different, their requests to ES will be different.
|
||||
- for some layer-types, Maps will fetch individual documents, rather than the result of an aggregation.
|
||||
|
||||
---
|
||||
|
||||
Data-fetch for timeslider should be responsive and smooth. A user dragging the slider should have an immediate visual result on the map.
|
||||
|
||||
In addition, we expect the timeslider will be used a lot for "comparisons". For example, imagine a user stepping back&forth between timeslices.
|
||||
|
||||
For this reason, apps using a timeslider (such as Maps) ideally:
|
||||
- pre-fetch data when possible
|
||||
- cache data when possible
|
||||
|
||||
For Maps specifically, when introducing timeslider, layers will therefore need to implement time-based data fetch based on _two_ pieces of state
|
||||
- the entire `timerange` (aka. the global Kibana timerange)
|
||||
- the selected `timeslice` (aka. the `timeslice` chosen by the user using the UI-component)
|
||||
|
||||
##### 2.3.3.1 Pre-fetching individual documents and masking of data
|
||||
|
||||
ES-document layers (which display individual documents) can prefetch all documents within the entire `timerange`, when the total number of docs is below some threshold. In the context of Maps, this threshold is the default index-search-size of the index.
|
||||
|
||||
Maps can then just mask data on the map based on some filter-expression. The evaluation of this filter-expression is done by mapbox-gl is fast because it occurs on the GPU. There is immediate visual feedback to the user as they manipulate the timeslider, because it does not require a roundtrip to update the data.
|
||||
|
||||
##### 2.3.3.2 Caching of aggregated searches
|
||||
|
||||
Aggregated data can be cached on the client, so toggling between timeslices can avoid a round-trip data-fetch.
|
||||
The main technique here is for layers to use `.mvt`-data format to request data. Tiled-based data can be cached client-side
|
||||
|
||||
We do _not_ propose _pre-fetching_ of aggregated data in this initial phase of the Maps timeslider effort. There is a couple reasons:
|
||||
- Based on the intended user-interactions for timeslider, because a user can flexibly select a `timeslice` of arbitrary widths, it would be really hard to determine how to determine which timeslices to aggregate up front.
|
||||
- Maps already strains the maximum bucket sizes it can retrieve from Elasticsearch. Cluster/grid-layers often push up to 10k or more buckets, and terms-aggregations for choropleth maps also is going up to 10k buckets. Prefetching this for timeslices (e.g. say x10 timeslices) would easily exceed the default bucket limit sizes of Elasticsearch.
|
||||
|
||||
|
||||
##### 2.3.3.3 Decouple data-fetch from UI-effort
|
||||
|
||||
Apart from refactoring the data-fetch for layers to now use two pieces of time-based state, the implementation will decouple any data-fetch considerations from the actual timeslider-UI work.
|
||||
|
||||
The idea is that dial in data-fetch optimizations can be dialed into Maps in a parallel work-thread, not necessarily dependent on any changes or additions to the UI.
|
||||
Any optimizations would not only affect timeslider users, but support all interaction patterns that require smooth data-updates (e.g. panning back&forth to two locations, toggling back&forth between two filters, ...)
|
||||
|
||||
The main effort to support efficient data-fetch in a maps-context is to use `.mvt` as the default data format ([https://github.com/elastic/kibana/issues/79868](https://github.com/elastic/kibana/issues/79868)). This is a stack-wide effort in collaboration with the Elasticsearch team ([https://github.com/elastic/elasticsearch/issues/58696](https://github.com/elastic/elasticsearch/issues/58696), which will add `.mvt` as a core response format to Elasticsearch.
|
||||
|
||||
Growing the use of `mvt`([https://docs.mapbox.com/vector-tiles/reference/](https://docs.mapbox.com/vector-tiles/reference/)) in Maps will help with both pre-fetching and client-side caching:
|
||||
- `mvt` is a binary format which allows more data to be packed inside, as compared to Json. Multiple tiles are patched together, so this introduces a form of parallelization as well. Due to growing the amount of data inside a single tile, and due to the parallelization, Maps has a pathway to increase the number of features that can be time-filtered.
|
||||
- Because vector tiles have fixed extents and scales (defined by a `{x}/{y}/{scale}`-tuple), this type of data-fetching allows tiles to be cached on the client. This cache can be the implicit browser disc-cache, or the transient in-mem cache of mapbox-gl. Using mvt thus provides a pathway for fast toggling back&forth between timeslices, without round-trips to fetch data.
|
||||
|
||||
|
||||
##### 2.3.3.4 timeslider and async search
|
||||
|
||||
It is unclear on what the practical uses for async-search would be in the context of a timeslider-control in Maps.
|
||||
|
||||
Timeslider is a highly interactive control that require immediate visual feedback. We also do not intend to activate timeslider in Maps on a Dashboard (see above).
|
||||
|
||||
End-users who need to view a dashboard with a long-running background search will need to manipulate the _global Kibana time picker_ to select the time-range, and will not be able to use the timeslider to do so.
|
||||
|
||||
# 3. Unresolved questions
|
||||
|
||||
## Making Timeslider a Kibana Embeddable
|
||||
|
||||
This below is a forward looking section. It is a proposal of how the Timeslider-UI can be exposed as an Embeddable, when that time should come.
|
||||
|
||||
We expect a few steps:
|
||||
- This would require the extraction of the timeslider React-component out of Maps into a separate plugin. As outlined above, this migration should be fairly straightforward, a "cut and paste".
|
||||
- It would require the creation of a `TimesliderEmbeddable` which wraps this UI-component.
|
||||
- It would also require the introduction of a new piece of embeddable-state, `Timeslice`, which can be controlled by the `TimesliderEmbeddable`.
|
||||
We believe it is important to keep `timeslice` and `timerange` separate, as individual apps and other embeddables will have different mechanism to efficiently fetch data and respond to changes in `timeslice` and/or `timerange`.
|
||||
|
||||
Having timeslider as a core Embeddable likely provides a better pathway to integrate timeslider-functionality in Dashboards or apps other than Maps.
|
||||
|
|
@ -1,261 +0,0 @@
|
|||
- Start Date: 2020-06-04
|
||||
- RFC PR: (leave this empty)
|
||||
- Kibana Issue: https://github.com/elastic/kibana/issues/89287
|
||||
|
||||
---
|
||||
- [1. Summary](#1-summary)
|
||||
- [2. Motivation](#2-motivation)
|
||||
- [3. Detailed design](#3-detailed-design)
|
||||
- [3.1 Core client-side changes](#31-core-client-side-changes)
|
||||
- [3.2 Core server-side changes](#32-core-server-side-changes)
|
||||
- [3.2.1 Plugins service](#321-plugins-service)
|
||||
- [3.2.2 HTTP service](#322-http-service)
|
||||
- [3.2.3 Elasticsearch service](#323-elasticsearch-service)
|
||||
- [3.2.4 UI Settings service](#324-ui-settings-service)
|
||||
- [3.2.5 Rendering service](#325-rendering-service)
|
||||
- [3.2.6 I18n service](#326-i18n-service)
|
||||
- [3.2.7 Environment service](#327-environment-service)
|
||||
- [3.2.8 Core app service](#328-core-app-service)
|
||||
- [3.2.9 Preboot service](#329-preboot-service)
|
||||
- [3.2.10 Bootstrap](#3210-bootstrap)
|
||||
- [4. Drawbacks](#4-drawbacks)
|
||||
- [5. Alternatives](#5-alternatives)
|
||||
- [6. Adoption strategy](#6-adoption-strategy)
|
||||
- [7. How we teach this](#7-how-we-teach-this)
|
||||
- [8. Unresolved questions](#8-unresolved-questions)
|
||||
- [8.1 Lifecycle stage name](#81-lifecycle-stage-name)
|
||||
- [8.2 Development mode and basepath proxy](#82-development-mode-and-basepath-proxy)
|
||||
- [9. Resolved questions](#9-resolved-questions)
|
||||
- [9.1 Core client-side changes](#91-core-client-side-changes)
|
||||
|
||||
# 1. Summary
|
||||
|
||||
The `preboot` (see [unresolved question 1](#81-lifecycle-stage-name)) is the Kibana initial lifecycle stage at which it only initializes a bare minimum of the core services and a limited set of special-purpose plugins. It's assumed that Kibana can change and reload its own configuration at this stage and may require administrator involvement before it can proceed to the `setup` and `start` stages.
|
||||
|
||||
# 2. Motivation
|
||||
|
||||
The `preboot` lifecycle stage is a prerequisite for the Kibana interactive setup mode. This is the mode Kibana enters to on the first launch if it detects that user hasn't explicitly configured their own connection to Elasticsearch. In this mode, Kibana will present an interface to the user that would allow them to provide Elasticsearch connection information and potentially any other configuration information. Once the information is verified, Kibana will write it to the disk and allow the rest of Kibana to start.
|
||||
|
||||
The interactive setup mode will be provided through a dedicated `userSetup` plugin that will be initialized at the `preboot` stage.
|
||||
|
||||
# 3. Detailed design
|
||||
|
||||
The central part of the `preboot` stage is a dedicated HTTP server instance formerly known as `Not Ready` server. Kibana starts this server at the `preboot` stage and shuts it down as soon as the main HTTP server is ready to start, as illustrated at the following diagram:
|
||||
|
||||

|
||||
|
||||
Currently, preboot HTTP server only exposes a status endpoint and renders a static `Kibana server is not ready yet` string whenever users try to access Kibana before it's completely initialized. The changes proposed in this RFC should allow special-purpose plugins to define custom HTTP endpoints, and serve interactive client-side applications on this server, and hence make Kibana interactive setup mode possible.
|
||||
|
||||
## 3.1 Core client-side changes
|
||||
|
||||
The RFC aims to limit the changes to only those that are absolutely required and doesn't assume any modifications in the client-side part of the Kibana Core at the moment. This may introduce a certain level of inconsistency in the client-side codebase, but we consider it insignificant. See [resolved question 1](#91-core-client-side-changes) for more details.
|
||||
|
||||
## 3.2 Core server-side changes
|
||||
|
||||
We'll update only several Core server-side services to support the new `preboot` lifecycle stage and preboot plugins.
|
||||
|
||||
Once none of the `preboot` plugins holds the `setup` anymore, Kibana might need to reload the configuration before it can finally proceed to `setup`. This doesn't require any special care from the existing plugin developers since Kibana would instantiate plugins only after it reloads the config. We'll also make sure that neither of the Core services relies on the stale configuration it may have acquired during the `preboot` stage.
|
||||
|
||||
### 3.2.1 Plugins service
|
||||
|
||||
First of all, we'll introduce a new type of special-purpose plugins: preboot plugins, in contrast to standard plugins. Kibana will initialize preboot plugins at the `preboot` stage, before even instantiating standard plugins.
|
||||
|
||||
Preboot plugins have only `setup` and `stop` methods, and can only depend on other preboot plugins. Standard plugins cannot depend on the preboot plugins since Kibana will stop them before starting the standard plugins:
|
||||
|
||||
```ts
|
||||
export interface PrebootPlugin<TSetup = void, TPluginsSetup extends object = object> {
|
||||
setup(core: CorePrebootSetup, plugins: TPluginsSetup): TSetup;
|
||||
stop?(): void;
|
||||
}
|
||||
```
|
||||
|
||||
To differentiate preboot and standard plugins we'll introduce a new _optional_ `type` property in the plugin manifest. The property can have only two possible values: `preboot` for `preboot` plugins and `standard` for the standard ones. If `type` is omitted, the `standard` value will be assumed.
|
||||
|
||||
```json5
|
||||
// NOTE(azasypkin): all other existing properties have been omitted for brevity.
|
||||
{
|
||||
"type": "preboot", // 'preboot' | 'standard' | undefined
|
||||
}
|
||||
```
|
||||
|
||||
The Plugins service will split plugins into two separate groups during discovery to use them separately at the `preboot`, `setup`, and `start` stages. The Core contract that preboot plugins will receive during their `setup` will be different from the one standard plugins receive, and will only include the functionality that is currently required for the interactive setup mode. We'll discuss this functionality in details in the following sections:
|
||||
|
||||
```ts
|
||||
export interface CorePrebootSetup {
|
||||
elasticsearch: ElasticsearchServicePrebootSetup;
|
||||
http: HttpServicePrebootSetup;
|
||||
preboot: PrebootServiceSetup;
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2.2 HTTP service
|
||||
|
||||
We'll change HTTP service to initialize and start preboot HTTP server (formerly known as `Not Ready` server) in the new `preboot` method instead of `setup`. The returned `InternalHttpServicePrebootSetup` contract will presumably be very similar to the existing `InternalHttpServiceSetup` contract, but will only include APIs we currently need to support interactive setup mode:
|
||||
|
||||
```ts
|
||||
// NOTE(azasypkin): some existing properties have been omitted for brevity.
|
||||
export interface InternalHttpServicePrebootSetup
|
||||
extends Pick<HttpServiceSetup, 'auth' | 'csp' | 'basePath' | 'getServerInfo'> {
|
||||
server: HttpServerSetup['server'];
|
||||
externalUrl: ExternalUrlConfig;
|
||||
registerRoutes(path: string, callback: (router: IRouter) => void): void;
|
||||
}
|
||||
```
|
||||
|
||||
The only part of this contract that will be available to the preboot plugins via `CorePrebootSetup` is the API to register HTTP routes on the already running preboot HTTP server:
|
||||
|
||||
```ts
|
||||
export interface HttpServicePrebootSetup {
|
||||
registerRoutes(path: string, callback: (router: IRouter) => void): void;
|
||||
}
|
||||
```
|
||||
|
||||
The Core HTTP context available to handlers of the routes registered on the preboot HTTP server will only expose the `uiSettings` service. As explained in the [UI Settings service section](#324-ui-settings-service), this service will only give access to the **default Core** UI settings and their overrides set through Kibana configuration, if any.
|
||||
```ts
|
||||
// NOTE(azasypkin): the fact that the client is lazily initialized has been omitted for brevity.
|
||||
export interface PrebootCoreRouteHandlerContext {
|
||||
readonly uiSettings: { client: IUiSettingsClient };
|
||||
}
|
||||
```
|
||||
|
||||
The authentication and authorization components are not available at the `preboot` stage, and hence all preboot HTTP server routes can be freely accessed by anyone with access to the network Kibana is exposed to.
|
||||
|
||||
Just as today, Kibana will shut the preboot HTTP server down as soon as it's ready to start the main HTTP server.
|
||||
|
||||
### 3.2.3 Elasticsearch service
|
||||
|
||||
As mentioned in the [Motivation section](#2-motivation), the main goal of the interactive setup mode is to give the user a hassle-free way to configure Kibana connection to an Elasticsearch cluster. That means that users might provide certain connection information, and Kibana preboot plugins should be able to construct a new Elasticsearch client using this information to verify it and potentially call Elasticsearch APIs.
|
||||
|
||||
To support this use case we'll add a new `preboot` method to the Elasticsearch service that will return the following contract, and make it available to the preboot plugins via `CorePrebootSetup`:
|
||||
|
||||
```ts
|
||||
export interface ElasticsearchServicePrebootSetup {
|
||||
readonly createClient: (
|
||||
type: string,
|
||||
clientConfig?: Partial<ElasticsearchClientConfig>
|
||||
) => ICustomClusterClient;
|
||||
}
|
||||
```
|
||||
|
||||
The Elasticsearch clients created with `createClient` rely on the default Kibana Elasticsearch configuration and any configuration overrides specified by the consumer.
|
||||
|
||||
__NOTE:__ We may need to expose a full or portion of Elasticsearch config to the preboot plugins for them to check if the user has already configured Elasticsearch connection. There are other ways to check that without direct access to the configuration though.
|
||||
|
||||
### 3.2.4 UI Settings service
|
||||
|
||||
We'll introduce a new `preboot` method in the UI Settings service that will produce a UI Settings client instance. Since during the `preboot` stage Kibana can access neither user information nor Saved Objects, this client will only give access to the **default Core** UI settings and their overrides set through Kibana configuration, if any:
|
||||
|
||||
```ts
|
||||
export interface InternalUiSettingsServicePrebootSetup {
|
||||
defaultsClient(): IUiSettingsClient;
|
||||
}
|
||||
```
|
||||
|
||||
UI Settings service isn't strictly necessary during the `preboot` stage, but many Kibana Core components rely on it explicitly and implicitly, which justifies this simple change.
|
||||
|
||||
### 3.2.5 Rendering service
|
||||
|
||||
We'll introduce a new `preboot` method in the Rendering service that will register Kibana main UI bootstrap template route on the preboot HTTP server as it does for the main HTTP server today. The main difference is that bootstrap UI will only reference bundles of the preboot plugins and will rely on the default UI settings.
|
||||
|
||||
### 3.2.6 I18n service
|
||||
|
||||
We'll introduce a new `preboot` method in the I18n service to only include translations for the Core itself and preboot plugins in the translations bundle loaded with the preboot UI bootstrap template. This would potentially allow us to switch locale during interactive setup mode if there is such a need in the future.
|
||||
|
||||
### 3.2.7 Environment service
|
||||
|
||||
There are no changes required in the Environment service itself, but we'll expose one additional property from its `setup` contract to the plugins: the paths to the known configuration files. The interactive setup mode should be able to figure out to which configuration file Kibana should save any changes users might need to make.
|
||||
|
||||
### 3.2.8 Core app service
|
||||
|
||||
We'll introduce a new `preboot` method in the Core app service to register routes on the preboot HTTP server necessary for the rendering of the Kibana preboot applications. Most of the routes will be the same as for the main HTTP server, but there are three notable exceptions:
|
||||
|
||||
1. JS bundles routes will only include those exposed by the preboot plugins
|
||||
|
||||
2. Default route for the preboot HTTP server will be hardcoded to the root path (`/`) since we cannot rely on the default value of the `defaultRoute` UI setting (`/app/home`)
|
||||
|
||||
3. Main application route (`/app/{id}/{any*}`) will be replaced with the catch-all route (`/{path*}`). The reason is that if the user tries to access Kibana with a legit standard application URL (e.g. `/app/discover/?parameters`) while Kibana is still at the `preboot` stage, they will end up with `Application is not found` error. Instead, with the catch-all route, Kibana will capture the original URL in the `next` query string parameter and redirect the user to the root (e.g. `/?next=%2Fapp%2Fdiscover%2F%3Fparameters`). This will allow us to automatically redirect the user back to the original URL as soon as Kibana is ready. The main drawback and limitation of this approach are that there can be only one root-level preboot application. We can lift this limitation in the future if we have to though, for example, to support post-preboot Saved Objects migration UI or something similar.
|
||||
|
||||
Serving a proper Kibana application on the root route of the preboot HTTP server implies that we'll also have a chance to replace the static `Kibana server is not ready yet` string with a more helpful and user-friendly application. Such application may potentially display a certain set of Kibana status information.
|
||||
|
||||
### 3.2.9 Preboot service
|
||||
|
||||
To support interactive applications at the `preboot` stage we should allow preboot plugins to pause Kibana startup sequence. This functionality will be exposed by the new Preboot service, and will be available to the preboot plugins via `CorePrebootSetup`. Preboot plugins will be able to provide a promise to hold `setup` and/or `start` for as long as needed, and also let Kibana know if it has to reload configuration before it enters the `setup` stage.
|
||||
|
||||
```ts
|
||||
export interface PrebootServiceSetup {
|
||||
readonly isSetupOnHold: () => boolean;
|
||||
readonly holdSetupUntilResolved: (
|
||||
reason: string,
|
||||
promise: Promise<{ shouldReloadConfig: boolean } | void>
|
||||
) => void;
|
||||
readonly isStartOnHold: () => boolean;
|
||||
readonly holdStartUntilResolved: (
|
||||
reason: string,
|
||||
promise: Promise<void>
|
||||
) => void
|
||||
}
|
||||
```
|
||||
|
||||
Preboot service will provide a pair of helper `isSetupOnHold` and `isStartOnHold` methods that would allow consumers to check if `setup` or `start` are on hold before they are blocked on waiting.
|
||||
|
||||
Internal Preboot service contract will also expose `waitUntilCanSetup` and `waitUntilCanStart` methods that bootstrap process can use to know when it can proceed to `setup` and `start` stages. If any of these methods returns a `Promise` that is rejected, Kibana will shut down.
|
||||
|
||||
```ts
|
||||
// NOTE(azasypkin): some existing properties have been omitted for brevity.
|
||||
export interface InternalPrebootServiceSetup {
|
||||
readonly waitUntilCanSetup: () => Promise<{ shouldReloadConfig: boolean } | void>;
|
||||
readonly waitUntilCanStart: () => Promise<void>;
|
||||
}
|
||||
```
|
||||
|
||||
### 3.2.10 Bootstrap
|
||||
|
||||
We'll update Kibana bootstrap sequence to include `preboot` stage and to conditionally reload configuration before proceeding to `setup` and `start` stages:
|
||||
|
||||
```ts
|
||||
// NOTE(azasypkin): some functionality and checks have been omitted for brevity.
|
||||
const { preboot } = await root.preboot();
|
||||
|
||||
const { shouldReloadConfig } = await preboot.waitUntilCanSetup();
|
||||
if (shouldReloadConfig) {
|
||||
await reloadConfiguration('pre-boot request');
|
||||
}
|
||||
await root.setup();
|
||||
|
||||
await preboot.waitUntilCanStart();
|
||||
await root.start();
|
||||
```
|
||||
|
||||
It's not yet clear if we need to adjust the base path proxy to account for this new lifecycle stage (see [unresolved question 2](#82-development-mode-and-basepath-proxy)).
|
||||
|
||||
# 4. Drawbacks
|
||||
|
||||
The main drawback is that proposed changes affect quite a few Kibana Core services that may impose a risk of breaking something in the critical parts of Kibana.
|
||||
|
||||
# 5. Alternatives
|
||||
|
||||
The most viable alternative to support interactive setup mode for Kibana was a standalone application that would be completely separated from Kibana. We ruled out this option since we won't be able to leverage existing and battle-tested Core services, UI components, and development tools. This would make the long-term maintenance burden unreasonably high.
|
||||
|
||||
# 6. Adoption strategy
|
||||
|
||||
The new `preboot` stage doesn't need an adoption strategy since it's intended for internal platform use only.
|
||||
|
||||
# 7. How we teach this
|
||||
|
||||
The new `preboot` stage shouldn't need much knowledge sharing since it's intended for internal platform use only and doesn't affect the standard plugins. All new services, methods, and contracts will be sufficiently documented in the code.
|
||||
|
||||
# 8. Unresolved questions
|
||||
|
||||
## 8.1 Lifecycle stage name
|
||||
|
||||
Is `preboot` the right name for this new lifecycle stage? Do we have a better alternative?
|
||||
|
||||
## 8.2 Development mode and basepath proxy
|
||||
|
||||
Currently, the base path proxy blocks any requests to Kibana until it receives `SERVER_LISTENING` message. Kibana's main process sends this message only after `start`, but we should change that to support interactive preboot applications. It's not yet clear how big the impact of this change will be.
|
||||
|
||||
# 9. Resolved questions
|
||||
|
||||
## 9.1 Core client-side changes
|
||||
|
||||
The server-side part of the `preboot` plugins will follow a new `PrebootPlugin` interface that doesn't have a `start` method, but the client-side part will stay the same as for standard plugins. This significantly simplifies implementation and doesn't introduce any known technical issues, but, unfortunately, brings some inconsistency to the codebase. We agreed that it's tolerable assuming we define a dedicated client-side `PrebootPlugin` interface that would hide from `CoreStart` all services that are unavailable to the preboot plugins (e.g., Saved Objects service).
|
|
@ -1,729 +0,0 @@
|
|||
- Start Date: 2021-03-09
|
||||
- RFC PR: https://github.com/elastic/kibana/pull/94057
|
||||
- Kibana Issue: https://github.com/elastic/kibana/issues/68626
|
||||
- POC PR: https://github.com/elastic/kibana/pull/93380
|
||||
|
||||
---
|
||||
|
||||
- [1. Summary](#1-summary)
|
||||
- [2. Motivation](#2-motivation)
|
||||
- [3. Architecture](#3-architecture)
|
||||
- [4. Testing](#4-testing)
|
||||
- [5. Detailed design](#5-detailed-design)
|
||||
- [6. Technical impact](#6-technical-impact)
|
||||
- [6.1 Technical impact on Core](#6.1-technical-impact-on-core)
|
||||
- [6.2 Technical impact on Plugins](#6.2-technical-impact-on-plugins)
|
||||
- [6.3 Summary of breaking changes](#6.3-summary-of-breaking-changes)
|
||||
- [7. Drawbacks](#7-drawbacks)
|
||||
- [8. Alternatives](#8-alternatives)
|
||||
- [9. Adoption strategy](#9-adoption-strategy)
|
||||
- [10. How we teach this](#10-how-we-teach-this)
|
||||
- [11. Unresolved questions](#11-unresolved-questions)
|
||||
- [12. Resolved questions](#12-resolved-questions)
|
||||
|
||||
# 1. Summary
|
||||
|
||||
This RFC proposes a new core service which leverages the [Node.js cluster API](https://nodejs.org/api/cluster.html)
|
||||
to support multi-process Kibana instances.
|
||||
|
||||
# 2. Motivation
|
||||
|
||||
The Kibana server currently uses a single Node process to serve HTTP traffic.
|
||||
This is a byproduct of the single-threaded nature of Node's event loop.
|
||||
|
||||
As a consequence, Kibana cannot take advantage of multi-core hardware: If you run Kibana on an
|
||||
8-core machine, it will only utilize one of those cores. This makes it expensive to scale out
|
||||
Kibana, as server hardware will typically have multiple cores, so you end up paying for power
|
||||
you never use. Since Kibana is generally more CPU-intensive than memory-intensive, it would be
|
||||
advantageous to use all available cores to maximize the performance we can get out of a single
|
||||
machine.
|
||||
|
||||
Another benefit of this approach would be improving Kibana's overall performance for most users
|
||||
without requiring an operator to scale out the server, as it would allow the server to handle
|
||||
more http requests at once, making it less likely that a single bad request could delay the
|
||||
event loop and impact subsequent requests.
|
||||
|
||||
The introduction of a clustering mode would allow spawning multiple Kibana processes ('workers')
|
||||
from a single Kibana instance. (See [Alternatives](#8-alternatives) to learn more about the
|
||||
difference between clustering and worker pools). You can think of these processes as individual
|
||||
instances of the Kibana server which listen on the same port on the same machine, and serve
|
||||
incoming traffic in a round-robin fashion.
|
||||
|
||||
Our intent is to eventually make clustering the default behavior in Kibana, taking advantage of
|
||||
all available CPUs out of the box. However, this should still be an optional way to run Kibana
|
||||
since users might have use cases for single-process instances (for example, users running Kibana
|
||||
inside Docker containers might choose to rather use their container orchestration to run a
|
||||
container per host CPU with a single Kibana process per container).
|
||||
|
||||
# 3. Architecture
|
||||
|
||||
In 'classic' mode, the Kibana server is started in the main Node.js process.
|
||||
|
||||

|
||||
|
||||
In clustering mode, the main Node.js process would only start the coordinator, which would then
|
||||
fork workers using Node's `cluster` API. Node's underlying socket implementation allows multiple
|
||||
processes to listen to the same ports, effectively performing http traffic balancing between the
|
||||
workers for us.
|
||||
|
||||

|
||||
|
||||
The coordinator's primary responsibility is to orchestrate the workers. It would not be a 'super'
|
||||
worker handling both the job of a worker while being in charge of managing the other workers.
|
||||
|
||||
In addition, the coordinator would be responsible for some specific activities that need to be
|
||||
handled in a centralized manner:
|
||||
- collecting logs from each of the workers & writing them to a single file or stdout
|
||||
- gathering basic status information from each worker for use in the `/status` and `/stats` APIs
|
||||
|
||||
Over time, it is possible that the role of the coordinator would expand to serve more purposes,
|
||||
especially if we start implementing custom routing logic to run different services on specialized
|
||||
processes.
|
||||
|
||||
# 4. Testing
|
||||
|
||||
Thorough performance testing is critical in evaluating the success of this plan. The results
|
||||
below reflect some initial testing that was performed against an experimental
|
||||
[proof-of-concept](https://github.com/elastic/kibana/pull/93380). Should we move forward with this
|
||||
RFC, one of the first tasks will be to update the POC and build out a more detailed test plan that
|
||||
covers all of the scenarios we are concerned with.
|
||||
|
||||
## 4.1 Local testing
|
||||
|
||||
These tests were performed against a local development machine, with an 8-core CPU(2.4 GHz 8-Core
|
||||
Intel Core i9 - 32 GB 2400 MHz DDR4), using the default configuration of the `kibana-load-testing` tool.
|
||||
|
||||
### 4.1.1 Raw results
|
||||
|
||||
#### Non-clustered mode
|
||||
|
||||

|
||||
|
||||
#### Clustered mode, 2 workers
|
||||
|
||||

|
||||
|
||||
#### Clustered mode, 4 workers
|
||||
|
||||

|
||||
|
||||
### 4.1.2 Analysis
|
||||
|
||||
- Between non-clustered and 2-worker cluster mode, we observe a 20/25% gain in the 50th percentile response time.
|
||||
Gain for the 75th and 95th are between 10% and 40%
|
||||
- Between 2-worker and 4-workers cluster mode, the gain on 50th is negligible, but the 75th and the 95th are
|
||||
significantly better on the 4-workers results, sometimes up to 100% gain (factor 2 ratio)
|
||||
|
||||
Overall, switching to 2 workers comes with the most significant improvement in the 50th pct,
|
||||
and increasing further to 4 workers decreases even more significantly the highest percentiles.
|
||||
Even if increasing the number of workers doesn’t just linearly increase the performances
|
||||
(which totally make sense, most of our requests response time is caused by awaiting ES response),
|
||||
the improvements of the clustering mode on performance under heavy load are far from negligible.
|
||||
|
||||
## 4.2 Testing against cloud
|
||||
|
||||
There is currently no easy way to test the performance improvements this could provide on Cloud, as we can't
|
||||
deploy custom builds or branches on Cloud at the moment.
|
||||
|
||||
On Cloud, Kibana is running in a containerised environment using CPU CFS quota and CPU shares.
|
||||
|
||||
If we want to investigate the potential perf improvement on Cloud further, our only option would be to setup a
|
||||
similar-ish environment locally (which wasn't done during the initial investigation).
|
||||
|
||||
# 5. Detailed design
|
||||
|
||||
## 5.1 Enabling clustering mode
|
||||
|
||||
Enabling clustering mode will be done using the `node.enabled` configuration property.
|
||||
|
||||
If clustering is enabled by default, then no configuration would be required by users, and
|
||||
Kibana would automatically use all available cores. However, more detailed configuration
|
||||
would be available for users with more advanced use cases:
|
||||
```yaml
|
||||
node:
|
||||
enabled: true # enabled by default
|
||||
|
||||
coordinator:
|
||||
max_old_space_size: 1gb # optional, allows to configure memory limit for coordinator only
|
||||
|
||||
# Basic config for multiple workers with the same options
|
||||
workers: # when count is provided, all workers share the same config
|
||||
count: 2 # worker names (for logging) are generated: `worker-1`, `worker-2`
|
||||
max_old_space_size: 1gb # optional, allows to configure memory limits per-worker
|
||||
|
||||
# Alternative advanced config, allowing for worker "types" to be configured
|
||||
workers:
|
||||
foo: # the key here would be used as the worker name
|
||||
count: 2
|
||||
max_old_space_size: 1gb
|
||||
bar:
|
||||
count: 1
|
||||
max_old_space_size: 512mb
|
||||
```
|
||||
|
||||
This per-worker design would give us the flexibility to eventually provide more fine-grained configuration,
|
||||
like dedicated workers for http requests or background jobs.
|
||||
|
||||
## 5.2 Cross-worker communication
|
||||
|
||||
For some of our changes (such as the `/status` API, see below), we will need some kind of cross-worker
|
||||
communication. This will need to pass through the coordinator, which will also serve as an 'event bus',
|
||||
or IPC forwarder.
|
||||
|
||||
This IPC API will be exposed from the node service:
|
||||
|
||||
```ts
|
||||
export interface NodeServiceSetup {
|
||||
// [...]
|
||||
broadcast: (type: string, payload?: WorkerMessagePayload, options?: BroadcastOptions) => void;
|
||||
addMessageHandler: (type: string, handler: MessageHandler) => MessageHandlerUnsubscribeFn;
|
||||
}
|
||||
```
|
||||
|
||||
To preserve isolation and to avoid creating an implicit cross-plugin API, handlers registered from a
|
||||
given plugin will only be invoked for messages sent by the same plugin.
|
||||
|
||||
Notes:
|
||||
- To reduce clustered and non-clustered mode divergence, in non-clustered mode, these APIs would just be no-ops.
|
||||
It will avoid forcing (most) code to check which mode Kibana is running before calling them.
|
||||
- In the case where `sendToSelf` is true, we would still attempt to broadcast the message.
|
||||
- We could eventually use an Observable pattern instead of a handler pattern to subscribe to messages.
|
||||
|
||||
## 5.3 Executing code on a single worker
|
||||
|
||||
In some scenarios, we would like to have parts of the code executed only from a single process.
|
||||
|
||||
Saved object migrations would be a good example:
|
||||
we don't need to have each worker try to perform the migration, and we'd prefer to have one performing/trying
|
||||
the migration, and the others waiting for it. Due to the architecture, we can't have the coordinator perform
|
||||
such single-process jobs, as it doesn't actually run a Kibana server.
|
||||
|
||||
There are various ways to address such use-cases. What seems to be the best compromise right now would be the
|
||||
concept of 'main worker'. The coordinator would arbitrarily elect a worker as the 'main' one at startup. The
|
||||
node service would then expose an API to let workers identify themselves as main or not.
|
||||
|
||||
```ts
|
||||
export interface NodeServiceSetup {
|
||||
// [...]
|
||||
isMainWorker: () => boolean;
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- In non-clustered mode, `isMainWorker` would always return true, to reduce the divergence between clustered and
|
||||
non-clustered modes.
|
||||
|
||||
## 5.4 The node service API
|
||||
|
||||
We propose adding a new node service to Core, which will be responsible for adding the necessary cluster APIs,
|
||||
and handling interaction with Node's `cluster` API. This service would be accessible via Core's setup and start contracts
|
||||
(`coreSetup.node` and `coreStart.node`).
|
||||
|
||||
At the moment, no need to extend Core's request handler context with node related APIs has been identified.
|
||||
|
||||
The initial contract interface would look like this:
|
||||
|
||||
```ts
|
||||
type WorkerMessagePayload = Serializable;
|
||||
|
||||
interface BroadcastOptions {
|
||||
/**
|
||||
* If true, will also send the message to the worker that sent it.
|
||||
* Defaults to false.
|
||||
*/
|
||||
sendToSelf?: boolean;
|
||||
/**
|
||||
* If true, the message will also be sent to subscribers subscribing after the message was effectively sent.
|
||||
* Defaults to false.
|
||||
*/
|
||||
persist?: boolean;
|
||||
}
|
||||
|
||||
export interface NodeServiceSetup {
|
||||
/**
|
||||
* Return true if clustering mode is enabled, false otherwise
|
||||
*/
|
||||
isEnabled: () => boolean;
|
||||
/**
|
||||
* Return the current worker's id. In non-clustered mode, will return `1`
|
||||
*/
|
||||
getWorkerId: () => number;
|
||||
/**
|
||||
* Broadcast a message to other workers.
|
||||
* In non-clustered mode, this is a no-op.
|
||||
*/
|
||||
broadcast: (type: string, payload?: WorkerMessagePayload, options?: BroadcastOptions) => void;
|
||||
/**
|
||||
* Registers a handler for given `type` of IPC messages
|
||||
* In non-clustered mode, this is a no-op that returns a no-op unsubscription callback.
|
||||
*/
|
||||
addMessageHandler: (type: string, handler: MessageHandler) => MessageHandlerUnsubscribeFn;
|
||||
/**
|
||||
* Returns true if the current worker has been elected as the main one.
|
||||
* In non-clustered mode, will always return true
|
||||
*/
|
||||
isMainWorker: () => boolean;
|
||||
}
|
||||
```
|
||||
|
||||
### 5.4.1 Example: Saved Object Migrations
|
||||
|
||||
To take the example of SO migration, the `KibanaMigrator.runMigrations` implementation could change to
|
||||
(naive implementation, the function is supposed to return a promise here, did not include that for simplicity):
|
||||
|
||||
```ts
|
||||
runMigration() {
|
||||
if (node.isMainWorker()) {
|
||||
this.runMigrationsInternal().then((result) => {
|
||||
applyMigrationState(result);
|
||||
// persist: true will send message even if subscriber subscribes after the message was actually sent
|
||||
node.broadcast('migration-complete', { payload: result }, { persist: true });
|
||||
})
|
||||
} else {
|
||||
const unsubscribe = node.addMessageHandler('migration-complete', ({ payload: result }) => {
|
||||
applyMigrationState(result);
|
||||
unsubscribe();
|
||||
});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Notes:
|
||||
- To be sure that we do not encounter a race condition with the event subscribing / sending (workers subscribing after
|
||||
the main worker actually sent the `migration-complete` event and then waiting indefinitely), we are using the `persist`
|
||||
option of the `broadcast` API. We felt this was a better approach than the alternative of having shared state among workers.
|
||||
|
||||
## 5.5 Sharing state between workers
|
||||
|
||||
This is not identified as necessary at the moment, and IPC broadcast should be sufficient, hopefully. We prefer to avoid
|
||||
the added complexity and risk of implicit dependencies if possible.
|
||||
|
||||
If we do eventually need shared state, we would probably have to use syscall libraries to share buffers such as
|
||||
[mmap-io](https://www.npmjs.com/package/mmap-io), and expose a higher level API for that from the `node` service. More
|
||||
research would be required if this proved to be a necessity.
|
||||
|
||||
# 6. Technical impact
|
||||
|
||||
This section attempts to be an exhaustive inventory of the changes that would be required to support clustering mode.
|
||||
|
||||
## 6.1 Technical impact on Core
|
||||
|
||||
### 6.1.1 Handling multi-process logs
|
||||
|
||||
This is an example of log output in a 2 workers cluster, coming from the POC:
|
||||
|
||||
```
|
||||
[2021-03-02T10:23:41.834+01:00][INFO ][plugins-service] Plugin initialization disabled.
|
||||
[2021-03-02T10:23:41.840+01:00][INFO ][plugins-service] Plugin initialization disabled.
|
||||
[2021-03-02T10:23:41.900+01:00][WARN ][savedobjects-service] Skipping Saved Object migrations on startup. Note: Individual documents will still be migrated when read or written.
|
||||
[2021-03-02T10:23:41.903+01:00][WARN ][savedobjects-service] Skipping Saved Object migrations on startup. Note: Individual documents will still be migrated when read or written.
|
||||
```
|
||||
|
||||
The workers logs are interleaved, and, most importantly, there is no way to see which process each log entry is coming from.
|
||||
We will need to address that.
|
||||
|
||||
#### Options we considered:
|
||||
|
||||
1. Having a distinct logging configuration (with separate log files) for each worker
|
||||
2. Centralizing log collection in the coordinator and writing all logs to a single file (or stdout)
|
||||
|
||||
#### Our recommended approach:
|
||||
|
||||
Overall we recommend keeping a single log file (option 2), and centralizing the logging system in the coordinator,
|
||||
with each worker sending the coordinator log messages via IPC. While this is a more complex implementation in terms
|
||||
of our logging system, it solves several problems:
|
||||
- Preserves backwards compatibility.
|
||||
- Avoids the issue of interleaved log messages that could occur with multiple processes writing to the same file or stdout.
|
||||
- Provides a solution for the rolling-file appender (see below), as the coordinator would handle rolling all log files
|
||||
- The changes to BaseLogger could potentially have the added benefit of paving the way for our future logging MDC.
|
||||
|
||||
We could add the process name information to the log messages, and add a new conversion to be able to display it with
|
||||
the pattern layout, such as `%worker` for example.
|
||||
|
||||
The default pattern could evolve to (ideally, only when clustering is enabled):
|
||||
```
|
||||
[%date][%level][%worker][%logger] %message
|
||||
```
|
||||
|
||||
The logging output would then look like:
|
||||
```
|
||||
[2021-03-02T10:23:41.834+01:00][INFO ][worker-1][plugins-service] Plugin initialization disabled.
|
||||
[2021-03-02T10:23:41.840+01:00][INFO ][worker-2][plugins-service] Plugin initialization disabled.
|
||||
```
|
||||
|
||||
Notes:
|
||||
- The coordinator will probably need to output logs too. `%worker` would be interpolated to `coordinator`
|
||||
for the coordinator process.
|
||||
- Even if we add the `%worker` pattern, we could still consider letting users configure per-worker log
|
||||
files as a future enhancement.
|
||||
|
||||
### 6.1.2 The rolling-file appender
|
||||
|
||||
The rolling process of the `rolling-file` appender is going to be problematic in clustered mode, as it will cause
|
||||
concurrency issues during the rolling. We need to find a way to have this rolling stage clustered-proof.
|
||||
|
||||
#### Options we considered:
|
||||
|
||||
1. have the rolling file appenders coordinate themselves when rolling
|
||||
|
||||
By using a broadcast message based mutex mechanism, the appenders could acquire a ‘lock’ to roll a specific file, and
|
||||
notify other workers when the rolling is complete (quite similar to what we want to do with SO migration for example).
|
||||
|
||||
An alternative to this option would be to only have the main worker handle the rolling logic. We will lose control
|
||||
on the exact size the file is when rolling, as we would need to wait until the main worker receives a log message
|
||||
for the rolling appender before the rolling is effectively performed. The upside would be that it reduces the inter-workers
|
||||
communication to a notification from the main worker to the others once the rolling is done for them to reopen their
|
||||
file handler.
|
||||
|
||||
2. have the coordinator process perform the rolling
|
||||
|
||||
Another option would be to have the coordinator perform the rotation instead. When a rolling is required, the appender
|
||||
would send a message to the coordinator, which would perform the rolling and notify the workers once the operation is complete.
|
||||
|
||||
Note that this option is even more complicated than the previous one, as it forces to move the rolling implementation
|
||||
outside of the appender, without any significant upsides identified.
|
||||
|
||||
3. centralize the logging system in the coordinator
|
||||
|
||||
We could go further, and change the way the logging system works in clustering mode by having the coordinator centralize
|
||||
the logging system. The worker’s logger implementation would just send messages to the coordinator. If this may be a
|
||||
correct design, the main downside is that the logging implementation would be totally different in cluster and
|
||||
non cluster mode, and seems to be way more work that the other options.
|
||||
|
||||
#### Our recommended approach:
|
||||
Even though it's more complex, we feel that centralizing the logging system in the coordinator is the right move here,
|
||||
as it will also solve for how to enable the coordinator to log its own messages.
|
||||
|
||||
### 6.1.3 The status API
|
||||
|
||||
In clustering mode, the workers will all have an individual status. One could have a connectivity issue with ES
|
||||
while the other ones are green. Hitting the `/status` endpoint will reach a random (and different each time) worker,
|
||||
meaning that it would not be possible to know the status of the cluster as a whole.
|
||||
|
||||
We will need to add some centralized status state in the coordinator. Also, as the `/status` endpoint cannot be served
|
||||
from the coordinator, we will also need to have the workers retrieve the global status from the coordinator to serve
|
||||
the status endpoint.
|
||||
|
||||
Ultimately, we'd need to make the following updates to the `/status` API, neither of which
|
||||
is a breaking change:
|
||||
1. The response will return the highest-severity status level for each plugin, which will be
|
||||
determined by looking at the shared global status stored in the coordinator.
|
||||
2. We will introduce an extension to the existing `/status` response to allow inspecting
|
||||
per-worker statuses.
|
||||
|
||||
### 6.1.4 The stats API & metrics service
|
||||
|
||||
The `/stats` endpoint is somewhat problematic in that it contains a handful of `process` metrics
|
||||
which will differ from worker-to-worker:
|
||||
```json
|
||||
{
|
||||
// ...
|
||||
"process": {
|
||||
"memory": {
|
||||
"heap": {
|
||||
"total_bytes": 533581824,
|
||||
"used_bytes": 296297424,
|
||||
"size_limit": 4345298944
|
||||
},
|
||||
"resident_set_size_bytes": 563625984
|
||||
},
|
||||
"pid": 52646,
|
||||
"event_loop_delay": 0.22967800498008728,
|
||||
"uptime_ms": 1706021.930404
|
||||
},
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
As each request could be routed to a different worker, different results may come back each time.
|
||||
|
||||
This endpoint, registered from the `usage_collection` plugin, is getting these stats from Core's
|
||||
`metrics` service (`getOpsMetrics$`), which is also used in the `monitoring` plugin for stats
|
||||
collection.
|
||||
|
||||
Ultimately we will extend the API to provide per-worker stats, but the question remains what we
|
||||
should do with the existing `process` stats.
|
||||
|
||||
#### Options we considered:
|
||||
1. Deprecate them? (breaking change)
|
||||
2. Accept a situation where they may be round-robined to different workers? (probably no)
|
||||
3. Try to consolidate them somehow? (can't think of a good way to do this)
|
||||
4. Always return stats for one process, e.g. main or coordinator? (doesn't give us the full picture)
|
||||
|
||||
#### Our recommended approach:
|
||||
We agreed that we would go with (3) and have each worker report metrics to the coordinator for
|
||||
sharing, with the metrics aggregated as follows:
|
||||
```json
|
||||
{
|
||||
// ...
|
||||
"process": {
|
||||
"memory": {
|
||||
"heap": {
|
||||
"total_bytes": 533581824, // sum of coordinator + workers
|
||||
"used_bytes": 296297424, // sum of coordinator + workers
|
||||
"size_limit": 4345298944 // sum of coordinator + workers
|
||||
},
|
||||
"resident_set_size_bytes": 563625984 // sum of coordinator + workers
|
||||
},
|
||||
"pid": 52646, // pid of the coordinator
|
||||
"event_loop_delay": 0.22967800498008728, // max of coordinator + workers
|
||||
"uptime_ms": 1706021.930404 // uptime of the coordinator
|
||||
},
|
||||
// ...
|
||||
}
|
||||
```
|
||||
|
||||
This has its downsides (`size_limit` in particular could be confusing), but otherwise generally makes sense:
|
||||
- sum of available/used heap & node rss is straightforward
|
||||
- `event_loop_delay` max makes sense, as we are mostly only interested in that number if it is high anyway
|
||||
- `pid` and `uptime_in_millis` from the coordinator make sense, especially as long as we are killing
|
||||
all workers any time one of them dies. In the future if we respawn workers that die, this could be
|
||||
misleading, but hopefully by then we can deprecate this and move Metricbeat to using the per-worker
|
||||
stats.
|
||||
|
||||
### 6.1.5 PID file
|
||||
|
||||
Without changes, each worker is going to try to write and read the same PID file. Also, this breaks the whole pid file
|
||||
usage, as the PID stored in the file will be a arbitrary worker’s PID, instead of the coordinator (main process) PID.
|
||||
|
||||
In clustering mode, we will need to have to coordinator handle the PID file logic, and to disable pid file handling
|
||||
in the worker's environment service.
|
||||
|
||||
### 6.1.6 Saved Objects migration
|
||||
|
||||
In the current state, all workers are going to try to perform the migration. Ideally, we would have only one process
|
||||
perform the migration, and the other ones just wait for a ready signal. We can’t easily have the coordinator do it,
|
||||
so we would probably have to leverage the ‘main worker’ concept here.
|
||||
|
||||
The SO migration v2 is supposed to be resilient to concurrent attempts though, as we already support multi-instances
|
||||
Kibana, so this can probably be considered an improvement.
|
||||
|
||||
### 6.1.7 Memory consumption
|
||||
|
||||
In clustered mode, node options such as `max-old-space-size` will be used by all processes.
|
||||
|
||||
The `kibana` startup script will read this setting out of the CLI or `config/node.options` and set a NODE_OPTIONS environment
|
||||
variable, which will be passed to any workers, possibly leading to unexpected behavior.
|
||||
|
||||
e.g. using `--max-old-space-size=1024` in a 2 workers cluster would have a maximum memory usage of 3gb (1 coordinator + 2 workers).
|
||||
|
||||
Our plan for addressing this is to _disable clustering if a user has `max-old-space-size` set at all_, which would ensure it isn't
|
||||
possible to hit unpredictable behavior. To enable clustering, the user would simply remove `max-old-space-size` settings, and
|
||||
clustering would be on by default. They could alternatively configure memory settings for each worker individually, as shown above.
|
||||
|
||||
### 6.1.8 Workers error handling
|
||||
|
||||
When using `cluster`, the common best practice is to have the coordinator recreate ('restart') workers when they terminate unexpectedly.
|
||||
However, given Kibana's architecture, some failures are not recoverable (workers failing because of config validation, failed migration...).
|
||||
|
||||
For instance, if a worker (well, all workers) terminates because of an invalid configuration property, it doesn't make
|
||||
any sense to have the coordinator recreate them indefinitely, as the error requires manual intervention.
|
||||
|
||||
As a first step, we plan to terminate the main Kibana process when any worker terminates unexpectedly for any reason (after all,
|
||||
this is already the behavior in non-cluster mode). In the future, we will look toward distinguishing between recoverable
|
||||
and non-recoverable errors as an enhancement, so that we can automatically restart workers on any recoverable error.
|
||||
|
||||
### 6.1.9 Data folder
|
||||
|
||||
The data folder (`path.data`) is currently the same for all workers.
|
||||
|
||||
We still have to identify with the teams if this is going to be a problem. It could be, for example, if some plugins
|
||||
are accessing files in write mode, which could result in concurrency issues between the workers.
|
||||
|
||||
If that was confirmed, we would plan to create and use a distinct data folder for each worker, which would be non-breaking
|
||||
as we don't consider the layout of this directory to be part of our public API.
|
||||
|
||||
### 6.1.10 instanceUUID
|
||||
|
||||
The same instance UUID (`server.uuid` / `{dataFolder}/uuid`) is currently used by all the workers.
|
||||
|
||||
So far, we have not identified any places where this will be problematic, however, we will look to other teams to
|
||||
help validate this.
|
||||
|
||||
Note that if we did need to have per-worker UUIDs, this could be a breaking change, as the single `server.uuid`
|
||||
configuration property would not be enough. If this change becomes necessary, one approach could be to have unique worker
|
||||
IDs with `${serverUuid}-${workerId}`.
|
||||
|
||||
## 6.2 Technical impact on Plugins
|
||||
|
||||
### 6.2.1 What types of things could break?
|
||||
|
||||
#### Concurrent access to the same resources
|
||||
|
||||
Is there, for example, some part of the code that is accessing and writing files from the data folder (or anywhere else)
|
||||
and makes the assumption that it is the sole process actually writing to that file?
|
||||
|
||||
#### Using instanceUUID as a unique Kibana process identifier
|
||||
|
||||
Is there, for example, schedulers that are using the instanceUUID a single process id, in opposition to a single
|
||||
Kibana instance id? Are there situations where having the same instance UUID for all the workers is going to be a problem?
|
||||
|
||||
#### Things needing to run only once per Kibana instance
|
||||
|
||||
Is there any part of the code that needs to be executed only once in a multi-worker mode, such as initialization code,
|
||||
or starting schedulers?
|
||||
|
||||
An example would be Reporting's queueFactory polling. As we want to only be running a single headless at a time per
|
||||
Kibana instance, only one worker should have polling enabled.
|
||||
|
||||
### 6.2.2 Identified required changes
|
||||
|
||||
#### Reporting
|
||||
|
||||
We will probably want to restrict to a single headless per Kibana instance. For that, we will have to change the logic
|
||||
in [createQueueFactory](https://github.com/elastic/kibana/blob/4584a8b570402aa07832cf3e5b520e5d2cfa7166/x-pack/plugins/reporting/server/lib/create_queue.ts#L60-L64)
|
||||
to only have the 'main' worker be polling for reporting tasks.
|
||||
|
||||
#### Telemetry
|
||||
|
||||
- Server side fetcher
|
||||
|
||||
The telemetry/server/fetcher.ts will attempt sending the telemetry usage multiple times once per day from each process.
|
||||
We do store a state in the SavedObjects store of the last time the usage was sent to prevent sending multiple times
|
||||
(although race conditions might occur).
|
||||
|
||||
- Tasks storing telemetry data
|
||||
|
||||
We have tasks across several plugins storing data in savedobjects specifically for telemetry. Under clustering these
|
||||
tasks will be registered multiple times.
|
||||
|
||||
Note that sending the data multiple times doesn’t have any real consequences, apart from the additional number of ES requests,
|
||||
so this should be considered non-blocking and only an improvement.
|
||||
|
||||
- Event-based telemetry
|
||||
|
||||
Event-based telemetry may be affected as well. Both the existing one in the Security Solutions team and the general
|
||||
one that is in the works. More specifically, the size of the queues will be multiplied per worker, also growing in the
|
||||
amount of network bandwidth used, and potentially affecting our customers.
|
||||
|
||||
We could address that by making sure that the queues are held only in the main worker.
|
||||
|
||||
#### Task Manager
|
||||
|
||||
Currently, task manager does "claims" for jobs to run based on the server uuid. We think this could still work with
|
||||
a multi-process setup - each task manager in the worker would be doing "claims" for the same server uuid, which
|
||||
seems functionally the same as setting max_workers to `current max_workers * number of workers`.
|
||||
Another alternative would be to compose something like `${server.uuid}-${worker.Id}`, as TM only
|
||||
really needs a unique identifier.
|
||||
|
||||
However, as a first step we can simply run Task Manager on the main worker. This doesn't completely solve potential
|
||||
noisy neighbor problems as the main worker will still be receiving & serving http requests, however it will at least
|
||||
ensure that other worker processes are free to serve http requests without risk of TM interference. Long term, we
|
||||
could explore manually spawning a dedicated child process for background tasks that can be called from workers, and
|
||||
thinking of a way for plugins to tell Core when they need to run things in the background.
|
||||
|
||||
It would be ideal if we could eventually solve this with our multi-process setup, however this needs more design work
|
||||
and could necessitate an RFC in its own right. The key thing to take away here is that the work we are doing in this
|
||||
RFC would not prevent us from exploring this path further in a subsequent phase. In fact, it could prove to be a
|
||||
helpful first step in that direction.
|
||||
|
||||
#### Alerting
|
||||
|
||||
Currently haven't identified any Alerting-specific requirements that aren't already covered by the
|
||||
Task Manager requirements.
|
||||
|
||||
## 6.3 Summary of breaking changes
|
||||
|
||||
### 6.3.1 `/stats` API & metrics service
|
||||
|
||||
Currently the only breaking change we have identified is for the `/stats` API.
|
||||
|
||||
The `process` memory usage reported doesn't really make sense in a multi-process Kibana, and
|
||||
even though we have a plan to aggregate this data as a temporary solution (see 6.1.4), this
|
||||
could still lead to confusion for users as it doesn't paint a clear picture of the state of the system.
|
||||
|
||||
Our plan is to deprecate the `process` field, and later remove it or change the structure
|
||||
to better support a multi-process Kibana.
|
||||
|
||||
# 7. Drawbacks
|
||||
|
||||
- Implementation cost is going to be significant as this will require multiple phases, both in core and in plugins.
|
||||
Also, this will have to be a collaborative effort, as we can't enable cluster mode in production until all of the
|
||||
identified breaking changes have been addressed.
|
||||
- Even if it is easier to deploy, at a technical level it doesn't really provide anything more than a multi-instance Kibana setup.
|
||||
- This will add complexity to the code, especially in Core where some parts of the logic will drastically diverge between
|
||||
clustered and non-clustered modes (most notably our logging system).
|
||||
- There is a risk of introducing subtle bugs in clustered mode, as we may overlook some breaking changes, or developers
|
||||
may neglect to ensure clustered mode compatibility when adding new features.
|
||||
- Proper testing of all the edge cases is going to be tedious, and in some cases realistically impossible. Proper
|
||||
education of developers is going to be critical to ensure we are building new features with clustering in mind.
|
||||
|
||||
# 8. Alternatives
|
||||
|
||||
One alternative to the `cluster` module is using a worker pool via `worker_threads`. Both have distinct use cases
|
||||
though. Clustering is meant to have multiple workers with the same codebase, often sharing a network socket to balance
|
||||
network traffic. Worker threads is a way to create specialized workers in charge of executing isolated, CPU intensive
|
||||
tasks on demand (e.g. encrypting or descrypting a file). If we were to identify that under heavy load, the actual bottleneck
|
||||
is ES, maybe exposing a worker thread service and API from Core (task_manager would be a perfect example of potential consumer)
|
||||
would make more sense.
|
||||
|
||||
However, we believe the simplicity and broad acceptance of the `cluster` API in the Node community makes it the
|
||||
better approach over `worker_threads`, and would prefer to only go down the road of a worker pool as a last resort.
|
||||
|
||||
Another alternative would be to provide tooling to ease the deployment of multi-instance Kibana setups, and only support
|
||||
multi-instance mode moving forward.
|
||||
|
||||
# 9. Adoption strategy
|
||||
|
||||
Because the changes proposed in this RFC touch the lowest levels of Kibana's core, and therefore have potential to impact
|
||||
large swaths of Kibana's codebase, we propse a multi-phase strategy:
|
||||
|
||||
## Phase 0
|
||||
In the prepratory phase, we will evolve the existing POC to validate the finer details of this RFC, while also putting together
|
||||
a more detailed testing strategy that can be used to benchmark our future work.
|
||||
|
||||
## Phase 1
|
||||
To start implementation, we will make the required changes in Core, adding the `node.enabled` configuration property.
|
||||
At first, we'll include a big warning in the logs to make it clear that this shouldn't be used in production yet.
|
||||
This way, we allow developers to test their features against clustering mode and to adapt their code
|
||||
to use the new `node` API and service. At this point we will also aim to document any identified breaking changes and
|
||||
add deprecation notices where applicable, to allow developers time to prepare for 8.0.
|
||||
|
||||
## Phase 2
|
||||
When all the required changes have been performed in plugin code, we will enable the `node` configuration on production
|
||||
mode as a `beta` feature. We would ideally also add telemetry collection for the clustering usages (relevant metrics TBD)
|
||||
to have a precise vision of the adoption of the feature.
|
||||
|
||||
## Phase 3
|
||||
Once the new feature has been validated and we are comfortable considering it GA, we will enable `node` by default.
|
||||
(We could alternatively enable it by default from the outset, still with a `beta` label).
|
||||
|
||||
# 10. How we teach this
|
||||
|
||||
During Phase 1, we should create documentation on the clustering mode: best practices, how to identify code that may break in
|
||||
clustered mode, and so on.
|
||||
|
||||
We will specifically look to make changes to our docs around contributing to Kibana, specifically we can add a section
|
||||
in the [best practices](https://www.elastic.co/guide/en/kibana/master/development-best-practices.html#_testing_stability) to
|
||||
remind contributers to be thinking about the fact that you cannot rely on a 1:1 relationship between the Kibana process and
|
||||
an individual machine.
|
||||
|
||||
Lastly, we'll take advantage of internal communications to kibana-contributors, and make an effort to individually check in
|
||||
with the teams who we think will most likely be affected by these changes.
|
||||
|
||||
# 11. Unresolved questions
|
||||
|
||||
**Are breaking changes required for the `/stats` API & metrics service?**
|
||||
|
||||
See 6.1.4 above.
|
||||
|
||||
# 12. Resolved questions
|
||||
|
||||
**How do we handle http requests that need to be served by a specific process?**
|
||||
|
||||
The Node.js cluster API is not really the right solution for this, as it won't allow for custom scheduling policies. A custom scheduling policy would basically mean re-implementing the cluster API on our own. At this point we will not be solving this particular issue with the clustering project, however the abstraction proposed in this RFC will not preclude us from changing out the underlying implementation in the future should we choose to do so.
|
||||
|
||||
**How do we handle http requests that need to have knowledge of all processes?**
|
||||
|
||||
`/status` and `/stats` are the big issues here, as they could be reported differently from each process. The current plan is to manage their state centrally in the coordinator and have each process report this data at a regular interval, so that all processes can retrieve it and serve it in response to any requests against that endpoint. Exact details of the changes to those APIs would need to be determined. I think `status` will likely require breaking changes as pointed out above, however `stats` may not.
|
||||
|
||||
**Is it okay for the workers to share the same `path.data` directory?**
|
||||
|
||||
We have been unable to identify any plugins which are writing to this directory.
|
||||
The App Services team has confirmed that `path.data` is no longer in use in the reporting plugin.
|
||||
|
||||
**Is using the same `server.uuid` in each worker going to cause problems?**
|
||||
|
||||
We have been unable to identify any plugins for which this would cause issues.
|
||||
The Alerting team has confirmed that Task Manager doesn't need server uuid, just a unique
|
||||
identifier. That means something like server.uuid + worker.id would work.
|