Commit graph

120 commits

Author SHA1 Message Date
Jake Landis
bb9566a57e
Update discovery.asciidoc (#106541) (#106695)
Fix typo

(cherry picked from commit 96a46b9c5b)

Co-authored-by: Boen <13752080613@163.com>
2024-03-22 15:43:48 -04:00
David Turner
61191b880c
Link to troubleshooting docs from other disco pages (#102509)
I have several times struggled to find the docs about restoring from a
snapshot if a quorum cannot be found. That info is on the discovery
troubleshooting page, but it seems I expect it to be on somewhere like
the quorums or voting docs pages instead. This commit adds links from
those pages to the troubleshooting page.
2023-11-23 09:45:21 +00:00
David Turner
9b51d9972d
More specific cluster.initial_master_nodes instructions (#101493)
In the note on forming a single cluster we describe what to do if
inadvertently forming extra clusters, but we can be more explicit about
what to do with `cluster.initial_master_nodes` in these instructions.
This commit adds the missing details.
2023-10-30 08:25:40 +00:00
Abdon Pijpelink
af76a3a436
[DOCS] Add 'Troubleshooting an unstable cluster' to nav (#99287)
* [DOCS] Add 'Troubleshooting an unstable cluster' to nav

* Adjust docs links in code

* Revert "Adjust docs links in code"

This reverts commit f3846b1d78.

---------

Co-authored-by: David Turner <david.turner@elastic.co>
2023-09-08 13:42:50 +02:00
David Turner
0f6a217ed8
Fix admonition about initial_master_nodes (#98242)
Admonition paragraphs cannot be combined with a `+` continuation mark.
This commit fixes the formatting by using an admonition block instead.
2023-08-08 11:50:36 +01:00
David Turner
09e53f9ad9
Enhance docs around network troubleshooting (#97305)
Discovery, like cluster membership, can also be affected by network-like
issues (e.g. GC/VM pauses, dropped packets and blocked threads) so this
commit duplicates the troubleshooting info across both places.
2023-07-10 10:57:44 +01:00
debadair
777598d602
[DOCS] Remove redirect pages (#88738)
* [DOCS] Remove manual redirects

* [DOCS] Removed refs to modules-discovery-hosts-providers

* [DOCS] Fixed broken internal refs

* Fixing bad cross links in ES book, and adding redirects.asciidoc[] back into docs/reference/index.asciidoc.

* Update docs/reference/search/point-in-time-api.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/setup/restart-cluster.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/sql/endpoints/translate.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update docs/reference/snapshot-restore/restore-snapshot.asciidoc

Co-authored-by: James Rodewig <james.rodewig@elastic.co>

* Update repository-azure.asciidoc

* Update node-tool.asciidoc

* Update repository-azure.asciidoc

---------

Co-authored-by: amyjtechwriter <61687663+amyjtechwriter@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Amy Jonsson <amy.jonsson@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2023-05-24 12:32:46 +01:00
David Turner
7a517cb4a0
Add note on jstack frequency for troubleshooting (#95764)
Suggest calling `jstack` every 15s to ensure that at least one capture
shows a stuck thread. Also adds a link to this guide to the list on the
troubleshooting overview page.
2023-05-03 10:04:13 +01:00
David Turner
f0989404ab
Bootstrapping docs clarifications (#94977)
Explains why you should remove `cluster.initial_master_nodes`, and
rewords some of the other sections a little for (subjectively) improved
readability.
2023-04-03 14:43:12 +01:00
David Kilfoyle
7cd484ac95
Revert "Cross-reference disclaimer" (#94829)
* Revert "Cross-reference disclaimer (#94801)"

This reverts commit 902649be31.

* Highlight sentences about removing `cluster.initial_master_nodes` setting
2023-03-28 11:17:53 -04:00
Stef Nestor
902649be31
Cross-reference disclaimer (#94801)
👋🏼 howdy, team! Can we cross pollinate the [important banner](https://www.elastic.co/guide/en/elasticsearch/reference/master/important-settings.html#initial_master_nodes) from the `cluster.initial_master_nodes` setting page to the related [bootstrap doc](https://www.elastic.co/guide/en/elasticsearch/reference/master/modules-discovery-bootstrap-cluster.html#bootstrap-cluster-name) to avoid user's misunderstanding the latter's "This is only required the first time a cluster starts up" as saying they don't need to comment-out these settings?
2023-03-28 00:16:40 -04:00
David Turner
4c68382065
Capture thread dump on ShardLockObtainFailedException (#93458)
We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.

Relates #93226
2023-02-02 11:17:40 -05:00
David Turner
dfab580976
Limit length of lag detector hot threads log lines (#92851)
If debug logging is enabled then the lag detector will capture and
report the hot threads of a lagging node. In some cases the resulting
log message can be very large, exceeding 10kiB, which means it is
truncated in most logging setups. The relevant thread(s) may be waiting
on I/O, which is not considered "hot" and therefore may not appear in
the first 10kiB.

This commit adjusts this logging mechanism to split the message into
chunks of size at most 2kiB (after compression and base64-encoding) to
ensure that the entire hot threads output can be faithfully
reconstructed from these logs.

Closes #88126
2023-01-13 13:11:26 +00:00
David Turner
6203560983
Fix docs for fault detection troubleshooting (#92749)
In #92742 we changed the logging around cluster membership changes but the docs
don't quite match the final version. This commit addresses that.
2023-01-09 10:17:06 +00:00
David Turner
5182748318
Improve node-{join,left} logging for troubleshooting (#92742)
Today to troubleshoot an unstable cluster we ask the users to parse the
rather complex `node-join` and `node-left` messages emitted by the
`MasterService`. These messages may refer to many nodes, may be
truncated, and are generally pretty hard to work with.

With this commit we start to emit a simplified log message about each
node added and removed. It also renames the respective executor classes:

- `JoinTaskExecutor` -> `NodeJoinExecutor`
- `NodeRemovalClusterStateTaskExecutor` -> `NodeLeftExecutor`

This brings their names in line with each other, and the messages that
they emit, whilst preserving the older `node-join` and `node-left`
terminology as reported by the `MasterService`.

Finally, it updates the troubleshooting logs to reflect these new and
simplified logs.

Relates #92741
2023-01-09 04:34:41 -05:00
Luiz Guilherme Pais dos Santos
9eec322424
Fix format for cluster.discovery_configuration_check.interval (#90452) 2022-12-22 16:11:33 +01:00
David Turner
c9d4892929
Weaken language about "low-latency" networks (#89198)
Today we say that voting-only nodes require a "low-latency" network.
This term has a specific meaning in some operating environments which is
different from our intended meaning. To avoid this confusion this commit
removes the absolute term "low-latency" in favour of describing the
requirements relative to the user's own performance goals.
2022-08-09 13:15:37 +01:00
Leaf-Lin
945cb27782
[DOCS] Adding discovery troubleshooting link in the master get help page (#87344)
* Adding discovery troubleshooting link

* Add tags to pull in discovery troubleshooting content

* Move discovery troubleshooting to separate page and add redirects

Co-authored-by: Adam Locke <adam.locke@elastic.co>
2022-07-06 15:51:43 -04:00
Iraklis Psaroudakis
50d2cf31b8
Periodic warning for 1-node cluster w/ seed hosts (#88013)
For fully-formed single-node clusters, emit a periodic warning if seed_hosts has been set to a non-empty list.

Closes #85222
2022-06-30 16:35:15 +03:00
David Turner
80f7af58f8
More detail in discovery troubleshooting docs (#86930)
In #85074 we added docs on discovery troubleshooting that really only
talked about troubleshooting master elections. There's also the case
where the master is elected fine but some other node can't join it. This
commit adds troubleshooting docs about that too.

Co-authored-by: Adam Locke <adam.locke@elastic.co>
2022-06-06 08:33:45 +01:00
David Turner
79f181d208
Reduce resource needs of join validation (#85380)
Fixes a few scalability issues around join validation:

- compresses the cluster state sent over the wire
- shares the serialized cluster state across multiple nodes
- forks the decompression/deserialization work off the transport thread

Relates #77466
Closes #83204
2022-04-26 12:15:54 +01:00
David Turner
33a553f61f Fix up whitespace error introduced in #85948 2022-04-19 07:58:10 +01:00
David Turner
ce004d49e7
More docs re. removing cluster.initial_master_nodes (#85948)
Ensures that on every page of the docs that mentions
`cluster.initial_master_nodes` also mentions that this setting must be
removed after bootstrapping completes.
2022-04-19 07:54:43 +01:00
David Turner
6a273886e9
Add technical docs on diagnosing instability etc (#85074)
Copies some internal troubleshooting docs to the reference manual for
wider use.

Co-authored-by: James Rodewig <james.rodewig@gmail.com>
2022-03-31 09:01:10 +01:00
David Turner
fd76f9c5d1
Fix auto-bootstrap docs (#85215)
Today it's no longer true that by default nodes will auto-discover other
nodes on the same host and bootstrap them all into a cluster. This
commit fixes the docs on auto-bootstrapping to recognise this.
2022-03-22 16:35:48 +00:00
Tobias Stadler
e3deacf547
[DOCS] Fix typos (#83895) 2022-02-15 12:42:17 -05:00
James Rodewig
2f03112b5b
[DOCS] Synced with 8.0 stack upgrade changes (#83489) (#83596)
This moves the bulk of the upgrade information into the consolidated upgrade guide, but leaves the primary upgrade topic in place as a cross reference.

Relates to: https://github.com/elastic/stack-docs/pull/1970

Co-authored-by: gchaps <33642766+gchaps@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
(cherry picked from commit f6473d71f9)

Co-authored-by: debadair <debadair@elastic.co>
2022-02-07 11:01:42 -05:00
David Turner
7dd32fb027
Reduce verbosity-increase timeout to 3m (#81118)
Today we increase the verbosity of discovery failures after 5 minutes
without a master. Unfortunately 5 minutes is a common orchestration
timeout, so if discovery is broken then we see nodes being shut down
just before they start to emit useful logs. This commit reduces the
default timeout to 3 minutes to address that.
2021-11-30 09:52:39 +00:00
David Turner
8cf4c7b6fb
Remove last few mentions of Zen discovery (#80410)
We have a few leftover mentions of `zen` discovery, mostly for
historical/BwC reasons, which this commit removes.

Prior to this commit the default value for `discovery.type` was `zen`
but this was not written down anywhere or officially supported: the two
options were to set it to `single-node` or to omit it entirely. This
commit changes the default to `multi-node` and documents this.

Co-authored-by: Adam Locke <adam.locke@elastic.co>
2021-11-09 09:52:06 +01:00
Adam Locke
529986e9b1
A typo error (#78987) (#79203)
* A typo error

a space between 'E' and 'cluster...'

* Update example, fix headings, change notes

Co-authored-by: Adam Locke <adam.locke@elastic.co>

Co-authored-by: Marwane Chahoud <marwane.chahoud@gmail.com>
2021-10-15 08:52:03 -04:00
Martijn van Groningen
8a1deff75a
Improve fault-detection.asciidoc (#76821)
Add section to fault-detection.asciidoc about nodes being removed from cluster
due to slow cluster state applying.
2021-08-23 14:31:06 +02:00
David Turner
eabe2d1b34
Increase PeerFinder verbosity on persistent failure (#73128)
If a node is partitioned away from the rest of the cluster then the
`ClusterFormationFailureHelper` periodically reports that it cannot
discover the expected collection of nodes, but does not indicate why. To
prove it's a connectivity problem, users must today restart the node
with `DEBUG` logging on `org.elasticsearch.discovery.PeerFinder` to see
further details.

With this commit we log messages at `WARN` level if the node remains
disconnected for longer than a configurable timeout, which defaults to 5
minutes.

Relates #72968
2021-05-17 10:52:18 +01:00
James Rodewig
693807a6d3
[DOCS] Fix double spaces (#71082) 2021-03-31 09:57:47 -04:00
James Rodewig
5c75d004fa
[DOCS] Replace put with create or update in API names (#70330)
Co-authored-by: debadair <debadair@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2021-03-15 14:49:44 -04:00
David Turner
2adeb4a666
Expand and consolidate networking docs (#68051)
Today's network config docs are split into "Network", "HTTP" and
"Transport" pages, with unclear relationships between them. We often
encounter users with weird configs that indicate they don't really
understand how these settings all relate. In fact these pages are all
very interrelated, and the HTTP and Transport pages are almost all only
for advanced users. This commit brings these docs into a single page and
rewords some things to try and guide users away from the advanced
settings unless their configuration needs all the extra complexity.

It also adds a section entitled "Binding and publishing" which clarifies
the meanings of the `bind_host` and `publish_host` parameters. This is
also a common source of confusion amongst users.

It also clarifies that many of these settings accept a list of
addresses, and warns that this may not be what you want. Closes #67956.

Co-authored-by: Adam Locke <adam.locke@elastic.co>
2021-02-01 13:06:20 +00:00
David Turner
9c100cdeae
Extend default probe connect/handshake timeouts (#68059)
Today the discovery phase has a short 1-second timeout for handshaking
with a remote node after connecting, which allows it to quickly move on
and retry in the case of connecting to something that doesn't respond
straight away (e.g. it isn't an Elasticsearch node).

This short timeout was necessary when the component was first developed
because each connection attempt would block a thread. Since #42636 the
connection attempt is now nonblocking so we can apply a more relaxed
timeout.

If transport security is enabled then our handshake timeout applies to
the TLS handshake followed by the Elasticsearch handshake. If the TLS
handshake alone takes over a second then the whole handshake times out
with a `ConnectTransportException`, but this does not tell us which of
the two individual handshakes took so long.

TLS handshakes have their own 10-second timeout, which if reached yields
a `SslHandshakeTimeoutException` that allows us to distinguish a problem
at the TLS level from one at the Elasticsearch level. Therefore this
commit extends the discovery probe timeouts.
2021-01-27 16:41:44 +00:00
Adam Locke
789ee2d73e
[DOCS] Combining important config settings into a single page (#63849)
* Combining important config settings into a single page.

* Updating ids for two pages causing link errors and implementing redirects.
2020-10-19 10:02:22 -04:00
James Rodewig
dcf0c3062f
[DOCS] Document dynamic discovery settings (#61420) 2020-09-04 10:56:17 -04:00
Yannick Welsch
0b517ddca6
Provide option to allow writes when master is down (#60605)
Elasticsearch currently blocks writes by default when a master is unavailable. The cluster.no_master_block setting allows
a user to change this behavior to also block reads when a master is unavailable. This PR introduces a way to now also still
allow writes when a master is offline. Writes will continue to work as long as routing table changes are not needed (as
those require the master for consistency), or if dynamic mapping updates are not required (as again, these require the
master for consistency).

Eventually we should switch the default of cluster.no_master_block to this new mode.
2020-08-12 16:37:32 +02:00
David Turner
19eb922d9f
Remove join timeout (#60873)
There is no point in timing out a join attempt any more. Timing out and
retrying with the same master is pointless, and an in-flight join
attempt to one master no longer blocks attempts to join other masters.
This commit removes this unnecessary setting.

Relates #60872 in which this setting was deprecated.
2020-08-10 13:57:54 +01:00
James Rodewig
2774cd6938
[DOCS] Swap [float] for [discrete] (#60124)
Changes instances of `[float]` in our docs for `[discrete]`.

Asciidoctor prefers the `[discrete]` tag for floating headings:
https://asciidoctor.org/docs/asciidoc-asciidoctor-diffs/#blocks
2020-07-23 11:48:22 -04:00
David Turner
c661a40083
Add docs for filesystem health checks (#59134)
Documents the feature and settings introduced in #52680.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-07-07 14:14:35 +01:00
James Rodewig
70cb519aa7
[DOCS] Relocate discovery module content (#56611)
* Moves `Discovery and cluster formation` content from `Modules` to
`Set up Elasticsearch`.

* Combines `Adding and removing nodes` with `Adding nodes to your
  cluster`. Adds related redirect.

* Removes and redirects the `Modules` page.

* Rewrites parts of `Discovery and cluster formation` to remove `module`
  references and meta references to the section.
2020-05-12 17:39:06 -04:00
David Turner
10ab397d7f
Adjust docs for voting config exclusions API (#55006)
In #50836 we deprecated the existing voting config exclusions API and added a
new one. This commit adjust the docs to match.
2020-04-20 19:47:09 +01:00
David Turner
e2cda1a279
"Adding nodes" instructions only work on localhost (#52677)
The introductory sections of the reference manual contains some simplified
instructions for adding a node to the cluster. Unfortunately they are a little
too simplified and only really work for clusters running on `localhost`. If you
try and follow these instructions for a distributed cluster then the new node
will, confusingly, auto-bootstrap itself into a distinct one-node cluster.

Multiple nodes running on localhost is a valid config, of course, but we should
spell out that these instructions are really only for experimentation and that
it takes a bit more work to add nodes to a distributed cluster. This commit
does so.

Also, the "important config" instructions for discovery say that you MUST set
`discovery.seed_hosts` whereas in fact it is fine to ignore this setting and
use a dynamic discovery mechanism instead. This commit weakens this statement
and links to the docs for dynamic discovery mechanisms.

Finally, this section is also overloaded with some technical details that are
not important for this context and are adequately covered elsewhere, and
completely fails to note that the default discovery port is 9300. This commit
addresses this.
2020-02-27 08:51:17 +00:00
David Turner
a304d9a656
Ignore timeouts with single-node discovery (#52159)
Today we use `cluster.join.timeout` to prevent nodes from waiting indefinitely
if joining a faulty master that is too slow to respond, and
`cluster.publish.timeout` to allow a faulty master to detect that it is unable
to publish its cluster state updates in a timely fashion. If these timeouts
occur then the node restarts the discovery process in an attempt to find a
healthier master.

In the special case of `discovery.type: single-node` there is no point in
looking for another healthier master since the single node in the cluster is
all we've got. This commit suppresses these timeouts and instead lets the node
wait for joins and publications to succeed no matter how long this might take.
2020-02-11 14:00:06 +00:00
David Turner
69e0b1a0f4
Drop snapshot instructions for autobootstrap fix (#49755)
The "Restore any snapshots as required" step is a trap: it's somewhere between
tricky and impossible to restore multiple clusters into a single one.

Also add a note about configuring discovery during a rolling upgrade to
proscribe any rare cases where you might accidentally autobootstrap during the
upgrade.
2019-12-02 12:43:18 +00:00
glerb
dd47cf4560 [DOCS] Correct typo in Discovery docs (#48494) 2019-11-05 08:48:20 -05:00
David Turner
9e30a57ca5
More bootstrap docs tweaks (#47809)
Clarifies not to set `cluster.initial_master_nodes` on nodes that are joining
an existing cluster.

Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
2019-10-10 10:53:27 +02:00
David Turner
a84908cebd Clarify that discovery ignores master-ineligibles (#44835)
The changes in #32006 mean that the discovery process can no longer use
master-ineligible nodes as a stepping-stone between master-eligible nodes.
This was normally an indication of a strange and possibly-fragile configuration
and was not recommended. This commit clarifies that only master-eligible nodes
are now involved with discovery.
2019-09-12 11:12:35 +01:00