Draft of new tshooting and faq topics

Fixes #9943
This commit is contained in:
Karen Metts 2018-08-22 10:38:21 -04:00 committed by karen.metts
parent 69c1928f4a
commit 6dab577a93
3 changed files with 263 additions and 0 deletions

View file

@ -238,6 +238,14 @@ include::static/maintainer-guide.asciidoc[]
:edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/submitting-a-plugin.asciidoc
include::static/submitting-a-plugin.asciidoc[]
// FAQ and Troubleshooting
// :edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/faq.asciidoc
include::static/faq.asciidoc[]
// :edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/troubleshooting.asciidoc
include::static/troubleshooting.asciidoc[]
// Glossary of Terms
:edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/glossary.asciidoc

64
docs/static/faq.asciidoc vendored Normal file
View file

@ -0,0 +1,64 @@
[[faq]]
== Frequently Asked Questions (FAQ)
This is a new section. We will be adding more questions and answers, so check back soon.
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
forum].
[float]
[[faq-kafka]]
=== Kafka
This section is just a quick summary the most common Kafka questions I answered on Github and Slack over the last few months:
[float]
[[faq-kafka-settings]]
===== Kafka settings
[float]
[[faq-kafka-partitions]]
===== How many partitions should be used per topic?
At least: Number of LS nodes x consumer threads per node.
Better yet: Use a multiple of the above number. Increasing the number of
partitions for an existing topic is extremely complicated. Partitions have a
very low overhead. Using 5 to 10 times the number of partitions suggested by the
first point is generally fine so long as the overall partition count does not
exceed 2k (err on the side of over-partitioning 10x when for less than 1k
partitions overall, over-partition less liberally if it makes you exceed 1k
partitions).
[float]
[[faq-kafka-threads]]
===== How many consumer threads should I configure?
Lower values tend to be more efficient and have less memory overhead. Try a
value of `1` then iterate your way up. The value should in general be lower than
the number of pipeline workers. Values larger than 4 rarely result in a
performance improvement.
[float]
[[faq-kafka-pq-persist]]
==== Kafka input and persistent queue (PQ)
[float]
===== Does Kafka Input commit offsets only after the event has been safely persisted to the PQ?
No, we cant make the guarantee. Offsets are committed to Kafka periodically. If
writes to the PQ are slow/blocked, offsets for events that havent yet safely
reached the PQ can be committed.
[float]
[[faq-kafka-offset-commit]]
===== Does Kafa Input commit offsets only for events that have passed the pipeline fully?
No, we cant make the guarantee. Offsets are committed to Kafka periodically. If
writes to the PQ are slow/blocked offsets for events that havent yet safely
reached the PQ can be committed.

191
docs/static/troubleshooting.asciidoc vendored Normal file
View file

@ -0,0 +1,191 @@
[[troubleshooting]]
== Troubleshooting Common Problems
This is a new section. We will be adding more tips and solutions, so check back soon.
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
forum].
[float]
[[ts-install]]
== Installation and setup
[float]
[[ts-temp-dir]]
=== Inaccessible temp directory
Certain versions of the JRuby runtime and libraries
in certain plugins (e.g., the Netty network library in the TCP input) copy
executable files to the temp directory which causes subsequent failures when
/tmp is mounted noexec.
Possible solutions:
. Change setting to mount /tmp with exec.
. Specify an alternate directory using the `-Djava.io.tmpdir` setting in the jvm.options file.
[float]
[[ts-ingest]]
== Data ingestion
[float]
[[ts-429]]
=== Error response code 429
A 429 message indicates that an application is busy handling other requests. For
example, Elasticsearch throws a 429 code to notify Logstash (or other indexers)
that the bulk failed because the ingest queue is full. Any documents that
weren't processed should be retried.
TBD: Does Logstash retry? Should the user take any action?
*Sample error*
[source,txt]
-----
[2018-08-21T20:05:36,111][INFO ][logstash.outputs.elasticsearch] retrying
failed action with response code: 429
({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of
org.elasticsearch.transport.TransportService$7@85be457 on
EsThreadPoolExecutor[bulk, queue capacity = 200,
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@538c9d8a[Running,
pool size = 16, active threads = 16, queued tasks = 200, completed tasks =
685]]"})
-----
[float]
[[ts-kafka]]
== Common Kafka support issues and solutions
This section contains a list of the most common Kafka related support issues of
the last few months.
[float]
[[ts-kafka-timeout]]
=== Kafka session timeout issues (input side)
This is a very common problem.
Symptoms: Throughput issues and duplicate event
processing LS logs warnings:
`[2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
Auto offset commit failed for group clap_tx1: Commit cannot be completed since
the group has already rebalanced and assigned the partitions to another member.`
This means that the time between subsequent calls to poll() was longer than the
configured session.timeout.ms, which typically implies that the poll loop is
spending too much time message processing. You can address this either by
increasing the session timeout or by reducing the maximum size of batches
returned in poll() with max.poll.records.
[INFO][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Revoking
previously assigned partitions [] for group log-ronline-node09
`[2018-01-29T14:54:06,485][INFO]`[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
Setting newly assigned partitions [elk-pmbr-9] for group log-pmbr
Example: https://github.com/elastic/support-dev-help/issues/3319
Background:
Kafka tracks the individual consumers in a consumer group (i.e. a number of LS
instances) and tries to give each consumer one or more specific partitions of
the data in the topic theyre consuming. In order to achieve this, Kafka has to
also track whether or not a consumer (LS Kafka input thread) is making any
progress on their assigned partition and reassign partitions that have not seen
progress in a set timeframe. This causes a problem when Logstash is requesting
more events from the Kafka Broker than it can process within the timeout because
it triggers reassignment of partitions. Reassignment of partitions can cause
duplicate processing of events and significant throughput problems because of
the time the reassignment takes. Solution:
Solution:
Fixing the problem is easy by reducing the number of records per request that LS
polls from the Kafka Broker in on request, reducing the number of Kafka input
threads and/or increasing the relevant timeouts in the Kafka Consumer
configuration.
The number of records to pull in one request is set by the option
`max_poll_records`. If it exceeds the default value of 500, reducing this
should be the first thing to try. The number of input threads is given by the
option `consumer_threads`. If it exceeds the number of pipeline workers
configured in the `logstash.yml` it should certainly be reduced. If it is a
large value (> 4), it likely makes sense to reduce it to 4 (if the client has
the time/resources for it, it would be ideal to start with a value of 1 and then
increment from there to find the optimal performance). The relevant timeout is
set via `session_timeout_ms`. It should be set to a value that ensures that the
number of events in `max_poll_records` can be safely processed within. Example:
pipeline throughput is 10k/s and `max_poll_records` is set to 1k => the value
must be at least 100ms if `consumer_threads` is set to `1`. If it is set to a
higher value n, then the minimum session timeout increases proportionally to `n *
100ms`. In practice the value must be set much larger than the theoretical value
because the behaviour of the outputs and filters in a pipeline follows a
distribution. It should also be larger than the maximum time you expect your
outputs to stall for. The default setting is 10s == `10000ms`. If a user is
experiencing periodic problems with an output like Elasticsearch output that
could stall because of load or similar effects, there is little downside to
increasing this value significantly to say 60s. Note: Decreasing the
`max_poll_records` is preferable to increasing this timeout from the performance
perspective. Increasing this timeout is your only option if the clients issues
are caused by periodically stalling outputs. Check logs for evidence of stalling
outputs (e.g. ES output logging status `429`).
[float]
[[ts-kafka-many-offset-commits]]
=== Large number of offset commits (input side)
Symptoms: Logstashs Kafka Input is causing a much higher number of commits to
the offset topic than expected. Often the complaint also mentions redundant
offset commits where the same offset is committed repeatedly.
Examples: https://github.com/elastic/support-dev-help/issues/3702
https://github.com/elastic/support-dev-help/issues/3060 Solution:
For Kafka Broker versions 0.10.2.1 to 1.0.x: The problem is caused by a bug in
Kafka. https://issues.apache.org/jira/browse/KAFKA-6362 The clients best option
is upgrading their Kafka Brokers to version 1.1 or newer. For older versions of
Kafka or if the above does not fully resolve the issue: The problem can also be
caused by setting too low of a value for `poll_timeout_ms` relative to the rate
at which the Kafka Brokers receive events themselves (or if Brokers periodically
idle between receiving bursts of events). Increasing the value set for
`poll_timeout_ms` will proportionally decrease the number of offsets commits in
this scenario (i.e. raising it by 10x will lead to 10x fewer offset commits).
[float]
[[ts-kafka-codec-errors-input]]
=== Codec Errors in Kafka Input (before Plugin Version 6.3.4 only)
Symptoms:
Logstash Kafka input randomly logs errors from the configured codec and/or reads
events incorrectly (partial reads, mixing data between multiple events etc.).
Log example: [2018-02-05T13:51:25,773][FATAL][logstash.runner ] An
unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
:backtrace=>["org/jruby/RubyArray.java:1892:in `join'",
"org/jruby/RubyArray.java:1898:in `join'",
"/usr/share/logstash/logstash-core/lib/logstash/util/buftok.rb:87:in `extract'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:38:in
`decode'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:241:in
`thread_runner'",
"file:/usr/share/logstash/vendor/jruby/lib/jruby.jar!/jruby/java/java_ext/java.lang.rb:12:in
`each'",
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:240:in
`thread_runner'"]}
Examples: https://github.com/elastic/support-dev-help/issues/3308
https://github.com/elastic/support-dev-help/issues/2107 Background:
There was a bug in the way the Kafka Input plugin was handling codec instances
when running on multiple threads (`consumer_threads` set to > 1).
https://github.com/logstash-plugins/logstash-input-kafka/issues/210 Solution:
Ideally: Upgrade Kafka Input plugin to v. 6.3.4 or later. If (and only if)
upgrading is impossible: Set `consumer_threads` to `1`.