Draft of new tshooting and faq topics

Fixes #9943
2025-04-24 14:47:19 -04:00 · 2018-08-22 10:38:21 -04:00 · 2018-08-22 10:38:21 -04:00 · 6dab577a93
commit 6dab577a93
parent 69c1928f4a
3 changed files with 263 additions and 0 deletions
--- a/docs/index.asciidoc
+++ b/docs/index.asciidoc
@ -238,6 +238,14 @@ include::static/maintainer-guide.asciidoc[]
 :edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/submitting-a-plugin.asciidoc
 include::static/submitting-a-plugin.asciidoc[]

+// FAQ and Troubleshooting
+
+// :edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/faq.asciidoc
+include::static/faq.asciidoc[]
+
+// :edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/troubleshooting.asciidoc
+include::static/troubleshooting.asciidoc[]
+
 // Glossary of Terms

 :edit_url: https://github.com/elastic/logstash/edit/{branch}/docs/static/glossary.asciidoc
--- a/docs/static/faq.asciidoc
+++ b/docs/static/faq.asciidoc
@ -0,0 +1,64 @@
+[[faq]] 
+== Frequently Asked Questions (FAQ)
+
+This is a new section. We will be adding more questions and answers, so check back soon.
+
+Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
+forum].
+
+[float]
+[[faq-kafka]]
+=== Kafka
+
+This section is just a quick summary the most common  Kafka questions I answered on Github and Slack over the last few months:
+
+[float]
+[[faq-kafka-settings]]
+===== Kafka settings
+
+[float]
+[[faq-kafka-partitions]]
+===== How many partitions should be used per topic?
+
+At least: Number of LS nodes x consumer threads per node.
+
+Better yet: Use a multiple of the above number. Increasing the number of
+partitions for an existing topic is extremely complicated. Partitions have a
+very low overhead. Using 5 to 10 times the number of partitions suggested by the
+first point is generally fine so long as the overall partition count does not
+exceed 2k (err on the side of over-partitioning 10x when for less than 1k
+partitions overall, over-partition less liberally if it makes you exceed 1k
+partitions).
+
+[float]
+[[faq-kafka-threads]]
+===== How many consumer threads should I configure?
+
+Lower values tend to be more efficient and have less memory overhead. Try a
+value of `1` then iterate your way up. The value should in general be lower than
+the number of pipeline workers. Values larger than 4 rarely result in a
+performance improvement.
+
+
+[float]
+[[faq-kafka-pq-persist]]
+==== Kafka input and persistent queue (PQ)
+[float]
+===== Does Kafka Input commit offsets only after the event has been safely persisted to the PQ?
+
+No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
+writes to the PQ are slow/blocked, offsets for events that haven’t yet safely
+reached the PQ can be committed.
+
+
+[float]
+[[faq-kafka-offset-commit]]
+===== Does Kafa Input commit offsets only for events that have passed the pipeline fully?
+No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
+writes to the PQ are slow/blocked offsets for events that haven’t yet safely
+reached the PQ can be committed. 
+
+
+
+
+
--- a/docs/static/troubleshooting.asciidoc
+++ b/docs/static/troubleshooting.asciidoc
@ -0,0 +1,191 @@
+[[troubleshooting]] 
+== Troubleshooting Common Problems
+
+This is a new section. We will be adding more tips and solutions, so check back soon.
+
+Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
+forum].
+
+
+[float] 
+[[ts-install]] 
+== Installation and setup
+
+
+[float] 
+[[ts-temp-dir]] 
+=== Inaccessible temp directory
+
+Certain versions of the JRuby runtime and libraries
+in certain plugins (e.g., the Netty network library in the TCP input) copy
+executable files to the temp directory which causes subsequent failures when
+/tmp is mounted noexec. 
+
+Possible solutions:
+. Change setting to mount /tmp with exec.
+. Specify an alternate directory using the `-Djava.io.tmpdir` setting in the jvm.options file.
+ 
+
+[float] 
+[[ts-ingest]] 
+== Data ingestion
+
+[float] 
+[[ts-429]] 
+=== Error response code 429
+
+A 429 message indicates that an application is busy handling other requests. For
+example, Elasticsearch throws a 429 code to notify Logstash (or other indexers)
+that the bulk failed because the ingest queue is full.  Any documents that
+weren't processed should be retried.
+
+TBD:  Does Logstash retry? Should the user take any action?
+
+*Sample error*
+[source,txt]
+-----
+[2018-08-21T20:05:36,111][INFO ][logstash.outputs.elasticsearch] retrying
+failed action with response code: 429
+({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of
+org.elasticsearch.transport.TransportService$7@85be457 on
+EsThreadPoolExecutor[bulk, queue capacity = 200,
+org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@538c9d8a[Running,
+pool size = 16, active threads = 16, queued tasks = 200, completed tasks =
+685]]"})
+-----
+
+
+[float] 
+[[ts-kafka]] 
+== Common Kafka support issues and solutions
+
+This section contains a list of the most common Kafka related support issues of
+the last few months.  
+
+[float] 
+[[ts-kafka-timeout]] 
+=== Kafka session timeout issues (input side)
+
+This is a very common problem. 
+
+Symptoms: Throughput issues and duplicate event
+processing LS logs warnings:
+
+`[2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
+Auto offset commit failed for group clap_tx1: Commit cannot be completed since
+the group has already rebalanced and assigned the partitions to another member.`
+
+This means that the time between subsequent calls to poll() was longer than the
+configured session.timeout.ms, which typically implies that the poll loop is
+spending too much time message processing. You can address this either by
+increasing the session timeout or by reducing the maximum size of batches
+returned in poll() with max.poll.records. 
+
+[INFO][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Revoking
+previously assigned partitions [] for group log-ronline-node09
+`[2018-01-29T14:54:06,485][INFO]`[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
+Setting newly assigned partitions [elk-pmbr-9] for group log-pmbr 
+
+Example: https://github.com/elastic/support-dev-help/issues/3319
+
+Background:
+
+Kafka tracks the individual consumers in a consumer group (i.e. a number of LS
+instances) and tries to give each consumer one or more specific partitions of
+the data in the topic they’re consuming.  In order to achieve this, Kafka has to
+also track whether or not a consumer (LS Kafka input thread) is making any
+progress on their assigned partition and reassign partitions that have not seen
+progress in a set timeframe. This causes a problem when Logstash is requesting
+more events from the Kafka Broker than it can process within the timeout because
+it triggers reassignment of partitions. Reassignment of partitions can cause
+duplicate processing of events and significant throughput problems because of
+the time the reassignment takes. Solution:
+
+Solution:
+Fixing the problem is easy by reducing the number of records per request that LS
+polls from the Kafka Broker in on request, reducing the number of Kafka input
+threads and/or increasing the relevant timeouts in the Kafka Consumer
+configuration.
+
+The number of records to pull in one request is set by the option
+`max_poll_records`.  If it exceeds the default value of 500, reducing this
+should be the first thing to try. The number of input threads is given by the
+option `consumer_threads`.  If it exceeds the number of pipeline workers
+configured in the `logstash.yml` it should certainly be reduced.  If it is a
+large value (> 4), it likely makes sense to reduce it to 4 (if the client has
+the time/resources for it, it would be ideal to start with a value of 1 and then
+increment from there to find the optimal performance). The relevant timeout is
+set via `session_timeout_ms`. It should be set to a value that ensures that the
+number of events in `max_poll_records` can be safely processed within. Example:
+pipeline throughput is 10k/s and `max_poll_records` is set to 1k => the value
+must be at least 100ms if `consumer_threads` is set to `1`. If it is set to a
+higher value n, then the minimum session timeout increases proportionally to `n *
+100ms`. In practice the value must be set much larger than the theoretical value
+because the behaviour of the outputs and filters in a pipeline follows a
+distribution. It should also be larger than the maximum time you expect your
+outputs to stall for. The default setting is 10s == `10000ms`. If a user is
+experiencing periodic problems with an output like Elasticsearch output that
+could stall because of load or similar effects, there is little downside to
+increasing this value significantly to say 60s. Note: Decreasing the
+`max_poll_records` is preferable to increasing this timeout from the performance
+perspective. Increasing this timeout is your only option if the client’s issues
+are caused by periodically stalling outputs. Check logs for evidence of stalling
+outputs (e.g. ES output logging status `429`).
+
+[float] 
+[[ts-kafka-many-offset-commits]] 
+=== Large number of offset commits (input side)
+
+Symptoms: Logstash’s Kafka Input is causing a much higher number of commits to
+the offset topic than expected. Often the complaint also mentions redundant
+offset commits where the same offset is committed repeatedly.
+
+Examples: https://github.com/elastic/support-dev-help/issues/3702
+https://github.com/elastic/support-dev-help/issues/3060 Solution:
+
+For Kafka Broker versions 0.10.2.1 to 1.0.x: The problem is caused by a bug in
+Kafka. https://issues.apache.org/jira/browse/KAFKA-6362 The client’s best option
+is upgrading their Kafka Brokers to version 1.1 or newer. For older versions of
+Kafka or if the above does not fully resolve the issue: The problem can also be
+caused by setting too low of a value for `poll_timeout_ms` relative to the rate
+at which the Kafka Brokers receive events themselves (or if Brokers periodically
+idle between receiving bursts of events). Increasing the value set for
+`poll_timeout_ms` will proportionally decrease the number of offsets commits in
+this scenario (i.e. raising it by 10x will lead to 10x fewer offset commits).
+
+
+[float] 
+[[ts-kafka-codec-errors-input]] 
+=== Codec Errors in Kafka Input (before Plugin Version 6.3.4 only) 
+
+Symptoms:
+Logstash Kafka input randomly logs errors from the configured codec and/or reads
+events incorrectly (partial reads, mixing data between multiple events etc.).
+
+Log example:  [2018-02-05T13:51:25,773][FATAL][logstash.runner          ] An
+unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
+:backtrace=>["org/jruby/RubyArray.java:1892:in `join'",
+"org/jruby/RubyArray.java:1898:in `join'",
+"/usr/share/logstash/logstash-core/lib/logstash/util/buftok.rb:87:in `extract'",
+"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:38:in
+`decode'",
+"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:241:in
+`thread_runner'",
+"file:/usr/share/logstash/vendor/jruby/lib/jruby.jar!/jruby/java/java_ext/java.lang.rb:12:in
+`each'",
+"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:240:in
+`thread_runner'"]} 
+
+Examples: https://github.com/elastic/support-dev-help/issues/3308
+https://github.com/elastic/support-dev-help/issues/2107 Background:
+
+There was a bug in the way the Kafka Input plugin was handling codec instances
+when running on multiple threads (`consumer_threads` set to > 1).
+https://github.com/logstash-plugins/logstash-input-kafka/issues/210 Solution:
+
+Ideally: Upgrade Kafka Input plugin to v. 6.3.4 or later. If (and only if)
+upgrading is impossible: Set `consumer_threads` to `1`.
+
+
+
+