Content changes for ts and faq

Fixes #9943
2025-04-24 14:47:19 -04:00 · 2018-08-24 11:04:20 -04:00 · 2018-08-24 11:04:20 -04:00 · 9838f43b45
commit 9838f43b45
parent 6dab577a93
2 changed files with 164 additions and 85 deletions
--- a/docs/static/faq.asciidoc
+++ b/docs/static/faq.asciidoc
@ -1,7 +1,7 @@
 [[faq]] 
 == Frequently Asked Questions (FAQ)
-This is a new section. We will be adding more questions and answers, so check back soon.
+We will be adding more questions and answers, so please check back soon.
 Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
 forum].
@ -10,25 +10,26 @@ forum].
 [[faq-kafka]]
 === Kafka
-This section is just a quick summary the most common  Kafka questions I answered on Github and Slack over the last few months:
+This section is a summary of the most common Kafka questions from the last few months.
 [float]
 [[faq-kafka-settings]]
-===== Kafka settings
+==== Kafka settings
 [float]
 [[faq-kafka-partitions]]
 ===== How many partitions should be used per topic?
-At least: Number of LS nodes x consumer threads per node.
+At least: Number of {ls} nodes multiplied by consumer threads per node.
 Better yet: Use a multiple of the above number. Increasing the number of
 partitions for an existing topic is extremely complicated. Partitions have a
 very low overhead. Using 5 to 10 times the number of partitions suggested by the
-first point is generally fine so long as the overall partition count does not
+first point is generally fine, so long as the overall partition count does not
-exceed 2k (err on the side of over-partitioning 10x when for less than 1k
+exceed 2000.
-partitions overall, over-partition less liberally if it makes you exceed 1k
+
-partitions).
+Err on the side of over-partitioning up to a total 1000
 partitions overall. Try not to exceed 1000 partitions.
 [float]
 [[faq-kafka-threads]]
@ -39,7 +40,6 @@ value of `1` then iterate your way up. The value should in general be lower than
 the number of pipeline workers. Values larger than 4 rarely result in a
 performance improvement.
 [float]
 [[faq-kafka-pq-persist]]
 ==== Kafka input and persistent queue (PQ)
@ -47,7 +47,7 @@ performance improvement.
 ===== Does Kafka Input commit offsets only after the event has been safely persisted to the PQ?
 No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
-writes to the PQ are slow/blocked, offsets for events that haven’t yet safely
+writes to the PQ are slow/blocked, offsets for events that haven’t safely
 reached the PQ can be committed.
@ -55,7 +55,7 @@ reached the PQ can be committed.
 [[faq-kafka-offset-commit]]
 ===== Does Kafa Input commit offsets only for events that have passed the pipeline fully?
 No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
-writes to the PQ are slow/blocked offsets for events that haven’t yet safely
+writes to the PQ are slow/blocked, offsets for events that haven’t safely
 reached the PQ can be committed. 
--- a/docs/static/troubleshooting.asciidoc
+++ b/docs/static/troubleshooting.asciidoc
@ -1,7 +1,7 @@
 [[troubleshooting]] 
 == Troubleshooting Common Problems
-This is a new section. We will be adding more tips and solutions, so check back soon.
+We will be adding more tips and solutions, so please check back soon.
 Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
 forum].
@ -17,13 +17,14 @@ forum].
 === Inaccessible temp directory
 Certain versions of the JRuby runtime and libraries
-in certain plugins (e.g., the Netty network library in the TCP input) copy
+in certain plugins (the Netty network library in the TCP input, for example) copy
-executable files to the temp directory which causes subsequent failures when
+executable files to the temp directory. This situation causes subsequent failures when
-/tmp is mounted noexec. 
+`/tmp` is mounted `noexec`. 
-Possible solutions:
+*Possible solutions*
-. Change setting to mount /tmp with exec.
+
-. Specify an alternate directory using the `-Djava.io.tmpdir` setting in the jvm.options file.
+* Change setting to mount `/tmp` with `exec`.
 * Specify an alternate directory using the `-Djava.io.tmpdir` setting in the `jvm.options` file.
 [float] 
@ -34,15 +35,15 @@ Possible solutions:
 [[ts-429]] 
 === Error response code 429
-A 429 message indicates that an application is busy handling other requests. For
+A `429` message indicates that an application is busy handling other requests. For
-example, Elasticsearch throws a 429 code to notify Logstash (or other indexers)
+example, Elasticsearch throws a `429` code to notify Logstash (or other indexers)
-that the bulk failed because the ingest queue is full.  Any documents that
+that the bulk failed because the ingest queue is full. Any documents that
 weren't processed should be retried.
 TBD:  Does Logstash retry? Should the user take any action?
 *Sample error*
-[source,txt]
+
 -----
 [2018-08-21T20:05:36,111][INFO ][logstash.outputs.elasticsearch] retrying
 failed action with response code: 429
@ -55,113 +56,150 @@ pool size = 16, active threads = 16, queued tasks = 200, completed tasks =
 -----
 [float] 
 [[ts-performance]] 
 == General performance tuning
 For general performance tuning tips and guidelines, see <<performance-tuning>>.
 [float] 
 [[ts-kafka]] 
 == Common Kafka support issues and solutions
-This section contains a list of the most common Kafka related support issues of
+This section contains a list of common Kafka issues from
 the last few months.  
 [float] 
 [[ts-kafka-timeout]] 
 === Kafka session timeout issues (input side)
-This is a very common problem. 
+This is a common problem. 
-Symptoms: Throughput issues and duplicate event
+*Symptoms* 
 processing LS logs warnings:
-`[2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
+Throughput issues and duplicate event processing {ls} logs warnings:
 -----
 [2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
 Auto offset commit failed for group clap_tx1: Commit cannot be completed since
-the group has already rebalanced and assigned the partitions to another member.`
+the group has already rebalanced and assigned the partitions to another member.
 -----
-This means that the time between subsequent calls to poll() was longer than the
+The time between subsequent calls to `poll()` was longer than the
-configured session.timeout.ms, which typically implies that the poll loop is
+configured `session.timeout.ms`, which typically implies that the poll loop is
-spending too much time message processing. You can address this either by
+spending too much time processing messages. You can address this by
 increasing the session timeout or by reducing the maximum size of batches
-returned in poll() with max.poll.records. 
+returned in `poll()` with `max.poll.records`. 
 -----
 [INFO][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Revoking
 previously assigned partitions [] for group log-ronline-node09
 `[2018-01-29T14:54:06,485][INFO]`[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
 Setting newly assigned partitions [elk-pmbr-9] for group log-pmbr 
 -----
-Example: https://github.com/elastic/support-dev-help/issues/3319
+*Background*
-Background:
+Kafka tracks the individual consumers in a consumer group (for example, a number
 of {ls} instances) and tries to give each consumer one or more specific
 partitions of data in the topic they’re consuming. In order to achieve this,
 Kafka tracks whether or not a consumer ({ls} Kafka input thread) is making
 progress on their assigned partition, and reassigns partitions that have not
 made progress in a set timeframe. 
-Kafka tracks the individual consumers in a consumer group (i.e. a number of LS
+When {ls} requests more events from the Kafka Broker than it can process within
-instances) and tries to give each consumer one or more specific partitions of
+the timeout, it triggers reassignment of partitions. Reassignment of partitions
-the data in the topic they’re consuming.  In order to achieve this, Kafka has to
+takes time, and can cause duplicate processing of events and significant
-also track whether or not a consumer (LS Kafka input thread) is making any
+throughput problems. 
 progress on their assigned partition and reassign partitions that have not seen
 progress in a set timeframe. This causes a problem when Logstash is requesting
 more events from the Kafka Broker than it can process within the timeout because
 it triggers reassignment of partitions. Reassignment of partitions can cause
 duplicate processing of events and significant throughput problems because of
 the time the reassignment takes. Solution:
-Solution:
+*Possible solutions*
 Fixing the problem is easy by reducing the number of records per request that LS
 polls from the Kafka Broker in on request, reducing the number of Kafka input
 threads and/or increasing the relevant timeouts in the Kafka Consumer
 configuration.
-The number of records to pull in one request is set by the option
+* Reduce the number of records per request that {ls} polls from the Kafka Broker in one request,
-`max_poll_records`.  If it exceeds the default value of 500, reducing this
+* Reduce the number of Kafka input threads, and/or 
-should be the first thing to try. The number of input threads is given by the
+* Increase the relevant timeouts in the Kafka Consumer configuration.
-option `consumer_threads`.  If it exceeds the number of pipeline workers
+
-configured in the `logstash.yml` it should certainly be reduced.  If it is a
+*Details*
-large value (> 4), it likely makes sense to reduce it to 4 (if the client has
+
-the time/resources for it, it would be ideal to start with a value of 1 and then
+The `max_poll_records` option sets the number of records to be pulled in one request.
-increment from there to find the optimal performance). The relevant timeout is
+If it exceeds the default value of 500, try reducing it. 
-set via `session_timeout_ms`. It should be set to a value that ensures that the
+
-number of events in `max_poll_records` can be safely processed within. Example:
+The `consumer_threads` option sets the number of input threads. If the value exceeds
-pipeline throughput is 10k/s and `max_poll_records` is set to 1k => the value
+the number of pipeline workers configured in the `logstash.yml` file, it should
 certainly be reduced.  
 If the value is greater than 4, try reducing it to `4` or less if the client has
 the time/resources for it. Try starting with a value of `1`, and then
 incrementing from there to find the optimal performance. 
 The `session_timeout_ms` option sets the relevant timeout. Set it to a value
 that ensures that the number of events in `max_poll_records` can be safely
 processed within the time limit. 
 -----
 EXAMPLE
 Pipeline throughput is `10k/s` and `max_poll_records` is set to 1k =>. The value
 must be at least 100ms if `consumer_threads` is set to `1`. If it is set to a
-higher value n, then the minimum session timeout increases proportionally to `n *
+higher value `n`, then the minimum session timeout increases proportionally to
-100ms`. In practice the value must be set much larger than the theoretical value
+`n * 100ms`.
-because the behaviour of the outputs and filters in a pipeline follows a
+-----
-distribution. It should also be larger than the maximum time you expect your
+
-outputs to stall for. The default setting is 10s == `10000ms`. If a user is
+In practice the value must be set much higher than the theoretical value because
-experiencing periodic problems with an output like Elasticsearch output that
+the behavior of the outputs and filters in a pipeline follows a distribution.
-could stall because of load or similar effects, there is little downside to
+The value should also be higher than the maximum time you expect your outputs to
-increasing this value significantly to say 60s. Note: Decreasing the
+stall. The default setting is `10s == 10000ms`. If you are experiencing
-`max_poll_records` is preferable to increasing this timeout from the performance
+periodic problems with an output that can stall because of load or similar
-perspective. Increasing this timeout is your only option if the client’s issues
+effects (such as the Elasticsearch output), there is little downside to
-are caused by periodically stalling outputs. Check logs for evidence of stalling
+increasing this value significantly to say `60s`. 
-outputs (e.g. ES output logging status `429`).
+
 From a performance perspective, decreasing the `max_poll_records` value is preferable
 to increasing the timeout value. Increasing the timeout is your only option if the
 client’s issues are caused by periodically stalling outputs. Check logs for
 evidence of stalling outputs, such as `ES output logging status 429`.
 [float] 
 [[ts-kafka-many-offset-commits]] 
 === Large number of offset commits (input side)
-Symptoms: Logstash’s Kafka Input is causing a much higher number of commits to
+*Symptoms*
 Logstash’s Kafka Input is causing a much higher number of commits to
 the offset topic than expected. Often the complaint also mentions redundant
 offset commits where the same offset is committed repeatedly.
-Examples: https://github.com/elastic/support-dev-help/issues/3702
+*Solution*
 https://github.com/elastic/support-dev-help/issues/3060 Solution:
 For Kafka Broker versions 0.10.2.1 to 1.0.x: The problem is caused by a bug in
 Kafka. https://issues.apache.org/jira/browse/KAFKA-6362 The client’s best option
-is upgrading their Kafka Brokers to version 1.1 or newer. For older versions of
+is upgrading their Kafka Brokers to version 1.1 or newer. 
 For older versions of
 Kafka or if the above does not fully resolve the issue: The problem can also be
-caused by setting too low of a value for `poll_timeout_ms` relative to the rate
+caused by setting the value for `poll_timeout_ms` too low relative to the rate
 at which the Kafka Brokers receive events themselves (or if Brokers periodically
 idle between receiving bursts of events). Increasing the value set for
-`poll_timeout_ms` will proportionally decrease the number of offsets commits in
+`poll_timeout_ms` proportionally decreases the number of offsets commits in
-this scenario (i.e. raising it by 10x will lead to 10x fewer offset commits).
+this scenario. For example, raising it by 10x will lead to 10x fewer offset commits.
 [float] 
 [[ts-kafka-codec-errors-input]] 
 === Codec Errors in Kafka Input (before Plugin Version 6.3.4 only) 
-Symptoms:
+*Symptoms*
 Logstash Kafka input randomly logs errors from the configured codec and/or reads
 events incorrectly (partial reads, mixing data between multiple events etc.).
 -----
 Log example:  [2018-02-05T13:51:25,773][FATAL][logstash.runner          ] An
 unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
 :backtrace=>["org/jruby/RubyArray.java:1892:in `join'",
@ -175,16 +213,57 @@ unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
 `each'",
 "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:240:in
 `thread_runner'"]} 
 -----
-Examples: https://github.com/elastic/support-dev-help/issues/3308
+*Background*
 https://github.com/elastic/support-dev-help/issues/2107 Background:
 There was a bug in the way the Kafka Input plugin was handling codec instances
 when running on multiple threads (`consumer_threads` set to > 1).
-https://github.com/logstash-plugins/logstash-input-kafka/issues/210 Solution:
+https://github.com/logstash-plugins/logstash-input-kafka/issues/210 
 *Solution*
 * Upgrade Kafka Input plugin to v. 6.3.4 or later. 
 * If (and only if) upgrading is impossible, set `consumer_threads` to `1`.
 [float] 
 [[ts-other]] 
 == Other issues
 [float] 
 [[ts-cli]] 
 === Command line
 [float] 
 [[ts-windows-cli]] 
 ==== Shell commands on Windows OS
 Command line often show single quotes. 
 On Windows systems, replace a single quote `'' with a double quote `"`. 
 *Example*
 Instead of:
 -----
 bin/logstash -e 'input { stdin { } } output { stdout {} }'
 -----
 Use this format on Windows systems:
 -----
 bin/logstash -e "input { stdin { } } output { stdout {} }"
 -----
 Ideally: Upgrade Kafka Input plugin to v. 6.3.4 or later. If (and only if)
 upgrading is impossible: Set `consumer_threads` to `1`.