mirror of
https://github.com/elastic/logstash.git
synced 2025-04-24 14:47:19 -04:00
parent
6dab577a93
commit
9838f43b45
2 changed files with 164 additions and 85 deletions
22
docs/static/faq.asciidoc
vendored
22
docs/static/faq.asciidoc
vendored
|
@ -1,7 +1,7 @@
|
||||||
[[faq]]
|
[[faq]]
|
||||||
== Frequently Asked Questions (FAQ)
|
== Frequently Asked Questions (FAQ)
|
||||||
|
|
||||||
This is a new section. We will be adding more questions and answers, so check back soon.
|
We will be adding more questions and answers, so please check back soon.
|
||||||
|
|
||||||
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
|
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
|
||||||
forum].
|
forum].
|
||||||
|
@ -10,25 +10,26 @@ forum].
|
||||||
[[faq-kafka]]
|
[[faq-kafka]]
|
||||||
=== Kafka
|
=== Kafka
|
||||||
|
|
||||||
This section is just a quick summary the most common Kafka questions I answered on Github and Slack over the last few months:
|
This section is a summary of the most common Kafka questions from the last few months.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[faq-kafka-settings]]
|
[[faq-kafka-settings]]
|
||||||
===== Kafka settings
|
==== Kafka settings
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[faq-kafka-partitions]]
|
[[faq-kafka-partitions]]
|
||||||
===== How many partitions should be used per topic?
|
===== How many partitions should be used per topic?
|
||||||
|
|
||||||
At least: Number of LS nodes x consumer threads per node.
|
At least: Number of {ls} nodes multiplied by consumer threads per node.
|
||||||
|
|
||||||
Better yet: Use a multiple of the above number. Increasing the number of
|
Better yet: Use a multiple of the above number. Increasing the number of
|
||||||
partitions for an existing topic is extremely complicated. Partitions have a
|
partitions for an existing topic is extremely complicated. Partitions have a
|
||||||
very low overhead. Using 5 to 10 times the number of partitions suggested by the
|
very low overhead. Using 5 to 10 times the number of partitions suggested by the
|
||||||
first point is generally fine so long as the overall partition count does not
|
first point is generally fine, so long as the overall partition count does not
|
||||||
exceed 2k (err on the side of over-partitioning 10x when for less than 1k
|
exceed 2000.
|
||||||
partitions overall, over-partition less liberally if it makes you exceed 1k
|
|
||||||
partitions).
|
Err on the side of over-partitioning up to a total 1000
|
||||||
|
partitions overall. Try not to exceed 1000 partitions.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[faq-kafka-threads]]
|
[[faq-kafka-threads]]
|
||||||
|
@ -39,7 +40,6 @@ value of `1` then iterate your way up. The value should in general be lower than
|
||||||
the number of pipeline workers. Values larger than 4 rarely result in a
|
the number of pipeline workers. Values larger than 4 rarely result in a
|
||||||
performance improvement.
|
performance improvement.
|
||||||
|
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[faq-kafka-pq-persist]]
|
[[faq-kafka-pq-persist]]
|
||||||
==== Kafka input and persistent queue (PQ)
|
==== Kafka input and persistent queue (PQ)
|
||||||
|
@ -47,7 +47,7 @@ performance improvement.
|
||||||
===== Does Kafka Input commit offsets only after the event has been safely persisted to the PQ?
|
===== Does Kafka Input commit offsets only after the event has been safely persisted to the PQ?
|
||||||
|
|
||||||
No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
|
No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
|
||||||
writes to the PQ are slow/blocked, offsets for events that haven’t yet safely
|
writes to the PQ are slow/blocked, offsets for events that haven’t safely
|
||||||
reached the PQ can be committed.
|
reached the PQ can be committed.
|
||||||
|
|
||||||
|
|
||||||
|
@ -55,7 +55,7 @@ reached the PQ can be committed.
|
||||||
[[faq-kafka-offset-commit]]
|
[[faq-kafka-offset-commit]]
|
||||||
===== Does Kafa Input commit offsets only for events that have passed the pipeline fully?
|
===== Does Kafa Input commit offsets only for events that have passed the pipeline fully?
|
||||||
No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
|
No, we can’t make the guarantee. Offsets are committed to Kafka periodically. If
|
||||||
writes to the PQ are slow/blocked offsets for events that haven’t yet safely
|
writes to the PQ are slow/blocked, offsets for events that haven’t safely
|
||||||
reached the PQ can be committed.
|
reached the PQ can be committed.
|
||||||
|
|
||||||
|
|
||||||
|
|
227
docs/static/troubleshooting.asciidoc
vendored
227
docs/static/troubleshooting.asciidoc
vendored
|
@ -1,7 +1,7 @@
|
||||||
[[troubleshooting]]
|
[[troubleshooting]]
|
||||||
== Troubleshooting Common Problems
|
== Troubleshooting Common Problems
|
||||||
|
|
||||||
This is a new section. We will be adding more tips and solutions, so check back soon.
|
We will be adding more tips and solutions, so please check back soon.
|
||||||
|
|
||||||
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
|
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
|
||||||
forum].
|
forum].
|
||||||
|
@ -17,13 +17,14 @@ forum].
|
||||||
=== Inaccessible temp directory
|
=== Inaccessible temp directory
|
||||||
|
|
||||||
Certain versions of the JRuby runtime and libraries
|
Certain versions of the JRuby runtime and libraries
|
||||||
in certain plugins (e.g., the Netty network library in the TCP input) copy
|
in certain plugins (the Netty network library in the TCP input, for example) copy
|
||||||
executable files to the temp directory which causes subsequent failures when
|
executable files to the temp directory. This situation causes subsequent failures when
|
||||||
/tmp is mounted noexec.
|
`/tmp` is mounted `noexec`.
|
||||||
|
|
||||||
Possible solutions:
|
*Possible solutions*
|
||||||
. Change setting to mount /tmp with exec.
|
|
||||||
. Specify an alternate directory using the `-Djava.io.tmpdir` setting in the jvm.options file.
|
* Change setting to mount `/tmp` with `exec`.
|
||||||
|
* Specify an alternate directory using the `-Djava.io.tmpdir` setting in the `jvm.options` file.
|
||||||
|
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
|
@ -34,15 +35,15 @@ Possible solutions:
|
||||||
[[ts-429]]
|
[[ts-429]]
|
||||||
=== Error response code 429
|
=== Error response code 429
|
||||||
|
|
||||||
A 429 message indicates that an application is busy handling other requests. For
|
A `429` message indicates that an application is busy handling other requests. For
|
||||||
example, Elasticsearch throws a 429 code to notify Logstash (or other indexers)
|
example, Elasticsearch throws a `429` code to notify Logstash (or other indexers)
|
||||||
that the bulk failed because the ingest queue is full. Any documents that
|
that the bulk failed because the ingest queue is full. Any documents that
|
||||||
weren't processed should be retried.
|
weren't processed should be retried.
|
||||||
|
|
||||||
TBD: Does Logstash retry? Should the user take any action?
|
TBD: Does Logstash retry? Should the user take any action?
|
||||||
|
|
||||||
*Sample error*
|
*Sample error*
|
||||||
[source,txt]
|
|
||||||
-----
|
-----
|
||||||
[2018-08-21T20:05:36,111][INFO ][logstash.outputs.elasticsearch] retrying
|
[2018-08-21T20:05:36,111][INFO ][logstash.outputs.elasticsearch] retrying
|
||||||
failed action with response code: 429
|
failed action with response code: 429
|
||||||
|
@ -55,113 +56,150 @@ pool size = 16, active threads = 16, queued tasks = 200, completed tasks =
|
||||||
-----
|
-----
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[ts-performance]]
|
||||||
|
== General performance tuning
|
||||||
|
|
||||||
|
For general performance tuning tips and guidelines, see <<performance-tuning>>.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ts-kafka]]
|
[[ts-kafka]]
|
||||||
== Common Kafka support issues and solutions
|
== Common Kafka support issues and solutions
|
||||||
|
|
||||||
This section contains a list of the most common Kafka related support issues of
|
This section contains a list of common Kafka issues from
|
||||||
the last few months.
|
the last few months.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ts-kafka-timeout]]
|
[[ts-kafka-timeout]]
|
||||||
=== Kafka session timeout issues (input side)
|
=== Kafka session timeout issues (input side)
|
||||||
|
|
||||||
This is a very common problem.
|
This is a common problem.
|
||||||
|
|
||||||
Symptoms: Throughput issues and duplicate event
|
*Symptoms*
|
||||||
processing LS logs warnings:
|
|
||||||
|
|
||||||
`[2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
|
Throughput issues and duplicate event processing {ls} logs warnings:
|
||||||
|
|
||||||
|
-----
|
||||||
|
[2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
|
||||||
Auto offset commit failed for group clap_tx1: Commit cannot be completed since
|
Auto offset commit failed for group clap_tx1: Commit cannot be completed since
|
||||||
the group has already rebalanced and assigned the partitions to another member.`
|
the group has already rebalanced and assigned the partitions to another member.
|
||||||
|
-----
|
||||||
|
|
||||||
This means that the time between subsequent calls to poll() was longer than the
|
The time between subsequent calls to `poll()` was longer than the
|
||||||
configured session.timeout.ms, which typically implies that the poll loop is
|
configured `session.timeout.ms`, which typically implies that the poll loop is
|
||||||
spending too much time message processing. You can address this either by
|
spending too much time processing messages. You can address this by
|
||||||
increasing the session timeout or by reducing the maximum size of batches
|
increasing the session timeout or by reducing the maximum size of batches
|
||||||
returned in poll() with max.poll.records.
|
returned in `poll()` with `max.poll.records`.
|
||||||
|
|
||||||
|
-----
|
||||||
[INFO][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Revoking
|
[INFO][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Revoking
|
||||||
previously assigned partitions [] for group log-ronline-node09
|
previously assigned partitions [] for group log-ronline-node09
|
||||||
`[2018-01-29T14:54:06,485][INFO]`[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
|
`[2018-01-29T14:54:06,485][INFO]`[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
|
||||||
Setting newly assigned partitions [elk-pmbr-9] for group log-pmbr
|
Setting newly assigned partitions [elk-pmbr-9] for group log-pmbr
|
||||||
|
-----
|
||||||
|
|
||||||
Example: https://github.com/elastic/support-dev-help/issues/3319
|
*Background*
|
||||||
|
|
||||||
Background:
|
Kafka tracks the individual consumers in a consumer group (for example, a number
|
||||||
|
of {ls} instances) and tries to give each consumer one or more specific
|
||||||
|
partitions of data in the topic they’re consuming. In order to achieve this,
|
||||||
|
Kafka tracks whether or not a consumer ({ls} Kafka input thread) is making
|
||||||
|
progress on their assigned partition, and reassigns partitions that have not
|
||||||
|
made progress in a set timeframe.
|
||||||
|
|
||||||
Kafka tracks the individual consumers in a consumer group (i.e. a number of LS
|
When {ls} requests more events from the Kafka Broker than it can process within
|
||||||
instances) and tries to give each consumer one or more specific partitions of
|
the timeout, it triggers reassignment of partitions. Reassignment of partitions
|
||||||
the data in the topic they’re consuming. In order to achieve this, Kafka has to
|
takes time, and can cause duplicate processing of events and significant
|
||||||
also track whether or not a consumer (LS Kafka input thread) is making any
|
throughput problems.
|
||||||
progress on their assigned partition and reassign partitions that have not seen
|
|
||||||
progress in a set timeframe. This causes a problem when Logstash is requesting
|
|
||||||
more events from the Kafka Broker than it can process within the timeout because
|
|
||||||
it triggers reassignment of partitions. Reassignment of partitions can cause
|
|
||||||
duplicate processing of events and significant throughput problems because of
|
|
||||||
the time the reassignment takes. Solution:
|
|
||||||
|
|
||||||
Solution:
|
*Possible solutions*
|
||||||
Fixing the problem is easy by reducing the number of records per request that LS
|
|
||||||
polls from the Kafka Broker in on request, reducing the number of Kafka input
|
|
||||||
threads and/or increasing the relevant timeouts in the Kafka Consumer
|
|
||||||
configuration.
|
|
||||||
|
|
||||||
The number of records to pull in one request is set by the option
|
* Reduce the number of records per request that {ls} polls from the Kafka Broker in one request,
|
||||||
`max_poll_records`. If it exceeds the default value of 500, reducing this
|
* Reduce the number of Kafka input threads, and/or
|
||||||
should be the first thing to try. The number of input threads is given by the
|
* Increase the relevant timeouts in the Kafka Consumer configuration.
|
||||||
option `consumer_threads`. If it exceeds the number of pipeline workers
|
|
||||||
configured in the `logstash.yml` it should certainly be reduced. If it is a
|
*Details*
|
||||||
large value (> 4), it likely makes sense to reduce it to 4 (if the client has
|
|
||||||
the time/resources for it, it would be ideal to start with a value of 1 and then
|
The `max_poll_records` option sets the number of records to be pulled in one request.
|
||||||
increment from there to find the optimal performance). The relevant timeout is
|
If it exceeds the default value of 500, try reducing it.
|
||||||
set via `session_timeout_ms`. It should be set to a value that ensures that the
|
|
||||||
number of events in `max_poll_records` can be safely processed within. Example:
|
The `consumer_threads` option sets the number of input threads. If the value exceeds
|
||||||
pipeline throughput is 10k/s and `max_poll_records` is set to 1k => the value
|
the number of pipeline workers configured in the `logstash.yml` file, it should
|
||||||
|
certainly be reduced.
|
||||||
|
If the value is greater than 4, try reducing it to `4` or less if the client has
|
||||||
|
the time/resources for it. Try starting with a value of `1`, and then
|
||||||
|
incrementing from there to find the optimal performance.
|
||||||
|
|
||||||
|
The `session_timeout_ms` option sets the relevant timeout. Set it to a value
|
||||||
|
that ensures that the number of events in `max_poll_records` can be safely
|
||||||
|
processed within the time limit.
|
||||||
|
|
||||||
|
-----
|
||||||
|
EXAMPLE
|
||||||
|
Pipeline throughput is `10k/s` and `max_poll_records` is set to 1k =>. The value
|
||||||
must be at least 100ms if `consumer_threads` is set to `1`. If it is set to a
|
must be at least 100ms if `consumer_threads` is set to `1`. If it is set to a
|
||||||
higher value n, then the minimum session timeout increases proportionally to `n *
|
higher value `n`, then the minimum session timeout increases proportionally to
|
||||||
100ms`. In practice the value must be set much larger than the theoretical value
|
`n * 100ms`.
|
||||||
because the behaviour of the outputs and filters in a pipeline follows a
|
-----
|
||||||
distribution. It should also be larger than the maximum time you expect your
|
|
||||||
outputs to stall for. The default setting is 10s == `10000ms`. If a user is
|
In practice the value must be set much higher than the theoretical value because
|
||||||
experiencing periodic problems with an output like Elasticsearch output that
|
the behavior of the outputs and filters in a pipeline follows a distribution.
|
||||||
could stall because of load or similar effects, there is little downside to
|
The value should also be higher than the maximum time you expect your outputs to
|
||||||
increasing this value significantly to say 60s. Note: Decreasing the
|
stall. The default setting is `10s == 10000ms`. If you are experiencing
|
||||||
`max_poll_records` is preferable to increasing this timeout from the performance
|
periodic problems with an output that can stall because of load or similar
|
||||||
perspective. Increasing this timeout is your only option if the client’s issues
|
effects (such as the Elasticsearch output), there is little downside to
|
||||||
are caused by periodically stalling outputs. Check logs for evidence of stalling
|
increasing this value significantly to say `60s`.
|
||||||
outputs (e.g. ES output logging status `429`).
|
|
||||||
|
From a performance perspective, decreasing the `max_poll_records` value is preferable
|
||||||
|
to increasing the timeout value. Increasing the timeout is your only option if the
|
||||||
|
client’s issues are caused by periodically stalling outputs. Check logs for
|
||||||
|
evidence of stalling outputs, such as `ES output logging status 429`.
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ts-kafka-many-offset-commits]]
|
[[ts-kafka-many-offset-commits]]
|
||||||
=== Large number of offset commits (input side)
|
=== Large number of offset commits (input side)
|
||||||
|
|
||||||
Symptoms: Logstash’s Kafka Input is causing a much higher number of commits to
|
*Symptoms*
|
||||||
|
|
||||||
|
Logstash’s Kafka Input is causing a much higher number of commits to
|
||||||
the offset topic than expected. Often the complaint also mentions redundant
|
the offset topic than expected. Often the complaint also mentions redundant
|
||||||
offset commits where the same offset is committed repeatedly.
|
offset commits where the same offset is committed repeatedly.
|
||||||
|
|
||||||
Examples: https://github.com/elastic/support-dev-help/issues/3702
|
*Solution*
|
||||||
https://github.com/elastic/support-dev-help/issues/3060 Solution:
|
|
||||||
|
|
||||||
For Kafka Broker versions 0.10.2.1 to 1.0.x: The problem is caused by a bug in
|
For Kafka Broker versions 0.10.2.1 to 1.0.x: The problem is caused by a bug in
|
||||||
Kafka. https://issues.apache.org/jira/browse/KAFKA-6362 The client’s best option
|
Kafka. https://issues.apache.org/jira/browse/KAFKA-6362 The client’s best option
|
||||||
is upgrading their Kafka Brokers to version 1.1 or newer. For older versions of
|
is upgrading their Kafka Brokers to version 1.1 or newer.
|
||||||
|
|
||||||
|
For older versions of
|
||||||
Kafka or if the above does not fully resolve the issue: The problem can also be
|
Kafka or if the above does not fully resolve the issue: The problem can also be
|
||||||
caused by setting too low of a value for `poll_timeout_ms` relative to the rate
|
caused by setting the value for `poll_timeout_ms` too low relative to the rate
|
||||||
at which the Kafka Brokers receive events themselves (or if Brokers periodically
|
at which the Kafka Brokers receive events themselves (or if Brokers periodically
|
||||||
idle between receiving bursts of events). Increasing the value set for
|
idle between receiving bursts of events). Increasing the value set for
|
||||||
`poll_timeout_ms` will proportionally decrease the number of offsets commits in
|
`poll_timeout_ms` proportionally decreases the number of offsets commits in
|
||||||
this scenario (i.e. raising it by 10x will lead to 10x fewer offset commits).
|
this scenario. For example, raising it by 10x will lead to 10x fewer offset commits.
|
||||||
|
|
||||||
|
|
||||||
[float]
|
[float]
|
||||||
[[ts-kafka-codec-errors-input]]
|
[[ts-kafka-codec-errors-input]]
|
||||||
=== Codec Errors in Kafka Input (before Plugin Version 6.3.4 only)
|
=== Codec Errors in Kafka Input (before Plugin Version 6.3.4 only)
|
||||||
|
|
||||||
Symptoms:
|
*Symptoms*
|
||||||
|
|
||||||
Logstash Kafka input randomly logs errors from the configured codec and/or reads
|
Logstash Kafka input randomly logs errors from the configured codec and/or reads
|
||||||
events incorrectly (partial reads, mixing data between multiple events etc.).
|
events incorrectly (partial reads, mixing data between multiple events etc.).
|
||||||
|
|
||||||
|
-----
|
||||||
Log example: [2018-02-05T13:51:25,773][FATAL][logstash.runner ] An
|
Log example: [2018-02-05T13:51:25,773][FATAL][logstash.runner ] An
|
||||||
unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
|
unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
|
||||||
:backtrace=>["org/jruby/RubyArray.java:1892:in `join'",
|
:backtrace=>["org/jruby/RubyArray.java:1892:in `join'",
|
||||||
|
@ -175,16 +213,57 @@ unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
|
||||||
`each'",
|
`each'",
|
||||||
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:240:in
|
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:240:in
|
||||||
`thread_runner'"]}
|
`thread_runner'"]}
|
||||||
|
-----
|
||||||
|
|
||||||
Examples: https://github.com/elastic/support-dev-help/issues/3308
|
*Background*
|
||||||
https://github.com/elastic/support-dev-help/issues/2107 Background:
|
|
||||||
|
|
||||||
There was a bug in the way the Kafka Input plugin was handling codec instances
|
There was a bug in the way the Kafka Input plugin was handling codec instances
|
||||||
when running on multiple threads (`consumer_threads` set to > 1).
|
when running on multiple threads (`consumer_threads` set to > 1).
|
||||||
https://github.com/logstash-plugins/logstash-input-kafka/issues/210 Solution:
|
https://github.com/logstash-plugins/logstash-input-kafka/issues/210
|
||||||
|
|
||||||
|
*Solution*
|
||||||
|
|
||||||
|
* Upgrade Kafka Input plugin to v. 6.3.4 or later.
|
||||||
|
* If (and only if) upgrading is impossible, set `consumer_threads` to `1`.
|
||||||
|
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[ts-other]]
|
||||||
|
== Other issues
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[ts-cli]]
|
||||||
|
=== Command line
|
||||||
|
|
||||||
|
[float]
|
||||||
|
[[ts-windows-cli]]
|
||||||
|
==== Shell commands on Windows OS
|
||||||
|
|
||||||
|
Command line often show single quotes.
|
||||||
|
On Windows systems, replace a single quote `'' with a double quote `"`.
|
||||||
|
|
||||||
|
*Example*
|
||||||
|
|
||||||
|
Instead of:
|
||||||
|
|
||||||
|
-----
|
||||||
|
bin/logstash -e 'input { stdin { } } output { stdout {} }'
|
||||||
|
-----
|
||||||
|
|
||||||
|
Use this format on Windows systems:
|
||||||
|
|
||||||
|
-----
|
||||||
|
bin/logstash -e "input { stdin { } } output { stdout {} }"
|
||||||
|
-----
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Ideally: Upgrade Kafka Input plugin to v. 6.3.4 or later. If (and only if)
|
|
||||||
upgrading is impossible: Set `consumer_threads` to `1`.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue