mirror of
https://github.com/elastic/logstash.git
synced 2025-04-24 22:57:16 -04:00
Logging settings can affect performances Backports #12246 to 7.x branch Co-authored-by: Luca Belluccini <luca.belluccini@elastic.co>
351 lines
12 KiB
Text
351 lines
12 KiB
Text
[[troubleshooting]]
|
||
== Troubleshooting Common Problems
|
||
|
||
We are adding more troubleshooting tips, so please check back soon. If you
|
||
have something to add, please:
|
||
|
||
* create an issue at
|
||
https://github.com/elastic/logstash/issues, or
|
||
* create a pull request with your proposed changes at https://github.com/elastic/logstash.
|
||
|
||
Also check out the https://discuss.elastic.co/c/logstash[Logstash discussion
|
||
forum].
|
||
|
||
|
||
[float]
|
||
[[ts-install]]
|
||
== Installation and setup
|
||
|
||
|
||
[float]
|
||
[[ts-temp-dir]]
|
||
=== Inaccessible temp directory
|
||
|
||
Certain versions of the JRuby runtime and libraries
|
||
in certain plugins (the Netty network library in the TCP input, for example) copy
|
||
executable files to the temp directory. This situation causes subsequent failures when
|
||
`/tmp` is mounted `noexec`.
|
||
|
||
*Sample error*
|
||
|
||
[source,sh]
|
||
-----
|
||
[2018-03-25T12:23:01,149][ERROR][org.logstash.Logstash ]
|
||
java.lang.IllegalStateException: org.jruby.exceptions.RaiseException:
|
||
(LoadError) Could not load FFI Provider: (NotImplementedError) FFI not
|
||
available: java.lang.UnsatisfiedLinkError: /tmp/jffi5534463206038012403.so:
|
||
/tmp/jffi5534463206038012403.so: failed to map segment from shared object:
|
||
Operation not permitted
|
||
-----
|
||
|
||
*Possible solutions*
|
||
|
||
* Change setting to mount `/tmp` with `exec`.
|
||
* Specify an alternate directory using the `-Djava.io.tmpdir` setting in the `jvm.options` file.
|
||
|
||
|
||
[float]
|
||
[[ts-startup]]
|
||
== {ls} start up
|
||
|
||
[float]
|
||
[[ts-illegal-reflective-error]]
|
||
=== 'Illegal reflective access' errors
|
||
|
||
// https://github.com/elastic/logstash/issues/10496 and https://github.com/elastic/logstash/issues/10498
|
||
|
||
Running Logstash with Java 11 results in warnings similar to these:
|
||
|
||
[source,sh]
|
||
-----
|
||
WARNING: An illegal reflective access operation has occurred
|
||
WARNING: Illegal reflective access by org.jruby.ext.openssl.SecurityHelper (file:/{...}/jruby{...}jopenssl.jar) to field java.security.MessageDigest.provider
|
||
WARNING: Please consider reporting this to the maintainers of org.jruby.ext.openssl.SecurityHelper
|
||
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
|
||
WARNING: All illegal access operations will be denied in a future release
|
||
-----
|
||
|
||
These errors appear related to https://github.com/jruby/jruby/issues/4834[a known issue with JRuby].
|
||
|
||
*Work around*
|
||
|
||
Try adding these values to the `jvm.options` file.
|
||
|
||
[source,sh]
|
||
-----
|
||
--add-opens=java.base/java.security=ALL-UNNAMED
|
||
--add-opens=java.base/java.io=ALL-UNNAMED
|
||
--add-opens=java.base/java.nio.channels=org.jruby.dist
|
||
--add-opens=java.base/sun.nio.ch=org.jruby.dist
|
||
--add-opens=java.management/sun.management=org.jruby.dist
|
||
-----
|
||
|
||
*Notes:*
|
||
|
||
* These settings allow Logstash to start without warnings in Java 11, but they
|
||
prevent Logstash from starting on Java 8.
|
||
* This workaround has been tested with simple pipelines. If you have experiences
|
||
to share, please comment in the
|
||
https://github.com/elastic/logstash/issues/10496[issue].
|
||
|
||
|
||
[float]
|
||
[[ts-ingest]]
|
||
== Data ingestion
|
||
|
||
[float]
|
||
[[ts-429]]
|
||
=== Error response code 429
|
||
|
||
A `429` message indicates that an application is busy handling other requests. For
|
||
example, Elasticsearch sends a `429` code to notify Logstash (or other indexers)
|
||
that the bulk failed because the ingest queue is full. Logstash will retry sending documents.
|
||
|
||
*Possible actions*
|
||
|
||
Check {es} to see if it needs attention.
|
||
|
||
* {ref}/cluster-stats.html[Cluster stats API]
|
||
* {ref}/monitor-elasticsearch-cluster.html[Monitor a cluster]
|
||
|
||
*Sample error*
|
||
|
||
-----
|
||
[2018-08-21T20:05:36,111][INFO ][logstash.outputs.elasticsearch] retrying
|
||
failed action with response code: 429
|
||
({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of
|
||
org.elasticsearch.transport.TransportService$7@85be457 on
|
||
EsThreadPoolExecutor[bulk, queue capacity = 200,
|
||
org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@538c9d8a[Running,
|
||
pool size = 16, active threads = 16, queued tasks = 200, completed tasks =
|
||
685]]"})
|
||
-----
|
||
|
||
|
||
[float]
|
||
[[ts-performance]]
|
||
== General performance tuning
|
||
|
||
For general performance tuning tips and guidelines, see <<performance-tuning>>.
|
||
|
||
|
||
|
||
[float]
|
||
[[ts-pipeline]]
|
||
== Troubleshooting a pipeline
|
||
|
||
Pipelines, by definition, are unique. Here are some guidelines to help you get
|
||
started.
|
||
|
||
* Identify the offending pipeline.
|
||
* Start small. Create a minimum pipeline that manifests the problem.
|
||
|
||
|
||
For basic pipelines, this configuration could be enough to make the problem show itself.
|
||
|
||
[source,ruby]
|
||
-----
|
||
input {stdin{}} output {stdout{}}
|
||
-----
|
||
|
||
{ls} can separate logs by pipeline. This feature can help you identify the offending pipeline.
|
||
Set `pipeline.separate_logs: true` in your `logstash.yml` to enable the log per pipeline feature.
|
||
|
||
For more complex pipelines, the problem could be caused by a series of plugins in
|
||
a specific order. Troubleshooting these pipelines usually requires trial and error.
|
||
Start by systematically removing input and output plugins until you're left with
|
||
the minimum set that manifest the issue.
|
||
|
||
We want to expand this section to make it more helpful. If you have
|
||
troubleshooting tips to share, please:
|
||
|
||
* create an issue at https://github.com/elastic/logstash/issues, or
|
||
* create a pull request with your proposed changes at https://github.com/elastic/logstash.
|
||
|
||
[float]
|
||
[[ts-pipeline-logging-level-performance]]
|
||
=== Logging level can affect performances
|
||
|
||
*Symptoms*
|
||
|
||
Simple filters such as `mutate` or `json` filter can take several milliseconds per event to execute.
|
||
Inputs and outputs might be affected, too.
|
||
|
||
*Background*
|
||
|
||
The different plugins running on Logstash can be quite verbose if the logging level is set to `debug` or `trace`.
|
||
As the logging library used in Logstash is synchronous, heavy logging can affect performances.
|
||
|
||
*Solution*
|
||
|
||
Reset the logging level to `info`.
|
||
|
||
[float]
|
||
[[ts-kafka]]
|
||
== Common Kafka support issues and solutions
|
||
|
||
[float]
|
||
[[ts-kafka-timeout]]
|
||
=== Kafka session timeout issues (input side)
|
||
|
||
*Symptoms*
|
||
|
||
Throughput issues and duplicate event processing {ls} logs warnings:
|
||
|
||
-----
|
||
[2017-10-18T03:37:59,302][WARN][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
|
||
Auto offset commit failed for group clap_tx1: Commit cannot be completed since
|
||
the group has already rebalanced and assigned the partitions to another member.
|
||
-----
|
||
|
||
The time between subsequent calls to `poll()` was longer than the
|
||
configured `session.timeout.ms`, which typically implies that the poll loop is
|
||
spending too much time processing messages. You can address this by
|
||
increasing the session timeout or by reducing the maximum size of batches
|
||
returned in `poll()` with `max.poll.records`.
|
||
|
||
-----
|
||
[INFO][org.apache.kafka.clients.consumer.internals.ConsumerCoordinator] Revoking
|
||
previously assigned partitions [] for group log-ronline-node09
|
||
`[2018-01-29T14:54:06,485][INFO]`[org.apache.kafka.clients.consumer.internals.ConsumerCoordinator]
|
||
Setting newly assigned partitions [elk-pmbr-9] for group log-pmbr
|
||
-----
|
||
|
||
*Background*
|
||
|
||
Kafka tracks the individual consumers in a consumer group (for example, a number
|
||
of {ls} instances) and tries to give each consumer one or more specific
|
||
partitions of data in the topic they’re consuming. In order to achieve this,
|
||
Kafka tracks whether or not a consumer ({ls} Kafka input thread) is making
|
||
progress on their assigned partition, and reassigns partitions that have not
|
||
made progress in a set timeframe.
|
||
|
||
When {ls} requests more events from the Kafka Broker than it can process within
|
||
the timeout, it triggers reassignment of partitions. Reassignment of partitions
|
||
takes time, and can cause duplicate processing of events and significant
|
||
throughput problems.
|
||
|
||
*Possible solutions*
|
||
|
||
* Reduce the number of records per request that {ls} polls from the Kafka Broker in one request,
|
||
* Reduce the number of Kafka input threads, and/or
|
||
* Increase the relevant timeouts in the Kafka Consumer configuration.
|
||
|
||
*Details*
|
||
|
||
The `max_poll_records` option sets the number of records to be pulled in one request.
|
||
If it exceeds the default value of 500, try reducing it.
|
||
|
||
The `consumer_threads` option sets the number of input threads. If the value exceeds
|
||
the number of pipeline workers configured in the `logstash.yml` file, it should
|
||
certainly be reduced.
|
||
If the value is greater than 4, try reducing it to `4` or less if the client has
|
||
the time/resources for it. Try starting with a value of `1`, and then
|
||
incrementing from there to find the optimal performance.
|
||
|
||
The `session_timeout_ms` option sets the relevant timeout. Set it to a value
|
||
that ensures that the number of events in `max_poll_records` can be safely
|
||
processed within the time limit.
|
||
|
||
-----
|
||
EXAMPLE
|
||
Pipeline throughput is `10k/s` and `max_poll_records` is set to 1k =>. The value
|
||
must be at least 100ms if `consumer_threads` is set to `1`. If it is set to a
|
||
higher value `n`, then the minimum session timeout increases proportionally to
|
||
`n * 100ms`.
|
||
-----
|
||
|
||
In practice the value must be set much higher than the theoretical value because
|
||
the behavior of the outputs and filters in a pipeline follows a distribution.
|
||
The value should also be higher than the maximum time you expect your outputs to
|
||
stall. The default setting is `10s == 10000ms`. If you are experiencing
|
||
periodic problems with an output that can stall because of load or similar
|
||
effects (such as the Elasticsearch output), there is little downside to
|
||
increasing this value significantly to say `60s`.
|
||
|
||
From a performance perspective, decreasing the `max_poll_records` value is preferable
|
||
to increasing the timeout value. Increasing the timeout is your only option if the
|
||
client’s issues are caused by periodically stalling outputs. Check logs for
|
||
evidence of stalling outputs, such as `ES output logging status 429`.
|
||
|
||
[float]
|
||
[[ts-kafka-many-offset-commits]]
|
||
=== Large number of offset commits (Kafka input side)
|
||
|
||
*Symptoms*
|
||
|
||
Logstash’s Kafka Input is causing a much higher number of commits to
|
||
the offset topic than expected. Often the complaint also mentions redundant
|
||
offset commits where the same offset is committed repeatedly.
|
||
|
||
*Solution*
|
||
|
||
For Kafka Broker versions 0.10.2.1 to 1.0.x: The problem is caused by a bug in
|
||
Kafka. https://issues.apache.org/jira/browse/KAFKA-6362 The client’s best option
|
||
is upgrading their Kafka Brokers to version 1.1 or newer.
|
||
|
||
For older versions of
|
||
Kafka or if the above does not fully resolve the issue: The problem can also be
|
||
caused by setting the value for `poll_timeout_ms` too low relative to the rate
|
||
at which the Kafka Brokers receive events themselves (or if Brokers periodically
|
||
idle between receiving bursts of events). Increasing the value set for
|
||
`poll_timeout_ms` proportionally decreases the number of offsets commits in
|
||
this scenario. For example, raising it by 10x will lead to 10x fewer offset commits.
|
||
|
||
|
||
[float]
|
||
[[ts-kafka-codec-errors-input]]
|
||
=== Codec Errors in Kafka Input (before Plugin Version 6.3.4 only)
|
||
|
||
*Symptoms*
|
||
|
||
Logstash Kafka input randomly logs errors from the configured codec and/or reads
|
||
events incorrectly (partial reads, mixing data between multiple events etc.).
|
||
|
||
-----
|
||
Log example: [2018-02-05T13:51:25,773][FATAL][logstash.runner ] An
|
||
unexpected error occurred! {:error=>#<TypeError: can't convert nil into String>,
|
||
:backtrace=>["org/jruby/RubyArray.java:1892:in `join'",
|
||
"org/jruby/RubyArray.java:1898:in `join'",
|
||
"/usr/share/logstash/logstash-core/lib/logstash/util/buftok.rb:87:in `extract'",
|
||
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-codec-line-3.0.8/lib/logstash/codecs/line.rb:38:in
|
||
`decode'",
|
||
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:241:in
|
||
`thread_runner'",
|
||
"file:/usr/share/logstash/vendor/jruby/lib/jruby.jar!/jruby/java/java_ext/java.lang.rb:12:in
|
||
`each'",
|
||
"/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-kafka-5.1.11/lib/logstash/inputs/kafka.rb:240:in
|
||
`thread_runner'"]}
|
||
-----
|
||
|
||
*Background*
|
||
|
||
There was a bug in the way the Kafka Input plugin was handling codec instances
|
||
when running on multiple threads (`consumer_threads` set to > 1).
|
||
https://github.com/logstash-plugins/logstash-input-kafka/issues/210
|
||
|
||
*Solution*
|
||
|
||
* Upgrade Kafka Input plugin to v. 6.3.4 or later.
|
||
* If (and only if) upgrading is not possible, set `consumer_threads` to `1`.
|
||
|
||
|
||
[float]
|
||
[[ts-other]]
|
||
== Other issues
|
||
|
||
Coming soon
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|