mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
92 lines
5 KiB
Text
92 lines
5 KiB
Text
[[corruption-troubleshooting]]
|
|
== Troubleshooting corruption
|
|
|
|
{es} expects that the data it reads from disk is exactly the data it previously
|
|
wrote. If it detects that the data on disk is different from what it wrote then
|
|
it will report some kind of exception such as:
|
|
|
|
- `org.apache.lucene.index.CorruptIndexException`
|
|
- `org.elasticsearch.gateway.CorruptStateException`
|
|
- `org.elasticsearch.index.translog.TranslogCorruptedException`
|
|
|
|
Typically these exceptions happen due to a checksum mismatch. Most of the data
|
|
that {es} writes to disk is followed by a checksum using a simple algorithm
|
|
known as CRC32 which is fast to compute and good at detecting the kinds of
|
|
random corruption that may happen when using faulty storage. A CRC32 checksum
|
|
mismatch definitely indicates that something is faulty, although of course a
|
|
matching checksum doesn't prove the absence of corruption.
|
|
|
|
Verifying a checksum is expensive since it involves reading every byte of the
|
|
file which takes significant effort and might evict more useful data from the
|
|
filesystem cache, so systems typically don't verify the checksum on a file very
|
|
often. This is why you tend only to encounter a corruption exception when
|
|
something unusual is happening. For instance, corruptions are often detected
|
|
during merges, shard movements, and snapshots. This does not mean that these
|
|
processes are causing corruption: they are examples of the rare times where
|
|
reading a whole file is necessary. {es} takes the opportunity to verify the
|
|
checksum at the same time, and this is when the corruption is detected and
|
|
reported. It doesn't indicate the cause of the corruption or when it happened.
|
|
Corruptions can remain undetected for many months.
|
|
|
|
The files that make up a Lucene index are written sequentially from start to
|
|
end and then never modified or overwritten. This access pattern means the
|
|
checksum computation is very simple and can happen on-the-fly as the file is
|
|
initially written, and also makes it very unlikely that an incorrect checksum
|
|
is due to a userspace bug at the time the file was written. The routine that
|
|
computes the checksum is straightforward, widely used, and very well-tested, so
|
|
you can be very confident that a checksum mismatch really does indicate that
|
|
the data read from disk is different from the data that {es} previously wrote.
|
|
|
|
The files that make up a Lucene index are written in full before they are used.
|
|
If a file is needed to recover an index after a restart then your storage
|
|
system previously confirmed to {es} that this file was durably synced to disk.
|
|
On Linux this means that the `fsync()` system call returned successfully. {es}
|
|
sometimes detects that an index is corrupt because a file needed for recovery
|
|
has been truncated or is missing its footer. This indicates that your storage
|
|
system acknowledges durable writes incorrectly.
|
|
|
|
There are many possible explanations for {es} detecting corruption in your
|
|
cluster. Databases like {es} generate a challenging I/O workload that may find
|
|
subtle infrastructural problems which other tests may miss. {es} is known to
|
|
expose the following problems as file corruptions:
|
|
|
|
- Filesystem bugs, especially in newer and nonstandard filesystems which might
|
|
not have seen enough real-world production usage to be confident that they
|
|
work correctly.
|
|
|
|
- https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux[Kernel bugs].
|
|
|
|
- Bugs in firmware running on the drive or RAID controller.
|
|
|
|
- Incorrect configuration, for instance configuring `fsync()` to report success
|
|
before all durable writes have completed.
|
|
|
|
- Faulty hardware, which may include the drive itself, the RAID controller,
|
|
your RAM or CPU.
|
|
|
|
Data corruption typically doesn't result in other evidence of problems apart
|
|
from the checksum mismatch. Do not interpret this as an indication that your
|
|
storage subsystem is working correctly and therefore that {es} itself caused
|
|
the corruption. It is rare for faulty storage to show any evidence of problems
|
|
apart from the data corruption, but data corruption itself is a very strong
|
|
indicator that your storage subsystem is not working correctly.
|
|
|
|
To rule out {es} as the source of data corruption, generate an I/O workload
|
|
using something other than {es} and look for data integrity errors. On Linux
|
|
the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
|
|
verify the integrity of the data they write. Use version 0.12.01 or newer of
|
|
`stress-ng` since earlier versions do not have strong enough integrity checks.
|
|
You can check that durable writes persist across power outages using a script
|
|
such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
|
|
|
|
To narrow down the source of the corruptions, systematically change components
|
|
in your cluster's environment until the corruptions stop. The details will
|
|
depend on the exact configuration of your hardware, but may include the
|
|
following:
|
|
|
|
- Try a different filesystem or a different kernel.
|
|
|
|
- Try changing each hardware component in turn, ideally changing to a different
|
|
model or manufacturer.
|
|
|
|
- Try different firmware versions for each hardware component.
|