elasticsearch/docs/reference/troubleshooting/corruption-issues.asciidoc
David Turner 7103053f03
Add troubleshooting docs about data corruption (#88760)
Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
2022-07-28 11:23:23 +01:00

92 lines
5 KiB
Text

[[corruption-troubleshooting]]
== Troubleshooting corruption
{es} expects that the data it reads from disk is exactly the data it previously
wrote. If it detects that the data on disk is different from what it wrote then
it will report some kind of exception such as:
- `org.apache.lucene.index.CorruptIndexException`
- `org.elasticsearch.gateway.CorruptStateException`
- `org.elasticsearch.index.translog.TranslogCorruptedException`
Typically these exceptions happen due to a checksum mismatch. Most of the data
that {es} writes to disk is followed by a checksum using a simple algorithm
known as CRC32 which is fast to compute and good at detecting the kinds of
random corruption that may happen when using faulty storage. A CRC32 checksum
mismatch definitely indicates that something is faulty, although of course a
matching checksum doesn't prove the absence of corruption.
Verifying a checksum is expensive since it involves reading every byte of the
file which takes significant effort and might evict more useful data from the
filesystem cache, so systems typically don't verify the checksum on a file very
often. This is why you tend only to encounter a corruption exception when
something unusual is happening. For instance, corruptions are often detected
during merges, shard movements, and snapshots. This does not mean that these
proceses are causing corruption: they are examples of the rare times where
reading a whole file is necessary. {es} takes the opportunity to verify the
checksum at the same time, and this is when the corruption is detected and
reported. It doesn't indicate the cause of the corruption or when it happened.
Corruptions can remain undetected for many months.
The files that make up a Lucene index are written sequentially from start to
end and then never modified or overwritten. This access pattern means the
checksum computation is very simple and can happen on-the-fly as the file is
initially written, and also makes it very unlikely that an incorrect checksum
is due to a userspace bug at the time the file was written. The routine that
computes the checksum is straightforward, widely used, and very well-tested, so
you can be very confident that a checksum mismatch really does indicate that
the data read from disk is different from the data that {es} previously wrote.
The files that make up a Lucene index are written in full before they are used.
If a file is needed to recover an index after a restart then your storage
system previously confirmed to {es} that this file was durably synced to disk.
On Linux this means that the `fsync()` system call returned successfully. {es}
sometimes detects that an index is corrupt because a file needed for recovery
has been truncated or is missing its footer. This indicates that your storage
system acknowledges durable writes incorrectly.
There are many possible explanations for {es} detecting corruption in your
cluster. Databases like {es} generate a challenging I/O workload that may find
subtle infrastructural problems which other tests may miss. {es} is known to
expose the following problems as file corruptions:
- Filesystem bugs, especially in newer and nonstandard filesystems which might
not have seen enough real-world production usage to be confident that they
work correctly.
- https://www.elastic.co/blog/canonical-elastic-and-google-team-up-to-prevent-data-corruption-in-linux[Kernel bugs].
- Bugs in firmware running on the drive or RAID controller.
- Incorrect configuration, for instance configuring `fsync()` to report success
before all durable writes have completed.
- Faulty hardware, which may include the drive itself, the RAID controller,
your RAM or CPU.
Data corruption typically doesn't result in other evidence of problems apart
from the checksum mismatch. Do not interpret this as an indication that your
storage subsystem is working correctly and therefore that {es} itself caused
the corruption. It is rare for faulty storage to show any evidence of problems
apart from the data corruption, but data corruption itself is a very strong
indicator that your storage subsystem is not working correctly.
To rule out {es} as the source of data corruption, generate an I/O workload
using something other than {es} and look for data integrity errors. On Linux
the `fio` and `stress-ng` tools can both generate challenging I/O workloads and
verify the integrity of the data they write. Use version 0.12.01 or newer of
`stress-ng` since earlier versions do not have strong enough integrity checks.
You can check that durable writes persist across power outages using a script
such as https://gist.github.com/bradfitz/3172656[`diskchecker.pl`].
To narrow down the source of the corruptions, systematically change components
in your cluster's environment until the corruptions stop. The details will
depend on the exact configuration of your hardware, but may include the
following:
- Try a different filesystem or a different kernel.
- Try changing each hardware component in turn, ideally changing to a different
model or manufacturer.
- Try different firmware versions for each hardware component.