More balanced docs about NFS etc (#85060)

Today we don't really say anything about the requirements for the data
path in terms of correctness, and we specifically say to avoid NFS for
performance reasons. This isn't wholly accurate: some NFS
implementations work just fine. This commit documents a more balanced
position on local vs remote storage.
This commit is contained in:
David Turner 2022-03-18 13:01:59 +00:00 committed by GitHub
parent b0ab9394c7
commit ff742fcb27
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 41 additions and 21 deletions

View file

@ -96,16 +96,11 @@ faster.
[discrete] [discrete]
=== Use faster hardware === Use faster hardware
If indexing is I/O bound, you should investigate giving more memory to the If indexing is I/O-bound, consider increasing the size of the filesystem cache
filesystem cache (see above) or buying faster drives. In particular SSD drives (see above) or using faster storage. Elasticsearch generally creates individual
are known to perform better than spinning disks. Always use local storage, files with sequential writes. However, indexing involves writing multiple files
remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of concurrently, and a mix of random and sequential reads too, so SSD drives tend
virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized to perform better than spinning disks.
storage works very well with Elasticsearch, and it is appealing since it is so
fast and simple to set up, but it is also unfortunately inherently slower on an
ongoing basis when compared to dedicated local storage. If you put an index on
`EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
throttled.
Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember Stripe your index across multiple SSDs by configuring a RAID 0 array. Remember
that it will increase the risk of failure since the failure of any one SSD that it will increase the risk of failure since the failure of any one SSD
@ -115,6 +110,14 @@ different nodes so there's redundancy for any node failures. You can also use
<<modules-snapshots,snapshot and restore>> to backup the index for further <<modules-snapshots,snapshot and restore>> to backup the index for further
insurance. insurance.
Directly-attached (local) storage generally performs better than remote storage
because it is simpler to configure well and avoids communications overheads.
With careful tuning it is sometimes possible to achieve acceptable performance
using remote storage too. Benchmark your system with a realistic workload to
determine the effects of any tuning parameters. If you cannot achieve the
performance you expect, work with the vendor of your storage system to identify
the problem.
[discrete] [discrete]
=== Indexing buffer size === Indexing buffer size

View file

@ -12,18 +12,21 @@ index in physical memory.
[discrete] [discrete]
=== Use faster hardware === Use faster hardware
If your search is I/O bound, you should investigate giving more memory to the If your searches are I/O-bound, consider increasing the size of the filesystem
filesystem cache (see above) or buying faster drives. In particular SSD drives cache (see above) or using faster storage. Each search involves a mix of
are known to perform better than spinning disks. Always use local storage, sequential and random reads across multiple files, and there may be many
remote filesystems such as `NFS` or `SMB` should be avoided. Also beware of searches running concurrently on each shard, so SSD drives tend to perform
virtualized storage such as Amazon's `Elastic Block Storage`. Virtualized better than spinning disks.
storage works very well with Elasticsearch, and it is appealing since it is so
fast and simple to set up, but it is also unfortunately inherently slower on an
ongoing basis when compared to dedicated local storage. If you put an index on
`EBS`, be sure to use provisioned IOPS otherwise operations could be quickly
throttled.
If your search is CPU-bound, you should investigate buying faster CPUs. Directly-attached (local) storage generally performs better than remote storage
because it is simpler to configure well and avoids communications overheads.
With careful tuning it is sometimes possible to achieve acceptable performance
using remote storage too. Benchmark your system with a realistic workload to
determine the effects of any tuning parameters. If you cannot achieve the
performance you expect, work with the vendor of your storage system to identify
the problem.
If your searches are CPU-bound, consider using a larger number of faster CPUs.
[discrete] [discrete]
=== Document modeling === Document modeling

View file

@ -454,6 +454,20 @@ Like all node settings, it can also be specified on the command line as:
./bin/elasticsearch -Epath.data=/var/elasticsearch/data ./bin/elasticsearch -Epath.data=/var/elasticsearch/data
---- ----
The contents of the `path.data` directory must persist across restarts, because
this is where your data is stored. {es} requires the filesystem to act as if it
were backed by a local disk, but this means that it will work correctly on
properly-configured remote block devices (e.g. a SAN) and remote filesystems
(e.g. NFS) as long the remote storage behaves no differently from local
storage. You can run multiple {es} nodes on the same filesystem, but each {es}
node must have its own data path.
The performance of an {es} cluster is often limited by the performance of the
underlying storage, so you must ensure that your storage supports acceptable
performance. Some remote storage performs very poorly, especially under the
kind of load that {es} imposes, so make sure to benchmark your system carefully
before committing to a particular storage architecture.
TIP: When using the `.zip` or `.tar.gz` distributions, the `path.data` setting TIP: When using the `.zip` or `.tar.gz` distributions, the `path.data` setting
should be configured to locate the data directory outside the {es} home should be configured to locate the data directory outside the {es} home
directory, so that the home directory can be deleted without deleting your data! directory, so that the home directory can be deleted without deleting your data!