mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-27 17:10:22 -04:00

History

Nik Everett 10336c950c ESQL: Speed loading stored fields (#127348 ) This speeds up loading from stored fields by opting more blocks into the "sequential" strategy. This really kicks in when loading stored fields like `text`. And when you need less than 100% of documents, but more than, say, 10%. This is most useful when you need 99.9% of field documents. That sort of thing. Here's the perf numbers: ``` %100.0 {"took": 403 -> 401,"documents_found":1000000} %099.9 {"took":3990 -> 436,"documents_found": 999000} %099.0 {"took":4069 -> 440,"documents_found": 990000} %090.0 {"took":3468 -> 421,"documents_found": 900000} %030.0 {"took":1213 -> 152,"documents_found": 300000} %020.0 {"took": 766 -> 104,"documents_found": 200000} %010.0 {"took": 397 -> 55,"documents_found": 100000} %009.0 {"took": 352 -> 375,"documents_found": 90000} %008.0 {"took": 304 -> 317,"documents_found": 80000} %007.0 {"took": 273 -> 287,"documents_found": 70000} %005.0 {"took": 199 -> 204,"documents_found": 50000} %001.0 {"took": 46 -> 46,"documents_found": 10000} ``` Let's explain this with an example. First, jump to `main` and load a million documents: ``` rm -f /tmp/bulk for a in {1..1000}; do echo '{"index":{}}' >> /tmp/bulk echo '{"text":"text '$(printf %04d $a)'"}' >> /tmp/bulk done curl -s -uelastic:password -HContent-Type:application/json -XDELETE localhost:9200/test for a in {1..1000}; do echo -n $a: curl -s -uelastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_bulk?pretty --data-binary @/tmp/bulk \| grep errors done curl -s -uelastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_forcemerge?max_num_segments=1 curl -s -uelastic:password -HContent-Type:application/json -XPOST localhost:9200/test/_refresh echo ``` Now query them all. Run this a few times until it's stable: ``` echo -n "%100.0 " curl -s -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?pretty' -d'{ "query": "FROM test \| STATS SUM(LENGTH(text))", "pragma": { "data_partitioning": "shard" } }' \| jq -c '{took, documents_found}' ``` Now fetch 99.9% of documents: ``` echo -n "%099.9 " curl -s -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?pretty' -d'{ "query": "FROM test \| WHERE NOT text.keyword IN (\"text 0998\") \| STATS SUM(LENGTH(text))", "pragma": { "data_partitioning": "shard" } }' \| jq -c '{took, documents_found}' ``` This should spit out something like: ``` %100.0 { "took":403,"documents_found":1000000} %099.9 {"took":4098, "documents_found":999000} ``` We're loading fewer documents but it's slower! What in the world?! If you dig into the profile you'll see that it's value loading: ``` $ curl -s -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?pretty' -d'{ "query": "FROM test \| STATS SUM(LENGTH(text))", "pragma": { "data_partitioning": "shard" }, "profile": true }' \| jq '.profile.drivers[].operators[] \| select(.operator \| contains("ValuesSourceReaderOperator"))' { "operator": "ValuesSourceReaderOperator[fields = [text]]", "status": { "readers_built": { "stored_fields[requires_source:true, fields:0, sequential: true]": 222, "text:column_at_a_time:null": 222, "text:row_stride:BlockSourceReader.Bytes": 1 }, "values_loaded": 1000000, "process_nanos": 370687157, "pages_processed": 222, "rows_received": 1000000, "rows_emitted": 1000000 } } $ curl -s -uelastic:password -HContent-Type:application/json -XPOST 'localhost:9200/_query?pretty' -d'{ "query": "FROM test \| WHERE NOT text.keyword IN (\"text 0998\") \| STATS SUM(LENGTH(text))", "pragma": { "data_partitioning": "shard" }, "profile": true }' \| jq '.profile.drivers[].operators[] \| select(.operator \| contains("ValuesSourceReaderOperator"))' { "operator": "ValuesSourceReaderOperator[fields = [text]]", "status": { "readers_built": { "stored_fields[requires_source:true, fields:0, sequential: false]": 222, "text:column_at_a_time:null": 222, "text:row_stride:BlockSourceReader.Bytes": 1 }, "values_loaded": 999000, "process_nanos": 3965803793, "pages_processed": 222, "rows_received": 999000, "rows_emitted": 999000 } } ``` It jumps from 370ms to almost four seconds! Loading fewer values! The second big difference is in the `stored_fields` marker. In the second on it's `sequential: false` and in the first `sequential: true`. `sequential: true` uses Lucene's "merge" stored fields reader instead of the default one. It's much more optimized at decoding sequences of documents. Previously we only enabled this reader when loading compact sequences of documents - when the entire block looks like ``` 1, 2, 3, 4, 5, ... 1230, 1231 ``` If there are any gaps we wouldn't enable it. That was a very conservative thing we did long ago without doing any experiments. We knew it was faster without any gaps, but not otherwise. It turns out it's a lot faster in a lot more cases. I've measured it as faster for 99% gaps, at least on simple documents. I'm a bit worried that this is too aggressive, so I've set made it configurable and made the default being to use the "merge" loader with 10% gaps. So we'd use the merge loader with a block like: ``` 1, 11, 21, 31, ..., 1231, 1241 ```		2025-04-29 23:20:15 +02:00
..
src	ESQL: Speed loading stored fields (#127348 )	2025-04-29 23:20:15 +02:00
build.gradle	ESQL: Fix EvalBenchmark (#124736 )	2025-03-14 20:19:20 +00:00
README.md	ESQL: Speed up TO_IP (#126338 )	2025-04-07 09:34:53 -04:00
run.sh	Add benchmark script (#126596 )	2025-04-18 19:09:38 +02:00

README.md

Elasticsearch Microbenchmark Suite

This directory contains the microbenchmark suite of Elasticsearch. It relies on JMH.

Purpose

We do not want to microbenchmark everything but the kitchen sink and should typically rely on our macrobenchmarks with Rally. Microbenchmarks are intended to spot performance regressions in performance-critical components. The microbenchmark suite is also handy for ad-hoc microbenchmarks, but please remove them again before merging your PR.

Getting Started

Just run gradlew -p benchmarks run from the project root directory. It will build all microbenchmarks, execute them and print the result.

Running Microbenchmarks

Running via an IDE is not supported as the results are meaningless because we have no control over the JVM running the benchmarks.

If you want to run a specific benchmark class like, say, MemoryStatsBenchmark, you can use --args:

gradlew -p benchmarks run --args 'MemoryStatsBenchmark'

Everything in the ' gets sent on the command line to JMH.

You can set benchmark parameters with -p:

gradlew -p benchmarks/ run --args 'RoundingBenchmark.round -prounder=es -prange="2000-10-01 to 2000-11-01" -pzone=America/New_York -pinterval=10d -pcount=1000000'

The benchmark code defines default values for the parameters, so if you leave any out JMH will run with each default value, one after the other. This will run with interval set to calendar year then calendar hour then 10d then 5d then 1h:

gradlew -p benchmarks/ run --args 'RoundingBenchmark.round -prounder=es -prange="2000-10-01 to 2000-11-01" -pzone=America/New_York -pcount=1000000'

Adding Microbenchmarks

Before adding a new microbenchmark, make yourself familiar with the JMH API. You can check our existing microbenchmarks and also the JMH samples.

In contrast to tests, the actual name of the benchmark class is not relevant to JMH. However, stick to the naming convention and end the class name of a benchmark with Benchmark. To have JMH execute a benchmark, annotate the respective methods with @Benchmark.

Tips and Best Practices

To get realistic results, you should exercise care when running benchmarks. Here are a few tips:

Do

Ensure that the system executing your microbenchmarks has as little load as possible. Shutdown every process that can cause unnecessary runtime jitter. Watch the Error column in the benchmark results to see the run-to-run variance.
Ensure to run enough warmup iterations to get the benchmark into a stable state. If you are unsure, don't change the defaults.
Avoid CPU migrations by pinning your benchmarks to specific CPU cores. On Linux you can use taskset.
Fix the CPU frequency to avoid Turbo Boost from kicking in and skewing your results. On Linux you can use cpufreq-set and the performance CPU governor.
Vary the problem input size with @Param.
Use the integrated profilers in JMH to dig deeper if benchmark results do not match your hypotheses:
- Add -prof gc to the options to check whether the garbage collector runs during a microbenchmark and skews your results. If so, try to force a GC between runs (-gc true) but watch out for the caveats.
- Add -prof perf or -prof perfasm (both only available on Linux, see Disassembling below) to see hotspots.
- Add -prof async to see hotspots.
Have your benchmarks peer-reviewed.

Don't

Blindly believe the numbers that your microbenchmark produces but verify them by measuring e.g. with -prof perfasm.
Run more threads than your number of CPU cores (in case you run multi-threaded microbenchmarks).
Look only at the Score column and ignore Error. Instead, take countermeasures to keep Error low / variance explainable.

Disassembling

NOTE: Linux only. Sorry Mac and Windows.

Disassembling is fun! Maybe not always useful, but always fun! Generally, you'll want to install perf and the JDK's hsdis. perf is generally available via apg-get install perf or pacman -S perf linux-tools. hsdis you'll want to compile from source. is a little more involved. This worked on 2020-08-01:

git clone git@github.com:openjdk/jdk.git
cd jdk
git checkout jdk-24-ga
# Get a known good binutils
wget https://ftp.gnu.org/gnu/binutils/binutils-2.35.tar.gz
tar xf binutils-2.35.tar.gz
bash configure --with-hsdis=binutils --with-binutils-src=binutils-2.35 \
    --with-boot-jdk=~/.gradle/jdks/oracle_corporation-24-amd64-linux.2
make build-hsdis
cp ./build/linux-x86_64-server-release/jdk/lib/hsdis-amd64.so \
    ~/.gradle/jdks/oracle_corporation-24-amd64-linux.2/lib/hsdis.so

If you want to disassemble a single method do something like this:

gradlew -p benchmarks run --args ' MemoryStatsBenchmark -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.yourMethodName -XX:PrintAssemblyOptions=intel"

If you want perf to find the hot methods for you, then do add -prof perfasm.

NOTE: perfasm will need more access:

sudo bash
echo -1 > /proc/sys/kernel/perf_event_paranoid
exit

If you get warnings like:

The perf event count is suspiciously low (0).

then check if you are bumping into this by running:

perf stat -B dd if=/dev/zero of=/dev/null count=1000000

If you see lines like:

         765019980      cpu_atom/cycles/                 #    1.728 GHz                         (0.60%)
        2258845959      cpu_core/cycles/                 #    5.103 GHz                         (99.18%)

then perf is just not going to work for you.

Async Profiler

Note: Linux and Mac only. Sorry Windows.

IMPORTANT: The 2.0 version of the profiler doesn't seem to be compatible with JMH as of 2021-04-30.

The async profiler is neat because it does not suffer from the safepoint bias problem. And because it makes pretty flame graphs!

Let user processes read performance stuff:

sudo bash
echo 0 > /proc/sys/kernel/kptr_restrict
echo 1 > /proc/sys/kernel/perf_event_paranoid
exit

Grab the async profiler from https://github.com/jvm-profiling-tools/async-profiler and run prof async like so:

gradlew -p benchmarks/ run --args 'LongKeyedBucketOrdsBenchmark.multiBucket -prof "async:libPath=/home/nik9000/Downloads/async-profiler-3.0-29ee888-linux-x64/lib/libasyncProfiler.so;dir=/tmp/prof;output=flamegraph"'

Note: As of January 2025 the latest release of async profiler doesn't work with our JDK but the nightly is fine.

If you are on Mac, this'll warn you that you downloaded the shared library from the internet. You'll need to go to settings and allow it to run.

The profiler tells you it'll be more accurate if you install debug symbols with the JVM. I didn't, and the results looked pretty good to me. (2021-02-01)