elasticsearch/benchmarks
Nik Everett de8b39e52d
Lower contention on requests with many aggs (#66895)
This lowers the contention on the `REQUEST` circuit breaker when building
many aggregations on many threads by preallocating a chunk of breaker
up front. This cuts down on the number of times we enter the busy loop
in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building
aggregations. We still hit the busy loop if we collect many buckets.

We let the `AggregationBuilder` pick size of the "chunk" that we
preallocate but it doesn't have much to go on - not even the field types.
But it is available in a convenient spot and the estimates don't have to
be particularly accurate.

The benchmarks on my 12 core desktop are interesting:
```
Benchmark         (breaker)  Mode  Cnt    Score    Error  Units
sum                    noop  avgt   10    1.672 ±  0.042  us/op
sum                    real  avgt   10    4.100 ±  0.027  us/op
sum             preallocate  avgt   10    4.230 ±  0.034  us/op
termsSixtySums         noop  avgt   10   92.658 ±  0.939  us/op
termsSixtySums         real  avgt   10  278.764 ± 39.751  us/op
termsSixtySums  preallocate  avgt   10  120.896 ± 16.097  us/op
termsSum               noop  avgt   10    4.573 ±  0.095  us/op
termsSum               real  avgt   10    9.932 ±  0.211  us/op
termsSum        preallocate  avgt   10    7.695 ±  0.313  us/op
```

They show pretty clearly that not using the circuit breaker at all is
faster. But we can't do that because we don't want to bring the node
down on bad aggs. When there are many aggs (termsSixtySums) the
preallocation claws back much of the performance. It even helps
marginally when there are two aggs (termsSum). For a single agg (sum)
we see a 130 nanosecond hit. Fine.

But these values are all pretty small. At best we're seeing a 160
microsecond savings. Not so on a 160 vCPU machine:

```
Benchmark         (breaker)  Mode  Cnt      Score       Error  Units
sum                    noop  avgt   10     44.956 ±     8.851  us/op
sum                    real  avgt   10    118.008 ±    19.505  us/op
sum             preallocate  avgt   10    241.234 ±   305.998  us/op
termsSixtySums         noop  avgt   10   1339.802 ±    51.410  us/op
termsSixtySums         real  avgt   10  12077.671 ± 12110.993  us/op
termsSixtySums  preallocate  avgt   10   3804.515 ±  1458.702  us/op
termsSum               noop  avgt   10     59.478 ±     2.261  us/op
termsSum               real  avgt   10    293.756 ±   253.854  us/op
termsSum        preallocate  avgt   10    197.963 ±    41.578  us/op
```

All of these numbers are larger because we're running all the CPUs
flat out and we're seeing more contention everywhere. Even the "noop"
breaker sees some contention, but I think it is mostly around memory
allocation. Anyway, with many many (termsSixtySums) aggs we're looking
at 8 milliseconds of savings by preallocating. Just by dodging the busy
loop as much as possible. The error in the measurements there are
substantial. Here are the runs:
```
real:
Iteration   1: 8679.417 ±(99.9%) 273.220 us/op
Iteration   2: 5849.538 ±(99.9%) 179.258 us/op
Iteration   3: 5953.935 ±(99.9%) 152.829 us/op
Iteration   4: 5763.465 ±(99.9%) 150.759 us/op
Iteration   5: 14157.592 ±(99.9%) 395.224 us/op
Iteration   1: 24857.020 ±(99.9%) 1133.847 us/op
Iteration   2: 24730.903 ±(99.9%) 1107.718 us/op
Iteration   3: 18894.383 ±(99.9%) 738.706 us/op
Iteration   4: 5493.965 ±(99.9%) 120.529 us/op
Iteration   5: 6396.493 ±(99.9%) 143.630 us/op
preallocate:
Iteration   1: 5512.590 ±(99.9%) 110.222 us/op
Iteration   2: 3087.771 ±(99.9%) 120.084 us/op
Iteration   3: 3544.282 ±(99.9%) 110.373 us/op
Iteration   4: 3477.228 ±(99.9%) 107.270 us/op
Iteration   5: 4351.820 ±(99.9%) 82.946 us/op
Iteration   1: 3185.250 ±(99.9%) 154.102 us/op
Iteration   2: 3058.000 ±(99.9%) 143.758 us/op
Iteration   3: 3199.920 ±(99.9%) 61.589 us/op
Iteration   4: 3163.735 ±(99.9%) 71.291 us/op
Iteration   5: 5464.556 ±(99.9%) 59.034 us/op
```

That variability from 5.5ms to 25ms is terrible. It makes me not
particularly trust the 8ms savings from the report. But still,
the preallocating method has much less variability between runs
and almost all the runs are faster than all of the non-preallocated
runs. Maybe the savings is more like 2 or 3 milliseconds, but still.
Or maybe we should think of hte savings as worst vs worst? If so its
19 milliseconds.

Anyway, its hard to measure how much this helps. But, certainly some.

Closes #58647
2021-01-04 11:26:42 -05:00
..
src/main Lower contention on requests with many aggs (#66895) 2021-01-04 11:26:42 -05:00
build.gradle Dependency Graph gradle Task (#63641) 2020-11-16 11:23:04 +02:00
README.md Speed up writeVInt (#62345) 2020-09-15 14:20:53 -04:00

Elasticsearch Microbenchmark Suite

This directory contains the microbenchmark suite of Elasticsearch. It relies on JMH.

Purpose

We do not want to microbenchmark everything but the kitchen sink and should typically rely on our macrobenchmarks with Rally. Microbenchmarks are intended to spot performance regressions in performance-critical components. The microbenchmark suite is also handy for ad-hoc microbenchmarks but please remove them again before merging your PR.

Getting Started

Just run gradlew -p benchmarks run from the project root directory. It will build all microbenchmarks, execute them and print the result.

Running Microbenchmarks

Running via an IDE is not supported as the results are meaningless because we have no control over the JVM running the benchmarks.

If you want to run a specific benchmark class like, say, MemoryStatsBenchmark, you can use --args:

gradlew -p benchmarks run --args ' MemoryStatsBenchmark'

Everything in the ' gets sent on the command line to JMH. The leading inside the 's is important. Without it parameters are sometimes sent to gradle.

Adding Microbenchmarks

Before adding a new microbenchmark, make yourself familiar with the JMH API. You can check our existing microbenchmarks and also the JMH samples.

In contrast to tests, the actual name of the benchmark class is not relevant to JMH. However, stick to the naming convention and end the class name of a benchmark with Benchmark. To have JMH execute a benchmark, annotate the respective methods with @Benchmark.

Tips and Best Practices

To get realistic results, you should exercise care when running benchmarks. Here are a few tips:

Do

  • Ensure that the system executing your microbenchmarks has as little load as possible. Shutdown every process that can cause unnecessary runtime jitter. Watch the Error column in the benchmark results to see the run-to-run variance.
  • Ensure to run enough warmup iterations to get the benchmark into a stable state. If you are unsure, don't change the defaults.
  • Avoid CPU migrations by pinning your benchmarks to specific CPU cores. On Linux you can use taskset.
  • Fix the CPU frequency to avoid Turbo Boost from kicking in and skewing your results. On Linux you can use cpufreq-set and the performance CPU governor.
  • Vary the problem input size with @Param.
  • Use the integrated profilers in JMH to dig deeper if benchmark results to not match your hypotheses:
    • Add -prof gc to the options to check whether the garbage collector runs during a microbenchmarks and skews your results. If so, try to force a GC between runs (-gc true) but watch out for the caveats.
    • Add -prof perf or -prof perfasm (both only available on Linux) to see hotspots.
  • Have your benchmarks peer-reviewed.

Don't

  • Blindly believe the numbers that your microbenchmark produces but verify them by measuring e.g. with -prof perfasm.
  • Run more threads than your number of CPU cores (in case you run multi-threaded microbenchmarks).
  • Look only at the Score column and ignore Error. Instead take countermeasures to keep Error low / variance explainable.

Disassembling

Disassembling is fun! Maybe not always useful, but always fun! Generally, you'll want to install perf and FCML's hsdis. perf is generally available via apg-get install perf or pacman -S perf. FCML is a little more involved. This worked on 2020-08-01:

wget https://github.com/swojtasiak/fcml-lib/releases/download/v1.2.2/fcml-1.2.2.tar.gz
tar xf fcml*
cd fcml*
./configure
make
cd example/hsdis
make
sudo cp .libs/libhsdis.so.0.0.0 /usr/lib/jvm/java-14-adoptopenjdk/lib/hsdis-amd64.so

If you want to disassemble a single method do something like this:

gradlew -p benchmarks run --args ' MemoryStatsBenchmark -jvmArgs "-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*.yourMethodName -XX:PrintAssemblyOptions=intel"

If you want perf to find the hot methods for you then do add -prof:perfasm.