Commit graph

14 commits

Author SHA1 Message Date
Nik Everett
7e1e45eaa4
ESQL: Speed up TO_IP (#126338)
Speed up the TO_IP method by converting directly from utf-8 encoded
strings to the ip encoding. Previously we did:
```
utf-8 -> String -> INetAddress -> ip encoding
```

In a step towards solving #125460 this creates three IP parsing
functions, one the rejects leading zeros, one that interprets leading
zeros as decimal numbers, and one the interprets leading zeros as octal
numbers. IPs have historically been parsed in all three of those ways.

This plugs the "rejects leading zeros" parser into `TO_IP` because
that's the behavior it had before.

Here is the performance:
```
Benchmark               Score    Error  Units
leadingZerosAreDecimal  14.007 ± 0.093  ns/op
leadingZerosAreOctal    15.020 ± 0.373  ns/op
leadingZerosRejected    14.176 ± 3.861  ns/op
original                32.950 ± 1.062  ns/op
```

So this is roughly 45% faster than what we had.
2025-04-07 09:34:53 -04:00
Nik Everett
dc4fa26174
Speed up COALESCE significantly (#120139)
```
                      before              after
     (operation)   Score   Error       Score   Error  Units
 coalesce_2_noop  75.949 ± 3.961  ->   0.010 ±  0.001 ns/op  99.9%
coalesce_2_eager  99.299 ± 6.959  ->   4.292 ±  0.227 ns/op  95.7%
 coalesce_2_lazy 113.118 ± 5.747  ->  26.746 ±  0.954 ns/op  76.4%
```

We tend to advise folks that "COALESCE is faster than CASE", but, as of
8.16.0/https://github.com/elastic/elasticsearch/pull/112295 that wasn't the true. I was working with someone a few
days ago to port a scripted_metric aggregation to ESQL and we saw
COALESCE taking ~60% of the time. That won't do.

The trouble is that CASE and COALESCE have to be *lazy*, meaning that
operations like:
```
COALESCE(a, 1 / b)
```
should never emit a warning if `a` is not `null`, even if `b` is `0`. In
8.16/https://github.com/elastic/elasticsearch/pull/112295 CASE grew an optimization where it could operate non-lazily
if it was flagged as "safe". This brings a similar optimization to
COALESCE, see it above as "case_2_eager", a 95.7% improvement.

It also brings and arguably more important optimization - entire-block
execution for COALESCE. The schort version is that, if the first
parameter of COALESCE returns no nulls we can return it without doing
anything lazily. There are a few more cases, but the upshot is that
COALESCE is pretty much *free* in cases where long strings of results
are `null` or not `null`. That's the `coalesce_2_noop` line.

Finally, when there mixed null and non-null values we were using a
single builder with some fairly inefficient paths. This specializes them
per type and skips some slow null-checking where possible. That's the
`coalesce_2_lazy` result, a more modest 76.4%.

NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like `+`. If COALESCE isn't taking a *ton* time in your query don't get particularly excited about this. It's fun though.

Closes #119953
2025-01-23 17:40:09 +00:00
Nik Everett
7ebb09b9f3
Speed getIntLE from BytesReference (#90147)
This speeds up `getIntLE` from `BytesReference` which we'll be using in
the upcoming dense representations for aggregations. Here's the
performance:

```
           (type)  Mode  Cnt   Before  Error  After    Error  Units
            array  avgt    7   1.036 ± 0.062   0.261 ± 0.022  ns/op
paged_bytes_array  avgt    7   5.189 ± 0.172   5.317 ± 0.196  ns/op
  composite_256kb  avgt    7  30.792 ± 0.834  11.240 ± 0.387  ns/op
composite_262344b  avgt    7  32.503 ± 1.017  11.155 ± 0.358  ns/op
    composite_1mb  avgt    7  25.189 ± 0.449   8.379 ± 0.193  ns/op
```

The `array` method is how we'll use slices that don't span the edges of
a netty buffer. The `paged_bytes_array` method doesn't really change and
represents the default for internal stuff. I'll bet we could make it
faster too, but I don't know that we use it in the hot path. The
`composite_<size>` method is how we'll be reading large slabs from the
netty byte buffer. We could probably do better if we relied on the sizes
of the buffers being even, but we don't presently do that in the
composite bytes array. The different sizes following `composite` show
that the performance is dominated by the number of slabs in the
composite buffer. `1mb` looks like the largest buffer netty uses.
`256kb` is the smallest. The wild number of bytes intentionally doesn't
line the int up on sensible values. I don't think we'll use sizes like
that but it looks like the performance doesn't make a huge difference.
We're dominated by the buffer choice.
2022-09-22 00:47:11 +09:30
Abele Mălan
22c4a10c63
Update README.md (#84153)
- delete/update a few misplaced words;
- add some extra commas;
- fix capitalization of "Mac".
2022-03-10 14:03:07 -05:00
Quentin Pradet
5d8421744a
Fix link to benchmark page (#83887) 2022-02-15 13:00:52 +04:00
Nik Everett
fad5e44b99
update benchmark readme (#72620)
Documents that version 2.0 of the async profiler doesn't seem to work
with jmh. Fixes some syntax in another profiling example.
2021-05-03 11:30:50 -04:00
Nik Everett
a5f3787be4
It's flame graph time! (#68312)
Upgrade JMH to latest (1.26) to pick up its async profiler integration
and update the documentation to include instructions to running the
async profiler and making pretty pretty flame graphs.
2021-02-02 11:11:16 -05:00
Nik Everett
dfc45396e7
Speed up writeVInt (#62345)
This speeds up `StreamOutput#writeVInt` quite a bit which is nice
because it is *very* commonly called when serializing aggregations. Well,
when serializing anything. All "collections" serialize their size as a
vint. Anyway, I was examining the serialization speeds of `StringTerms`
and this saves about 30% of the write time for that. I expect it'll be
useful other places.
2020-09-15 14:20:53 -04:00
Nik Everett
1af8d9f228
Rework checking if a year is a leap year (#60585)
This way is faster, saving about 8% on the microbenchmark that rounds to
the nearest month. That is in the hot path for `date_histogram` which is
a very popular aggregation so it seems worth it to at least try and
speed it up a little.
2020-08-05 16:09:51 -04:00
Nik Everett
0097a86d53
Optimize date_histograms across daylight savings time (#55559)
Rounding dates on a shard that contains a daylight savings time transition
is currently something like 1400% slower than when a shard contains dates
only on one side of the DST transition. And it makes a ton of short lived
garbage. This replaces that implementation with one that benchmarks to
having around 30% overhead instead of the 1400%. And it doesn't generate
any garbage per search hit.

Some background:
There are two ways to round in ES:
* Round to the nearest time unit (Day/Hour/Week/Month/etc)
* Round to the nearest time *interval* (3 days/2 weeks/etc)

I'm only optimizing the first one in this change and plan to do the second
in a follow up. It turns out that rounding to the nearest unit really *is*
two problems: when the unit rounds to midnight (day/week/month/year) and
when it doesn't (hour/minute/second). Rounding to midnight is consistently
about 25% faster and rounding to individual hour or minutes.

This optimization relies on being able to *usually* figure out what the
minimum and maximum dates are on the shard. This is similar to an existing
optimization where we rewrite time zones that aren't fixed
(think America/New_York and its daylight savings time transitions) into
fixed time zones so long as there isn't a daylight savings time transition
on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement
time interval rounding the time zone rewriting optimization *should* no
longer be needed.

This optimization doesn't come into play for `composite` or
`auto_date_histogram` aggs because neither have been migrated to the new
`DATE` `ValuesSourceType` which is where that range lookup happens. When
they are they will be able to pick up the optimization without much work.
I expect this to be substantial for `auto_date_histogram` but less so for
`composite` because it deals with fewer values.

Note: My 30% overhead figure comes from small numbers of daylight savings
time transitions. That overhead gets higher when there are more
transitions in logarithmic fashion. When there are two thousand years
worth of transitions my algorithm ends up being 250% slower than rounding
without a time zone, but java time is 47000% slower at that point,
allocating memory as fast as it possibly can.
2020-05-07 07:22:32 -04:00
Nik Everett
21eb9695af
Build: Remove shadowing from benchmarks (#32475)
Removes shadowing from the benchmarks. It isn't *strictly* needed. We do
have to rework the documentation on how to run the benchmark, but it
still seems to work if you run everything through gradle.
2018-07-31 17:31:13 -04:00
Daniel Mitterdorfer
889d802115 Refine wording in benchmark README and correct typos 2016-06-15 23:01:56 +02:00
Daniel Mitterdorfer
32dd813436 Fix typo in benchmark README 2016-06-15 22:45:47 +02:00
Daniel Mitterdorfer
2c467fd9c2 Add microbenchmarking infrastructure (#18891)
With this commit we add a benchmarks project that contains the necessary build
infrastructure and an example benchmark. It is added as a separate project to avoid
interfering with the regular build too much (especially sanity checks) and to keep
the microbenchmarks isolated.

Microbenchmarks are generated with `gradle :benchmarks:jmhJar` and can be run with
` gradle :benchmarks:jmh`.

We intentionally do not use the
[jmh-gradle-plugin](https://github.com/melix/jmh-gradle-plugin) as it causes all
sorts of problems (dependencies are not properly excluded, not all JMH parameters
can be set) and it adds another abstraction layer that is not needed.

Closes #18242
2016-06-15 16:48:02 +02:00