elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-04-24 07:07:57 -04:00

Author	SHA1	Message	Date
Nik Everett	7e1e45eaa4	ESQL: Speed up TO_IP (#126338 ) Speed up the TO_IP method by converting directly from utf-8 encoded strings to the ip encoding. Previously we did: ``` utf-8 -> String -> INetAddress -> ip encoding ``` In a step towards solving #125460 this creates three IP parsing functions, one the rejects leading zeros, one that interprets leading zeros as decimal numbers, and one the interprets leading zeros as octal numbers. IPs have historically been parsed in all three of those ways. This plugs the "rejects leading zeros" parser into `TO_IP` because that's the behavior it had before. Here is the performance: ``` Benchmark Score Error Units leadingZerosAreDecimal 14.007 ± 0.093 ns/op leadingZerosAreOctal 15.020 ± 0.373 ns/op leadingZerosRejected 14.176 ± 3.861 ns/op original 32.950 ± 1.062 ns/op ``` So this is roughly 45% faster than what we had.	2025-04-07 09:34:53 -04:00
Nik Everett	dc4fa26174	Speed up COALESCE significantly (#120139 ) ``` before after (operation) Score Error Score Error Units coalesce_2_noop 75.949 ± 3.961 -> 0.010 ± 0.001 ns/op 99.9% coalesce_2_eager 99.299 ± 6.959 -> 4.292 ± 0.227 ns/op 95.7% coalesce_2_lazy 113.118 ± 5.747 -> 26.746 ± 0.954 ns/op 76.4% ``` We tend to advise folks that "COALESCE is faster than CASE", but, as of 8.16.0/https://github.com/elastic/elasticsearch/pull/112295 that wasn't the true. I was working with someone a few days ago to port a scripted_metric aggregation to ESQL and we saw COALESCE taking ~60% of the time. That won't do. The trouble is that CASE and COALESCE have to be lazy, meaning that operations like: ``` COALESCE(a, 1 / b) ``` should never emit a warning if `a` is not `null`, even if `b` is `0`. In 8.16/https://github.com/elastic/elasticsearch/pull/112295 CASE grew an optimization where it could operate non-lazily if it was flagged as "safe". This brings a similar optimization to COALESCE, see it above as "case_2_eager", a 95.7% improvement. It also brings and arguably more important optimization - entire-block execution for COALESCE. The schort version is that, if the first parameter of COALESCE returns no nulls we can return it without doing anything lazily. There are a few more cases, but the upshot is that COALESCE is pretty much free in cases where long strings of results are `null` or not `null`. That's the `coalesce_2_noop` line. Finally, when there mixed null and non-null values we were using a single builder with some fairly inefficient paths. This specializes them per type and skips some slow null-checking where possible. That's the `coalesce_2_lazy` result, a more modest 76.4%. NOTE: These %s of improvements on COALESCE itself, or COALESCE with some load-overhead operators like `+`. If COALESCE isn't taking a ton time in your query don't get particularly excited about this. It's fun though. Closes #119953	2025-01-23 17:40:09 +00:00
Nik Everett	7ebb09b9f3	Speed `getIntLE` from `BytesReference` (#90147 ) This speeds up `getIntLE` from `BytesReference` which we'll be using in the upcoming dense representations for aggregations. Here's the performance: ``` (type) Mode Cnt Before Error After Error Units array avgt 7 1.036 ± 0.062 0.261 ± 0.022 ns/op paged_bytes_array avgt 7 5.189 ± 0.172 5.317 ± 0.196 ns/op composite_256kb avgt 7 30.792 ± 0.834 11.240 ± 0.387 ns/op composite_262344b avgt 7 32.503 ± 1.017 11.155 ± 0.358 ns/op composite_1mb avgt 7 25.189 ± 0.449 8.379 ± 0.193 ns/op ``` The `array` method is how we'll use slices that don't span the edges of a netty buffer. The `paged_bytes_array` method doesn't really change and represents the default for internal stuff. I'll bet we could make it faster too, but I don't know that we use it in the hot path. The `composite_<size>` method is how we'll be reading large slabs from the netty byte buffer. We could probably do better if we relied on the sizes of the buffers being even, but we don't presently do that in the composite bytes array. The different sizes following `composite` show that the performance is dominated by the number of slabs in the composite buffer. `1mb` looks like the largest buffer netty uses. `256kb` is the smallest. The wild number of bytes intentionally doesn't line the int up on sensible values. I don't think we'll use sizes like that but it looks like the performance doesn't make a huge difference. We're dominated by the buffer choice.	2022-09-22 00:47:11 +09:30
Abele Mălan	22c4a10c63	Update README.md (#84153 ) - delete/update a few misplaced words; - add some extra commas; - fix capitalization of "Mac".	2022-03-10 14:03:07 -05:00
Quentin Pradet	5d8421744a	Fix link to benchmark page (#83887 )	2022-02-15 13:00:52 +04:00
Nik Everett	fad5e44b99	update benchmark readme (#72620 ) Documents that version 2.0 of the async profiler doesn't seem to work with jmh. Fixes some syntax in another profiling example.	2021-05-03 11:30:50 -04:00
Nik Everett	a5f3787be4	It's flame graph time! (#68312 ) Upgrade JMH to latest (1.26) to pick up its async profiler integration and update the documentation to include instructions to running the async profiler and making pretty pretty flame graphs.	2021-02-02 11:11:16 -05:00
Nik Everett	dfc45396e7	Speed up writeVInt (#62345 ) This speeds up `StreamOutput#writeVInt` quite a bit which is nice because it is very commonly called when serializing aggregations. Well, when serializing anything. All "collections" serialize their size as a vint. Anyway, I was examining the serialization speeds of `StringTerms` and this saves about 30% of the write time for that. I expect it'll be useful other places.	2020-09-15 14:20:53 -04:00
Nik Everett	1af8d9f228	Rework checking if a year is a leap year (#60585 ) This way is faster, saving about 8% on the microbenchmark that rounds to the nearest month. That is in the hot path for `date_histogram` which is a very popular aggregation so it seems worth it to at least try and speed it up a little.	2020-08-05 16:09:51 -04:00
Nik Everett	0097a86d53	Optimize date_histograms across daylight savings time (#55559 ) Rounding dates on a shard that contains a daylight savings time transition is currently something like 1400% slower than when a shard contains dates only on one side of the DST transition. And it makes a ton of short lived garbage. This replaces that implementation with one that benchmarks to having around 30% overhead instead of the 1400%. And it doesn't generate any garbage per search hit. Some background: There are two ways to round in ES: * Round to the nearest time unit (Day/Hour/Week/Month/etc) * Round to the nearest time interval (3 days/2 weeks/etc) I'm only optimizing the first one in this change and plan to do the second in a follow up. It turns out that rounding to the nearest unit really is two problems: when the unit rounds to midnight (day/week/month/year) and when it doesn't (hour/minute/second). Rounding to midnight is consistently about 25% faster and rounding to individual hour or minutes. This optimization relies on being able to usually figure out what the minimum and maximum dates are on the shard. This is similar to an existing optimization where we rewrite time zones that aren't fixed (think America/New_York and its daylight savings time transitions) into fixed time zones so long as there isn't a daylight savings time transition on the shard (UTC-5 or UTC-4 for America/New_York). Once I implement time interval rounding the time zone rewriting optimization should no longer be needed. This optimization doesn't come into play for `composite` or `auto_date_histogram` aggs because neither have been migrated to the new `DATE` `ValuesSourceType` which is where that range lookup happens. When they are they will be able to pick up the optimization without much work. I expect this to be substantial for `auto_date_histogram` but less so for `composite` because it deals with fewer values. Note: My 30% overhead figure comes from small numbers of daylight savings time transitions. That overhead gets higher when there are more transitions in logarithmic fashion. When there are two thousand years worth of transitions my algorithm ends up being 250% slower than rounding without a time zone, but java time is 47000% slower at that point, allocating memory as fast as it possibly can.	2020-05-07 07:22:32 -04:00
Nik Everett	21eb9695af	Build: Remove shadowing from benchmarks (#32475 ) Removes shadowing from the benchmarks. It isn't strictly needed. We do have to rework the documentation on how to run the benchmark, but it still seems to work if you run everything through gradle.	2018-07-31 17:31:13 -04:00
Daniel Mitterdorfer	889d802115	Refine wording in benchmark README and correct typos	2016-06-15 23:01:56 +02:00
Daniel Mitterdorfer	32dd813436	Fix typo in benchmark README	2016-06-15 22:45:47 +02:00
Daniel Mitterdorfer	2c467fd9c2	Add microbenchmarking infrastructure (#18891 ) With this commit we add a benchmarks project that contains the necessary build infrastructure and an example benchmark. It is added as a separate project to avoid interfering with the regular build too much (especially sanity checks) and to keep the microbenchmarks isolated. Microbenchmarks are generated with `gradle :benchmarks:jmhJar` and can be run with ` gradle :benchmarks:jmh`. We intentionally do not use the [jmh-gradle-plugin](https://github.com/melix/jmh-gradle-plugin) as it causes all sorts of problems (dependencies are not properly excluded, not all JMH parameters can be set) and it adds another abstraction layer that is not needed. Closes #18242	2016-06-15 16:48:02 +02:00

14 commits