* ESQL: Speed up VALUES for many buckets (#123073)
Speeds up the VALUES agg when collecting from many buckets.
Specifically, this speeds up the algorithm used to `finish` the
aggregation. Most specifically, this makes the algorithm more tollerant
to large numbers of groups being collected. The old algorithm was
`O(n^2)` with the number of groups. The new one is `O(n)`
```
(groups)
1 219.683 ± 1.069 -> 223.477 ± 1.990 ms/op
1000 426.323 ± 75.963 -> 463.670 ± 7.275 ms/op
100000 36690.871 ± 4656.350 -> 7800.332 ± 2775.869 ms/op
200000 89422.113 ± 2972.606 -> 21920.288 ± 3427.962 ms/op
400000 timed out at 10 minutes -> 40051.524 ± 2011.706 ms/op
```
The `1` group version was not changed at all. That's just noise in the
measurement. The small bump in the `1000` case is almost certainly worth
it and real. The huge drop in the `100000` case is quite real.
* Fix
* Compile
This speeds up grouping by bytes valued fields (keyword, text, ip, and
wildcard) when the input is an ordinal block:
```
bytes_refs 22.213 ± 0.322 -> 19.848 ± 0.205 ns/op (*maybe* real, maybe noise. still good)
ordinal didn't exist -> 2.988 ± 0.011 ns/op
```
I see this as 20ns -> 3ns, an 85% speed up. We never hard the ordinals
branch before so I'm expecting the same performance there - about 20ns
per op.
This also speeds up grouping by a pair of byte valued fields:
```
two_bytes_refs 83.112 ± 42.348 -> 46.521 ± 0.386 ns/op
two_ordinals 83.531 ± 23.473 -> 8.617 ± 0.105 ns/op
```
The speed up is much better when the fields are ordinals because hashing
bytes is comparatively slow.
I believe the ordinals case is quite common. I've run into it in quite a
few profiles.
* Add CircuitBreaker to TDigest, Step 4: Take into account shallow classes size (#113613)
* Removed muted tests from merge conflict
* Added missing empty line in muted tests
This adds a test to *every* agg for when it's entirely filtered away and
another when filtering is enabled but unused. I'll follow up with
another test later for partial filtering.
That test caught a bug where some aggs would think they'd been `seen`
when they hadn't. This fixes that too.
Part of https://github.com/elastic/elasticsearch/issues/99815
## Steps
1. Migrate TDigest classes to use a custom Array implementation. Temporarily use a simple array wrapper (https://github.com/elastic/elasticsearch/pull/112810)
2. Implement CircuitBreaking in the `MemoryTrackingTDigestArrays` class. Add `Releasable` and ensure it's always closed within TDigest (This PR)
3. Pass the CircuitBreaker as a parameter to TDigestState from wherever it's being used
4. Account remaining TDigest classes size ("SHALLOW_SIZE")
Every step should be safely mergeable to main:
- The first and second steps should have no impact.
- The third and fourth ones will start increasing the CB count partially.
## Remarks
To simplify testing the CircuitBreaker, added a helper method + `@After` to ESTestCase.
Right now CBs are usually tested through MockBigArrays. E.g:
f7a0196b45/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/AbstractFunctionTestCase.java (L1263-L1265)
So I guess there was no need for this yet. But I may have missed something somewhere.
Also, I'm separating this PR from the "step 3" as integrating this (CB) in the current usages may require some refactor of external code, which may be somewhat more _dangerous_
This speeds up the `CASE` function when it has two or three arguments
and both of the arguments are constants or fields. This works because
`CASE` is lazy so it can avoid warnings in cases like
```
CASE(foo != 0, 2 / foo, 1)
```
And, in the case where the function is *very* slow, it can avoid the
computations.
But if the lhs and rhs of the `CASE` are constant then there isn't any
work to avoid.
The performance improvment is pretty substantial:
```
(operation) Before Error After Error Units
case_1_lazy 97.422 ± 1.048 101.571 ± 0.737 ns/op
case_1_eager 79.312 ± 1.190 4.601 ± 0.049 ns/op
```
The top line is a `CASE` that has to be lazy - it shouldn't change. The
4 nanos change here is noise. The eager version improves by about 94%.
The native platform dir can be found using a TestUtil method, but
benchmarks was trying to construct it on its own. This commit switches
to using the util method.
This adds a `Block#keepMask(BooleanVector)` method that will make a new
block, keeping all of the values where the vector is `true` and
`null`ing all of the velues where the vector is false.
This will be useful for implementing partial aggregation application
like `| STATS MAX(a WHERE b > 1), MIN(j WHERE b > 2) BY bar`. Or however
the syntax ends up being. We already skip `null` group keys and we can
evaluate the `b > 2` bits to a mask pretty easily. It should also be
useful in optimizing `CASE(a > 2, foo)` - but only when the RHS of the
CASE is `null` and the LHS is a constant or constant-like.
This is something that's very optimize-able. I haven't really optimized
it in this PR, but it should be possible to speed this up a ton and
remove a lot of copying. Here's where the benchmarks start:
```
(dataTypeAndBlockKind) Mode Cnt Score Error Units
int/array avgt 7 3.705 ± 0.153 ns/op
int/vector avgt 7 3.234 ± 0.078 ns/op
```
That's about the same speed as reading the block. In a few of these
cases I expect we can get them to constant performance rather than
per-record performance.
Native libraries in Java are loaded by calling System.loadLibrary. This
method inspects paths in the java.library.path to find the requested
library. Elasticsearch previously used this to find libsystemd, but now
the only remaining use is to set the additional platform directory in
which Elasticsearch keeps its own native libraries.
One issue with setting java.library.path is that its not set for the cli
process, which makes loading the native library infrastructure from clis
difficult. This commit reworks how Elasticsearch native libraries are
found in order to avoid needing to set java.library.path. There are two
cases. The simplest is production, where the working directory is the
Elasticsearch installation directory, so the platform specific directory
can be constructed. The second case is for tests where we don't have an
installtion. We already pass in java.library.path there, so this change
renames the system property to be a test specific property that the new
loading infrastructure looks for.
This change allows querying the `index.mode` setting via a new
`_index_mode` metadata field, enabling APIs such as `field_caps` or
`resolve_indices` to target indices that are either time_series or logs
only. This approach avoids adding and handling a new parameter for
`index_mode` in these APIs. Both ES|QL and the `_search` API should also
work with this new field.
* Mechanical package change in IntelliJ
* A couple of manual fixups
* Export plugins.loading to deprecation
* Put plugin-cli in a module so can export PluginsUtils to it.
This adds `hamming` distances, the pop-count of `xor` byte vectors as a
first class citizen in painless.
For byte vectors, this means that we can compute hamming distances via
script_score (aka, brute-force).
The implementation of `hamming` is the same that is available in Lucene,
and when lucene 9.11 is merged, we should update our logic where
applicable to utilize it.
NOTE: this does not yet add hamming distance as a metric for indexed
vectors. This will be a future PR after the Lucene 9.11 upgrade.
After #109162, all instances of DataType were in one class, as a collection of static constants. That pattern can then be simplified into an enum, merging the behavior class and the static collection class. That opens the door to future optimizations, like using enum serialization rather than string serialization to save bytes over the wire. It also makes the code easier to read, as all the behavior is now in a single file (which is still pretty short).
Most of this PR is just juggling names around to have the two references refer to the same thing.
Follow up work can merge the functions from EsqlDataTypes into this enum, but this PR is long enough already.
This moves all of the new data types declared in `EsqlDataTypes` into
`DataTypes`. It also removes `EsqlDataTypes#types` and makes
`DataTypes#types` return what that used to return. It doesn't modify any
other methods of `EsqlDataTypes`. That's a change for another time.
This commit extends the custom SIMD optimized SQ vector scorer to include search time scoring.
When run on JDK22+ vector scoring with be done with the custom scorer. The implementation uses the JDK 22+ on-heap ALLOW_HEAP_ACCESS Linker.Option so that the native code can access the query vector directly.
This commit refactors libvec to replace custom scorer types with Lucene types.
The initial implementation created separate types to model the vector scorer with an adapter between them and the Lucene types. This was done to avoid a dependency on Lucene from the native module. This is no longer an issue, since the code is now separated from the native module already, and in fact already depends on Lucene. This PR drops the custom types infavour of the Lucene ones. This will help future refactoring, and avoid bugs by reusing the existing and know model in this area.
I also took the liberty of reflowing the code to match that of the recent change in Lucene to support off-heap scoring - this code is now very similar to that, and will become even more clean and streamlined in the lucene_snapshot branch. This refactoring is not directly dependent on the next version of Lucene, so it done in main.
Part of https://github.com/elastic/elasticsearch/issues/106679
* Copy the `ql` project into a different project _just for esql_, call it `esql-core`.
* Make `esql` depend only on the latter.
* Fix `EsqlNodeSubclassTests`; I'm confused why this didn't bite us earlier.
* Update the warning regexes in some csv tests as the exceptions have other package names now.
**Note to reviewers:** Exclude the first commit when viewing the diff,
as that contains only the actual copying of `ql`. The remaining commits
are the actually meaningful ones. _The `build.gradle` files probably
require the most attention._
This commit updates the native vector provider to reflect that Lucene's scalar quantization is unsigned int7, with a range of values from 0 to 127 inclusive. Stride has been pushed down into native, to allow other platforms to more easily select there own stride length.
Previously the implementation supports signed int8. We might want the more general signed int8 implementation in the future, but for now unsigned int7 is sufficient, and allows to provide more efficient implementations on x64.
We had a TODO in our `BlockHash` implementations optimized for pairs of
columns - we wanted to use `MultiValueDedupe` inside their `add` methods
for `Block`s. This implements to TODO.
It makes a small behavior change on one of the blocks we don't yet use
in production - the `BytesRefLongBlockHash` will now properly keep
`null` bytes strings. That's a side effect of reusing other components
and doesn't actually allow us to use it in production - we're still
waiting on the joint hash tables which are blocked behind vector
instructions.
I moved a bunch of files to different places so I could reach into the
innards of the `MultivalueDedupe` subclasses to build the block hash
addition logic. It seemed like a reasonable thing to do. And it seemed
reasonable not to expose the raw arrays outside of the package.
This change moves GroupSpec from HashAggregationOperator to BlockHash,
making it available for MetricsAggregatorOperator, which will be
introduced soon.
This commit adds an optimised int8 vector distance implementation for aarch64. Additional platforms like, say, x64, will be added as a follow-up.
The vector distance implementation outperforms Lucene's Pamana Vector implementation for binary comparisons by approx 5x (depending on the number of dimensions). It does so by means of compiler intrinsics built into a separate native library and link by Panama's FFI. Comparisons are performed on off-heap mmap'ed vector data.
The implementation is currently only used during merging of scalar quantized segments, through a custom format ES814HnswScalarQuantizedVectorsFormat, but its usage will likely be expanded over time.
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Lorenzo Dematté <lorenzo.dematte@elastic.co>
Co-authored-by: Mark Vieira <portugee@gmail.com>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
This makes a couple of changes to regex processing in the compute
engine:
1. Process utf-8 strings directly. This should save a ton of time.
2. Snip the `toString` output if it is too big - I chose 64kb of
strings.
3. I changed the formatting of the automaton to a slightly customized
`dot` output. Because automata are graphs. Everyone knows it. And
they are a lot easier to read as graphs. `dot` is easy to convert
into a graph.
4. I implement `EvaluatorMapper` for regex operations which is pretty
standard for the rest of our operations.
We want to report that observation of document parsing has finished only upon a successful indexing.
To achieve this, we need to perform reporting only in one place (not as previously in both IngestService and 'bulk action')
This commit splits the DocumentParsingObserver in two. One for wrapping an XContentParser and returning the observed state - the DocumentSizeObserver and a DocumentSizeReporter to perform an action when parsing has been completed and indexing successful.
To perform reporting in one place we need to pass the state from IngestService to 'bulk action'. The state is currently represented as long - normalisedBytesParsed.
In TransportShardBulkAction we are getting the normalisedBytesParsed information and in the serverless plugin we will check if the value is indicating that parsing already happened in IngestService (value being != -1) we create a DocumentSizeObserver with the fixed normalisedBytesParsed and won't increment it.
When the indexing is completed and successful we report the observed state for an index with DocumentSizeReporter
small nit: by passing the documentParsingObserve via SourceToParse we no longer have to inject it via complex hierarchy for DocumentParser. Hence some constructor changes