ESQL: Add BlockHash#lookup (#107762)

Adds a `lookup` method to `BlockHash` which finds keys that were already
in the hash without modifying it and returns the "ordinal" that the
`BlockHash` produced when that key had been called with `add`.

For multi-column keys this can change the number of values pretty
drastically. You get a combinatorial explosion of values. So if you have
three columns with 2 values each the most values you can get is 2*2*2=8.
If you have five columns with ten values each you can have 100,000
values in a single position! That's too many.

Let's do an example! This one has a two row block containing three
colunms. One row has two values in each column so it could produce at
most 8 values. In this case one of the values is missing from the hash,
so it only produces 7.

Block:
|   a  |   b  |   c  |
| ----:| ----:| ----:|
|    1 |    4 |    6 |
| 1, 2 | 3, 4 | 5, 6 |

BlockHash contents:
| a | b | c |
| -:| -:| -:|
| 1 | 3 | 5 |
| 1 | 3 | 6 |
| 1 | 4 | 5 |
| 1 | 4 | 6 |
| 2 | 3 | 5 |
| 2 | 3 | 6 |
| 2 | 4 | 6 |

Results:

|          ord        |
| -------------------:|
|                   3 |
| 0, 1, 2, 3, 4, 5, 6 |

The `add` method has a fairly fool-proof mechanism to work around this,
it calls it's consumers with a callback that can split positions into
multiple calls. It calls the callback in batches of like 16,000
positions at a time. And aggs uses the callback. So you can aggregate
over five colunms with ten values each. It's slow, but the callbacks
let us get through it.

Unlike `add`, `lookup` can't use a callback. We're going to need it to
return `Iterator` of `IntBlock`s containing ordinals. That's just how
we're going to use it. That'd be ok, but we can't split a single
position across multiple `Block`s. That's just not how `Block` works.

So, instead, we fail the query if we produce more than 100,000 entries
in a single position. We'd like to stop collecting and emit a warning,
but that's a problem for another change. That's a single 400kb array
which is quite big.

Anyway! If we're not bumping into massive rows we emit `IntBlocks`
targeting a particular size in memory. Likely we'll also want to plug in
a target number of rows as well, but for now this'll do.
This commit is contained in:
Nik Everett 2024-04-24 08:30:36 -04:00 committed by GitHub
parent 52af16adb1
commit 0f68c673f7
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
27 changed files with 811 additions and 104 deletions

View file

@ -0,0 +1,49 @@
/*
* Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
* or more contributor license agreements. Licensed under the Elastic License
* 2.0 and the Server Side Public License, v 1; you may not use this file except
* in compliance with, at your election, the Elastic License 2.0 or the Server
* Side Public License, v 1.
*/
package org.elasticsearch.core;
import java.util.Iterator;
import java.util.Objects;
/**
* An {@link Iterator} with state that must be {@link #close() released}.
*/
public interface ReleasableIterator<T> extends Releasable, Iterator<T> {
/**
* Returns a single element iterator over the supplied value.
*/
static <T extends Releasable> ReleasableIterator<T> single(T element) {
return new ReleasableIterator<>() {
private T value = Objects.requireNonNull(element);
@Override
public boolean hasNext() {
return value != null;
}
@Override
public T next() {
final T res = value;
value = null;
return res;
}
@Override
public void close() {
Releasables.close(value);
}
@Override
public String toString() {
return "ReleasableIterator[" + value + "]";
}
};
}
}