ESQL: Add BlockHash#lookup (#107762)

Adds a `lookup` method to `BlockHash` which finds keys that were already in the hash without modifying it and returns the "ordinal" that the `BlockHash` produced when that key had been called with `add`. For multi-column keys this can change the number of values pretty drastically. You get a combinatorial explosion of values. So if you have three columns with 2 values each the most values you can get is 2*2*2=8. If you have five columns with ten values each you can have 100,000 values in a single position! That's too many. Let's do an example! This one has a two row block containing three colunms. One row has two values in each column so it could produce at most 8 values. In this case one of the values is missing from the hash, so it only produces 7. Block: | a | b | c | | ----:| ----:| ----:| | 1 | 4 | 6 | | 1, 2 | 3, 4 | 5, 6 | BlockHash contents: | a | b | c | | -:| -:| -:| | 1 | 3 | 5 | | 1 | 3 | 6 | | 1 | 4 | 5 | | 1 | 4 | 6 | | 2 | 3 | 5 | | 2 | 3 | 6 | | 2 | 4 | 6 | Results: | ord | | -------------------:| | 3 | | 0, 1, 2, 3, 4, 5, 6 | The `add` method has a fairly fool-proof mechanism to work around this, it calls it's consumers with a callback that can split positions into multiple calls. It calls the callback in batches of like 16,000 positions at a time. And aggs uses the callback. So you can aggregate over five colunms with ten values each. It's slow, but the callbacks let us get through it. Unlike `add`, `lookup` can't use a callback. We're going to need it to return `Iterator` of `IntBlock`s containing ordinals. That's just how we're going to use it. That'd be ok, but we can't split a single position across multiple `Block`s. That's just not how `Block` works. So, instead, we fail the query if we produce more than 100,000 entries in a single position. We'd like to stop collecting and emit a warning, but that's a problem for another change. That's a single 400kb array which is quite big. Anyway! If we're not bumping into massive rows we emit `IntBlocks` targeting a particular size in memory. Likely we'll also want to plug in a target number of rows as well, but for now this'll do.
2025-04-24 15:17:30 -04:00 · 2024-04-24 08:30:36 -04:00 · 2024-04-24 08:30:36 -04:00 · 0f68c673f7
commit 0f68c673f7
parent 52af16adb1
27 changed files with 811 additions and 104 deletions
--- a/libs/core/src/main/java/org/elasticsearch/core/ReleasableIterator.java
+++ b/libs/core/src/main/java/org/elasticsearch/core/ReleasableIterator.java
@ -0,0 +1,49 @@
+/*
+ * Copyright Elasticsearch B.V. and/or licensed to Elasticsearch B.V. under one
+ * or more contributor license agreements. Licensed under the Elastic License
+ * 2.0 and the Server Side Public License, v 1; you may not use this file except
+ * in compliance with, at your election, the Elastic License 2.0 or the Server
+ * Side Public License, v 1.
+ */
+
+package org.elasticsearch.core;
+
+import java.util.Iterator;
+import java.util.Objects;
+
+/**
+ * An {@link Iterator} with state that must be {@link #close() released}.
+ */
+public interface ReleasableIterator<T> extends Releasable, Iterator<T> {
+    /**
+     * Returns a single element iterator over the supplied value.
+     */
+    static <T extends Releasable> ReleasableIterator<T> single(T element) {
+        return new ReleasableIterator<>() {
+            private T value = Objects.requireNonNull(element);
+
+            @Override
+            public boolean hasNext() {
+                return value != null;
+            }
+
+            @Override
+            public T next() {
+                final T res = value;
+                value = null;
+                return res;
+            }
+
+            @Override
+            public void close() {
+                Releasables.close(value);
+            }
+
+            @Override
+            public String toString() {
+                return "ReleasableIterator[" + value + "]";
+            }
+
+        };
+    }
+}