This adds comprehensive tests for `ExpressionEvaluator` making sure that it releases `Block`s. It fixes all of the `mv_*` evaluators to make sure they release as well.
This commit updates the hash grouping operator to close input pages, as well as use the block factory for internally created blocks.
Additionally:
* Adds a MockBlockFactory to help with tracking block creation
* Eagerly creates the block view of a vector, which helps with tracking since there can be only one block view instance per vector
* Resolves an issue with Filter Blocks, whereby they previously tried to emit their contents in toString
This creates `Block.Ref`, a reference to a `Block` which may or may not
be part of a `Page`. `Block.Ref` is `Releasable` and closing it is a
noop if the `Block` is part of a `Page`, but if it is "free floating"
then closing the `Block.Ref` will close the block.
It also modified `ExpressionEvaluator` to return a `Block.Ref` instead
of a `Block` - so you tend to work with `ExpressionEvaluator`s like
this:
```
try (Block.Ref ref = eval.eval(page)) {
return ref.block().doStuff();
}
```
This should make it *much* easier to release the memory from `Block`s
built by `ExpressionEvaluator`s.
This change is mostly mechanical, introducing the new signature for
`ExpressionEvaluator`. In a follow up change I'll modify the tests to
make sure we're correctly using it to close pages.
I did think about changing `ExpressionEvaluator` to add a method telling
you if the block that it returns must be closed or not. This would have
been more difficult to work with, and, ultimately, limiting.
Specifically, it is possible for an `ExpressionEvaluator` to *sometimes*
return a free floating block and other times return one that is
contained in a `Page`. Imagine `mv_concat` - it returns the block it
receives if the block doesn't have multivalued fields. Otherwise it
concats things. If that block happens to come directly out of the
`Page`, then `mv_concat` will sometimes produce free floating blocks and
sometimes not.
Today, we have the ability to specify whether multivalued fields are
sorted in ascending order or not. This feature allows operators like
topn to enable optimizations. However, we are currently missing the
deduplicated attribute. If multivalued fields are deduplicated at each
position, we can further optimize operators such as hash and mv_dedup.
In fact, blocks should not have mv_ascending property alone; it always
goes together with mv_deduplicated. Additionally, mv_dedup or hash
should generate blocks that have only the mv_dedup property.
This commit adds a BlockFactory - an extra level of indirection when building blocks. The factory couples circuit breaking when building, allowing for incrementing the breaker as blocks and Vectors are built.
This PR adds the infrastructure to allow us to move the operators and implementations over to the factory, rather than actually moving all there at once.
This prevents topn operations from using too much memory by hooking them
into circuit breaking framework. It builds on the work done in
https://github.com/elastic/elasticsearch/pull/99316 that moved all topn
storage to byte arrays by adding circuit breaking to process of growing
the underlying byte array.
This commit adds DriverContext to the construction of Evaluators.
DriverContext is enriched to carry bigArrays, and will eventually carry a BlockFactory - it's the context for code requiring to create instances of blocks and big arrays.
This lowers topn's memory usage somewhat and makes it easier to track
the memory usage. That looks like:
```
"status" : {
"occupied_rows" : 10000,
"ram_bytes_used" : 255392224,
"ram_used" : "243.5mb"
}
```
In some cases the memory usage savings is significant. In an example
with many, many keys the memory usage of each row drops from `58kb` to
`25kb`. This is a little degenerate though and I expect the savings to
normally be on the order of 10%.
The real advantage is memory tracking. It's *easy* to track used memory.
And, in a followup, it should be fairly easy to track circuit break the
used memory.
Mostly this is done by adding new abstractions and moving existing
abstractions to top level classes with tests and stuff.
* `TopNEncoder` is now a top level class. It has grown the ability to *decode* values as well as encode them. And it has grown "unsortable" versions which don't write their values such that sorting the bytes sorts the values. We use the "unsortable" versions when writing values.
* `KeyExtractor` extracts keys from the blocks and writes them to the row's `BytesRefBuilder`. This is basically objects replacing one of switch statements in `RowFactory`. They are more scattered but easier to test, and hopefully `TopNOperator` is more readable with this behavior factored out. Also! Most implementations are automatically generated.
* `ValueExtractor` extracts values from the blocks and writes them to the row's `BytesRefBuilder`. This replaces the other switch statement in `RowFactory` for the same reasons, except instead of writing to many arrays it writes to a `BytesRefBuilder` just like the key as compactly as it can manage.
The memory savings comes from three changes: 1. Lower overhead for
storing values by encoding them rather than using many primitive arrays.
2. Encode the value count as a vint rather than a whole int. Usually
there are very few rows and vint encodes that quite nicely. 3. Don't
write values that are in the key for single-valued fields. Instead we
read them from the key. That's going to be very very common.
This is unlikely to be faster than the old code. I haven't really tried
for speed. Just memory usage and accountability. Once we get good
accounting we can try and make this faster. I expect we'll have to
figure out the megamorphic invocations I've added. But, for now, they
help more than they hurt.
CompatibilityVersions now holds a map of system index names to their
mappings versions, alongside the transport version. We also add mapping
versions to the "minimum version barrier": if a node has a system index
whose version is below the cluster mappings version for that system
index, it is not allowed to join the cluster.
* ESQL: Disable optimizations with bad null handling
We have optimizations that kick in when aggregating on the following
pairs of field types:
* `long`, `long`
* `keyword`, `long`
* `long`, `keyword`
These optimizations don't have proper support for `null` valued fields
but will grow that after #98749. In the mean time this disables them in
a way that prevents them from bit-rotting.
* Update docs/changelog/99434.yaml
Cluster state currently holds a cluster minimum transport version and a map of nodes to transport versions. However, to determine node compatibility, we will need to account for more types of versions in cluster state than just the transport version (see #99076). Here we introduce a wrapper class to cluster state and update accessors and builders to use the new method. (I would have liked to re-use org.elasticsearch.cluster.node.VersionInformation, but that one holds IndexVersion rather than TransportVersion.
* Introduce CompatibilityVersions to cluster state class
We did not use the cluster settings on these gigantic objects
except for the one spot in the aggregtion context.
=> we can just hold a reference to it on the aggregation context
and simplify things a little for tests etc.
Also, inline needless indirection via single-use private method in
`toQuery`.
The reason for this is to wire the ESQL evaluator interface (such as
Warning infrastructure).
In the process arrange a bit the evaluator classes under expression
- introduce evaluator package
- move under evaluator EvalMapper & family of mappers (from planner)
- extract common interface from EvalMapper into its own file and rename
it from the generic Mapper (of which we have several classes) to
EvaluatorMapper
- mirror the package hierarchy from expression package
- widen visibility from protected to public (side-effect of the above)
- move classes that only generate code from expression to evaluator
When multivalued fields are loaded from lucene they are in sorted order
but we weren't taking advantage of that fact. Now we are! It's much
faster, even for fast operations like `mv_min`
```
(operation) Mode Cnt Score Error Units
mv_min avgt 7 3.820 ± 0.070 ns/op
mv_min_ascending avgt 7 1.979 ± 0.130 ns/op
```
We still have code to run in non-sorted mode because conversion functions
and a few other things don't load in sorted order.
I've also ported expanded the parameterized tests for the `MV_` functions
because, well, I needed to expand them at least a little to test this
change. And I just kept going and improved as many tests as I could.
In serverless we will like to report (meter and bill) upon a document ingestion. The metering should be agnostic to a document format (document structure should be normalised) hence we should allow to create XContentParsers which will keep track of parsed fields and values.
There are 2 places where the parsing of the ingested document happens:
1. upon the 'raw bulk' a request is sent without the pipelines
2. upon the 'ingest service' when a request is sent with pipelines
(parsing can occur twice when a dynamic mappings are calculated, this PR takes this into account and prevent double billing)
We also want to make sure, that the metering logic is not unnecessarily executed when a document was already reported. That is if a document was reported in IngestService, there is no point wrapping the XContentParser again.
This commit introduces a `DocumentReporterPlugin` an internal plugin that will be implemented in serverless. This plugin should return a `DocumentParsingObserver` supplier which will create a `DocumentParsingObserver`. A DocumentParsingObserver is used to wrap an `XContentParser` with an implementation that keeps track of parsed fields and values (performs a metering) and allows to send that information along with an index name to a MeteringReporter.
This sizes pages produced by operators based on an estimate of the
number of bytes that'll be added to the page after it's been emitted.
At this point it only really works properly for the
`LuceneSourceOperator`, but that's pretty useful!
The `LuceneTopNSourceOperator` doesn't yet have code to cut the output
of a topn collected from lucene into multiple pages, so it ignores the
calculated value. We'll get to that in a follow up.
We feed the right value into aggregations but ungrouped aggregations
ignore it because they only ever emit one row. Grouped aggregations
don't yet have any code to cut their output into multiple pages.
TopN *does* have code to cut the output into multiple pages but the
estimates passed to it are kind of hacky. A proper estimate of TopN
would account for the size of rows flowing into it, but I never wrote
code for that. The thing is - TopN doesn't have to estimate incoming row
size - it can measure each row as it builds it and use the esimate we're
building now as an estimate of extra bytes that'll be added. Which is
what it is! But that code also needs to be written.
Relates to https://github.com/elastic/elasticsearch-internal/issues/1385
This commit disables javadocs for the benchmarks project, since the docs are not necessary or interesting, and cause warning noise in the build log output.
Today, we have a hard-coded maximum page size of 16K in Lucene operators
and other operators like TopN and HashOperator. This default value
should work well in production. However, it doesn't provide enough
randomization in our tests because we mostly emit a single page.
Additionally, some tests take a significant amount of time because they
require indexing a large number of documents, which is several times the
page size.
To address these, this PR makes the page size parameter to be
configurable via the query pragmas, enabling randomization in tests.
This change has already uncovered a bug in LongLongBlockHash.
When you group by more than one multivalued field we generate one ord
per unique tuple of value of from each column. So if you group by
```
a=(1, 2, 3) b=(2, 3) c=(4, 5, 5)
```
Then you get these grouping keys:
```
1, 2, 4
1, 2, 5
1, 3, 4
1, 3, 5
2, 2, 4
2, 2, 5
2, 3, 4
2, 3, 5
3, 2, 4
3, 3, 5
```
That's as many grouping keys the the product of the Set-wise cardinality
of all elements. "Product" is a dangerous word! It's possible to make a
simple document containing just two fields that each are a list of
10,000 values and then send *that* into the aggregation framework. That
little baby document will spit out 100,000,000 grouping ordinals!
Without this PR we'd try to create a single `Block` that contains that
many entries. Or, rather, it'd be as big as the nearest power of two.
Gigantonormous. About 760mb! Like, possible, but a huge "slug" of heap
usage and not great.
This PR changes it so, at least for pairs of `long` keys we'll make many
smaller blocks. We cut the emitted ordinals into a block no more than
16*1024 entries, the default length of a block. That means our baby
document would make 6103 full blocks and one half full block. But each
one is going less than 200kb.
Relates to ESQL-1360
This adds support for multivalued fields the `PackedValuesHash` it does
so by encoding batches of values for each column into bytes and then
reading converted values row-wise. This works while also not causing
per-row megamorphic calls. There are megamorphic calls when the batch is
used up, but that should only hit a few times per block.
This implements the `MV_DEDUPE` function that removes duplicates from
multivalues fields. It wasn't strictly in our list of things we need in
the first release, but I'm grabbing this now because I realized I needed
very similar infrastructure when I was trying to build grouping by
multivalued fields. In fact, I realized that I could use our
stringtemplate code generation to generate most of the complex parts.
This generates the actual body of `MV_DEDUPE`'s implementation and the
body of the `Block` accepting `BlockHash` implementations. It'll be
useful in the final step for grouping by multivalued fields.
I also got pretty curious about whether the `O(n^2)` or `O(n*log(n))`
algorithm for deduplication is faster. I'd been assuming that for all
reasonable sized inputs the `O(n^2)` bubble sort looking selection
algorithm was faster. So I measured it. And it's mostly true - even for
`BytesRef` if you have a dozen entries the selection algorithm is
faster. Lower overhead and stuff. Anyway, to measure it I had to
implement the copy-and-sort `O(n*log(n))` algorithm. So while I was
there I plugged it in and selected it in cases where the number of
inputs is large and the selection alogorithm is likely to be slower.
This adds IndexVersion to cluster state, alongside node version. This is needed so IndexVersion can be tracked across the cluster, allowing min/max supported index versions to be determined.
Convert Avg into a SurrogateExpression and introduce dedicated rule
for handling surrogate AggregateFunction
Remove Avg implementation
Use sum instead of avg in some planning test
Add dataType case for Div operator
Relates ESQL-747
In SQL `AVG(foo)` is `null` if there are no value for `foo`. Same for
`MIN(foo)` and `MAX(foo)`. In fact, the only functions that don't return
`null` on empty inputs seem to be `COUNT` and `COUNT(DISTINCT`.
This flips our non-grouping aggs to have the same behavior because it's
both more expected and fits better with other things we're building.
This *is* different from Elasticsearch's aggs. But it's different in a
good way. It also lines up more closely with the way that our grouping
aggs work.
This also revives the broken `AggregatorBenchmark` so that I could get
performance figures for this change. And it's within the margin of
error:
```
(blockType) (grouping) (op) Mode Cnt Before After Units
vector_longs none sum avgt 7 0.440 ± 0.017 0.397 ± 0.003 ns/op
half_null_longs none sum avgt 7 5.785 ± 0.022 5.861 ± 0.134 ns/op
```
I expected a small slowdown on the `half_null_longs` line and see it,
but is within the margin of error. Either way, that's not the line
that's nearly as optimized. We'll loop back around to it eventually.
Closes ESQL-1297
* Initial import for TDigest forking.
* Fix MedianTest.
More work needed for TDigestPercentile*Tests and the TDigestTest (and
the rest of the tests) in the tdigest lib to pass.
* Fix Dist.
* Fix AVLTreeDigest.quantile to match Dist for uniform centroids.
* Update docs/changelog/96086.yaml
* Fix `MergingDigest.quantile` to match `Dist` on uniform distribution.
* Add merging to TDigestState.hashCode and .equals.
Remove wrong asserts from tests and MergingDigest.
* Fix style violations for tdigest library.
* Fix typo.
* Fix more style violations.
* Fix more style violations.
* Fix remaining style violations in tdigest library.
* Update results in docs based on the forked tdigest.
* Fix YAML tests in aggs module.
* Fix YAML tests in x-pack/plugin.
* Skip failing V7 compat tests in modules/aggregations.
* Fix TDigest library unittests.
Remove redundant serializing interfaces from the library.
* Remove YAML test versions for older releases.
These tests don't address compatibility issues in mixed cluster tests as
the latter contain a mix of older and newer nodes, so the output depends
on which node is picked as a data node since the forked TDigest library
is not backwards compatible (produces slightly different results).
* Fix test failures in docs and mixed cluster.
* Reduce buffer sizes in MergingDigest to avoid oom.
* Exclude more failing V7 compatibility tests.
* Update results for JdbcCsvSpecIT tests.
* Update results for JdbcDocCsvSpecIT tests.
* Revert unrelated change.
* More test fixes.
* Use version skips instead of blacklisting in mixed cluster tests.
* Switch TDigestState back to AVLTreeDigest.
* Update docs and tests with AVLTreeDigest output.
* Update flaky test.
* Remove dead code, esp around tracking of incoming data.
* Update docs/changelog/96086.yaml
* Delete docs/changelog/96086.yaml
* Remove explicit compression calls.
This was added to prevent concurrency tests from failing, but it leads
to reduces precision. Submit this to see if the concurrency tests are
still failing.
* Revert "Remove explicit compression calls."
This reverts commit 5352c96f65.
* Remove explicit compression calls to MedianAbsoluteDeviation input.
* Add unittests for AVL and merging digest accuracy.
* Fix spotless violations.
* Delete redundant tests and benchmarks.
* Fix spotless violation.
* Use the old implementation of AVLTreeDigest.
The latest library version is 50% slower and less accurate, as verified
by ComparisonTests.
* Update docs with latest percentile results.
* Update docs with latest percentile results.
* Remove repeated compression calls.
* Update more percentile results.
* Use approximate percentile values in integration tests.
This helps with mixed cluster tests, where some of the tests where
blocked.
* Fix expected percentile value in test.
* Revert in-place node updates in AVL tree.
Update quantile calculations between centroids and min/max values to
match v.3.2.
* Add SortingDigest and HybridDigest.
The SortingDigest tracks all samples in an ArrayList that
gets sorted for quantile calculations. This approach
provides perfectly accurate results and is the most
efficient implementation for up to millions of samples,
at the cost of bloated memory footprint.
The HybridDigest uses a SortingDigest for small sample
populations, then switches to a MergingDigest. This
approach combines to the best performance and results for
small sample counts with very good performance and
acceptable accuracy for effectively unbounded sample
counts.
* Remove deps to the 3.2 library.
* Remove unused licenses for tdigest.
* Revert changes for SortingDigest and HybridDigest.
These will be submitted in a follow-up PR for enabling MergingDigest.
* Remove unused Histogram classes and unit tests.
Delete dead and commented out code, make the remaining tests run
reasonably fast. Remove unused annotations, esp. SuppressWarnings.
* Remove Comparison class, not used.
* Revert "Revert changes for SortingDigest and HybridDigest."
This reverts commit 2336b11598.
* Use HybridDigest as default tdigest implementation
Add SortingDigest as a simple structure for percentile calculations that
tracks all data points in a sorted array. This is a fast and perfectly
accurate solution that leads to bloated memory allocation.
Add HybridDigest that uses SortingDigest for small sample counts, then
switches to MergingDigest. This approach delivers extreme
performance and accuracy for small populations while scaling
indefinitely and maintaining acceptable performance and accuracy with
constant memory allocation (15kB by default).
Provide knobs to switch back to AVLTreeDigest, either per query or
through ClusterSettings.
* Small fixes.
* Add javadoc and tests.
* Add javadoc and tests.
* Remove special logic for singletons in the boundaries.
While this helps with the case where the digest contains only
singletons (perfect accuracy), it has a major issue problem
(non-monotonic quantile function) when the first singleton is followed
by a non-singleton centroid. It's preferable to revert to the old
version from 3.2; inaccuracies in a singleton-only digest should be
mitigated by using a sorted array for small sample counts.
* Revert changes to expected values in tests.
This is due to restoring quantile functions to match head.
* Revert changes to expected values in tests.
This is due to restoring quantile functions to match head.
* Tentatively restore percentile rank expected results.
* Use cdf version from 3.2
Update Dist.cdf to use interpolation, use the same cdf
version in AVLTreeDigest and MergingDigest.
* Revert "Tentatively restore percentile rank expected results."
This reverts commit 7718dbba59.
* Revert remaining changes compared to main.
* Revert excluded V7 compat tests.
* Exclude V7 compat tests still failing.
* Exclude V7 compat tests still failing.
* Remove ClusterSettings tentatively.
* Initial import for TDigest forking.
* Fix MedianTest.
More work needed for TDigestPercentile*Tests and the TDigestTest (and
the rest of the tests) in the tdigest lib to pass.
* Fix Dist.
* Fix AVLTreeDigest.quantile to match Dist for uniform centroids.
* Update docs/changelog/96086.yaml
* Fix `MergingDigest.quantile` to match `Dist` on uniform distribution.
* Add merging to TDigestState.hashCode and .equals.
Remove wrong asserts from tests and MergingDigest.
* Fix style violations for tdigest library.
* Fix typo.
* Fix more style violations.
* Fix more style violations.
* Fix remaining style violations in tdigest library.
* Update results in docs based on the forked tdigest.
* Fix YAML tests in aggs module.
* Fix YAML tests in x-pack/plugin.
* Skip failing V7 compat tests in modules/aggregations.
* Fix TDigest library unittests.
Remove redundant serializing interfaces from the library.
* Remove YAML test versions for older releases.
These tests don't address compatibility issues in mixed cluster tests as
the latter contain a mix of older and newer nodes, so the output depends
on which node is picked as a data node since the forked TDigest library
is not backwards compatible (produces slightly different results).
* Fix test failures in docs and mixed cluster.
* Reduce buffer sizes in MergingDigest to avoid oom.
* Exclude more failing V7 compatibility tests.
* Update results for JdbcCsvSpecIT tests.
* Update results for JdbcDocCsvSpecIT tests.
* Revert unrelated change.
* More test fixes.
* Use version skips instead of blacklisting in mixed cluster tests.
* Switch TDigestState back to AVLTreeDigest.
* Update docs and tests with AVLTreeDigest output.
* Update flaky test.
* Remove dead code, esp around tracking of incoming data.
* Remove explicit compression calls.
This was added to prevent concurrency tests from failing, but it leads
to reduces precision. Submit this to see if the concurrency tests are
still failing.
* Update docs/changelog/96086.yaml
* Delete docs/changelog/96086.yaml
* Revert "Remove explicit compression calls."
This reverts commit 5352c96f65.
* Remove explicit compression calls to MedianAbsoluteDeviation input.
* Add unittests for AVL and merging digest accuracy.
* Fix spotless violations.
* Delete redundant tests and benchmarks.
* Fix spotless violation.
* Use the old implementation of AVLTreeDigest.
The latest library version is 50% slower and less accurate, as verified
by ComparisonTests.
* Update docs with latest percentile results.
* Update docs with latest percentile results.
* Remove repeated compression calls.
* Update more percentile results.
* Use approximate percentile values in integration tests.
This helps with mixed cluster tests, where some of the tests where
blocked.
* Fix expected percentile value in test.
* Revert in-place node updates in AVL tree.
Update quantile calculations between centroids and min/max values to
match v.3.2.
* Add SortingDigest and HybridDigest.
The SortingDigest tracks all samples in an ArrayList that
gets sorted for quantile calculations. This approach
provides perfectly accurate results and is the most
efficient implementation for up to millions of samples,
at the cost of bloated memory footprint.
The HybridDigest uses a SortingDigest for small sample
populations, then switches to a MergingDigest. This
approach combines to the best performance and results for
small sample counts with very good performance and
acceptable accuracy for effectively unbounded sample
counts.
* Remove deps to the 3.2 library.
* Remove unused licenses for tdigest.
* Revert changes for SortingDigest and HybridDigest.
These will be submitted in a follow-up PR for enabling MergingDigest.
* Remove unused Histogram classes and unit tests.
Delete dead and commented out code, make the remaining tests run
reasonably fast. Remove unused annotations, esp. SuppressWarnings.
* Remove Comparison class, not used.
* Revert "Revert changes for SortingDigest and HybridDigest."
This reverts commit 2336b11598.
* Use HybridDigest as default tdigest implementation
Add SortingDigest as a simple structure for percentile calculations that
tracks all data points in a sorted array. This is a fast and perfectly
accurate solution that leads to bloated memory allocation.
Add HybridDigest that uses SortingDigest for small sample counts, then
switches to MergingDigest. This approach delivers extreme
performance and accuracy for small populations while scaling
indefinitely and maintaining acceptable performance and accuracy with
constant memory allocation (15kB by default).
Provide knobs to switch back to AVLTreeDigest, either per query or
through ClusterSettings.
* Add javadoc and tests.
* Remove ClusterSettings tentatively.
* Restore bySize function in TDigest and subclasses.
* Update Dist.cdf to match the rest.
Update tests.
* Revert outdated test changes.
* Revert outdated changes.
* Small fixes.
* Update docs/changelog/96794.yaml
* Make HybridDigest the default implementation.
* Update boxplot documentation.
* Restore AVLTreeDigest as the default in TDigestState.
TDigest.createHybridDigest nw returns the right type.
The switch in TDigestState will happen in a separate PR
as it requires many test updates.
* Use execution_hint in tdigest spec.
* Fix Dist.cdf for empty digest.
* Pass ClusterSettings through SearchExecutionContext.
* Bump up TransportVersion.
* Bump up TransportVersion for real.
* HybridDigest uses its final implementation during deserialization.
* Restore the right TransportVersion in TDigestState.read
* Add dummy SearchExecutionContext factory for tests.
* Use TDigestExecutionHint instead of strings.
* Remove check for null context.
* Add link to TDigest javadoc.
* Use NodeSettings directly.
* Init executionHint to null, set before using.
* Update docs/changelog/96943.yaml
* Pass initialized executionHint to createEmptyPercentileRanksAggregator.
* Initialize TDigestExecutionHint.SETTING to "DEFAULT".
* Initialize TDigestExecutionHint to null.
* Use readOptionalWriteable/writeOptionalWriteable.
Move test-only SearchExecutionContext method in helper class under
test.
* Bump up TransportVersion.
* Small fixes.
Preparation for aggs to allow for consumption of multiple input
channels, and output of more than one Block.
The salient change can be seen in difference to the AggregatorFunction
and GroupingAggregatorFunction interfaces, e.g.:
```diff
- void addIntermediateInput(Block block);
- Block evaluateIntermediate();
- Block evaluateFinal();
---
+ void addIntermediateInput(Page page);
+ void evaluateIntermediate(Block[] blocks, int offset);
+ void evaluateFinal(Block[] blocks, int offset);
```
addIntermediate accepts a page (rather than a block), to allow the
aggregator function to consume multiple channels.
evaluateXXX accepts a block array and offset, to allow the aggregator
function to populate array elements.
For now, aggs continue to just use a single input channel and output
just a single block. A follow on change will refactor this.
This flips the resolution of aggs from a tree of switch statements into
method calls on the function objects. This gives us much more control
over how we contruct the aggs, making it much simpler to flow parameters
through the system and easier to make sure that only appropriate aggs
run in the right spot.
\
This commit changes access to the latest TransportVersion constant to
use a static method instead of a public static field. By encapsulating
the field we will be able to (in a followup) lazily determine what the
latest is, outside of clinit.
* Initial import for TDigest forking.
* Fix MedianTest.
More work needed for TDigestPercentile*Tests and the TDigestTest (and
the rest of the tests) in the tdigest lib to pass.
* Fix Dist.
* Fix AVLTreeDigest.quantile to match Dist for uniform centroids.
* Update docs/changelog/96086.yaml
* Fix `MergingDigest.quantile` to match `Dist` on uniform distribution.
* Add merging to TDigestState.hashCode and .equals.
Remove wrong asserts from tests and MergingDigest.
* Fix style violations for tdigest library.
* Fix typo.
* Fix more style violations.
* Fix more style violations.
* Fix remaining style violations in tdigest library.
* Update results in docs based on the forked tdigest.
* Fix YAML tests in aggs module.
* Fix YAML tests in x-pack/plugin.
* Skip failing V7 compat tests in modules/aggregations.
* Fix TDigest library unittests.
Remove redundant serializing interfaces from the library.
* Remove YAML test versions for older releases.
These tests don't address compatibility issues in mixed cluster tests as
the latter contain a mix of older and newer nodes, so the output depends
on which node is picked as a data node since the forked TDigest library
is not backwards compatible (produces slightly different results).
* Fix test failures in docs and mixed cluster.
* Reduce buffer sizes in MergingDigest to avoid oom.
* Exclude more failing V7 compatibility tests.
* Update results for JdbcCsvSpecIT tests.
* Update results for JdbcDocCsvSpecIT tests.
* Revert unrelated change.
* More test fixes.
* Use version skips instead of blacklisting in mixed cluster tests.
* Switch TDigestState back to AVLTreeDigest.
* Update docs and tests with AVLTreeDigest output.
* Update flaky test.
* Remove dead code, esp around tracking of incoming data.
* Update docs/changelog/96086.yaml
* Delete docs/changelog/96086.yaml
* Remove explicit compression calls.
This was added to prevent concurrency tests from failing, but it leads
to reduces precision. Submit this to see if the concurrency tests are
still failing.
* Revert "Remove explicit compression calls."
This reverts commit 5352c96f65.
* Remove explicit compression calls to MedianAbsoluteDeviation input.
* Add unittests for AVL and merging digest accuracy.
* Fix spotless violations.
* Delete redundant tests and benchmarks.
* Fix spotless violation.
* Use the old implementation of AVLTreeDigest.
The latest library version is 50% slower and less accurate, as verified
by ComparisonTests.
* Update docs with latest percentile results.
* Update docs with latest percentile results.
* Remove repeated compression calls.
* Update more percentile results.
* Use approximate percentile values in integration tests.
This helps with mixed cluster tests, where some of the tests where
blocked.
* Fix expected percentile value in test.
* Revert in-place node updates in AVL tree.
Update quantile calculations between centroids and min/max values to
match v.3.2.
* Add SortingDigest and HybridDigest.
The SortingDigest tracks all samples in an ArrayList that
gets sorted for quantile calculations. This approach
provides perfectly accurate results and is the most
efficient implementation for up to millions of samples,
at the cost of bloated memory footprint.
The HybridDigest uses a SortingDigest for small sample
populations, then switches to a MergingDigest. This
approach combines to the best performance and results for
small sample counts with very good performance and
acceptable accuracy for effectively unbounded sample
counts.
* Remove deps to the 3.2 library.
* Remove unused licenses for tdigest.
* Revert changes for SortingDigest and HybridDigest.
These will be submitted in a follow-up PR for enabling MergingDigest.
* Remove unused Histogram classes and unit tests.
Delete dead and commented out code, make the remaining tests run
reasonably fast. Remove unused annotations, esp. SuppressWarnings.
* Remove Comparison class, not used.
* Small fixes.
* Add javadoc and tests.
* Remove special logic for singletons in the boundaries.
While this helps with the case where the digest contains only
singletons (perfect accuracy), it has a major issue problem
(non-monotonic quantile function) when the first singleton is followed
by a non-singleton centroid. It's preferable to revert to the old
version from 3.2; inaccuracies in a singleton-only digest should be
mitigated by using a sorted array for small sample counts.
* Revert changes to expected values in tests.
This is due to restoring quantile functions to match head.
* Revert changes to expected values in tests.
This is due to restoring quantile functions to match head.
* Tentatively restore percentile rank expected results.
* Use cdf version from 3.2
Update Dist.cdf to use interpolation, use the same cdf
version in AVLTreeDigest and MergingDigest.
* Revert "Tentatively restore percentile rank expected results."
This reverts commit 7718dbba59.
* Revert remaining changes compared to main.
* Revert excluded V7 compat tests.
* Exclude V7 compat tests still failing.
* Exclude V7 compat tests still failing.
* Restore bySize function in TDigest and subclasses.