elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-04-23 14:47:31 -04:00

Author	SHA1	Message	Date
Nik Everett	0928d988dc	ESQL: Build tracked block in EVAL (#100268 ) (#100279 ) This changes `EVAL` to build tracked blocks so we can trip the breaker when there are too many tracked blocks hanging about.	2023-10-04 11:21:05 -04:00
Luigi Dell'Aquila	2212c7d7c4	ESQL: Add DriverContext to aggregators and aggregator functions (#100080 ) Add DriverContext to aggregators and aggregator functions so that block creation can use blockFactory()	2023-10-04 06:41:07 -04:00
Nik Everett	c779a54578	ESQL: Fix mv evaluators releasing (#100097 ) This adds comprehensive tests for `ExpressionEvaluator` making sure that it releases `Block`s. It fixes all of the `mv_*` evaluators to make sure they release as well.	2023-10-03 16:46:45 -04:00
Chris Hegarty	29de48e339	ESQL: Use blockFactory more in aggs (#100017 ) This commit updates the hash grouping operator to close input pages, as well as use the block factory for internally created blocks. Additionally: * Adds a MockBlockFactory to help with tracking block creation * Eagerly creates the block view of a vector, which helps with tracking since there can be only one block view instance per vector * Resolves an issue with Filter Blocks, whereby they previously tried to emit their contents in toString	2023-10-03 14:01:13 +01:00
Nik Everett	e1b1f6f1db	ESQL: Create `Block.Ref` (#100042 ) This creates `Block.Ref`, a reference to a `Block` which may or may not be part of a `Page`. `Block.Ref` is `Releasable` and closing it is a noop if the `Block` is part of a `Page`, but if it is "free floating" then closing the `Block.Ref` will close the block. It also modified `ExpressionEvaluator` to return a `Block.Ref` instead of a `Block` - so you tend to work with `ExpressionEvaluator`s like this: ``` try (Block.Ref ref = eval.eval(page)) { return ref.block().doStuff(); } ``` This should make it much easier to release the memory from `Block`s built by `ExpressionEvaluator`s. This change is mostly mechanical, introducing the new signature for `ExpressionEvaluator`. In a follow up change I'll modify the tests to make sure we're correctly using it to close pages. I did think about changing `ExpressionEvaluator` to add a method telling you if the block that it returns must be closed or not. This would have been more difficult to work with, and, ultimately, limiting. Specifically, it is possible for an `ExpressionEvaluator` to sometimes return a free floating block and other times return one that is contained in a `Page`. Imagine `mv_concat` - it returns the block it receives if the block doesn't have multivalued fields. Otherwise it concats things. If that block happens to come directly out of the `Page`, then `mv_concat` will sometimes produce free floating blocks and sometimes not.	2023-09-29 09:26:44 -04:00
Nhat Nguyen	163b5eff5a	Add deduplicated attribute to MvOrdering (#100027 ) Today, we have the ability to specify whether multivalued fields are sorted in ascending order or not. This feature allows operators like topn to enable optimizations. However, we are currently missing the deduplicated attribute. If multivalued fields are deduplicated at each position, we can further optimize operators such as hash and mv_dedup. In fact, blocks should not have mv_ascending property alone; it always goes together with mv_deduplicated. Additionally, mv_dedup or hash should generate blocks that have only the mv_dedup property.	2023-09-28 12:41:39 -07:00
Nik Everett	0e219307f2	ESQL: Track blocks (#100025 ) This tracks blocks from topn and a few other places. We're going to try and track blocks all the places.	2023-09-28 13:17:08 -04:00
Chris Hegarty	21d9de04a1	Add a Factory for building blocks and Vectors (#99657 ) This commit adds a BlockFactory - an extra level of indirection when building blocks. The factory couples circuit breaking when building, allowing for incrementing the breaker as blocks and Vectors are built. This PR adds the infrastructure to allow us to move the operators and implementations over to the factory, rather than actually moving all there at once.	2023-09-22 14:30:34 +01:00
William Brafford	1f3126b47b	Add mappings versions to ClusterState.Builder convenience methods (#99551 ) This is a follow-up to [#99307](https://github.com/elastic/elasticsearch/pull/99307), adjusting convenience methods that used to take `TransportVersion` arguments to account for `MappingsVersion` maps.	2023-09-19 15:36:21 -04:00
Nik Everett	1a14bc7a06	ESQL: Prevent topn from using too much memory (#99611 ) This prevents topn operations from using too much memory by hooking them into circuit breaking framework. It builds on the work done in https://github.com/elastic/elasticsearch/pull/99316 that moved all topn storage to byte arrays by adding circuit breaking to process of growing the underlying byte array.	2023-09-18 11:33:39 -04:00
Chris Hegarty	07f6a65b24	ESQL: Remove default driver context (#99573 ) This commit removes the default driver implementation.	2023-09-14 17:24:42 +01:00
Chris Hegarty	baf11a9d03	ESQL: Add DriverContext to the construction of Evaluators (#99518 ) This commit adds DriverContext to the construction of Evaluators. DriverContext is enriched to carry bigArrays, and will eventually carry a BlockFactory - it's the context for code requiring to create instances of blocks and big arrays.	2023-09-13 22:09:23 +01:00
Nik Everett	5ddc67db03	ESQL: Compact topn (#99316 ) This lowers topn's memory usage somewhat and makes it easier to track the memory usage. That looks like: ``` "status" : { "occupied_rows" : 10000, "ram_bytes_used" : 255392224, "ram_used" : "243.5mb" } ``` In some cases the memory usage savings is significant. In an example with many, many keys the memory usage of each row drops from `58kb` to `25kb`. This is a little degenerate though and I expect the savings to normally be on the order of 10%. The real advantage is memory tracking. It's easy to track used memory. And, in a followup, it should be fairly easy to track circuit break the used memory. Mostly this is done by adding new abstractions and moving existing abstractions to top level classes with tests and stuff. * `TopNEncoder` is now a top level class. It has grown the ability to decode values as well as encode them. And it has grown "unsortable" versions which don't write their values such that sorting the bytes sorts the values. We use the "unsortable" versions when writing values. * `KeyExtractor` extracts keys from the blocks and writes them to the row's `BytesRefBuilder`. This is basically objects replacing one of switch statements in `RowFactory`. They are more scattered but easier to test, and hopefully `TopNOperator` is more readable with this behavior factored out. Also! Most implementations are automatically generated. * `ValueExtractor` extracts values from the blocks and writes them to the row's `BytesRefBuilder`. This replaces the other switch statement in `RowFactory` for the same reasons, except instead of writing to many arrays it writes to a `BytesRefBuilder` just like the key as compactly as it can manage. The memory savings comes from three changes: 1. Lower overhead for storing values by encoding them rather than using many primitive arrays. 2. Encode the value count as a vint rather than a whole int. Usually there are very few rows and vint encodes that quite nicely. 3. Don't write values that are in the key for single-valued fields. Instead we read them from the key. That's going to be very very common. This is unlikely to be faster than the old code. I haven't really tried for speed. Just memory usage and accountability. Once we get good accounting we can try and make this faster. I expect we'll have to figure out the megamorphic invocations I've added. But, for now, they help more than they hurt.	2023-09-13 16:27:50 -04:00
William Brafford	b5e06da143	Add mappings versions to CompatibilityVersions (#99307 ) CompatibilityVersions now holds a map of system index names to their mappings versions, alongside the transport version. We also add mapping versions to the "minimum version barrier": if a node has a system index whose version is below the cluster mappings version for that system index, it is not allowed to join the cluster.	2023-09-12 11:16:55 -04:00
Nik Everett	f9107e34c9	ESQL: Disable optimizations with bad null handling (#99434 ) * ESQL: Disable optimizations with bad null handling We have optimizations that kick in when aggregating on the following pairs of field types: * `long`, `long` * `keyword`, `long` * `long`, `keyword` These optimizations don't have proper support for `null` valued fields but will grow that after #98749. In the mean time this disables them in a way that prevents them from bit-rotting. * Update docs/changelog/99434.yaml	2023-09-11 13:18:23 -04:00
Nik Everett	443c53c636	ESQL: Add type to layout (#99327 ) We want it in a few places.	2023-09-08 07:17:18 -04:00
William Brafford	d32902cf45	Wrap transport version in cluster state (#99114 ) Cluster state currently holds a cluster minimum transport version and a map of nodes to transport versions. However, to determine node compatibility, we will need to account for more types of versions in cluster state than just the transport version (see #99076). Here we introduce a wrapper class to cluster state and update accessors and builders to use the new method. (I would have liked to re-use org.elasticsearch.cluster.node.VersionInformation, but that one holds IndexVersion rather than TransportVersion. * Introduce CompatibilityVersions to cluster state class	2023-09-06 09:52:42 -04:00
Luigi Dell'Aquila	ba87357824	ESQL: Add support for TEXT fields in comparison operators and SORT (#98528 )	2023-08-30 15:45:03 +02:00
Simon Cooper	b67a9e1ec3	Move text references to index created version to IndexVersion (#98727 )	2023-08-23 10:51:56 +01:00
Armin Braun	83cf72c7dc	Remove ClusterSettings from SearchExecutionContext (#98753 ) We did not use the cluster settings on these gigantic objects except for the one spot in the aggregtion context. => we can just hold a reference to it on the aggregation context and simplify things a little for tests etc. Also, inline needless indirection via single-use private method in `toQuery`.	2023-08-23 10:05:18 +02:00
Ignacio Vera	0a2fbee130	Fix compilation error on JDK 20+ (#98762 )	2023-08-23 11:11:07 +08:00
Bogdan Pintea	372458c9fd	ESQL: date_trunc(): swap order of arguments (#98624 ) Swap arguments order so that the range parameter is first and datetime one second, inline with other languages.	2023-08-22 18:20:05 +02:00
Costin Leau	98ea54370b	Add ESQL own flavor of arithmetic operators (#98628 ) The reason for this is to wire the ESQL evaluator interface (such as Warning infrastructure). In the process arrange a bit the evaluator classes under expression - introduce evaluator package - move under evaluator EvalMapper & family of mappers (from planner) - extract common interface from EvalMapper into its own file and rename it from the generic Mapper (of which we have several classes) to EvaluatorMapper - mirror the package hierarchy from expression package - widen visibility from protected to public (side-effect of the above) - move classes that only generate code from expression to evaluator	2023-08-21 15:26:30 -07:00
Nik Everett	f3bd0fbd5c	ESQL: Optimize some `MV_` functions on sorted (#98515 ) When multivalued fields are loaded from lucene they are in sorted order but we weren't taking advantage of that fact. Now we are! It's much faster, even for fast operations like `mv_min` ``` (operation) Mode Cnt Score Error Units mv_min avgt 7 3.820 ± 0.070 ns/op mv_min_ascending avgt 7 1.979 ± 0.130 ns/op ``` We still have code to run in non-sorted mode because conversion functions and a few other things don't load in sorted order. I've also ported expanded the parameterized tests for the `MV_` functions because, well, I needed to expand them at least a little to test this change. And I just kept going and improved as many tests as I could.	2023-08-15 18:44:44 -04:00
elasticsearchmachine	593849aea3	Merge pull request ESQL-1517 from elastic/main 🤖 ESQL: Merge upstream	2023-08-01 13:17:51 -04:00
Przemyslaw Gomulka	999489ce04	Infrastructure to report upon document parsing (#97961 ) In serverless we will like to report (meter and bill) upon a document ingestion. The metering should be agnostic to a document format (document structure should be normalised) hence we should allow to create XContentParsers which will keep track of parsed fields and values. There are 2 places where the parsing of the ingested document happens: 1. upon the 'raw bulk' a request is sent without the pipelines 2. upon the 'ingest service' when a request is sent with pipelines (parsing can occur twice when a dynamic mappings are calculated, this PR takes this into account and prevent double billing) We also want to make sure, that the metering logic is not unnecessarily executed when a document was already reported. That is if a document was reported in IngestService, there is no point wrapping the XContentParser again. This commit introduces a `DocumentReporterPlugin` an internal plugin that will be implemented in serverless. This plugin should return a `DocumentParsingObserver` supplier which will create a `DocumentParsingObserver`. A DocumentParsingObserver is used to wrap an `XContentParser` with an implementation that keeps track of parsed fields and values (performs a metering) and allows to send that information along with an index name to a MeteringReporter.	2023-08-01 13:55:18 +02:00
Nik Everett	65af5b2199	Size pages based on loaded columns (ESQL-1403) This sizes pages produced by operators based on an estimate of the number of bytes that'll be added to the page after it's been emitted. At this point it only really works properly for the `LuceneSourceOperator`, but that's pretty useful! The `LuceneTopNSourceOperator` doesn't yet have code to cut the output of a topn collected from lucene into multiple pages, so it ignores the calculated value. We'll get to that in a follow up. We feed the right value into aggregations but ungrouped aggregations ignore it because they only ever emit one row. Grouped aggregations don't yet have any code to cut their output into multiple pages. TopN does have code to cut the output into multiple pages but the estimates passed to it are kind of hacky. A proper estimate of TopN would account for the size of rows flowing into it, but I never wrote code for that. The thing is - TopN doesn't have to estimate incoming row size - it can measure each row as it builds it and use the esimate we're building now as an estimate of extra bytes that'll be added. Which is what it is! But that code also needs to be written. Relates to https://github.com/elastic/elasticsearch-internal/issues/1385	2023-07-27 15:41:46 -04:00
elasticsearchmachine	a77f3faa62	Merge pull request ESQL-1444 from elastic/main 🤖 ESQL: Merge upstream	2023-07-15 13:20:57 -04:00
Chris Hegarty	50a4765d13	Disable javadocs for benchmarks (#97695 ) This commit disables javadocs for the benchmarks project, since the docs are not necessary or interesting, and cause warning noise in the build log output.	2023-07-15 09:37:24 +01:00
Nhat Nguyen	5f754aabc4	Make page_size parameter configurable (ESQL-1402) Today, we have a hard-coded maximum page size of 16K in Lucene operators and other operators like TopN and HashOperator. This default value should work well in production. However, it doesn't provide enough randomization in our tests because we mostly emit a single page. Additionally, some tests take a significant amount of time because they require indexing a large number of documents, which is several times the page size. To address these, this PR makes the page size parameter to be configurable via the query pragmas, enabling randomization in tests. This change has already uncovered a bug in LongLongBlockHash.	2023-07-11 07:07:24 -07:00
Nik Everett	8f7ac51c77	Start to prevent massive grouping blocks (ESQL-1370) When you group by more than one multivalued field we generate one ord per unique tuple of value of from each column. So if you group by ``` a=(1, 2, 3) b=(2, 3) c=(4, 5, 5) ``` Then you get these grouping keys: ``` 1, 2, 4 1, 2, 5 1, 3, 4 1, 3, 5 2, 2, 4 2, 2, 5 2, 3, 4 2, 3, 5 3, 2, 4 3, 3, 5 ``` That's as many grouping keys the the product of the Set-wise cardinality of all elements. "Product" is a dangerous word! It's possible to make a simple document containing just two fields that each are a list of 10,000 values and then send that into the aggregation framework. That little baby document will spit out 100,000,000 grouping ordinals! Without this PR we'd try to create a single `Block` that contains that many entries. Or, rather, it'd be as big as the nearest power of two. Gigantonormous. About 760mb! Like, possible, but a huge "slug" of heap usage and not great. This PR changes it so, at least for pairs of `long` keys we'll make many smaller blocks. We cut the emitted ordinals into a block no more than 16*1024 entries, the default length of a block. That means our baby document would make 6103 full blocks and one half full block. But each one is going less than 200kb. Relates to ESQL-1360	2023-07-07 14:25:06 -05:00
elasticsearchmachine	ea5e26ba18	Merge pull request ESQL-1365 from elastic/main 🤖 ESQL: Merge upstream	2023-07-04 13:25:19 -04:00
Rene Groeschke	b8627079b4	Update Gradle Wrapper to 8.2 (#96686 ) - Convention usage has been deprecated and was fixed in our build files - Fix test dependencies and deprecation	2023-07-04 15:35:15 +02:00
Nik Everett	ead68fb70c	Support multivalued fields on arbitrary grouping (ESQL-1340) This adds support for multivalued fields the `PackedValuesHash` it does so by encoding batches of values for each column into bytes and then reading converted values row-wise. This works while also not causing per-row megamorphic calls. There are megamorphic calls when the batch is used up, but that should only hit a few times per block.	2023-07-03 08:19:32 -05:00
elasticsearchmachine	ceec5ab839	Merge pull request ESQL-1330 from elastic/main 🤖 ESQL: Merge upstream	2023-06-27 13:18:29 -04:00
Simon Cooper	a873e26cf7	Convert IndexVersion.CURRENT to a method with a pluggable interface (#97132 )	2023-06-27 14:47:32 +01:00
Nik Everett	1a1941913d	Implement `MV_DEDUPE` (ESQL-1287) This implements the `MV_DEDUPE` function that removes duplicates from multivalues fields. It wasn't strictly in our list of things we need in the first release, but I'm grabbing this now because I realized I needed very similar infrastructure when I was trying to build grouping by multivalued fields. In fact, I realized that I could use our stringtemplate code generation to generate most of the complex parts. This generates the actual body of `MV_DEDUPE`'s implementation and the body of the `Block` accepting `BlockHash` implementations. It'll be useful in the final step for grouping by multivalued fields. I also got pretty curious about whether the `O(n^2)` or `O(nlog(n))` algorithm for deduplication is faster. I'd been assuming that for all reasonable sized inputs the `O(n^2)` bubble sort looking selection algorithm was faster. So I measured it. And it's mostly true - even for `BytesRef` if you have a dozen entries the selection algorithm is faster. Lower overhead and stuff. Anyway, to measure it I had to implement the copy-and-sort `O(nlog(n))` algorithm. So while I was there I plugged it in and selected it in cases where the number of inputs is large and the selection alogorithm is likely to be slower.	2023-06-27 08:13:19 -05:00
Simon Cooper	3726113f06	Add IndexVersion to cluster state (#96258 ) This adds IndexVersion to cluster state, alongside node version. This is needed so IndexVersion can be tracked across the cluster, allowing min/max supported index versions to be determined.	2023-06-27 10:52:00 +01:00
Costin Leau	fa20e28da0	Introduce SurrogateExpression (ESQL-1285) Convert Avg into a SurrogateExpression and introduce dedicated rule for handling surrogate AggregateFunction Remove Avg implementation Use sum instead of avg in some planning test Add dataType case for Div operator Relates ESQL-747	2023-06-22 14:38:40 +03:00
Nik Everett	3a7746b5c7	Make most singleton aggs null on empty (ESQL-1300) In SQL `AVG(foo)` is `null` if there are no value for `foo`. Same for `MIN(foo)` and `MAX(foo)`. In fact, the only functions that don't return `null` on empty inputs seem to be `COUNT` and `COUNT(DISTINCT`. This flips our non-grouping aggs to have the same behavior because it's both more expected and fits better with other things we're building. This is different from Elasticsearch's aggs. But it's different in a good way. It also lines up more closely with the way that our grouping aggs work. This also revives the broken `AggregatorBenchmark` so that I could get performance figures for this change. And it's within the margin of error: ``` (blockType) (grouping) (op) Mode Cnt Before After Units vector_longs none sum avgt 7 0.440 ± 0.017 0.397 ± 0.003 ns/op half_null_longs none sum avgt 7 5.785 ± 0.022 5.861 ± 0.134 ns/op ``` I expected a small slowdown on the `half_null_longs` line and see it, but is within the margin of error. Either way, that's not the line that's nearly as optimized. We'll loop back around to it eventually. Closes ESQL-1297	2023-06-21 11:00:41 -04:00
elasticsearchmachine	b9fe7e7e9f	Merge pull request ESQL-1299 from elastic/main 🤖 ESQL: Merge upstream	2023-06-21 01:16:20 -04:00
Kostas Krikellas	32bdd3b148	Add a cluster setting configuring `TDigestExecutionHint` (#96943 ) * Initial import for TDigest forking. * Fix MedianTest. More work needed for TDigestPercentileTests and the TDigestTest (and the rest of the tests) in the tdigest lib to pass. Fix Dist. * Fix AVLTreeDigest.quantile to match Dist for uniform centroids. * Update docs/changelog/96086.yaml * Fix `MergingDigest.quantile` to match `Dist` on uniform distribution. * Add merging to TDigestState.hashCode and .equals. Remove wrong asserts from tests and MergingDigest. * Fix style violations for tdigest library. * Fix typo. * Fix more style violations. * Fix more style violations. * Fix remaining style violations in tdigest library. * Update results in docs based on the forked tdigest. * Fix YAML tests in aggs module. * Fix YAML tests in x-pack/plugin. * Skip failing V7 compat tests in modules/aggregations. * Fix TDigest library unittests. Remove redundant serializing interfaces from the library. * Remove YAML test versions for older releases. These tests don't address compatibility issues in mixed cluster tests as the latter contain a mix of older and newer nodes, so the output depends on which node is picked as a data node since the forked TDigest library is not backwards compatible (produces slightly different results). * Fix test failures in docs and mixed cluster. * Reduce buffer sizes in MergingDigest to avoid oom. * Exclude more failing V7 compatibility tests. * Update results for JdbcCsvSpecIT tests. * Update results for JdbcDocCsvSpecIT tests. * Revert unrelated change. * More test fixes. * Use version skips instead of blacklisting in mixed cluster tests. * Switch TDigestState back to AVLTreeDigest. * Update docs and tests with AVLTreeDigest output. * Update flaky test. * Remove dead code, esp around tracking of incoming data. * Update docs/changelog/96086.yaml * Delete docs/changelog/96086.yaml * Remove explicit compression calls. This was added to prevent concurrency tests from failing, but it leads to reduces precision. Submit this to see if the concurrency tests are still failing. * Revert "Remove explicit compression calls." This reverts commit `5352c96f65`. * Remove explicit compression calls to MedianAbsoluteDeviation input. * Add unittests for AVL and merging digest accuracy. * Fix spotless violations. * Delete redundant tests and benchmarks. * Fix spotless violation. * Use the old implementation of AVLTreeDigest. The latest library version is 50% slower and less accurate, as verified by ComparisonTests. * Update docs with latest percentile results. * Update docs with latest percentile results. * Remove repeated compression calls. * Update more percentile results. * Use approximate percentile values in integration tests. This helps with mixed cluster tests, where some of the tests where blocked. * Fix expected percentile value in test. * Revert in-place node updates in AVL tree. Update quantile calculations between centroids and min/max values to match v.3.2. * Add SortingDigest and HybridDigest. The SortingDigest tracks all samples in an ArrayList that gets sorted for quantile calculations. This approach provides perfectly accurate results and is the most efficient implementation for up to millions of samples, at the cost of bloated memory footprint. The HybridDigest uses a SortingDigest for small sample populations, then switches to a MergingDigest. This approach combines to the best performance and results for small sample counts with very good performance and acceptable accuracy for effectively unbounded sample counts. * Remove deps to the 3.2 library. * Remove unused licenses for tdigest. * Revert changes for SortingDigest and HybridDigest. These will be submitted in a follow-up PR for enabling MergingDigest. * Remove unused Histogram classes and unit tests. Delete dead and commented out code, make the remaining tests run reasonably fast. Remove unused annotations, esp. SuppressWarnings. * Remove Comparison class, not used. * Revert "Revert changes for SortingDigest and HybridDigest." This reverts commit `2336b11598`. * Use HybridDigest as default tdigest implementation Add SortingDigest as a simple structure for percentile calculations that tracks all data points in a sorted array. This is a fast and perfectly accurate solution that leads to bloated memory allocation. Add HybridDigest that uses SortingDigest for small sample counts, then switches to MergingDigest. This approach delivers extreme performance and accuracy for small populations while scaling indefinitely and maintaining acceptable performance and accuracy with constant memory allocation (15kB by default). Provide knobs to switch back to AVLTreeDigest, either per query or through ClusterSettings. * Small fixes. * Add javadoc and tests. * Add javadoc and tests. * Remove special logic for singletons in the boundaries. While this helps with the case where the digest contains only singletons (perfect accuracy), it has a major issue problem (non-monotonic quantile function) when the first singleton is followed by a non-singleton centroid. It's preferable to revert to the old version from 3.2; inaccuracies in a singleton-only digest should be mitigated by using a sorted array for small sample counts. * Revert changes to expected values in tests. This is due to restoring quantile functions to match head. * Revert changes to expected values in tests. This is due to restoring quantile functions to match head. * Tentatively restore percentile rank expected results. * Use cdf version from 3.2 Update Dist.cdf to use interpolation, use the same cdf version in AVLTreeDigest and MergingDigest. * Revert "Tentatively restore percentile rank expected results." This reverts commit `7718dbba59`. * Revert remaining changes compared to main. * Revert excluded V7 compat tests. * Exclude V7 compat tests still failing. * Exclude V7 compat tests still failing. * Remove ClusterSettings tentatively. * Initial import for TDigest forking. * Fix MedianTest. More work needed for TDigestPercentileTests and the TDigestTest (and the rest of the tests) in the tdigest lib to pass. Fix Dist. * Fix AVLTreeDigest.quantile to match Dist for uniform centroids. * Update docs/changelog/96086.yaml * Fix `MergingDigest.quantile` to match `Dist` on uniform distribution. * Add merging to TDigestState.hashCode and .equals. Remove wrong asserts from tests and MergingDigest. * Fix style violations for tdigest library. * Fix typo. * Fix more style violations. * Fix more style violations. * Fix remaining style violations in tdigest library. * Update results in docs based on the forked tdigest. * Fix YAML tests in aggs module. * Fix YAML tests in x-pack/plugin. * Skip failing V7 compat tests in modules/aggregations. * Fix TDigest library unittests. Remove redundant serializing interfaces from the library. * Remove YAML test versions for older releases. These tests don't address compatibility issues in mixed cluster tests as the latter contain a mix of older and newer nodes, so the output depends on which node is picked as a data node since the forked TDigest library is not backwards compatible (produces slightly different results). * Fix test failures in docs and mixed cluster. * Reduce buffer sizes in MergingDigest to avoid oom. * Exclude more failing V7 compatibility tests. * Update results for JdbcCsvSpecIT tests. * Update results for JdbcDocCsvSpecIT tests. * Revert unrelated change. * More test fixes. * Use version skips instead of blacklisting in mixed cluster tests. * Switch TDigestState back to AVLTreeDigest. * Update docs and tests with AVLTreeDigest output. * Update flaky test. * Remove dead code, esp around tracking of incoming data. * Remove explicit compression calls. This was added to prevent concurrency tests from failing, but it leads to reduces precision. Submit this to see if the concurrency tests are still failing. * Update docs/changelog/96086.yaml * Delete docs/changelog/96086.yaml * Revert "Remove explicit compression calls." This reverts commit `5352c96f65`. * Remove explicit compression calls to MedianAbsoluteDeviation input. * Add unittests for AVL and merging digest accuracy. * Fix spotless violations. * Delete redundant tests and benchmarks. * Fix spotless violation. * Use the old implementation of AVLTreeDigest. The latest library version is 50% slower and less accurate, as verified by ComparisonTests. * Update docs with latest percentile results. * Update docs with latest percentile results. * Remove repeated compression calls. * Update more percentile results. * Use approximate percentile values in integration tests. This helps with mixed cluster tests, where some of the tests where blocked. * Fix expected percentile value in test. * Revert in-place node updates in AVL tree. Update quantile calculations between centroids and min/max values to match v.3.2. * Add SortingDigest and HybridDigest. The SortingDigest tracks all samples in an ArrayList that gets sorted for quantile calculations. This approach provides perfectly accurate results and is the most efficient implementation for up to millions of samples, at the cost of bloated memory footprint. The HybridDigest uses a SortingDigest for small sample populations, then switches to a MergingDigest. This approach combines to the best performance and results for small sample counts with very good performance and acceptable accuracy for effectively unbounded sample counts. * Remove deps to the 3.2 library. * Remove unused licenses for tdigest. * Revert changes for SortingDigest and HybridDigest. These will be submitted in a follow-up PR for enabling MergingDigest. * Remove unused Histogram classes and unit tests. Delete dead and commented out code, make the remaining tests run reasonably fast. Remove unused annotations, esp. SuppressWarnings. * Remove Comparison class, not used. * Revert "Revert changes for SortingDigest and HybridDigest." This reverts commit `2336b11598`. * Use HybridDigest as default tdigest implementation Add SortingDigest as a simple structure for percentile calculations that tracks all data points in a sorted array. This is a fast and perfectly accurate solution that leads to bloated memory allocation. Add HybridDigest that uses SortingDigest for small sample counts, then switches to MergingDigest. This approach delivers extreme performance and accuracy for small populations while scaling indefinitely and maintaining acceptable performance and accuracy with constant memory allocation (15kB by default). Provide knobs to switch back to AVLTreeDigest, either per query or through ClusterSettings. * Add javadoc and tests. * Remove ClusterSettings tentatively. * Restore bySize function in TDigest and subclasses. * Update Dist.cdf to match the rest. Update tests. * Revert outdated test changes. * Revert outdated changes. * Small fixes. * Update docs/changelog/96794.yaml * Make HybridDigest the default implementation. * Update boxplot documentation. * Restore AVLTreeDigest as the default in TDigestState. TDigest.createHybridDigest nw returns the right type. The switch in TDigestState will happen in a separate PR as it requires many test updates. * Use execution_hint in tdigest spec. * Fix Dist.cdf for empty digest. * Pass ClusterSettings through SearchExecutionContext. * Bump up TransportVersion. * Bump up TransportVersion for real. * HybridDigest uses its final implementation during deserialization. * Restore the right TransportVersion in TDigestState.read * Add dummy SearchExecutionContext factory for tests. * Use TDigestExecutionHint instead of strings. * Remove check for null context. * Add link to TDigest javadoc. * Use NodeSettings directly. * Init executionHint to null, set before using. * Update docs/changelog/96943.yaml * Pass initialized executionHint to createEmptyPercentileRanksAggregator. * Initialize TDigestExecutionHint.SETTING to "DEFAULT". * Initialize TDigestExecutionHint to null. * Use readOptionalWriteable/writeOptionalWriteable. Move test-only SearchExecutionContext method in helper class under test. * Bump up TransportVersion. * Small fixes.	2023-06-20 21:50:08 +03:00
Chris Hegarty	a5d1433782	Prepare aggs to allow for consumption of multiple input channels, and output multiple Blocks (ESQL-1281) Preparation for aggs to allow for consumption of multiple input channels, and output of more than one Block. The salient change can be seen in difference to the AggregatorFunction and GroupingAggregatorFunction interfaces, e.g.: ```diff - void addIntermediateInput(Block block); - Block evaluateIntermediate(); - Block evaluateFinal(); --- + void addIntermediateInput(Page page); + void evaluateIntermediate(Block[] blocks, int offset); + void evaluateFinal(Block[] blocks, int offset); ``` addIntermediate accepts a page (rather than a block), to allow the aggregator function to consume multiple channels. evaluateXXX accepts a block array and offset, to allow the aggregator function to populate array elements. For now, aggs continue to just use a single input channel and output just a single block. A follow on change will refactor this.	2023-06-17 14:25:48 +01:00
Nik Everett	799030c1f0	Convert agg resolution to eval-style (ESQL-1246) This flips the resolution of aggs from a tree of switch statements into method calls on the function objects. This gives us much more control over how we contruct the aggs, making it much simpler to flow parameters through the system and easier to make sure that only appropriate aggs run in the right spot. \	2023-06-14 11:33:22 -05:00
elasticsearchmachine	ef227c3ff0	Merge pull request ESQL-1275 from elastic/main 🤖 ESQL: Merge upstream	2023-06-14 09:59:18 -04:00
Simon Cooper	71c12262fb	Migrate index created version to IndexVersion (#96066 )	2023-06-14 09:43:31 +01:00
Nhat Nguyen	31949fcbde	Merge remote-tracking branch 'elastic/esql/lang' into merge-main	2023-06-13 21:20:00 -07:00
Ryan Ernst	164e97e2ca	Encapsulate TransportVersion.CURRENT (#96681 ) This commit changes access to the latest TransportVersion constant to use a static method instead of a public static field. By encapsulating the field we will be able to (in a followup) lazily determine what the latest is, outside of clinit.	2023-06-13 18:44:15 -04:00
Kostas Krikellas	67211be81d	Fork TDigest library (#96086 ) * Initial import for TDigest forking. * Fix MedianTest. More work needed for TDigestPercentileTests and the TDigestTest (and the rest of the tests) in the tdigest lib to pass. Fix Dist. * Fix AVLTreeDigest.quantile to match Dist for uniform centroids. * Update docs/changelog/96086.yaml * Fix `MergingDigest.quantile` to match `Dist` on uniform distribution. * Add merging to TDigestState.hashCode and .equals. Remove wrong asserts from tests and MergingDigest. * Fix style violations for tdigest library. * Fix typo. * Fix more style violations. * Fix more style violations. * Fix remaining style violations in tdigest library. * Update results in docs based on the forked tdigest. * Fix YAML tests in aggs module. * Fix YAML tests in x-pack/plugin. * Skip failing V7 compat tests in modules/aggregations. * Fix TDigest library unittests. Remove redundant serializing interfaces from the library. * Remove YAML test versions for older releases. These tests don't address compatibility issues in mixed cluster tests as the latter contain a mix of older and newer nodes, so the output depends on which node is picked as a data node since the forked TDigest library is not backwards compatible (produces slightly different results). * Fix test failures in docs and mixed cluster. * Reduce buffer sizes in MergingDigest to avoid oom. * Exclude more failing V7 compatibility tests. * Update results for JdbcCsvSpecIT tests. * Update results for JdbcDocCsvSpecIT tests. * Revert unrelated change. * More test fixes. * Use version skips instead of blacklisting in mixed cluster tests. * Switch TDigestState back to AVLTreeDigest. * Update docs and tests with AVLTreeDigest output. * Update flaky test. * Remove dead code, esp around tracking of incoming data. * Update docs/changelog/96086.yaml * Delete docs/changelog/96086.yaml * Remove explicit compression calls. This was added to prevent concurrency tests from failing, but it leads to reduces precision. Submit this to see if the concurrency tests are still failing. * Revert "Remove explicit compression calls." This reverts commit `5352c96f65`. * Remove explicit compression calls to MedianAbsoluteDeviation input. * Add unittests for AVL and merging digest accuracy. * Fix spotless violations. * Delete redundant tests and benchmarks. * Fix spotless violation. * Use the old implementation of AVLTreeDigest. The latest library version is 50% slower and less accurate, as verified by ComparisonTests. * Update docs with latest percentile results. * Update docs with latest percentile results. * Remove repeated compression calls. * Update more percentile results. * Use approximate percentile values in integration tests. This helps with mixed cluster tests, where some of the tests where blocked. * Fix expected percentile value in test. * Revert in-place node updates in AVL tree. Update quantile calculations between centroids and min/max values to match v.3.2. * Add SortingDigest and HybridDigest. The SortingDigest tracks all samples in an ArrayList that gets sorted for quantile calculations. This approach provides perfectly accurate results and is the most efficient implementation for up to millions of samples, at the cost of bloated memory footprint. The HybridDigest uses a SortingDigest for small sample populations, then switches to a MergingDigest. This approach combines to the best performance and results for small sample counts with very good performance and acceptable accuracy for effectively unbounded sample counts. * Remove deps to the 3.2 library. * Remove unused licenses for tdigest. * Revert changes for SortingDigest and HybridDigest. These will be submitted in a follow-up PR for enabling MergingDigest. * Remove unused Histogram classes and unit tests. Delete dead and commented out code, make the remaining tests run reasonably fast. Remove unused annotations, esp. SuppressWarnings. * Remove Comparison class, not used. * Small fixes. * Add javadoc and tests. * Remove special logic for singletons in the boundaries. While this helps with the case where the digest contains only singletons (perfect accuracy), it has a major issue problem (non-monotonic quantile function) when the first singleton is followed by a non-singleton centroid. It's preferable to revert to the old version from 3.2; inaccuracies in a singleton-only digest should be mitigated by using a sorted array for small sample counts. * Revert changes to expected values in tests. This is due to restoring quantile functions to match head. * Revert changes to expected values in tests. This is due to restoring quantile functions to match head. * Tentatively restore percentile rank expected results. * Use cdf version from 3.2 Update Dist.cdf to use interpolation, use the same cdf version in AVLTreeDigest and MergingDigest. * Revert "Tentatively restore percentile rank expected results." This reverts commit `7718dbba59`. * Revert remaining changes compared to main. * Revert excluded V7 compat tests. * Exclude V7 compat tests still failing. * Exclude V7 compat tests still failing. * Restore bySize function in TDigest and subclasses.	2023-06-13 11:43:54 +03:00
elasticsearchmachine	7b25f00c60	Merge pull request ESQL-1253 from elastic/main 🤖 ESQL: Merge upstream	2023-06-08 13:15:09 -04:00

1 2 3 4 5 ...

259 commits