elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-28 17:34:17 -04:00

Author	SHA1	Message	Date
Nhat Nguyen	d5a2dc4a49	Remove non-breaking API vectors (#103598 ) This PR removes APIs in Vectors that use the non-breaking block factory. Some tests now explicitly use the non-breaking factory. The goal of this PR, along with some follow-ups, is to phase out the non-breaking block factory in production. We can gradually remove its usage in tests later	2023-12-20 10:00:55 -08:00
Nhat Nguyen	642ae13989	Fix benchmark compilation	2023-12-20 08:18:05 -08:00
Alexander Spies	812686a079	ESQL: Block access benchmark (#102872 ) Add a micro benchmark for access times to individual block elements.	2023-12-20 14:57:20 +01:00
Benjamin Trent	6b6fd7b957	Adding new DynamicMapperBenchmark to exercise dynamic mapping parsing (#103015 )	2023-12-05 15:43:48 -05:00
Alexander Spies	09afa36020	ESQL: Make EvalBenchmarks executable again (#102854 )	2023-12-01 18:03:06 +01:00
Nik Everett	f3b68a6b8f	ESQL: Load text field from parent keyword field (#102490 ) This adds support for loading a text field from a parent keyword field. The mapping for that looks like: ``` "properties": { "foo": { "type": "keyword", "fields": { "text": { "type": "text" } } } } ``` In this case it's safe to load the `text` subfield from the doc values for the `keyword` field above. Closes #102473	2023-11-28 14:38:33 -05:00
Salvatore Campagna	b9db21ab06	Export circuit breaker trip count as a counter metric (#101423 ) Here we export both parent and children circuit breaker trip counts as metrics so that we can collect their values using APM. We expose a counter for the trip count of the parent circuit breaker and a counter for each trip count of children circuit breakers including: * field data circuit breakers * per-request circuit breakers * in-flight requests circuit breakers * custom circuit breakers used by plugins (EQL and Machine Learning) The circuit breaker metrics include: * es.breaker.parent.trip.total * es.breaker.field_data.trip.total * es.breaker.request.trip.total * es.breaker.in_flight_requests.trip.total * es.breaker.eql_sequence.trip.total * es.breaker.in_model_inference.trip.total Each of the metrics is exposed at node level.	2023-11-27 14:18:27 +01:00
Nhat Nguyen	e100384466	Remove Block.Ref from MultivalueDedupe (#102432 ) We should replace Block.Ref with reference counting of Block. The PR is an initial step that removes Block.Ref from MultivalueDedupe.	2023-11-21 22:36:49 -08:00
Nik Everett	a4a12a668e	ESQL: Support the `_source` metadata field (#102391 ) This adds support for load the `_source` field using the syntax: ``` FROM test [METADATA _source] ``` The `_source` field is loaded as a special type - `_source` which no functions support (1). We just render it on the output. Which looks like: ``` $ curl -XDELETE -uelastic:password localhost:9200/test $ curl -XPOST -HContent-Type:application/json -uelastic:password localhost:9200/test/_doc/1?refresh -d'{ "words": "words", "other stuff": [ "wow", "such", "suff" ] }' $ curl -XPOST -HContent-Type:application/json -uelastic:password localhost:9200/_query?pretty -d'{ "query": "FROM test [METADATA _source] \| KEEP _source \| LIMIT 1" }' { "columns" : [ { "name" : "_source", "type" : "_source" } ], "values" : [ [ { "words" : "words", "other stuff" : [ "wow", "such", "suff" ] } ] ] } ``` The `_source` is just a json object. We use the same infrastructure to convert it to json as the `_search` response. This works for both stored `_source` and synthetic `_source`, but it runs row-by-row every time. This perfect for stored `_source` but it's less nice to synthetic `_source`. We'd be better of rebuilding synthetic `_source` from blocks but that'd require a lot of new infrastructure. And synthetic `_source` isn't going to be fast anyway. (1): `IS NULL` and `IS NOT NULL` support `_source` because we get that for free.	2023-11-21 16:34:20 -05:00
Nik Everett	fd300cffcf	ESQL: Load more than one field at once (#102192 ) This modifies ESQL to load a list of fields at one time which is especially effective when loading from stored fields or _source because it allows visiting the stored fields one time. Part of #101322	2023-11-20 09:44:04 -05:00
Armin Braun	ae6d180379	Clean up some more dead code in o.e.s.aggregations (#101820 ) Another iteration of mostly automatic cleanup on top of #101806.	2023-11-07 20:07:17 +01:00
Lee Hinman	4952f986ce	Modularize shard availability service (#101796 ) * Modularize shard availability service This commit moves the `ShardsAvailabilityHealthIndicatorService` to a package and modularizes it with exports so that Serverless can make use of it as a superclass. Relates to #101394	2023-11-03 15:59:09 -06:00
Simon Cooper	43a3730df9	Refactor PluginsService.filterPlugins to return a Stream rather than a List (#101522 )	2023-10-31 10:09:14 +00:00
Nik Everett	5365daa221	ESQL: Track memory from values loaded from lucene (#101383 ) This adds memory tracking for values loaded from doc values and stored fields.	2023-10-26 21:24:43 -04:00
Nik Everett	4ca793ec1a	ESQL: Load values a different way (#101235 ) This changes how we load values in ESQL, delegating to the `MappedFieldType` like we do with doc values and synthetic source. This allows a much more OO way of getting the loads working which makes that path much easier to read. And! It means those code paths look like doc values. So there's symmetry. It's like it rhymes. There are a few side effects here: 1. It's fairly simple to load from ordinals efficiently. I wrote some block-at-a-time code for resolving ordinals and it's about twice as fast. With more work it should be possible to make custom ordinal-shaped blocks move through the system to save space and speed things up. 2. Most fields can now be loaded from `_source`. Everything that can be loaded from `_source` in scripts will load from `_source` in ESQL. 3. We get a lot more tests for loading fields in different configurations by piggybacking on the synthetic source testing framework. 4. Loading from `_source` no longer sorts the fields. Same for stored fields. Now we keep them in whatever they were stored in. This is a pretty marginal time save because loading from `_source` is so much more time consuming than the sort. But it's something.	2023-10-25 14:05:09 -04:00
Nik Everett	38eac268b4	ESQL: Build tracked block in EVAL (#100268 ) This changes `EVAL` to build tracked blocks so we can trip the breaker when there are too many tracked blocks hanging about.	2023-10-04 10:29:54 -04:00
Luigi Dell'Aquila	2212c7d7c4	ESQL: Add DriverContext to aggregators and aggregator functions (#100080 ) Add DriverContext to aggregators and aggregator functions so that block creation can use blockFactory()	2023-10-04 06:41:07 -04:00
Nik Everett	c779a54578	ESQL: Fix mv evaluators releasing (#100097 ) This adds comprehensive tests for `ExpressionEvaluator` making sure that it releases `Block`s. It fixes all of the `mv_*` evaluators to make sure they release as well.	2023-10-03 16:46:45 -04:00
Chris Hegarty	29de48e339	ESQL: Use blockFactory more in aggs (#100017 ) This commit updates the hash grouping operator to close input pages, as well as use the block factory for internally created blocks. Additionally: * Adds a MockBlockFactory to help with tracking block creation * Eagerly creates the block view of a vector, which helps with tracking since there can be only one block view instance per vector * Resolves an issue with Filter Blocks, whereby they previously tried to emit their contents in toString	2023-10-03 14:01:13 +01:00
Nik Everett	e1b1f6f1db	ESQL: Create `Block.Ref` (#100042 ) This creates `Block.Ref`, a reference to a `Block` which may or may not be part of a `Page`. `Block.Ref` is `Releasable` and closing it is a noop if the `Block` is part of a `Page`, but if it is "free floating" then closing the `Block.Ref` will close the block. It also modified `ExpressionEvaluator` to return a `Block.Ref` instead of a `Block` - so you tend to work with `ExpressionEvaluator`s like this: ``` try (Block.Ref ref = eval.eval(page)) { return ref.block().doStuff(); } ``` This should make it much easier to release the memory from `Block`s built by `ExpressionEvaluator`s. This change is mostly mechanical, introducing the new signature for `ExpressionEvaluator`. In a follow up change I'll modify the tests to make sure we're correctly using it to close pages. I did think about changing `ExpressionEvaluator` to add a method telling you if the block that it returns must be closed or not. This would have been more difficult to work with, and, ultimately, limiting. Specifically, it is possible for an `ExpressionEvaluator` to sometimes return a free floating block and other times return one that is contained in a `Page`. Imagine `mv_concat` - it returns the block it receives if the block doesn't have multivalued fields. Otherwise it concats things. If that block happens to come directly out of the `Page`, then `mv_concat` will sometimes produce free floating blocks and sometimes not.	2023-09-29 09:26:44 -04:00
Nhat Nguyen	163b5eff5a	Add deduplicated attribute to MvOrdering (#100027 ) Today, we have the ability to specify whether multivalued fields are sorted in ascending order or not. This feature allows operators like topn to enable optimizations. However, we are currently missing the deduplicated attribute. If multivalued fields are deduplicated at each position, we can further optimize operators such as hash and mv_dedup. In fact, blocks should not have mv_ascending property alone; it always goes together with mv_deduplicated. Additionally, mv_dedup or hash should generate blocks that have only the mv_dedup property.	2023-09-28 12:41:39 -07:00
Nik Everett	0e219307f2	ESQL: Track blocks (#100025 ) This tracks blocks from topn and a few other places. We're going to try and track blocks all the places.	2023-09-28 13:17:08 -04:00
Chris Hegarty	21d9de04a1	Add a Factory for building blocks and Vectors (#99657 ) This commit adds a BlockFactory - an extra level of indirection when building blocks. The factory couples circuit breaking when building, allowing for incrementing the breaker as blocks and Vectors are built. This PR adds the infrastructure to allow us to move the operators and implementations over to the factory, rather than actually moving all there at once.	2023-09-22 14:30:34 +01:00
William Brafford	1f3126b47b	Add mappings versions to ClusterState.Builder convenience methods (#99551 ) This is a follow-up to [#99307](https://github.com/elastic/elasticsearch/pull/99307), adjusting convenience methods that used to take `TransportVersion` arguments to account for `MappingsVersion` maps.	2023-09-19 15:36:21 -04:00
Nik Everett	1a14bc7a06	ESQL: Prevent topn from using too much memory (#99611 ) This prevents topn operations from using too much memory by hooking them into circuit breaking framework. It builds on the work done in https://github.com/elastic/elasticsearch/pull/99316 that moved all topn storage to byte arrays by adding circuit breaking to process of growing the underlying byte array.	2023-09-18 11:33:39 -04:00
Chris Hegarty	07f6a65b24	ESQL: Remove default driver context (#99573 ) This commit removes the default driver implementation.	2023-09-14 17:24:42 +01:00
Chris Hegarty	baf11a9d03	ESQL: Add DriverContext to the construction of Evaluators (#99518 ) This commit adds DriverContext to the construction of Evaluators. DriverContext is enriched to carry bigArrays, and will eventually carry a BlockFactory - it's the context for code requiring to create instances of blocks and big arrays.	2023-09-13 22:09:23 +01:00
Nik Everett	5ddc67db03	ESQL: Compact topn (#99316 ) This lowers topn's memory usage somewhat and makes it easier to track the memory usage. That looks like: ``` "status" : { "occupied_rows" : 10000, "ram_bytes_used" : 255392224, "ram_used" : "243.5mb" } ``` In some cases the memory usage savings is significant. In an example with many, many keys the memory usage of each row drops from `58kb` to `25kb`. This is a little degenerate though and I expect the savings to normally be on the order of 10%. The real advantage is memory tracking. It's easy to track used memory. And, in a followup, it should be fairly easy to track circuit break the used memory. Mostly this is done by adding new abstractions and moving existing abstractions to top level classes with tests and stuff. * `TopNEncoder` is now a top level class. It has grown the ability to decode values as well as encode them. And it has grown "unsortable" versions which don't write their values such that sorting the bytes sorts the values. We use the "unsortable" versions when writing values. * `KeyExtractor` extracts keys from the blocks and writes them to the row's `BytesRefBuilder`. This is basically objects replacing one of switch statements in `RowFactory`. They are more scattered but easier to test, and hopefully `TopNOperator` is more readable with this behavior factored out. Also! Most implementations are automatically generated. * `ValueExtractor` extracts values from the blocks and writes them to the row's `BytesRefBuilder`. This replaces the other switch statement in `RowFactory` for the same reasons, except instead of writing to many arrays it writes to a `BytesRefBuilder` just like the key as compactly as it can manage. The memory savings comes from three changes: 1. Lower overhead for storing values by encoding them rather than using many primitive arrays. 2. Encode the value count as a vint rather than a whole int. Usually there are very few rows and vint encodes that quite nicely. 3. Don't write values that are in the key for single-valued fields. Instead we read them from the key. That's going to be very very common. This is unlikely to be faster than the old code. I haven't really tried for speed. Just memory usage and accountability. Once we get good accounting we can try and make this faster. I expect we'll have to figure out the megamorphic invocations I've added. But, for now, they help more than they hurt.	2023-09-13 16:27:50 -04:00
William Brafford	b5e06da143	Add mappings versions to CompatibilityVersions (#99307 ) CompatibilityVersions now holds a map of system index names to their mappings versions, alongside the transport version. We also add mapping versions to the "minimum version barrier": if a node has a system index whose version is below the cluster mappings version for that system index, it is not allowed to join the cluster.	2023-09-12 11:16:55 -04:00
Nik Everett	f9107e34c9	ESQL: Disable optimizations with bad null handling (#99434 ) * ESQL: Disable optimizations with bad null handling We have optimizations that kick in when aggregating on the following pairs of field types: * `long`, `long` * `keyword`, `long` * `long`, `keyword` These optimizations don't have proper support for `null` valued fields but will grow that after #98749. In the mean time this disables them in a way that prevents them from bit-rotting. * Update docs/changelog/99434.yaml	2023-09-11 13:18:23 -04:00
Nik Everett	443c53c636	ESQL: Add type to layout (#99327 ) We want it in a few places.	2023-09-08 07:17:18 -04:00
William Brafford	d32902cf45	Wrap transport version in cluster state (#99114 ) Cluster state currently holds a cluster minimum transport version and a map of nodes to transport versions. However, to determine node compatibility, we will need to account for more types of versions in cluster state than just the transport version (see #99076). Here we introduce a wrapper class to cluster state and update accessors and builders to use the new method. (I would have liked to re-use org.elasticsearch.cluster.node.VersionInformation, but that one holds IndexVersion rather than TransportVersion. * Introduce CompatibilityVersions to cluster state class	2023-09-06 09:52:42 -04:00
Luigi Dell'Aquila	ba87357824	ESQL: Add support for TEXT fields in comparison operators and SORT (#98528 )	2023-08-30 15:45:03 +02:00
Simon Cooper	b67a9e1ec3	Move text references to index created version to IndexVersion (#98727 )	2023-08-23 10:51:56 +01:00
Armin Braun	83cf72c7dc	Remove ClusterSettings from SearchExecutionContext (#98753 ) We did not use the cluster settings on these gigantic objects except for the one spot in the aggregtion context. => we can just hold a reference to it on the aggregation context and simplify things a little for tests etc. Also, inline needless indirection via single-use private method in `toQuery`.	2023-08-23 10:05:18 +02:00
Ignacio Vera	0a2fbee130	Fix compilation error on JDK 20+ (#98762 )	2023-08-23 11:11:07 +08:00
Bogdan Pintea	372458c9fd	ESQL: date_trunc(): swap order of arguments (#98624 ) Swap arguments order so that the range parameter is first and datetime one second, inline with other languages.	2023-08-22 18:20:05 +02:00
Costin Leau	98ea54370b	Add ESQL own flavor of arithmetic operators (#98628 ) The reason for this is to wire the ESQL evaluator interface (such as Warning infrastructure). In the process arrange a bit the evaluator classes under expression - introduce evaluator package - move under evaluator EvalMapper & family of mappers (from planner) - extract common interface from EvalMapper into its own file and rename it from the generic Mapper (of which we have several classes) to EvaluatorMapper - mirror the package hierarchy from expression package - widen visibility from protected to public (side-effect of the above) - move classes that only generate code from expression to evaluator	2023-08-21 15:26:30 -07:00
Nik Everett	f3bd0fbd5c	ESQL: Optimize some `MV_` functions on sorted (#98515 ) When multivalued fields are loaded from lucene they are in sorted order but we weren't taking advantage of that fact. Now we are! It's much faster, even for fast operations like `mv_min` ``` (operation) Mode Cnt Score Error Units mv_min avgt 7 3.820 ± 0.070 ns/op mv_min_ascending avgt 7 1.979 ± 0.130 ns/op ``` We still have code to run in non-sorted mode because conversion functions and a few other things don't load in sorted order. I've also ported expanded the parameterized tests for the `MV_` functions because, well, I needed to expand them at least a little to test this change. And I just kept going and improved as many tests as I could.	2023-08-15 18:44:44 -04:00
elasticsearchmachine	593849aea3	Merge pull request ESQL-1517 from elastic/main 🤖 ESQL: Merge upstream	2023-08-01 13:17:51 -04:00
Przemyslaw Gomulka	999489ce04	Infrastructure to report upon document parsing (#97961 ) In serverless we will like to report (meter and bill) upon a document ingestion. The metering should be agnostic to a document format (document structure should be normalised) hence we should allow to create XContentParsers which will keep track of parsed fields and values. There are 2 places where the parsing of the ingested document happens: 1. upon the 'raw bulk' a request is sent without the pipelines 2. upon the 'ingest service' when a request is sent with pipelines (parsing can occur twice when a dynamic mappings are calculated, this PR takes this into account and prevent double billing) We also want to make sure, that the metering logic is not unnecessarily executed when a document was already reported. That is if a document was reported in IngestService, there is no point wrapping the XContentParser again. This commit introduces a `DocumentReporterPlugin` an internal plugin that will be implemented in serverless. This plugin should return a `DocumentParsingObserver` supplier which will create a `DocumentParsingObserver`. A DocumentParsingObserver is used to wrap an `XContentParser` with an implementation that keeps track of parsed fields and values (performs a metering) and allows to send that information along with an index name to a MeteringReporter.	2023-08-01 13:55:18 +02:00
Nik Everett	65af5b2199	Size pages based on loaded columns (ESQL-1403) This sizes pages produced by operators based on an estimate of the number of bytes that'll be added to the page after it's been emitted. At this point it only really works properly for the `LuceneSourceOperator`, but that's pretty useful! The `LuceneTopNSourceOperator` doesn't yet have code to cut the output of a topn collected from lucene into multiple pages, so it ignores the calculated value. We'll get to that in a follow up. We feed the right value into aggregations but ungrouped aggregations ignore it because they only ever emit one row. Grouped aggregations don't yet have any code to cut their output into multiple pages. TopN does have code to cut the output into multiple pages but the estimates passed to it are kind of hacky. A proper estimate of TopN would account for the size of rows flowing into it, but I never wrote code for that. The thing is - TopN doesn't have to estimate incoming row size - it can measure each row as it builds it and use the esimate we're building now as an estimate of extra bytes that'll be added. Which is what it is! But that code also needs to be written. Relates to https://github.com/elastic/elasticsearch-internal/issues/1385	2023-07-27 15:41:46 -04:00
elasticsearchmachine	a77f3faa62	Merge pull request ESQL-1444 from elastic/main 🤖 ESQL: Merge upstream	2023-07-15 13:20:57 -04:00
Chris Hegarty	50a4765d13	Disable javadocs for benchmarks (#97695 ) This commit disables javadocs for the benchmarks project, since the docs are not necessary or interesting, and cause warning noise in the build log output.	2023-07-15 09:37:24 +01:00
Nhat Nguyen	5f754aabc4	Make page_size parameter configurable (ESQL-1402) Today, we have a hard-coded maximum page size of 16K in Lucene operators and other operators like TopN and HashOperator. This default value should work well in production. However, it doesn't provide enough randomization in our tests because we mostly emit a single page. Additionally, some tests take a significant amount of time because they require indexing a large number of documents, which is several times the page size. To address these, this PR makes the page size parameter to be configurable via the query pragmas, enabling randomization in tests. This change has already uncovered a bug in LongLongBlockHash.	2023-07-11 07:07:24 -07:00
Nik Everett	8f7ac51c77	Start to prevent massive grouping blocks (ESQL-1370) When you group by more than one multivalued field we generate one ord per unique tuple of value of from each column. So if you group by ``` a=(1, 2, 3) b=(2, 3) c=(4, 5, 5) ``` Then you get these grouping keys: ``` 1, 2, 4 1, 2, 5 1, 3, 4 1, 3, 5 2, 2, 4 2, 2, 5 2, 3, 4 2, 3, 5 3, 2, 4 3, 3, 5 ``` That's as many grouping keys the the product of the Set-wise cardinality of all elements. "Product" is a dangerous word! It's possible to make a simple document containing just two fields that each are a list of 10,000 values and then send that into the aggregation framework. That little baby document will spit out 100,000,000 grouping ordinals! Without this PR we'd try to create a single `Block` that contains that many entries. Or, rather, it'd be as big as the nearest power of two. Gigantonormous. About 760mb! Like, possible, but a huge "slug" of heap usage and not great. This PR changes it so, at least for pairs of `long` keys we'll make many smaller blocks. We cut the emitted ordinals into a block no more than 16*1024 entries, the default length of a block. That means our baby document would make 6103 full blocks and one half full block. But each one is going less than 200kb. Relates to ESQL-1360	2023-07-07 14:25:06 -05:00
elasticsearchmachine	ea5e26ba18	Merge pull request ESQL-1365 from elastic/main 🤖 ESQL: Merge upstream	2023-07-04 13:25:19 -04:00
Rene Groeschke	b8627079b4	Update Gradle Wrapper to 8.2 (#96686 ) - Convention usage has been deprecated and was fixed in our build files - Fix test dependencies and deprecation	2023-07-04 15:35:15 +02:00
Nik Everett	ead68fb70c	Support multivalued fields on arbitrary grouping (ESQL-1340) This adds support for multivalued fields the `PackedValuesHash` it does so by encoding batches of values for each column into bytes and then reading converted values row-wise. This works while also not causing per-row megamorphic calls. There are megamorphic calls when the batch is used up, but that should only hit a few times per block.	2023-07-03 08:19:32 -05:00
elasticsearchmachine	ceec5ab839	Merge pull request ESQL-1330 from elastic/main 🤖 ESQL: Merge upstream	2023-06-27 13:18:29 -04:00

... 2 3 4 5 6 ...

424 commits