This PR removes APIs in Vectors that use the non-breaking block factory.
Some tests now explicitly use the non-breaking factory. The goal of this
PR, along with some follow-ups, is to phase out the non-breaking block
factory in production. We can gradually remove its usage in tests later
This adds support for loading a text field from a parent keyword field.
The mapping for that looks like:
```
"properties": {
"foo": {
"type": "keyword",
"fields": {
"text": { "type": "text" }
}
}
}
```
In this case it's safe to load the `text` subfield from the doc values
for the `keyword` field above.
Closes#102473
Here we export both parent and children circuit breaker trip counts as metrics so that we can collect their values using APM. We expose a counter for the trip count of the parent circuit breaker and a counter for each trip count of children circuit breakers including:
* field data circuit breakers
* per-request circuit breakers
* in-flight requests circuit breakers
* custom circuit breakers used by plugins (EQL and Machine Learning)
The circuit breaker metrics include:
* es.breaker.parent.trip.total
* es.breaker.field_data.trip.total
* es.breaker.request.trip.total
* es.breaker.in_flight_requests.trip.total
* es.breaker.eql_sequence.trip.total
* es.breaker.in_model_inference.trip.total
Each of the metrics is exposed at node level.
This adds support for load the `_source` field using the syntax:
```
FROM test [METADATA _source]
```
The `_source` field is loaded as a special type - `_source` which no
functions support (1). We just render it on the output. Which looks
like:
```
$ curl -XDELETE -uelastic:password localhost:9200/test
$ curl -XPOST -HContent-Type:application/json -uelastic:password localhost:9200/test/_doc/1?refresh -d'{
"words": "words",
"other stuff": [
"wow",
"such",
"suff"
]
}'
$ curl -XPOST -HContent-Type:application/json -uelastic:password localhost:9200/_query?pretty -d'{
"query": "FROM test [METADATA _source] | KEEP _source | LIMIT 1"
}'
{
"columns" : [
{
"name" : "_source",
"type" : "_source"
}
],
"values" : [
[
{
"words" : "words",
"other stuff" : [
"wow",
"such",
"suff"
]
}
]
]
}
```
The `_source` is just a json object. We use the same infrastructure to
convert it to json as the `_search` response.
This works for both stored `_source` and synthetic `_source`, but it
runs row-by-row every time. This *perfect* for stored `_source` but it's
less nice to synthetic `_source`. We'd be better of rebuilding synthetic
`_source` from blocks but that'd require a lot of new infrastructure.
And synthetic `_source` isn't going to be fast anyway.
(1): `IS NULL` and `IS NOT NULL` support `_source` because we get that
for free.
This modifies ESQL to load a list of fields at one time which is especially
effective when loading from stored fields or _source because it allows
visiting the stored fields one time.
Part of #101322
* Modularize shard availability service
This commit moves the `ShardsAvailabilityHealthIndicatorService` to a package and modularizes it
with exports so that Serverless can make use of it as a superclass.
Relates to #101394
This changes how we load values in ESQL, delegating to the
`MappedFieldType` like we do with doc values and synthetic
source. This allows a much more OO way of getting the loads
working which makes that path much easier to read. And! It
means those code paths look like doc values. So there's
symmetry. It's like it rhymes.
There are a few side effects here:
1. It's fairly simple to load from ordinals efficiently. I
wrote some block-at-a-time code for resolving ordinals
and it's about twice as fast. With more work it should
be possible to make custom ordinal-shaped blocks move
through the system to save space and speed things up.
2. Most fields can now be loaded from `_source`. Everything
that can be loaded from `_source` in scripts will load
from `_source` in ESQL.
3. We get a *lot* more tests for loading fields in
different configurations by piggybacking on the synthetic
source testing framework.
4. Loading from `_source` no longer sorts the fields. Same
for stored fields. Now we keep them in whatever they were
stored in. This is a pretty marginal time save because
loading from `_source` is so much more time consuming
than the sort. But it's something.
This adds comprehensive tests for `ExpressionEvaluator` making sure that it releases `Block`s. It fixes all of the `mv_*` evaluators to make sure they release as well.
This commit updates the hash grouping operator to close input pages, as well as use the block factory for internally created blocks.
Additionally:
* Adds a MockBlockFactory to help with tracking block creation
* Eagerly creates the block view of a vector, which helps with tracking since there can be only one block view instance per vector
* Resolves an issue with Filter Blocks, whereby they previously tried to emit their contents in toString
This creates `Block.Ref`, a reference to a `Block` which may or may not
be part of a `Page`. `Block.Ref` is `Releasable` and closing it is a
noop if the `Block` is part of a `Page`, but if it is "free floating"
then closing the `Block.Ref` will close the block.
It also modified `ExpressionEvaluator` to return a `Block.Ref` instead
of a `Block` - so you tend to work with `ExpressionEvaluator`s like
this:
```
try (Block.Ref ref = eval.eval(page)) {
return ref.block().doStuff();
}
```
This should make it *much* easier to release the memory from `Block`s
built by `ExpressionEvaluator`s.
This change is mostly mechanical, introducing the new signature for
`ExpressionEvaluator`. In a follow up change I'll modify the tests to
make sure we're correctly using it to close pages.
I did think about changing `ExpressionEvaluator` to add a method telling
you if the block that it returns must be closed or not. This would have
been more difficult to work with, and, ultimately, limiting.
Specifically, it is possible for an `ExpressionEvaluator` to *sometimes*
return a free floating block and other times return one that is
contained in a `Page`. Imagine `mv_concat` - it returns the block it
receives if the block doesn't have multivalued fields. Otherwise it
concats things. If that block happens to come directly out of the
`Page`, then `mv_concat` will sometimes produce free floating blocks and
sometimes not.
Today, we have the ability to specify whether multivalued fields are
sorted in ascending order or not. This feature allows operators like
topn to enable optimizations. However, we are currently missing the
deduplicated attribute. If multivalued fields are deduplicated at each
position, we can further optimize operators such as hash and mv_dedup.
In fact, blocks should not have mv_ascending property alone; it always
goes together with mv_deduplicated. Additionally, mv_dedup or hash
should generate blocks that have only the mv_dedup property.
This commit adds a BlockFactory - an extra level of indirection when building blocks. The factory couples circuit breaking when building, allowing for incrementing the breaker as blocks and Vectors are built.
This PR adds the infrastructure to allow us to move the operators and implementations over to the factory, rather than actually moving all there at once.
This prevents topn operations from using too much memory by hooking them
into circuit breaking framework. It builds on the work done in
https://github.com/elastic/elasticsearch/pull/99316 that moved all topn
storage to byte arrays by adding circuit breaking to process of growing
the underlying byte array.
This commit adds DriverContext to the construction of Evaluators.
DriverContext is enriched to carry bigArrays, and will eventually carry a BlockFactory - it's the context for code requiring to create instances of blocks and big arrays.
This lowers topn's memory usage somewhat and makes it easier to track
the memory usage. That looks like:
```
"status" : {
"occupied_rows" : 10000,
"ram_bytes_used" : 255392224,
"ram_used" : "243.5mb"
}
```
In some cases the memory usage savings is significant. In an example
with many, many keys the memory usage of each row drops from `58kb` to
`25kb`. This is a little degenerate though and I expect the savings to
normally be on the order of 10%.
The real advantage is memory tracking. It's *easy* to track used memory.
And, in a followup, it should be fairly easy to track circuit break the
used memory.
Mostly this is done by adding new abstractions and moving existing
abstractions to top level classes with tests and stuff.
* `TopNEncoder` is now a top level class. It has grown the ability to *decode* values as well as encode them. And it has grown "unsortable" versions which don't write their values such that sorting the bytes sorts the values. We use the "unsortable" versions when writing values.
* `KeyExtractor` extracts keys from the blocks and writes them to the row's `BytesRefBuilder`. This is basically objects replacing one of switch statements in `RowFactory`. They are more scattered but easier to test, and hopefully `TopNOperator` is more readable with this behavior factored out. Also! Most implementations are automatically generated.
* `ValueExtractor` extracts values from the blocks and writes them to the row's `BytesRefBuilder`. This replaces the other switch statement in `RowFactory` for the same reasons, except instead of writing to many arrays it writes to a `BytesRefBuilder` just like the key as compactly as it can manage.
The memory savings comes from three changes: 1. Lower overhead for
storing values by encoding them rather than using many primitive arrays.
2. Encode the value count as a vint rather than a whole int. Usually
there are very few rows and vint encodes that quite nicely. 3. Don't
write values that are in the key for single-valued fields. Instead we
read them from the key. That's going to be very very common.
This is unlikely to be faster than the old code. I haven't really tried
for speed. Just memory usage and accountability. Once we get good
accounting we can try and make this faster. I expect we'll have to
figure out the megamorphic invocations I've added. But, for now, they
help more than they hurt.
CompatibilityVersions now holds a map of system index names to their
mappings versions, alongside the transport version. We also add mapping
versions to the "minimum version barrier": if a node has a system index
whose version is below the cluster mappings version for that system
index, it is not allowed to join the cluster.
* ESQL: Disable optimizations with bad null handling
We have optimizations that kick in when aggregating on the following
pairs of field types:
* `long`, `long`
* `keyword`, `long`
* `long`, `keyword`
These optimizations don't have proper support for `null` valued fields
but will grow that after #98749. In the mean time this disables them in
a way that prevents them from bit-rotting.
* Update docs/changelog/99434.yaml
Cluster state currently holds a cluster minimum transport version and a map of nodes to transport versions. However, to determine node compatibility, we will need to account for more types of versions in cluster state than just the transport version (see #99076). Here we introduce a wrapper class to cluster state and update accessors and builders to use the new method. (I would have liked to re-use org.elasticsearch.cluster.node.VersionInformation, but that one holds IndexVersion rather than TransportVersion.
* Introduce CompatibilityVersions to cluster state class
We did not use the cluster settings on these gigantic objects
except for the one spot in the aggregtion context.
=> we can just hold a reference to it on the aggregation context
and simplify things a little for tests etc.
Also, inline needless indirection via single-use private method in
`toQuery`.
The reason for this is to wire the ESQL evaluator interface (such as
Warning infrastructure).
In the process arrange a bit the evaluator classes under expression
- introduce evaluator package
- move under evaluator EvalMapper & family of mappers (from planner)
- extract common interface from EvalMapper into its own file and rename
it from the generic Mapper (of which we have several classes) to
EvaluatorMapper
- mirror the package hierarchy from expression package
- widen visibility from protected to public (side-effect of the above)
- move classes that only generate code from expression to evaluator
When multivalued fields are loaded from lucene they are in sorted order
but we weren't taking advantage of that fact. Now we are! It's much
faster, even for fast operations like `mv_min`
```
(operation) Mode Cnt Score Error Units
mv_min avgt 7 3.820 ± 0.070 ns/op
mv_min_ascending avgt 7 1.979 ± 0.130 ns/op
```
We still have code to run in non-sorted mode because conversion functions
and a few other things don't load in sorted order.
I've also ported expanded the parameterized tests for the `MV_` functions
because, well, I needed to expand them at least a little to test this
change. And I just kept going and improved as many tests as I could.
In serverless we will like to report (meter and bill) upon a document ingestion. The metering should be agnostic to a document format (document structure should be normalised) hence we should allow to create XContentParsers which will keep track of parsed fields and values.
There are 2 places where the parsing of the ingested document happens:
1. upon the 'raw bulk' a request is sent without the pipelines
2. upon the 'ingest service' when a request is sent with pipelines
(parsing can occur twice when a dynamic mappings are calculated, this PR takes this into account and prevent double billing)
We also want to make sure, that the metering logic is not unnecessarily executed when a document was already reported. That is if a document was reported in IngestService, there is no point wrapping the XContentParser again.
This commit introduces a `DocumentReporterPlugin` an internal plugin that will be implemented in serverless. This plugin should return a `DocumentParsingObserver` supplier which will create a `DocumentParsingObserver`. A DocumentParsingObserver is used to wrap an `XContentParser` with an implementation that keeps track of parsed fields and values (performs a metering) and allows to send that information along with an index name to a MeteringReporter.
This sizes pages produced by operators based on an estimate of the
number of bytes that'll be added to the page after it's been emitted.
At this point it only really works properly for the
`LuceneSourceOperator`, but that's pretty useful!
The `LuceneTopNSourceOperator` doesn't yet have code to cut the output
of a topn collected from lucene into multiple pages, so it ignores the
calculated value. We'll get to that in a follow up.
We feed the right value into aggregations but ungrouped aggregations
ignore it because they only ever emit one row. Grouped aggregations
don't yet have any code to cut their output into multiple pages.
TopN *does* have code to cut the output into multiple pages but the
estimates passed to it are kind of hacky. A proper estimate of TopN
would account for the size of rows flowing into it, but I never wrote
code for that. The thing is - TopN doesn't have to estimate incoming row
size - it can measure each row as it builds it and use the esimate we're
building now as an estimate of extra bytes that'll be added. Which is
what it is! But that code also needs to be written.
Relates to https://github.com/elastic/elasticsearch-internal/issues/1385
This commit disables javadocs for the benchmarks project, since the docs are not necessary or interesting, and cause warning noise in the build log output.
Today, we have a hard-coded maximum page size of 16K in Lucene operators
and other operators like TopN and HashOperator. This default value
should work well in production. However, it doesn't provide enough
randomization in our tests because we mostly emit a single page.
Additionally, some tests take a significant amount of time because they
require indexing a large number of documents, which is several times the
page size.
To address these, this PR makes the page size parameter to be
configurable via the query pragmas, enabling randomization in tests.
This change has already uncovered a bug in LongLongBlockHash.
When you group by more than one multivalued field we generate one ord
per unique tuple of value of from each column. So if you group by
```
a=(1, 2, 3) b=(2, 3) c=(4, 5, 5)
```
Then you get these grouping keys:
```
1, 2, 4
1, 2, 5
1, 3, 4
1, 3, 5
2, 2, 4
2, 2, 5
2, 3, 4
2, 3, 5
3, 2, 4
3, 3, 5
```
That's as many grouping keys the the product of the Set-wise cardinality
of all elements. "Product" is a dangerous word! It's possible to make a
simple document containing just two fields that each are a list of
10,000 values and then send *that* into the aggregation framework. That
little baby document will spit out 100,000,000 grouping ordinals!
Without this PR we'd try to create a single `Block` that contains that
many entries. Or, rather, it'd be as big as the nearest power of two.
Gigantonormous. About 760mb! Like, possible, but a huge "slug" of heap
usage and not great.
This PR changes it so, at least for pairs of `long` keys we'll make many
smaller blocks. We cut the emitted ordinals into a block no more than
16*1024 entries, the default length of a block. That means our baby
document would make 6103 full blocks and one half full block. But each
one is going less than 200kb.
Relates to ESQL-1360
This adds support for multivalued fields the `PackedValuesHash` it does
so by encoding batches of values for each column into bytes and then
reading converted values row-wise. This works while also not causing
per-row megamorphic calls. There are megamorphic calls when the batch is
used up, but that should only hit a few times per block.