Re-implement `CATEGORIZE` in a way that works for multi-node clusters.
This requires that data is first categorized on each data node in a first pass, then the categorizers from each data node are merged on the coordinator node and previously categorized rows are re-categorized.
BlockHashes, used in HashAggregations, already work in a very similar way. E.g. for queries like `... | STATS ... BY field1, field2` they map values for `field1` and `field2` to unique integer ids that are then passed to the actual aggregate functions to identify which "bucket" a row belongs to. When passed from the data nodes to the coordinator, the BlockHashes are also merged to obtain unique ids for every value in `field1, field2` that is seen on the coordinator (not only on the local data nodes).
Therefore, we re-implement `CATEGORIZE` as a special BlockHash.
To choose the correct BlockHash when a query plan is mapped to physical operations, the `AggregateExec` query plan node needs to know that we will be categorizing the field `message` in a query containing `... | STATS ... BY c = CATEGORIZE(message)`. For this reason, _we do not extract the expression_ `c = CATEGORIZE(message)` into an `EVAL` node, in contrast to e.g. `STATS ... BY b = BUCKET(field, 10)`. The expression `c = CATEGORIZE(message)` simply remains inside the `AggregateExec`'s groupings.
**Important limitation:** For now, to use `CATEGORIZE` in a `STATS` command, there can be only 1 grouping (the `CATEGORIZE`) overall.
I goofed on the bit * byte and bit * float comparisons. Naturally, these
should be bigendian and compare the dimensions with the binary ones
appropriately.
Additionally, I added a test to ensure that this is handled correctly.
* Add timezone support to Cron objects
* Add timezone support to CronnableSchedule
* XContent change to support parsing and display of TimeZone fields on schedules
* Case insensitive timezone parsing
* Doc changes
* YAML REST tests
* Equals, toString and HashCode now include timezone
* Additional random testing for DST transitions
* Migrate Cron class to use wrapped LocalDateTime
The algorithm depends on some quirks of calendar but LocalDateTime
correctly ignores DST during calculations so this uses a LocalDateTime
with a wrapper to emulate some of Calendar's behaviours that the Cron
algorithm depends on
* Additional documentation to explain discontinuity event behaviour
* Remove redundant conversions from ZoneId to TimeZone following move to LocalDateTime
* Add documentation warning that manual clock changes will cause unpredictable watch execution
* Update docs/reference/watcher/trigger/schedule.asciidoc
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
---------
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
This enables date nanos support as tech preview. Basic operations, like reading values, binary comparisons, and functions that don't care about type should work, but some functions are not yet supported. Most notably, Bucket is not yet supported, although Date_Trunc is and can be used for grouping. See the docs for the full list of limitations.
relates to #109352
First PR for adding LOOKUP JOIN in ESQL.
Introduces grammar and wires the main building blocks to execute a query; follow-ups are required (see #116208 for more details).
Co-authored-by: Nik Everett <nik9000@users.noreply.github.com>
It has been noted that strange or incorrect error messages are returned if the ENRICH command uses incompatible data types, for example a KEYWORD with value 'foo' using in an int_range match: https://github.com/elastic/elasticsearch/issues/107357
This error is thrown at runtime and contradicts the ES|QL policy of only throwing errors at planning time, while at runtime we should instead set results to null and add a warning. However, we could make the planner stricter and block potentially mismatching types earlier.
However runtime parsing of KEYWORD fields has been a feature of ES|QL ENRICH since it's inception, in particular we even have tests asserting that KEYWORD fields containing parsable IP data can be joined to an ip_range ENRICH index.
In order to not create a backwards compatibility problem, we have compromised with the following:
* Strict range type checking at the planner time for incompatible range types, unless the incoming index field is KEYWORD
* For KEYWORD fields, allow runtime parsing of the fields, but when parsing fails, set the result to null and add a warning
Added extra tests to verify behaviour of match policies on non-keyword fields. They all behave as keywords (the enrich field is converted to keyword at policy execution time, and the input data is converted to keyword at lookup time).
Now, error fields will always have 'type' and 'reason' fields, and the information in those fields is the same regardless of whether the output is detailed or not
Allow the new flags added in Lucene in the HyphenationCompoundWordTokenFilter
Adds access to the two new flags no_sub_matches and no_overlapping_matches.
Lucene issue: https://github.com/apache/lucene/issues/9231
force_source is being parsed as a no-op since 8.8. This commit removes support
for it at REST, meaning a search request that provides it gets now an error back.
We have gotten more than one SDH due to customers not understanding
why restarts involving fully-mounted indices can pull a lot of data
from the snapshot tier, so it may help to be more explicit about
why this happens and how it can be avoided.
* Add total rule type counts to list calls and xpack usage
* Add feature
* Update docs/changelog/116357.yaml
* Fix docs test failure & update yaml tests
* remove additional spaces
---------
Co-authored-by: Mark J. Hoy <mark.hoy@elastic.co>