This will correct/switch "year" unit diffing from the current integer
subtraction to a crono subtraction. Consequently, two dates are (at
least) one year apart now if (at least) a full calendar year separates
them. The previous implementation simply subtracted the year part of the
dates.
Note: this parts with ES SQL's implementation of the same function,
which itself is aligned with MS SQL's implementation, which works
equivalent to an integer subtraction.
Fixes#112482.
(cherry picked from commit f7ff00f645)
When CASE hits a multivalued field it was previously either crashing on
fold or evaluating it to the first value. Since booleans are loaded in
sorted order from lucene that *usually* means `false`. This changes the
behavior to line up with the rest of ESQL - now multivalued fields are
treated as `false` with a warning.
You might say "hey wait! multivalued fields usually become `null`, not
`false`!". Yes, dear reader, you are right. Very right. But! `CASE`'s
contract is to immediatly convert its values into `true` or `false`
using the standard boolean tri-valued logic. So `null` just become
`false` immediately. This is how PostgreSQL, MySQL, and SQLite behave:
```
> SELECT CASE WHEN null THEN 1 ELSE 2 END;
2
```
They turn that `null` into a false. And we're right there with them.
Except, of course, that we're turning `[false, false]` and the like into
`null` first. See!? It's consitent. Consistently confusing, but sane at
least.
The warning message just says "treating multivalued field as false"
rather than explaining all of that.
This also fixes up a few of CASE's docs which I noticed were kind of
busted while working on CASE. I think the docs generation is having a
lot of trouble with CASE so I've manually hacked the right thing into
place, but we should figure out a better solution eventually.
Closes#112359
- Added mv_median_absolute_deviation function
- Added possibility of having a fixed param in Multivalue "ascending" functions
- Add surrogate to MedianAbsoluteDeviation
### Calculations used to avoid overflows
First, a quick recap of how the MAD is calculated:
1. Sort values, and get the median
2. Calculate the difference between each value with the median (`abs(median - value)`)
3. Sort the differences, and get their median
Calculating a MAD may overflow when calculating the differences (Step 2), given the type is a signed number, as the difference is a positive value, with potentially the same value as `POSITIVE_MAX - NEGATIVE_MIN`.
To solve this, some types are up-casted as follow:
- Int: Stored as longs, simple approach
- Long: Stored as longs, but switched to unsigned long representation when calculating the differences
- Unsigned long: No effect; the resulting range is the same
- Doubles: Nothing. If the values overflow to +/-infinity, they're left that way, as we'll just use those outliers to sort
Closes https://github.com/elastic/elasticsearch/issues/111590
This changes the generated types tables in the docs to say `date`
instead of `datetime`. That's the name of the field in Elasticsearch so
it's a lot less confusing to call it that.
Closes#111650
- Added the `mv_percentile(values, percentile)` function
- Used as a surrogate in the `percentile(column, percentile)` aggregation
- Updated docs to specify that the surrogate _should_ be implemented if possible
The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`)
Closes https://github.com/elastic/elasticsearch/issues/111591
Support Version, Keyword and Text in Max an Min aggregations.
The current implementation of both max and min does:
For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed
For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min
Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?
Part of https://github.com/elastic/elasticsearch/issues/110346
This laxes the check on numerical spans to allow them be specified as whole numbers. So far it was required that they be provided as a double.
This also expands the tests for date ranges to include string types.
Resolves#109340, resolves#104646, resolves#105375.
This profiles additional timing information for each individual driver.
To the results from `profile` it adds the start and stop time for each
driver. That was already in the task status. To the profile and task
status it also adds the number of times the driver slept and some more
detailed history about a few of those times.
Explanation time! The compute engine splits work into some number of
`Drivers` per node. Each `Driver` is a single threaded entity - it runs
on a thread for a while then does one of three things: 1. Finishes 2.
Goes async because one of it's `Operator`s has gone async 3. Yields the
thread pool because it has run for too long
This PR measures the second two. At this point only three operators can
go async: * ENRICH * Reading from an empty exchange * Writing to a full
exchange
We're quite interested the these sleeps at the moment because they think
they may be slowing things down. Here's what it looks like when a driver
goes async because it wants to read from an empty exchange:
```
... the rest of the profile ...
"sleeps" : {
"counts" : {
"exchange empty" : 2
},
"first" : [
{
"reason" : "exchange empty",
"sleep" : "2024-08-13T19:45:57.943Z",
"sleep_millis" : 1723578357943,
"wake" : "2024-08-13T19:45:58.159Z",
"wake_millis" : 1723578358159
},
{
"reason" : "exchange empty",
"sleep" : "2024-08-13T19:45:58.164Z",
"sleep_millis" : 1723578358164,
"wake" : "2024-08-13T19:45:58.165Z",
"wake_millis" : 1723578358165
}
],
"last": [same as above]
```
Every time the driver goes async we count it in the `counts` map -
grouped by the reason the driver slept. We also record the sleep and
wake times for the first and last ten times the driver sleeps. In this
case it only slept twice, so the `first` and `last` ten times is the
same array.
This should give us a good sense about why drivers sleep while using a
limited amount of memory per driver.
This removes date_nanos from the docs generated for all of our functions
because it's still under construction. I've done so as a sort of one-off
hack. My plan is to replace this in a follow up change with a
centralized registry of "under construction" data types. So we can make
new data types under a feature flag more easilly in the future. We're
going to be doing that a fair bit.
Resolves#109987
Add initial support for the date nanos data type. At this point, almost no functions are supported, including casting. This just covers loading and returning the values. Like millisecond dates, nanosecond dates are internally modeled as long values, so we don't need a new block type to support them.
This has very patchwork function support. Ideally, I don't think I would have added any function support yet, but the five MV functions you see here declare that they accept any non-spatial type, and will error tests if not wired up for new types. There are other functions, like Values, which also claim to support all non-spatial types, but don't currently enforce that in testing, so I didn't add them yet. Finally, there are functions like == which should work for all types, but are implemented as a specific list. I've left those for a follow up ticket as well.
This finishes the migration of `null` testing from a test method, namely
`testSimpleWithNulls`. It migrates it to `anyNullIsNull` and hand rolled
null cases.
Added IP support to TOP() aggregation.
Adapted a bit the stringtemplates organization for esql/compute to
(also?) work with specific datatypes. Right now it may be a bit messy,
but we need the specific support for cases like this.
- Added SUM() agg tests (Which autogenerates docs)
- Converted non-finite doubles to nulls in aggregator
The complete set of tests depends on
https://github.com/elastic/elasticsearch/issues/110437, as commented in
code. After completion, the test can be uncommented and everything
should work fine
- Support IP in MAX() and MIN()
- Used a custom IpArrayState for it, as it's quite different from the `X-ArrayState.java.st` generated ones
- Add IP test cases for aggregation tests
- Added Percentile aggregation tests and autogen docs
- Added a new "appendix" section to FunctionInfo. Existing Percentile docs had a final, long section with info, and we need this to leep it. We have an "detailedDescription" attribute already, but it's right after the description, and it would make it harder to read the important bits of the function (types, examples...). So I'm not reusing it.
- Added a custom implementation of BooleanBucketedSort to keep the top booleans
- Added boolean aggregator to TOP
- Added tests (Boolean aggregator tests, Top tests for boolean, and added boolean fields to CSV cases)
This adds an example to the docs an example of counting the TRUE results
of an expression. You do `COUNT(a > 0 OR NULL)`. That turns the `FALSE`
into `NULL`. Which you need to do because `COUNT(false)` is `1` -
because it's a value. But `COUNT(null)` is `0` - because it's the
absence of values.
We could like to make something more intuitive for this one day. But for
now, this is what works.
- Added support for Booleans on Max and Min
- Added some helper methods to BitArray (`set(index, value)` and `fill(from, to, value)`). This way, the container is more similar to other BigArrays, and it's easier to work with
Part of https://github.com/elastic/elasticsearch/issues/110346, as Max
and Min are dependencies of Top.
`MAX()` currently doesn't work with doubles smaller than
`Double.MIN_VALUE` (Note that `Double.MIN_VALUE` returns the smallest
non-zero positive, not the smallest double).
This PR adds tests for Max and Min, and fixes the bug (Detected by the
tests).
Also, as the tests now generate the docs, replaced the old docs with the
generated ones, and updated the Max&Min examples.
Some work around aggregation tests, with AVG as an example:
- Added tests and autogenerated docs for AVG
- As AVG uses "complex" surrogates (A combination of functions), we can't trivially execute them without a complete plan. As I'm not sure it's worth it for most aggregations, I'm skipping those cases for now, as to avoid blocking other aggs tests.
The bad side effect of skipping those tests is that most tests in AvgTests are actually ignored (74 of 100)
This adds a `NOTE` to each comparison saying that pushing the comparison
to the search index requires that the field have an `index` and
`doc_values`. This is unique compared to the rest of Elasticsearch which
only requires an `index` and it's caused by our insistence that
comparisons only return true for single-valued fields. We can in future
accelerate comparisons without `doc_values`, but we just haven't written
that code yet.
- Added a new `AbstractAggregationTestCase` base class for tests, that shares most of the code of function tests, adapted for aggregations. Including both testing and docs generation.
- Reused the `AbstractFunctionTestCase` class to also let us test evaluators if the aggregation is foldable
- Added a `TopListTests` example
- This includes the docs for Top_list _(Also added a missing include of Ip_prefix docs)_
- Adapted Kibana docs to use `type: "agg"` (@drewdaemon)
The current tests are very basic: Consume a page, generate an output,
all in Single aggregation mode (No intermediates, no grouping). More
complex testing will be added in future PRs
Initial PR of https://github.com/elastic/elasticsearch/issues/109917
* WIP Started refactoring in preparation for ST_DISTANCE
* Initial evaluators for ST_DISTANCE
* Update docs/changelog/108764.yaml
* Fix invalid changelog generated by CI
* Register function and get unit tests working
* Fixed failing meta function description tests, and refined descriptions
* Added initial CsvTests and calculate Geo differently to Cartesian
* Added more csv-spec tests and changed to arcDistance for accuracy
* Added generated docs files
* Link to generated docs
* Fix examples tag for linking from generated docs
* Skip wrapper function
And note that we might want to include instead some of the related intelligence from Circle2D::HaversineDistance class
* Added ST_DWITHIN and more tests for ST_DISTANCE and ST_DWITHIN
* Code style
* Added more tests, this time for sorting on distance
* Fixes after rebase on main
* The ST_DWITHIN cannot use BinarySpatialFunction because it is ternary
So we moved the common code to a separate SpatialTypeResolver, and made a simpler TernarySpatialFunction based on a simple TernaryScalarFunction. This had additional consequences, simplifying the points-only cases.
The main reason for this change was to support StDWithinTests which need to test a lot of things that involve varying all three input types, generating expected error strings, etc. The original hack of just adding to BinarySpatialFunction worked for the actual integration tests, but clearly did not satisfy all the use cases tested by the unit tests.
We also restricted ST_DWITHIN to take only a double as the third argument, because otherwise the number of evaluators would explode, since we need a separate evaluator for each Block type, and Integer and Double use different block types.
* Fixed function count after rebasing on main
* Update docs/changelog/108764.yaml
* Added generated docs for ST_DWITHIN
* Connect docs for ST_DWITHIN
* Add back issue link
* Remove support for ST_DWITHIN
* Update docs/changelog/108764.yaml
* Bring back link to issue in changelog
* Update x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/StDistance.java
Co-authored-by: Ignacio Vera <iverase@gmail.com>
* Revert reformatting of function descriptions
We should put this into a separate PR
* Github merged commit with incorrectly formatted whitespace
---------
Co-authored-by: Ignacio Vera <iverase@gmail.com>
This adds a test that generates
`docs/reference/esql/functions/kibana/inline_cast.json` which is a json
object who's keys are the names of valid inline casts and who's values
are the resulting data types.
I also moved one of the maps we use to make the inline casts to
`DataType`, which is a place where we want it.
When you divide two integers or two longs we round towards 0. Like
Postgres or Java or Rust or C. Other systems, like MySQL or SPL or
Javascript or Python always produce a floating point number. We should
warn folks about this. It's genuinely unexpected for some folks. OTOH,
converting into a floating point number would be unexpected for other
folks. Oh well, let's document what we've got.
Fixing MvAppendTests CB exceptions by generating smaller geometries: the
test generates a lot of documents and the CB is too small for multiple
big shapes.
Fixes https://github.com/elastic/elasticsearch/issues/109409
Add support for the string manipulation function REPEAT(string, number). This function concatenates the string argument with itself the specified number of times. If number is 0 an empty string is returned. If number is less than 0, null is returned and a warning is logged. If number is less than 0 and is a constant, the query will fail without executing.
Adding `MV_APPEND(value1, value2)` function, that appends two values
creating a single multi-value. If one or both the inputs are
multi-values, the result is the concatenation of all the values, eg.
```
MV_APPEND([a, b], [c, d]) -> [a, b, c, d]
```
~I think for this specific case it makes sense to consider `null` values
as empty arrays, so that~ ~MV_APPEND(value, null) -> value~ ~It is
pretty uncommon for ESQL (all the other functions, apart from
`COALESCE`, short-circuit to `null` when one of the values is null), so
let's discuss this behavior.~
[EDIT] considering the feedback from Andrei, I changed this logic and
made it consistent with the other functions: now if one of the
parameters is null, the function returns null
Added ESQL function to get the prefix of an IP. It works now with both
IPv4 and IPv6. For users planning to use it with mixed IPs, we may need
to add a function like "is_ipv4()" first.
**About the skipped test:** There's currently a "bug" in the
evaluators//functions that return null. Evaluators can't handle them.
We'll work on support for that in another PR. It affects other
functions, like `substring()`. In this function, however, it only
affects in "wrong" cases (Like an invalid prefix), so it has no impact.
Fixes https://github.com/elastic/elasticsearch/issues/99064