Resolves#111842
This adds a conversion function that yields DATE_NANOS. Mostly this is straight forward.
It is worth noting that when converting a millisecond date into a nanosecond date, the conversion function truncates it to 0 nanoseconds (i.e. first nanosecond of that millisecond). This is, of course, a bit of an assumption, but I don't have a better assumption we can make. I'd thought about adding a second, optional, parameter to control this behavior, but it's important that TO_DATE_NANOS extend AbstractConvertFunction, which itself extends UnaryScalarFunction, so that it will work correctly with union types. Also, it's unlikely the user will have any better guess than we do for filling in the nanoseconds.
Making that assumption does, however, create some weirdness. Consider two comparisons:
TO_DATETIME("2023-03-23T12:15:03.360103847") == TO_DATETIME("2023-03-23T12:15:03.360") will return true while TO_DATE_NANOS("2023-03-23T12:15:03.360103847") == TO_DATE_NANOS("2023-03-23T12:15:03.360") will return false. This is akin to casting between longs and doubles, where things may compare equal in one type that are not equal in the other. This seems fine, and I can't think of a better way to do it, but it's worth being aware of.
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
- Added mv_median_absolute_deviation function
- Added possibility of having a fixed param in Multivalue "ascending" functions
- Add surrogate to MedianAbsoluteDeviation
### Calculations used to avoid overflows
First, a quick recap of how the MAD is calculated:
1. Sort values, and get the median
2. Calculate the difference between each value with the median (`abs(median - value)`)
3. Sort the differences, and get their median
Calculating a MAD may overflow when calculating the differences (Step 2), given the type is a signed number, as the difference is a positive value, with potentially the same value as `POSITIVE_MAX - NEGATIVE_MIN`.
To solve this, some types are up-casted as follow:
- Int: Stored as longs, simple approach
- Long: Stored as longs, but switched to unsigned long representation when calculating the differences
- Unsigned long: No effect; the resulting range is the same
- Doubles: Nothing. If the values overflow to +/-infinity, they're left that way, as we'll just use those outliers to sort
Closes https://github.com/elastic/elasticsearch/issues/111590
This changes the generated types tables in the docs to say `date`
instead of `datetime`. That's the name of the field in Elasticsearch so
it's a lot less confusing to call it that.
Closes#111650
- Added the `mv_percentile(values, percentile)` function
- Used as a surrogate in the `percentile(column, percentile)` aggregation
- Updated docs to specify that the surrogate _should_ be implemented if possible
The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`)
Closes https://github.com/elastic/elasticsearch/issues/111591
This profiles additional timing information for each individual driver.
To the results from `profile` it adds the start and stop time for each
driver. That was already in the task status. To the profile and task
status it also adds the number of times the driver slept and some more
detailed history about a few of those times.
Explanation time! The compute engine splits work into some number of
`Drivers` per node. Each `Driver` is a single threaded entity - it runs
on a thread for a while then does one of three things: 1. Finishes 2.
Goes async because one of it's `Operator`s has gone async 3. Yields the
thread pool because it has run for too long
This PR measures the second two. At this point only three operators can
go async: * ENRICH * Reading from an empty exchange * Writing to a full
exchange
We're quite interested the these sleeps at the moment because they think
they may be slowing things down. Here's what it looks like when a driver
goes async because it wants to read from an empty exchange:
```
... the rest of the profile ...
"sleeps" : {
"counts" : {
"exchange empty" : 2
},
"first" : [
{
"reason" : "exchange empty",
"sleep" : "2024-08-13T19:45:57.943Z",
"sleep_millis" : 1723578357943,
"wake" : "2024-08-13T19:45:58.159Z",
"wake_millis" : 1723578358159
},
{
"reason" : "exchange empty",
"sleep" : "2024-08-13T19:45:58.164Z",
"sleep_millis" : 1723578358164,
"wake" : "2024-08-13T19:45:58.165Z",
"wake_millis" : 1723578358165
}
],
"last": [same as above]
```
Every time the driver goes async we count it in the `counts` map -
grouped by the reason the driver slept. We also record the sleep and
wake times for the first and last ten times the driver sleeps. In this
case it only slept twice, so the `first` and `last` ten times is the
same array.
This should give us a good sense about why drivers sleep while using a
limited amount of memory per driver.
- Added SUM() agg tests (Which autogenerates docs)
- Converted non-finite doubles to nulls in aggregator
The complete set of tests depends on
https://github.com/elastic/elasticsearch/issues/110437, as commented in
code. After completion, the test can be uncommented and everything
should work fine
- Added Percentile aggregation tests and autogen docs
- Added a new "appendix" section to FunctionInfo. Existing Percentile docs had a final, long section with info, and we need this to leep it. We have an "detailedDescription" attribute already, but it's right after the description, and it would make it harder to read the important bits of the function (types, examples...). So I'm not reusing it.
- Added support for Booleans on Max and Min
- Added some helper methods to BitArray (`set(index, value)` and `fill(from, to, value)`). This way, the container is more similar to other BigArrays, and it's easier to work with
Part of https://github.com/elastic/elasticsearch/issues/110346, as Max
and Min are dependencies of Top.
`MAX()` currently doesn't work with doubles smaller than
`Double.MIN_VALUE` (Note that `Double.MIN_VALUE` returns the smallest
non-zero positive, not the smallest double).
This PR adds tests for Max and Min, and fixes the bug (Detected by the
tests).
Also, as the tests now generate the docs, replaced the old docs with the
generated ones, and updated the Max&Min examples.
Some work around aggregation tests, with AVG as an example:
- Added tests and autogenerated docs for AVG
- As AVG uses "complex" surrogates (A combination of functions), we can't trivially execute them without a complete plan. As I'm not sure it's worth it for most aggregations, I'm skipping those cases for now, as to avoid blocking other aggs tests.
The bad side effect of skipping those tests is that most tests in AvgTests are actually ignored (74 of 100)
- Added a new `AbstractAggregationTestCase` base class for tests, that shares most of the code of function tests, adapted for aggregations. Including both testing and docs generation.
- Reused the `AbstractFunctionTestCase` class to also let us test evaluators if the aggregation is foldable
- Added a `TopListTests` example
- This includes the docs for Top_list _(Also added a missing include of Ip_prefix docs)_
- Adapted Kibana docs to use `type: "agg"` (@drewdaemon)
The current tests are very basic: Consume a page, generate an output,
all in Single aggregation mode (No intermediates, no grouping). More
complex testing will be added in future PRs
Initial PR of https://github.com/elastic/elasticsearch/issues/109917
* WIP Started refactoring in preparation for ST_DISTANCE
* Initial evaluators for ST_DISTANCE
* Update docs/changelog/108764.yaml
* Fix invalid changelog generated by CI
* Register function and get unit tests working
* Fixed failing meta function description tests, and refined descriptions
* Added initial CsvTests and calculate Geo differently to Cartesian
* Added more csv-spec tests and changed to arcDistance for accuracy
* Added generated docs files
* Link to generated docs
* Fix examples tag for linking from generated docs
* Skip wrapper function
And note that we might want to include instead some of the related intelligence from Circle2D::HaversineDistance class
* Added ST_DWITHIN and more tests for ST_DISTANCE and ST_DWITHIN
* Code style
* Added more tests, this time for sorting on distance
* Fixes after rebase on main
* The ST_DWITHIN cannot use BinarySpatialFunction because it is ternary
So we moved the common code to a separate SpatialTypeResolver, and made a simpler TernarySpatialFunction based on a simple TernaryScalarFunction. This had additional consequences, simplifying the points-only cases.
The main reason for this change was to support StDWithinTests which need to test a lot of things that involve varying all three input types, generating expected error strings, etc. The original hack of just adding to BinarySpatialFunction worked for the actual integration tests, but clearly did not satisfy all the use cases tested by the unit tests.
We also restricted ST_DWITHIN to take only a double as the third argument, because otherwise the number of evaluators would explode, since we need a separate evaluator for each Block type, and Integer and Double use different block types.
* Fixed function count after rebasing on main
* Update docs/changelog/108764.yaml
* Added generated docs for ST_DWITHIN
* Connect docs for ST_DWITHIN
* Add back issue link
* Remove support for ST_DWITHIN
* Update docs/changelog/108764.yaml
* Bring back link to issue in changelog
* Update x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/StDistance.java
Co-authored-by: Ignacio Vera <iverase@gmail.com>
* Revert reformatting of function descriptions
We should put this into a separate PR
* Github merged commit with incorrectly formatted whitespace
---------
Co-authored-by: Ignacio Vera <iverase@gmail.com>
Add support for the string manipulation function REPEAT(string, number). This function concatenates the string argument with itself the specified number of times. If number is 0 an empty string is returned. If number is less than 0, null is returned and a warning is logged. If number is less than 0 and is a constant, the query will fail without executing.
Adding `MV_APPEND(value1, value2)` function, that appends two values
creating a single multi-value. If one or both the inputs are
multi-values, the result is the concatenation of all the values, eg.
```
MV_APPEND([a, b], [c, d]) -> [a, b, c, d]
```
~I think for this specific case it makes sense to consider `null` values
as empty arrays, so that~ ~MV_APPEND(value, null) -> value~ ~It is
pretty uncommon for ESQL (all the other functions, apart from
`COALESCE`, short-circuit to `null` when one of the values is null), so
let's discuss this behavior.~
[EDIT] considering the feedback from Andrei, I changed this logic and
made it consistent with the other functions: now if one of the
parameters is null, the function returns null
Added ESQL function to get the prefix of an IP. It works now with both
IPv4 and IPv6. For users planning to use it with mixed IPs, we may need
to add a function like "is_ipv4()" first.
**About the skipped test:** There's currently a "bug" in the
evaluators//functions that return null. Evaluators can't handle them.
We'll work on support for that in another PR. It affects other
functions, like `substring()`. In this function, however, it only
affects in "wrong" cases (Like an invalid prefix), so it has no impact.
Fixes https://github.com/elastic/elasticsearch/issues/99064
- Added the cube root function to ESQL (`CBRT(x)`). Nearly identical to SQRT, but without the negative numbers exception
- Added docs generation support for Windows end lines (CRLF), as within the examples, it was writing the "\r" without the "\n" (Which was being converted to "\\n"), and some other inconsistencies
- Some updates to `package-info.java` documentation over how to create functions
- Fixes https://github.com/elastic/elasticsearch/issues/108675
Functions issue: https://github.com/elastic/elasticsearch/issues/98545
This adds some clarifications on the time unit strings the function
takes as arguments, noting the differences between these and the time
span literals, as well as the abbreviations' source.
This extends `BUCKET` function to accept a two-parameters-only
invocation: the first parameter remains as is, while the second is a
span. It can be a numeric (floating point) span, if the first argument
is numeric, or a date period or time duration, if the first argument is
a date.
Also, the function can now be invoked with the alias BIN.
Additionally, the function has been turned into a grouping-only function
and thus can only be used within a `STATS` command.
This improves the tests and docs for a few functions, specifically `E`,
`FLOOR`, `PI`, `POW`, and `ROUND`. The examples and tested signatures
will get copied into the docs and kibana signatures.
Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
This renames the function AUTO_BUCKET to just BUCKET.
It also removes the experimental tagging of the function in the docs, making it generally available.
* WIP Started developing ST_DISJOINT
Initially based on ST_INTERSECTS
* Fix functions list and add spatial point integration tests
* Update docs/changelog/107007.yaml
* More tests for shapes and cartesian-multigeoms
* Some more tests to highlight issues with DISJOINT on cartesian point indices
* Disable Lucene push-down for DISJOINT on cartesian point indices
* Added docs for ST_DISJOINT
* Support DISJOINT in the lucene-pushdown code for cartesian point indexes
* Re-enable push-to-source for DISJOINT on cartesian_point indices
* Fix docs example
* Try fix internal docs links which are not being rendered
* Fixed disjoint on empty geometry
* Added tests on empty linestring, and changed lucene push-down to exception
In lucene code only LineString can be empty, but in Elasticsearch even that is not allowed, resulting in parsing errors. So we cannot get to this code in the lucene push-down and now throw an error instead. The tests now assert on the warnings.
Note that for any predicate DISJOINT and INTERSECTS alike, the predicate fails, because the parsing error results in null, the function returns null, the predicate interprets this as false, and no documents match. This null-in-null-out rule means that DISJOINT and INTERSECTS give the same answer on invalid geometries.
* Add ES|QL signum function
* Update docs/changelog/106866.yaml
* Skip csv tests for versions older than 8.14
* Reference layout docs file and fix instructions for adding functions
* Break csv specs by param type
* More tests
This merges all of the hand written docs for `LOG` and `LOG10` into the
annotations which updates the `META FUNCTIONS` - now it'll always be the
same as the docs. This also deletes the hand maintained docs and let's
the documentation generation process rebuild it.
* WIP Started adding ST_CONTAINS
* Add generated evaluators
* Reduced warnings and use correct evaluators
* Refactored tests to remove duplicate code, and fixed Contains/multi-components
* Gradle build disallows using getDeclaredField
* Fixed cases where rectangles cross the dateline
* Fixed meta function tests
* Added ST_WITHIN to support inverting ST_CONTAINS
If the ST_CONTAINS is called with the constant on the left, we either have to create a lot more Evaluators to cover that case, or we have to invert it to ST_WITHIN. This inversion was a much easier option.
* Simplify inversion logic
* Add comment on choice of surrogate approach
* Add unit tests and missing fold() function
* Simple code cleanup
* Add integration tests for literals
* Add more integration tests based on actual data
* Generated documentation files
* Add documentation
* Fixed failing function count test
* Add tests that push-to-source works for ST_CONTAINS and ST_WITHIN
* Test more combinations of WITH/CONTAINS and literal on right and left
This also verifies that the re-writing of CONTAINS to WITHIN or vice versa occurs when the literal is on the left.
* test that physical planning also handles doc-values from STATS
* Added more tests for WITHIN/CONTAINS together with CENTROID
This should test the doc-values for points.
* Add cartesian_point tests
* Add cartesian_shape tests
* Disable Lucene-push-down for CARTESIAN data
This is a limitation in Lucene, which we could address as a performance optimization in a future PR, but since it probably requires Lucene changes, it cannot be done in this work.
* Fix doc links
* Added test data and tests for cartesian multi-polygons
Testing INTERSECTS, CONTAINS and WITHIN with multi-polydon fields
* Use required features for spatial points, shapes and centroid
* 8.13.0 is not yet historical version
This needs to be reverted as soon as 8.13.0 is released
* Added st_intersects and st_contains_within 'features'
* Code review updates
* Re-enable lucene push-down
* Added more required_features
* Fix point contains non-point
* Fix point contains point
* Re-enable lucene push-down in tests too
Forgot to change the physical planner unit tests after re-enabling lucene push-down
* Generate automatic docs
* Use generated examples docs
* Generated examples use '-result' prefix (singular)
* Mark spatial functions as preview/experimental
This updates the in-code docs on the trig functions to line up with the
docs, removes the docs, and uses the now mostly identical generated
docs. This means we only need to document these functions in one place -
right next to the code.