Commit graph

229 commits

Author SHA1 Message Date
Craig Taverner
d9e52afce3
Make spatial search functions not preview (#117489) (#117497) 2024-11-26 03:28:16 +11:00
Nik Everett
85e4643895
ESQL: Add docs for MV_PERCENTILE (#117377) (#117380)
We built this a while back. Let's document it.
2024-11-23 07:04:17 +11:00
Nik Everett
c1b01bbd54
ESQL: Make WEIGHTED_AVG not preview (#117356) (#117361)
It's not PREVIEW.
2024-11-23 03:51:52 +11:00
Luigi Dell'Aquila
32eeb6b279
ES|QL: fix validation of SORT by aggregate functions (#117316) (#117325) 2024-11-22 23:21:00 +11:00
Carlos Delgado
0304b92ccf
ESQL - match operator included in non-snapshot builds (#116819) (#117227)
(cherry picked from commit ea4b41fca8)
2024-11-21 19:34:50 +11:00
Mark Tozzi
cafa440771
[8.x] Esql Enable Date Nanos (#117080) (#117161)
* Esql Enable Date Nanos (#117080)

This enables date nanos support as tech preview. Basic operations, like reading values, binary comparisons, and functions that don't care about type should work, but some functions are not yet supported. Most notably, Bucket is not yet supported, although Date_Trunc is and can be used for grouping. See the docs for the full list of limitations.

relates to #109352

* Skip CATEGORIZE tests outside snapshot

---------

Co-authored-by: Nik Everett <nik9000@gmail.com>
2024-11-21 08:16:59 +11:00
Fang Xing
df1130f4b2
[ES|QL][DOCS] Add docs for date_period and time_duration (#116368) (#117021)
* add docs for date_period and time_duration
2024-11-20 00:14:46 +11:00
Bogdan Pintea
33dfe554e7
ESQL: Docs: COUNT: add an explanation to the use of the 3VL (#116684) (#117006)
Add an explanation of why `... OR NULL` is needed with `COUNT(...)`.

Fixes: #99954
2024-11-19 21:43:31 +11:00
Gal Lalouche
b3edb3a6a4
[ESQL] Update docs format (missing space before '=') (#116808) (#116816) 2024-11-15 02:04:30 +11:00
Gal Lalouche
dae79b5c22
[8.x] [ESQL] Add support BYTE_LENGTH scalar function (#116591) (#116731) 2024-11-14 14:40:42 +01:00
Tim Grein
b7951c5ce7
Add ES|QL bit_length function (#115792) (#116378) 2024-11-07 20:04:20 +11:00
Mark Tozzi
1224db91d5
[ESQL] clean up date trunc tests (#116111) (#116179)
While working on #110008 I discovered that the Date Trunc tests were only running in folding mode, because the interval types are marked as not representable. The correct way to test this is to set the forceLiteral flag for those fields, which will (as the name suggests) force them to be literals even in non-folding tests.

Doing that turned up errors in the evaluatorToString tests, which I fixed. There are two big changes here. First, the second parameter to the evaluator is a Rounding instance, not the actual interval. Since Rounding includes some information about the specific rounding in the toString results, I am just using a starts with matcher to validate the majority of the string, rather than trying to reconstruct the expected rounding string. Second, passing in a literal null for the interval parameter folds the whole expression to null, and thus a completely different toString. I added a clause in AnyNullIsNull to account for this.

While I was in there, I moved some specific test cases to a different file. I know moving code is something we're trying to minimize right now, but this seemed worth it. The tests in question do not depend on the parameters of the test case, but all methods in the class get run for every set of parameters. This was causing these tests to be run many times with the same values, which bloats our test run time and test count. Moving them to a distinct class means they'll only be executed once per test run. I feel like this benefit outweighs the cost of git history complexity.
2024-11-05 02:32:08 +11:00
Chris Hegarty
78fc557d3f ES|QL Add full-text search to the functions docs page (#116024)
Now that the match and qstr functions are Tech Previewing, we should add them to the top-level functions doc page.

Co-authored-by: Craig Taverner <craig@amanzi.com>
2024-11-01 12:08:48 +00:00
Craig Taverner
3b3e7f7484
Don't return TEXT type for functions that take TEXT (#114334) (#115625)
Always return `KEYWORD` for functions that previously returned `TEXT`, because any change to the value, no matter how small, is enough to render meaningless the original analyzer associated with the `TEXT` field value. In principle, if the attribute is no longer the original `FieldAttribute`, it can no longer claim to have the type `TEXT`.

This has been done for all functions: conversion functions, aggregating functions, multi-value functions. There were several that already produced `KEYWORD` for `TEXT` input (eg. ToString, FromBase64 and ToBase64, MvZip, ToLower, ToUpper, DateFormat, Concat, Left, Repeat, Replace, Right, Split, Substring), but many others that incorrectly claimed to produce `TEXT`, while this was really a false claim. This PR makes that now strict, and includes changes to the functions' units tests to disallow the tests to expect any functions output to be `TEXT`.

One side effect of this change is that methods that take multiple parameters that require all of them to have the same type, will now treat TEXT and KEYWORD the same. This was already the case for functions like `Concat`, but is now also the case for `Greatest`, `Least`, `Case`, `Coalesce` and `MvAppend`.

An associated change is that the type casting operator `::text` has been entirely removed. It used to map onto the `ToString` function which returned type KEYWORD, and so `::text` really produced a `KEYWORD`, which is a lie, or at least a `bug`, which is now fixed. Should we ever wish to actually produce real `TEXT`, we might love the fact that this operator has been freed up for future use (although it seems likely that function will require parameters to specify the analyzer, so might never be an operator again).

### Backwards compatibility issues:

This is a change that will fail BWC tests, since we have many tests that assert on TEXT output to functions. For this reason we needed to block two scenarios:

* We used the capability `functions_never_emit_text` to prevent 7 csv-spec tests and 2 yaml tests from being run against older versions that still emit text.
* We used `skipTest` to also block those two yaml tests from being run against the latest build, but using older yaml files downloaded (as far back as 8.14).

In all cases the change observed in these tests was simply the results columns no longer having `text` type, and instead being `keyword`.

---------

Co-authored-by: Luigi Dell'Aquila <luigi.dellaquila@gmail.com>
2024-10-25 20:12:02 +11:00
Luigi Dell'Aquila
5290630bd0
ES|QL: improve docs about escaping for GROK, DISSECT, LIKE, RLIKE (#115320) (#115493) 2024-10-24 19:14:57 +11:00
Nik Everett
f38f2301bc
ESQL: Skip unsupported grapheme cluster test (#115258)
This skips the test for reversing grapheme clusters if the node doesn't
support reversing grapheme clusters. Nodes that are using a jdk before
20 won't support reversing grapheme clusters because they don't have
https://bugs.openjdk.org/browse/JDK-8292387

This reworks `EsqlCapabilities` so we can easilly register it only if
we're on jdk 20:
```
FN_REVERSE_GRAPHEME_CLUSTERS(Runtime.version().feature() < 20),
```

Closes #114537
Closes #114535
Closes #114536
Closes #114558
Closes #114559
Closes #114560
2024-10-21 20:06:56 +02:00
Carlos Delgado
581894a035
Remove snapshot build restriction for match and qstr functions (#114482) (#114793) 2024-10-15 10:22:43 +02:00
Carlos Delgado
14c1c3c1cc
[8.x] Add ESQL match function (#113374) (#114695) 2024-10-14 17:14:43 +02:00
Larisa Motova
42aa343daf
[ES|QL] Add hypot function (#114382) (#114658)
Adds a hypotenuse function
2024-10-12 07:45:43 +11:00
Nik Everett
d6dfa71576
ESQL: Document MV_SLICE limitations (#114162) (#114348)
`MV_SLICE` is useful, but loading values from lucene frequently sorts
them so `MV_SLICE` is not as useful as you think it is. It's mostly for
after, say, a `SPLIT`. This documents that and adds a link to the
section on multivalues.

It also moves similar docs to a separate paragraph in the docs for
easier reading.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-10-10 09:01:33 +11:00
Drew Tate
2046b497fc
[ES|QL] add reverse function (#113297) (#114163)
Adds a REVERSE string function
2024-10-05 05:24:05 +10:00
Mark Tozzi
0c33257fee
[ESQL] Support datetime data type in Least and Greatest functions (#113961) (#114130)
While working on Date Nanos, I noticed that Least and Greatest didn't have support for datetime. This PR corrects that and adds tests for it.

It seems to me that resolveType() is doing the wrong thing for these functions, as it accepts types that then do not have evaluator mappings, but refactoring that seems out of scope right now.

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-10-05 00:11:17 +10:00
Luigi Dell'Aquila
458dd4afe3
ES|QL: provide snapshot_only info for functions (Kibana) (#113544) (#113927)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-10-04 22:13:19 +10:00
Mark Tozzi
d2a89968ec
[ESQL] Add TO_DATE_NANOS conversion function (#112150) (#113641)
Resolves #111842

This adds a conversion function that yields DATE_NANOS. Mostly this is straight forward.

It is worth noting that when converting a millisecond date into a nanosecond date, the conversion function truncates it to 0 nanoseconds (i.e. first nanosecond of that millisecond). This is, of course, a bit of an assumption, but I don't have a better assumption we can make. I'd thought about adding a second, optional, parameter to control this behavior, but it's important that TO_DATE_NANOS extend AbstractConvertFunction, which itself extends UnaryScalarFunction, so that it will work correctly with union types. Also, it's unlikely the user will have any better guess than we do for filling in the nanoseconds.

Making that assumption does, however, create some weirdness. Consider two comparisons:

TO_DATETIME("2023-03-23T12:15:03.360103847") == TO_DATETIME("2023-03-23T12:15:03.360") will return true while TO_DATE_NANOS("2023-03-23T12:15:03.360103847") == TO_DATE_NANOS("2023-03-23T12:15:03.360") will return false. This is akin to casting between longs and doubles, where things may compare equal in one type that are not equal in the other. This seems fine, and I can't think of a better way to do it, but it's worth being aware of.

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2024-09-27 22:59:55 +10:00
Nik Everett
0e6bbb0bea
ESQL: TOP support for strings (#113183) (#113408)
Adds support to the `TOP` aggregation for `keyword` and `text` field
types.

Closes #109849
2024-09-26 05:18:20 +10:00
Carlos Delgado
c3a2b19993
[8.x] ESQL QSTR function (#112590) (#113189) 2024-09-23 10:13:53 +02:00
Iraklis Psaroudakis
6f63a4e08b
fix a couple of docs typos (#112901) (#113283)
Co-authored-by: Pm Ching <41728178+pionCham@users.noreply.github.com>
2024-09-21 01:59:14 +10:00
Bogdan Pintea
6e314d6c2a
ESQL: Align year diffing to the rest of the units in DATE_DIFF: chronological (#113103) (#113258)
This will correct/switch "year" unit diffing from the current integer
subtraction to a crono subtraction. Consequently, two dates are (at
least) one year apart now if (at least) a full calendar year separates
them. The previous implementation simply subtracted the year part of the
dates.

Note: this parts with ES SQL's implementation of the same function,
which itself is aligned with MS SQL's implementation, which works
equivalent to an integer subtraction.

Fixes #112482.

(cherry picked from commit f7ff00f645)
2024-09-20 22:31:36 +10:00
Fang Xing
e8569356ea
[ES|QL] explicit cast a string literal to date_period and time_duration in arithmetic operations (#109193)
explicit cast to date_period and time_duration in arithmic operation
2024-09-09 14:56:43 -04:00
Nik Everett
ef3a5a1385
ESQL: Fix CASE when conditions are multivalued (#112401)
When CASE hits a multivalued field it was previously either crashing on
fold or evaluating it to the first value. Since booleans are loaded in
sorted order from lucene that *usually* means `false`. This changes the
behavior to line up with the rest of ESQL - now multivalued fields are
treated as `false` with a warning.

You might say "hey wait! multivalued fields usually become `null`, not
`false`!". Yes, dear reader, you are right. Very right. But! `CASE`'s
contract is to immediatly convert its values into `true` or `false`
using the standard boolean tri-valued logic. So `null` just become
`false` immediately. This is how PostgreSQL, MySQL, and SQLite behave:

```
> SELECT CASE WHEN null THEN 1 ELSE 2 END;
2
```

They turn that `null` into a false. And we're right there with them.
Except, of course, that we're turning `[false, false]` and the like into
`null` first. See!? It's consitent. Consistently confusing, but sane at
least.

The warning message just says "treating multivalued field as false"
rather than explaining all of that.

This also fixes up a few of CASE's docs which I noticed were kind of
busted while working on CASE. I think the docs generation is having a
lot of trouble with CASE so I've manually hacked the right thing into
place, but we should figure out a better solution eventually.

Closes #112359
2024-09-10 02:32:19 +10:00
Nik Everett
cf98240950 Update docs from code 2024-09-09 11:28:31 -04:00
Chris Berkhout
fbaeb1ee61
[ESQL] Add SPACE function (#112350)
Adds the SPACE(number) function, which is equivalent to REPEAT(" ", number).
2024-09-09 21:41:35 +10:00
Iván Cea Fontenla
fc2760cfd4
ESQL: mv_median_absolute_deviation function (#112055)
- Added mv_median_absolute_deviation function
- Added possibility of having a fixed param in Multivalue "ascending" functions
- Add surrogate to MedianAbsoluteDeviation

### Calculations used to avoid overflows
First, a quick recap of how the MAD is calculated:
1. Sort values, and get the median
2. Calculate the difference between each value with the median (`abs(median - value)`)
3. Sort the differences, and get their median

Calculating a MAD may overflow when calculating the differences (Step 2), given the type is a signed number, as the difference is a positive value, with potentially the same value as `POSITIVE_MAX - NEGATIVE_MIN`.
To solve this, some types are up-casted as follow:
- Int: Stored as longs, simple approach
- Long: Stored as longs, but switched to unsigned long representation when calculating the differences
- Unsigned long: No effect; the resulting range is the same
- Doubles: Nothing. If the values overflow to +/-infinity, they're left that way, as we'll just use those outliers to sort

Closes https://github.com/elastic/elasticsearch/issues/111590
2024-09-09 10:04:25 +02:00
Ioana Tagirta
90f1fb667c
[ES|QL] Document return value for locate in case substring is not found (#112202)
* Document return value for locate in case substring is not found

* Add note that string positions start from 1
2024-09-03 12:46:20 +02:00
Nik Everett
d8e705d5da
ESQL: Document date instead of datetime (#111985)
This changes the generated types tables in the docs to say `date`
instead of `datetime`. That's the name of the field in Elasticsearch so
it's a lot less confusing to call it that.

Closes #111650
2024-08-21 01:59:13 +10:00
Iván Cea Fontenla
65ce50c60a
ESQL: Added mv_percentile function (#111749)
- Added the `mv_percentile(values, percentile)` function
- Used as a surrogate in the `percentile(column, percentile)` aggregation
- Updated docs to specify that the surrogate _should_ be implemented if possible

The same way as mv_median does, this yields exact results (Ignoring double operations error).
For that, some decisions were made, specially in the long evaluator (Check the comments in context in `MvPercentile.java`)

Closes https://github.com/elastic/elasticsearch/issues/111591
2024-08-20 15:29:19 +02:00
Iván Cea Fontenla
e3f378ebd2
ESQL: Strings support for MAX and MIN aggregations (#111544)
Support Version, Keyword and Text in Max an Min aggregations.

The current implementation of both max and min does:

For non-grouping:
- Store a BytesRef
- When there's a max/min, copy it to the internal array. Grow it if needed

For grouping:
- Keep an array of BytesRef (null by default: there's no "initial/default value" here, as there's no "MAX" value for a string)
- Each BytesRef stores their own array, which will be grown as needed to copy the new max/min

Some notes:
- It's not shrinking the arrays, as to avoid having to copy, and potentially grow it again
- It's using raw arrays. But maybe it should use BigArrays to compute in the circuit breaker?

Part of https://github.com/elastic/elasticsearch/issues/110346
2024-08-20 15:24:55 +02:00
Bogdan Pintea
dd49c33479
ESQL: BUCKET: allow numerical spans as whole numbers (#111874)
This laxes the check on numerical spans to allow them be specified as whole numbers. So far it was required that they be provided as a double.

This also expands the tests for date ranges to include string types.

Resolves #109340, resolves #104646, resolves #105375.
2024-08-20 13:40:59 +02:00
Nik Everett
dc24003540
ESQL: Profile more timing information (#111855)
This profiles additional timing information for each individual driver.
To the results from `profile` it adds the start and stop time for each
driver. That was already in the task status. To the profile and task
status it also adds the number of times the driver slept and some more
detailed history about a few of those times.

Explanation time! The compute engine splits work into some number of
`Drivers` per node. Each `Driver` is a single threaded entity - it runs
on a thread for a while then does one of three things: 1. Finishes 2.
Goes async because one of it's `Operator`s has gone async 3. Yields the
thread pool because it has run for too long

This PR measures the second two. At this point only three operators can
go async: * ENRICH * Reading from an empty exchange * Writing to a full
exchange

We're quite interested the these sleeps at the moment because they think
they may be slowing things down. Here's what it looks like when a driver
goes async because it wants to read from an empty exchange:

```
... the rest of the profile ...
        "sleeps" : {
          "counts" : {
            "exchange empty" : 2
          },
          "first" : [
            {
              "reason" : "exchange empty",
              "sleep" : "2024-08-13T19:45:57.943Z",
              "sleep_millis" : 1723578357943,
              "wake" : "2024-08-13T19:45:58.159Z",
              "wake_millis" : 1723578358159
            },
            {
              "reason" : "exchange empty",
              "sleep" : "2024-08-13T19:45:58.164Z",
              "sleep_millis" : 1723578358164,
              "wake" : "2024-08-13T19:45:58.165Z",
              "wake_millis" : 1723578358165
            }
          ],
          "last": [same as above]
```

Every time the driver goes async we count it in the `counts` map -
grouped by the reason the driver slept. We also record the sleep and
wake times for the first and last ten times the driver sleeps. In this
case it only slept twice, so the `first` and `last` ten times is the
same array.

This should give us a good sense about why drivers sleep while using a
limited amount of memory per driver.
2024-08-20 07:29:01 +10:00
Nik Everett
2e22e73cdf
ESQL: Remove date_nanos from generated docs (#111884)
This removes date_nanos from the docs generated for all of our functions
because it's still under construction. I've done so as a sort of one-off
hack. My plan is to replace this in a follow up change with a
centralized registry of "under construction" data types. So we can make
new data types under a feature flag more easilly in the future. We're
going to be doing that a fair bit.
2024-08-15 00:22:25 +10:00
Mark Tozzi
67c69bb224
[ESQL] Date nanos type (#110205)
Resolves #109987

Add initial support for the date nanos data type. At this point, almost no functions are supported, including casting. This just covers loading and returning the values. Like millisecond dates, nanosecond dates are internally modeled as long values, so we don't need a new block type to support them.

This has very patchwork function support. Ideally, I don't think I would have added any function support yet, but the five MV functions you see here declare that they accept any non-spatial type, and will error tests if not wired up for new types. There are other functions, like Values, which also claim to support all non-spatial types, but don't currently enforce that in testing, so I didn't add them yet. Finally, there are functions like == which should work for all types, but are implemented as a specific list. I've left those for a follow up ticket as well.
2024-08-07 13:17:26 -04:00
Nik Everett
cc294a1a0f
ESQL: Finish migration of null testing (#111563)
This finishes the migration of `null` testing from a test method, namely
`testSimpleWithNulls`. It migrates it to `anyNullIsNull` and hand rolled
null cases.
2024-08-05 12:28:15 -04:00
Fang Xing
d87254369a
type = operator in kibana operator definition (#111436) 2024-07-31 11:07:18 -04:00
Pablo Machado
f79c62157d
ESQL: Add MV_PSERIES_WEIGHTED_SUM for score calculations used by security solution (#109017)
* Create MV_RIEMANN_ZETA scalar multivalue function



---------

Co-authored-by: Nik Everett <nik9000@gmail.com>
2024-07-31 12:08:28 +02:00
Iván Cea Fontenla
bc69827e1e
ESQL: WEIGHTED_AVG aggregation tests and docs (#111449) 2024-07-31 00:42:23 +10:00
Iván Cea Fontenla
735d80dffd
ESQL: Add COUNT and COUNT_DISTINCT aggregation tests (#111409) 2024-07-30 03:07:15 +10:00
Iván Cea Fontenla
826d49448b
ESQL: Added Median and MedianAbsoluteDeviation aggregations tests and kibana docs (#111231) 2024-07-26 22:11:01 +10:00
Iván Cea Fontenla
595d907f61
ESQL: SpatialCentroid aggregation tests and docs (#111236) 2024-07-26 10:41:18 +02:00
Fang Xing
66dd2687d5
[ES|QL] Generate docs for unregistered esql functions from annotations (#108749)
* render docs for operators
2024-07-22 14:58:17 -04:00
Iván Cea Fontenla
195b916e2b
ESQL: TOP aggregation IP support (#111105)
Added IP support to TOP() aggregation.

Adapted a bit the stringtemplates organization for esql/compute to
(also?) work with specific datatypes. Right now it may be a bit messy,
but we need the specific support for cases like this.
2024-07-22 22:35:48 +10:00