Make it clear that this API should be used only if the detailed shard
info is needed and only on ongoing snapshots. Remove incorrectly
mentioned `STATE` value.
This implements `INLINESTATS`. Most of the heavy lifting is done by
`LOOKUP`, with this change mostly adding a new abstraction to logical
plans, and interface I'm calling `Phased`. Implementing this interface
allows a logical plan node to cut the query into phases. `INLINESTATS`
implements it by asking for a "first phase" that's the same query, up to
`INLINESTATS`, but with `INLINESTATS` replaced with `STATS`. The next
phase replaces the `INLINESTATS` with a `LOOKUP` on the results of the
first phase.
So, this query:
```
FROM foo
| EVAL bar = a * b
| INLINESTATS m = MAX(bar) BY b
| WHERE m = bar
| LIMIT 1
```
gets split into
```
FROM foo
| EVAL bar = a * b
| STATS m = MAX(bar) BY b
```
followed by
```
FROM foo
| EVAL bar = a * b
| LOOKUP (results of m = MAX(bar) BY b) ON b
| WHERE m = bar
| LIMIT 1
```
This change returns the total number of fields at the segment level,
allowing for a more accurate estimate of the memory used by Lucene. The
new estimate is expected to be closer to the actual memory usage than
the current estimate using the index-level field count, due to the
non-trivial overhead incurred by each Lucene segment. Two new fields are
introduced: total_segment_fields, which is the total number of fields at
the segment level, and average_fields_per_segment. The overhead per
field in segments with fewer fields is larger than in segments with many
fields.
Added IP support to TOP() aggregation.
Adapted a bit the stringtemplates organization for esql/compute to
(also?) work with specific datatypes. Right now it may be a bit messy,
but we need the specific support for cases like this.
- Added SUM() agg tests (Which autogenerates docs)
- Converted non-finite doubles to nulls in aggregator
The complete set of tests depends on
https://github.com/elastic/elasticsearch/issues/110437, as commented in
code. After completion, the test can be uncommented and everything
should work fine
Clarify that the default config is the recommended one, and that users
should not normally enable `DEBUG` or `TRACE` logging without looking at
the source code. Also reorders the information a bit for easier reading.
- Support IP in MAX() and MIN()
- Used a custom IpArrayState for it, as it's quite different from the `X-ArrayState.java.st` generated ones
- Add IP test cases for aggregation tests
- Added Percentile aggregation tests and autogen docs
- Added a new "appendix" section to FunctionInfo. Existing Percentile docs had a final, long section with info, and we need this to leep it. We have an "detailedDescription" attribute already, but it's right after the description, and it would make it harder to read the important bits of the function (types, examples...). So I'm not reusing it.
I highly value the content on this [Data Tiers](https://www.elastic.co/guide/en/elasticsearch/reference/current/data-tiers.html) page. Thanks for writing it! In my experience, some users may become slightly confused by its golden nuggets due to its brevity. This PR attempts to flush out common questions while remaining concise.
The main changes are in the first and second-to-last sections; however, I do attempt some heading restructuring to make the TOC idea-groupings more clear for easier scan-throughs.
The specific clarifications I'd like to push in order of appearance:
- There's content tier (for "data category" > "content" as we've dubbed it on the higher page) and the data temperature tiers (for time series). That the temperature tiers group together is technically not stated so users end up asking about when they'd go hot>warm vs content>warm, etc. I suspect this confusion is only because users come straight to this page instead of starting at the hierarchy-parent page so have linked up.
- (Main) Frozen being accessed/searched "rarely" should imply, well rarely. I wrote 1% in the PR `[TIP]` guideline section as a discussion starting point. Frequently we see users not understanding either that they actually have been or that they shouldn't have ≥25% of all searches hitting frozen tier. This comes up because of architecture bugs (e.g. frozen indices with future timestamps) but also just happenstance (e.g. 01605242 where of searches they hit majority hot, ~5% cold, but then again hit 75% frozen).
- There's a slew of "how do I check that?", "how do I change that (at creation/later)?", "what if I set it null?" questions we get about `_tier_preference` so just extended the existing section already about it.
---------
Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
Today `cluster.routing.allocation.allow_rebalance` defaults to
`indices_all_active` which blocks all rebalancing moves while the
cluster is in `yellow` or `red` health. This was appropriate for the
legacy allocator which might do too many rebalancing moves otherwise.
The desired-balance allocator has better support for rebalancing a
cluster that is not in `green` health, and expects to be able to
rebalance some shards away from over-full nodes to avoid allocating
shards to undesirable locations in the first place. This commit changes
the default `allow_rebalance` setting to `always`.
This adds minimal docs around how to the new logs index mode for data
streams (most common use case). This is minimal because logs index mode
is still in tech preview. Minimal docs should allow any interested users
to experiment with the new logs index mode.
* Enforce an invariant in our dependency checker so that logical plans never have duplicate output attribute names or ids.
* Fix ROW to not produce columns with duplicate names.
* Fix ResolveUnionTypes to not create multiple synthetic field attributes for the same union type.
* Add tests for commands using the same column name more than once.
* Update docs w.r.t. how commands behave if they are used with duplicate column names.
* Union types documentation
* Try remove asciidoc error
* Another attempt
* Using literal block
* Nicer formatting
* Remove partintro
* Small refinements
* Edits for clarity and style
---------
Co-authored-by: Marci W <333176+marciw@users.noreply.github.com>
* (Doc+) Error "number of documents in the index can't exceed"
👋 howdy, team!
This adds resolution outline for error ... which induces ongoing, lowkey support
```
Number of documents in the index can't exceed [2147483519]
```
* feedback
* feedback
Co-authored-by: David Turner <david.turner@elastic.co>
* feedback
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
* feedback
* feedback
* Test change to address docs check failure
* Revert test change
* Test docs check
---------
Co-authored-by: David Turner <david.turner@elastic.co>
Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
- Added a custom implementation of BooleanBucketedSort to keep the top booleans
- Added boolean aggregator to TOP
- Added tests (Boolean aggregator tests, Top tests for boolean, and added boolean fields to CSV cases)
Today we return HTTP code 207 if some features successfully reset and
others failed. This is not an appropriate response code, it has a _very_
precise meaning according to the HTTP specification to which we do not
adhere. Since this API is used only in tests we can be stricter and
return a 500 unless it completely succeeds.
This adds an example to the docs an example of counting the TRUE results
of an expression. You do `COUNT(a > 0 OR NULL)`. That turns the `FALSE`
into `NULL`. Which you need to do because `COUNT(false)` is `1` -
because it's a value. But `COUNT(null)` is `0` - because it's the
absence of values.
We could like to make something more intuitive for this one day. But for
now, this is what works.
As preparation for #106081, this PR adds the `size_in_bytes`
field to the enrich cache. This field is calculated by summing
the ByteReference sizes of all the search hits in the cache.
It's not a perfect representation of the size of the enrich cache
on the heap, but some experimentation showed that it's quite close.