When creating and updating transforms, it is possible for clients to provide secondary headers.
When PUT, _preview, _update is called with secondary authorization headers, those are then used or stored with the transform.
closes: https://github.com/elastic/elasticsearch/issues/86731
It was previously required that the _start API caller required the same roles as the create API caller.
This does not make sense as when the transform is actually running (after _start) we rely solely on the roles of the caller who created the transform.
Consequently, this commit does the permission validations and various checks with the roles of user who created the transform, not the one calling _start
"Add" was out of the hyperlink context which I have fixed it.
Earlier line 71 was like : * *Add* <<set-up-lifecycle-policy,*lifecycle policy*>>
After rectifying line 71 is like : * <<set-up-lifecycle-policy,*Add lifecycle policy*>>
(cherry picked from commit 3b8d51c696)
Co-authored-by: Tapomoy Bhowmik <99604828+TapomoyBhowmik@users.noreply.github.com>
* correct way of getting node heap size
in [[shard-count-recommendation]], we explain that the number of shards should be at most 20 shards per GB of heap.
but the command to get relevant heap size should be _cat/nodes?v=true&h=heap.max and not _cat/nodes?v=true&h=heap.current . The latter gives the current memory consumption, which is alway moving. Here we need to consider the max allocated heap size (-Xmx)
* Adds heap.max to valid columns
Co-authored-by: Adam Locke <adam.locke@elastic.co>
Today the add/clear voting config exclusions APIs route a request to the
master node but do not expose the usual `?master_timeout` parameter
allowing to change the timeout for this phase of execution. This commit
adds the missing parameter.
Remove usage of deprecated elasticsearch.rest-test in DocsTestPlugin
we keep some files in src/test in docs projects as moving them would require more changes
in build-docs project outside this repository
* document cloud_id usage
* actually no cloud id used
* [source,console]
* suggested change
* Mark example as NOTCONSOLE
* Add tests
* Add comma
* Fix comma (for real this time)
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Adam Locke <adam.locke@elastic.co>
When starting a trained model deployment the user can tweak performance
by setting the `model_threads` and `inference_threads` parameters.
These parameters are hard to understand and cause confusion.
This commit renames these as well as the fields where their values are
reported in the stats API.
- `model_threads` => `number_of_allocations`
- `inference_threads` => `threads_per_allocation`
Now the terminology is as follows.
A model deployment starts with a requested `number_of_allocations`.
Each allocation means the model gets another thread for executing
parallel inference requests. Thus, more allocations should increase
throughput. In its turn, each allocation is may be using a number
of threads to parallelize each individual inference request.
This is the `threads_per_allocation` setting and increases inference
speed (which might also result in improved throughput).
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
```
And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.
Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.
## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)
## Educated guesses
The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values
These are mostly artifacts of how doc values are stored.
### sorted alphabetically
```
{
"b": 1,
"c": 2,
"a": 3
}
```
becomes
```
{
"a": 3,
"b": 1,
"c": 2
}
```
### as "objecty" as possible
```
{
"a.b": "foo"
}
```
becomes
```
{
"a": {
"b": "foo"
}
}
```
### pushes all arrays to the "leaf" fields
```
{
"a": [
{
"b": "foo",
"c": "bar"
},
{
"c": "bort"
},
{
"b": "snort"
}
}
```
becomes
```
{
"a" {
"b": ["foo", "snort"],
"c": ["bar", "bort"]
}
}
```
### sorts most array values
```
{
"a": [2, 3, 1]
}
```
becomes
```
{
"a": [1, 2, 3]
}
```
### removes duplicate text and keyword values
```
{
"a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
"a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`
Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.
That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
## perf numbers
I loaded the entire tsdb data set with this change and the size:
```
standard -> synthetic
store size 31.0 GB -> 7.0 GB (77.5% reduction)
_source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source)
```
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
The health API has a notion of details within each health indicator that is returned. These details can sometimes be
expensive to compute or transfer. This change allows a user to specify whether the details are generated and
returned. By default now all details are generated and returned (previously this was only the case if a component
was specified in the request). This behavior can be changed with the explain query param.
Closes#86215
1. Adds a note that you can restore older snapshots (to recover from a
failed upgrade) even after newer snapshots were taken.
2. Copies the note about incompatible S3 repo implementations to the top
level to avoid misunderstandings.
This commit adds a new `_ml/trained_models/{model_id}/_infer` API. This api works for both native NLP models and supervised models trained via Data Frame analytics.
The format of the API is the same as the old `_ml/trained_models/{model_id}/deployment/_infer`. Taking a `docs` and an `inference_config` parameter.
This PR also deprecates the old experimental `_ml/trained_models/{model_id}/deployment/_infer` API.
The biggest difference is that the response now nests all results under an "inference_results" object.
closes: https://github.com/elastic/elasticsearch/issues/86032
* [doc] Add information for how to find if compressed ordinary object pointers is in use using the REST APIs.
* Update docs/reference/setup/advanced-configuration.asciidoc
Co-authored-by: Nikola Grcevski <6207777+grcevski@users.noreply.github.com>
Co-authored-by: Nikola Grcevski <6207777+grcevski@users.noreply.github.com>
This commit adds tracking for desired nodes cluster membership.
When desired nodes are updated they are matched against the current
cluster members. Additionally when a node joins the cluster the
desired nodes cluster membership is updated.
In #50535 (ES v7.6) the default values for the
`DocumentSubsetBitsetCache` settings were changed. However, the docs
were not updated at that time, and still reflect the old values for
these settings
Adds a parameter `index_names` to the get snapshots API so that users may exclude the potentially very long index name lists when listing out snapshots.
closes#82937
Users should be able to specify specific metrics/keys within a specific bucket key.
An example is `agg["bucket_foo"]._count`.
This change now allows that.
closes: https://github.com/elastic/elasticsearch/issues/76320
Fixes a few scalability issues around join validation:
- compresses the cluster state sent over the wire
- shares the serialized cluster state across multiple nodes
- forks the decompression/deserialization work off the transport thread
Relates #77466Closes#83204
Today you cannot remove archived index settings by applying a setting
update `{"archived.*":null}` because `IndexSettings#same` incorrectly
treats such an update as a no-op. This commit fixes that.
This adds support for partial results to SQL.
The lenient mode is controlled by a new query paramter,
`allow_partial_search_results`, false by default. On shard failures, the
errors are added as Warning headers to the response. Only a first set of
failures are sent to the client, the last header briefs on the number of
remaining suppressed ones.