This makes using usesDefaultDistribution in our test setup for explicit by requiring a reason why it's needed.
This is helpful as part of revisiting the need for all those usages in our code base.
A new query parameter `?include_source_on_error` was added for create / index, update and bulk REST APIs to control
if to include the document source in the error response in case of parsing errors. The default value is `true`.
Here we introduce a new index-level setting, `ignore_above`, similar to what we have
for `ignore_malformed`. The setting will apply to all `keyword`, `wildcard` and `flattened`
fields. Each field mapping will still be allowed to override the index-level setting using a
mapping-level `ignore_above` value.
JDK 23 removes the COMPAT locale provider, leaving CLDR as the only option. This commit configures Elasticsearch
to use the CLDR provider when on JDK 23, but still use the existing COMPAT provider when on JDK 22 and below.
This causes some differences in locale behaviour; this also adapts various tests to still work whether run on COMPAT or CLDR.
Rename `xContent.streamSeparator()` and
`RestHandler.supportsStreamContent()` to `xContent.bulkSeparator()` and
`RestHandler.supportsBulkContent()`.
I want to reserve use of "supportsStreamContent" for current work in
HTTP layer to [support incremental content
handling](https://github.com/elastic/elasticsearch/pull/111438) besides
fully aggregated byte buffers. `supportsStreamContent` would indicate
that handler can parse chunks of http content as they arrive.
Introduces new cluster settings that allow only a certain set of scripts in scripted metrics aggregations:
- search.aggs.only_allowed_metric_scripts, defaults to false
- search.aggs.allowed_inline_metric_scripts, defaults to empty list
- search.aggs.allowed_stored_metric_scripts, defaults to empty list
Closes#97032
Adding the ability to set `require_data_stream` parameter (boolean) on bulk and indexing APIs.
For document indexing, this flag requires the indexing operation to either be pointed at a data stream, or match a template that will create a data stream.
* Introduce Prerequisites criteria (Predicate + factory) for modular skip decisions
- Removed accessors to specific criteria from SkipSection (used only on tests), adjusted test assertions
- Moved Features check (YAML test runner features) to SkipSection build time
* Separated check for xpack/no_xpack
Check for xpack is cluster-configuration (modules installed) dependent, while Features are meant to be "static" test-runner capabilities. We separate them so checks on one (test-runner features) can be run before and separately from the other.
* Consolidate skip() methods
- Divide require and skip predicates
- Divide requires and skip parsing (distinct sections)
- Renaming SkipSection to PrerequisiteSection and related methods/fields (e.g. skip -> evaluate)
* Refactoring tests
- moving and adding VersionRange tests
- adding specific version and os skip tests
- modified parse/validate/build to make SkipSection more unit-testable
* Adding cluster feature-based skip criteria
* Updated javadoc + renaming + better skip reason message
The open point in time API accepts a list of indices and opens a point in time view against those indices.
Like we do already for field caps, this commit allows users to provide an index_filter parameter as part of
the request body, that will be used to execute the can match phase and exclude the indices that can't possibly
match such filter.
Closes#99740
This PR is migrating some of the ITs that use either the
`elasticsearch.legacy-java-rest-test` or the
`elasticsearch.legacy-yaml-rest-test` gradle test plugins to the new
`elasticsearch.internal-java-rest-test` and
`elasticsearch.internal-yaml-rest-test` equivalents. This is the list of
the affected ITs: * SamlAuthenticationIT * OperatorPrivilegesIT *
ProfileIT * SetSecurityUserProcessorWithWithSecurityDisabledIT *
AsyncSearchSecurityIT * SecurityRealmSmokeTestCase *
KibanaSystemIndexIT * KerberosAuthenticationIT * ReindexWithSecurityIT
and ReindexWithSecurityClientYamlTestSuiteIT *
ReloadSecureSettingsWithPasswordProtectedKeystoreRestIT * PermissionsIT
from slm:qa:with-security * Permissions IT from
runtime-fields:with-security * Permissions IT from ilm:qa:with-securiy
* GraphWithSecurityIT and GraphWithSecurityInsufficientRoleIT
Related: ES-6751
Another round of automated fixes to this, marking things that can be
made static as static. Saves some JIT cycles but also turns some lambdas
from capturing to non-capturing and makes the "utilityness" of some
classes visible.
Before we used to track max_score in collapse when requested (track_scores=true)
or when there is no sort in collapse (see PR#27122). But this feature
was lost through refactoring and changes.
This PR restores this feature.
Closes#97653
Running the `matrix_stats_multi_value_field.yaml` test in multi node
test cluster showed a bug, see: 88758ab577
Also removes `MatrixStats` interface, removed usage of deprecated
ValueType enum and removed unused generic usage.
Relates to #90283
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
The line number style numbers prefixing the names of the aggregation
tests don't buy us anything. Worse, they've obfuscated that I forgot to
delete two files after merging their contents into the aggregations
module. I've deleted those as part of this PR.
We're going to move all aggregations to the module soon and this saves a
little time in the build by only running the tests one time - in the
aggregations module.
This adds a test for the `top_hits` aggregation using synthetic
`_source`. It works but let's be a bit paranoid here because it's a
whole new fetch phase.....
When an action is denied due to authorization error, the list of
assigned roles is shown in the error message. However, it is possible
that the effective roles are fewer or more than the assigned list: *
Fewer roles can happen when the role is not defined or the license does
not permit it * More roles can happen when anonymous access is enabled
This PR changes the error message to show the effective roles instead of
the assigned roles (whenever possible) to help troubleshooting. In
addition, it also reports any missing roles, i.e. roles that are
assigned but cannot be found.
Plumbs through a new parameter for the cardinality aggregation, to allow configuring the execution mode. This can have significant impacts on speed and memory usage. This PR exposes three collection modes and two heuristics that we can tune going forward. All of these are treated as hints and can be silently ignored, e.g. if not applicable to the given field type. I've change the default behavior to optimize for time, which potentially uses more memory. Users can override this for the old behavior if needed.
Synthetic source has a habit of reordering text fields. This frustrates
highlighting because it *often* wants to use index structures to find
the offsets to values in the field. This disables the FVH highlighter
for multi-valued text fields when synthetic source is enabled and runs
the unified highlighter in "analyze" mode when synthetic source is
enabled. That's *enough* to stop them from spitting out wrong answers.
We might be leaving some performance on the table when the unified
highlighter works on a single valued text field that is indexed with
offsets or term vectors. We don't really expect that to be common at all
though because *generally* folks will enable synthetic source to save
space and adding offsets or term vectors is quite space inefficient. If
it comes up, we might be able to improve here.
This fixes references to project that makes the plugin incompatible with Gradle
configuration cache. We also remove custom xpackProject utility:
using xpackProject in certain situations can break configure configuration cache compatibility as it uses a mutual project object under the hood that is discouraged to use in some use cases (e.g. at execution time)
It always breaks compatibility with --configure-on-demand
using xpackProject uses the project of the :x-pack project. referencing other project objects from other subproject should avoided where possible to decouple (sub project configurations). There's a good explanation of why we want to decouple our project configurations as much as possible here: https://docs.gradle.org/current/userguide/multi_project_configuration_and_execution.html#sec:decoupled_projects
it adds little value over default out of the box gradle api (just use project(':x-pack:someProject') instead of xpackProject('someProject') Also in some occasions its even shorter. e.g. when this is used as xpackProject('someProject').path instead of just passing :x-pack:someProject
I'll try to put a bit more context in the PR description in the future to make the motivation behind these kind of changes more clear upfront
Related to #57918
Painless execute allows users to validate their scripts. Some of the supported script contexts
support providing a sample document as well as an index to pull the mappings from.
The painless execute API requires cluster admin privileges today and while that's ok for the contexts that
don't support providing an index, it is not ideal when an index is provided. In fact users can run scripts
as part of the search API, which requires only the indices/read privilege on the indices that the users
is reading from.
This commit maps the painless execute action to an indices/read action when an index is specified, so that in
that case the same privileges as a search action will be requested to run painless execute.
Relates to #48856Closes#86428
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
```
And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.
Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.
## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)
## Educated guesses
The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values
These are mostly artifacts of how doc values are stored.
### sorted alphabetically
```
{
"b": 1,
"c": 2,
"a": 3
}
```
becomes
```
{
"a": 3,
"b": 1,
"c": 2
}
```
### as "objecty" as possible
```
{
"a.b": "foo"
}
```
becomes
```
{
"a": {
"b": "foo"
}
}
```
### pushes all arrays to the "leaf" fields
```
{
"a": [
{
"b": "foo",
"c": "bar"
},
{
"c": "bort"
},
{
"b": "snort"
}
}
```
becomes
```
{
"a" {
"b": ["foo", "snort"],
"c": ["bar", "bort"]
}
}
```
### sorts most array values
```
{
"a": [2, 3, 1]
}
```
becomes
```
{
"a": [1, 2, 3]
}
```
### removes duplicate text and keyword values
```
{
"a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
"a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`
Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.
That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
## perf numbers
I loaded the entire tsdb data set with this change and the size:
```
standard -> synthetic
store size 31.0 GB -> 7.0 GB (77.5% reduction)
_source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source)
```
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
Using a double as a return value works only if the field we are
sorting on is a number. If the field is not a value we can convert
to a double, like a non-numeric keyword, converting it to a number
returns `NaN`. Without this patch, sorting takes place on the bucket
key, if the order field points to a non-numeric value. The additional
bucket key comparator is implicitly added as a tie breaker to avoid
non-deterministic sorting of buckets.
With this change we support sorting using any subclass of SortValue.
This means the bucket key will be used just in case of equal values
on the order field.
Issue: #78506
The ESClientYamlSuiteTestCase is used to run yaml tests throughout
Elasticsearch. It utilizes the low level rest client in sniffing for
nodes, but the sniffer is not needed anywhere else in the test
framework.
This commit creates a new project, `:test:rest-runner` which is meant to
house the rest test running infrastructure. This has two purposes. First
is to remove the sniffer from the test framework dependencies, because
it transitively depends on Jackson. Second is to setup the runner for
future refactorings where it could be made to not depend on the entire
test framework, though how that could work is left for the future.
This adds a new sampling aggregation that performs a background sampling over all documents in an index.
The syntax is as follows:
```
{
"aggregations": {
"sampling": {
"random_sampler": {
"probability": 0.1
},
"aggs": {
"price_percentiles": {
"percentiles": {
"field": "taxful_total_price"
}
}
}
}
}
}
```
This aggregation provides fast random sampling over the entire document set in order to speed up costly aggregations.
Testing this over a variety of aggregations and data sets, the median speed up when sampling at `0.001` over millions of documents is around 70X speed improvement.
Relative error rate does rely on the size of the data and the aggregation kind. Here are some typically expected numbers when sampling over 10s of millions of documents. `p` is the configured probability and `n` is the number of documents matched by your provided filter query.
This PR introduces the lookup runtime fields which are used to retrieve
data from the related indices. The below search request enriches its
search hits with the location of each IP address from the `ip_location`
index.
```
POST logs/_search
{
"runtime_mappings": {
"location": {
"type": "lookup",
"lookup_index": "ip_location",
"query_type": "term",
"query_input_field": "ip",
"query_target_field": "_id",
"fetch_fields": [
"country",
"city"
]
}
},
"fields": [
"timestamp",
"message",
"location"
]
}
```
Response:
```
{
"hits": {
"hits": [
{
"_index": "logs",
"_id": "1",
"fields": {
"location": [
{
"city": [ "Montreal" ],
"country": [ "Canada" ]
}
],
"message": [ "the first message" ]
}
}
]
}
}
```
Many consumers of the field caps API need to do some post-processing of the
results before they can use them; for instance, Kibana would like to exclude
multifields from certain field selections, or would like to display only geo_point
fields in Maps. ML and QL consumers exclude nested fields in certain
circumstances. This post-processing is possible at the moment, but can be
hacky; and in all cases it involves sending the whole (possibly very large) field
caps response over the wire and then whittling it down in the client. It is also not
guaranteed to be accurate - runtime fields may be incorrectly classified as multifields,
for example.
This commit pushes filtering into elasticsearch itself, reducing the amount of data
that needs to be transported and ensuring better accuracy. The field caps API gets
two new parameters:
* filters - a comma-delimited list that may contain any combination of: `+metadata`,
`-metadata`, `-nested`, `-parent`, `-multifield`
* types - a comma-delimited list of field types; only fields that have a type in this set
will be returned
The API will make best-effort attempts to apply the filters post-hoc to responses from
older nodes, so this should still work in a mixed-cluster or cross-cluster situation.
Fixes#82966, #72174
- Add `es.index_mode_feature_flag_registered` feature flag to data-streams module's internalClusterTest task.
- Add `es.random_sampler_feature_flag_registered` feature flag to xpack rest tests with security qa module.
Closes#83722
As runtime fields not support `time_series_dimension` and
`time_series_metric`, it will lead to the failure of tsdb test case. And
tsdb indices require the @timestamp field.
So I improve the `runtimeifyMappingProperties` method logic, add some
skip rule.
- skip `time_series_dimension` field.
- skip `time_series_metric` field.
- skip `@timestamp` field.
And the PR fixed the failed test in
https://github.com/elastic/elasticsearch/issues/83431
This commit changes the superuser role (as used by the "elastic"
builtin user) so that it no longer has any sort of write access to
restricted indices (system indices).
This improves the safety and security of the cluster, as it means
that there are no out-of-the-box users or roles that can write to,
delete or close the security index.
Superusers can still read from (and monitor) system indices.
Other roles (and users) can still access system indices as specified
in their descriptor. These can be custom such as the
"_es_test_root" role used in the integration test suite, or builtin
roles such as kibana_system.
JEP 361[https://openjdk.java.net/jeps/361] added support for switch expressions
which can be much more terse and less error-prone than switch statements.
Another useful feature of switch expressions is exhaustiveness: we can make
sure that an enum switch expression covers all the cases at compile time.
The ES code base is quite JSON heavy. It uses a lot of multi-line JSON requests in tests which need to be escaped and concatenated which in turn makes them hard to read. Let's try to leverage Java 15 text blocks for representing them.
Fix the split package org.elasticsearch.common.xcontent, between server and the x-content lib. Move the x-content lib exported package from org.elasticsearch.common.xcontent to org.elasticsearch.xcontent ( following the naming convention of similar libraries ). Removing split packages is a prerequisite to modularization.