We have instances where BWC tests configure old ES version nodes with
the integTest distribution. This isn't a valid configuration, and while
we in reality resolve the default distribution artifact, we have other
configuration logic that behaves differently based on whether the
integTest distro was _requested_. Specifically, what to set ES_JAVA_HOME
to. This bug resulted in us attempting to run old nodes using the
current bundled JDK version, which may be incompatible with that older
version of Elasticsearch.
Closes#104858
If we proceed without waiting for pages, we might cancel the main
request before starting the data-node request. As a result, the exchange
sinks on data-nodes won't be removed until the inactive_timeout elapses,
which is longer than the assertBusy timeout.
Closes#106443
I investigated a heap attack test failure and found that an ESQL request
was stuck. This occurred in the following:
1. The ExchangeSource on the coordinator was blocked on reading because
there were no available pages.
2. Meanwhile, the ExchangeSink on the data node had pages ready for
fetching.
3. When an exchange request tried to fetch pages, it failed due to a
CircuitBreakingException. Despite the failure, no cancellation was
triggered because the status of the ExchangeSource on the coordinator
remained unchanged. To fix this issue, this PR introduces two changes:
Resumes the ExchangeSourceOperator and Driver on the coordinator,
eventually allowing the coordinator to trigger cancellation of the
request when failing to fetch pages.
Ensures that an exchange sink on the data nodes fails when a data node
request is cancelled. This callback was inadvertently omitted when
introducing the node-level reduction in Run empty reduction node level
on data nodes #106204.
I plan to spend some time to harden the exchange and compute service.
Closes#106262
The tests that assert sorting on spatial types causes consistent error messages, also were flaky for the non-error message cases under rare circumstances where the results were returned in different order. We now sort those on a sortable field for deterministic behaviour.
Some index requests target shard IDs specifically, which may not match the indices that the request targets as given by `IndicesRequest#indices()`, which requires a different interception strategy in order to make sure those requests are handled correctly in all cases and that any malformed messages are caught early to aid in troubleshooting.
This PR adds and interface allowing requests to report the shard IDs they target as well as the index names, and adjusts the interception of those requests as appropriate to handle those shard IDs in the cases where they are relevant.
* Fix error on sorting unsortable geo_point and cartesian_point
Without a LIMIT the correct error worked, but with LIMIT it did not. This fix mimics the same error with LIMIT and adds tests for all three scenarios:
* Without limit
* With Limit
* From row with limit
* Update docs/changelog/106351.yaml
* Add tests for geo_shape and cartesian_shape also
* Updated changelog
* Separate point and shape error messages
* Move error to later so we get it only if geo field is actually used in sort.
* Implemented planner check in Verifier instead
This is a much better solution.
* Revert previous solution
* Also check non-field attributes so the same error is provided for ROW
* Changed "can't" to "cannot"
* Add unit tests for verifier error
* Added sort limitations to documentation
* Added unit tests for spatial fields in VerifierTests
* Don't run the new yaml tests on older versions
These tests mostly test the validation errors which were changed only in 8.14.0, so should not be tested in earlier versions.
* Simplify check based on code review, skip duplicate forEachDown
* Add regression tests that test ACS and entity id mismatch, causing
us to go into the initCause branch
* Fix up exception creation: initCause it not
allowed because ElasticsearchException
initialises the cause to `null` already if
it isn't passed as a contructor param.
Signed-off-by: lloydmeta <lloydmeta@gmail.com>
`look_ahead_time` is set to 1 minute, the `assertBusy` loop needs to
wait for longer than that to get a readonly backing index.
Note that this is only relevant when the `UpdateTimeSeriesRangeService`
kicks in to bump the end time of the head index. This is rare (it runs
every 10 minutes) but can happen.
Fixes#101428
* During ML maintenance, reset jobs in the reset state without a corresponding task.
* Update docs/changelog/106062.yaml
* Fix race condition in MlDailyMaintenanceServiceTests
* Fix log level
elasticsearch-certutil csr generates a private key and a certificate
signing request (CSR) file. It has always accepted the "--pass" command
line option, but ignore it and always generated an unencrypted private
key.
This commit fixes the utility so the --pass option is respected and the
private key is encrypted.
This computation involves parsing all the pipeline metadata on the
cluster applier thread. It's pretty expensive if there are lots of
pipelines, and seems mostly unnecessary because it's only needed for a
validation check when creating new processors.
* Reset job if existing reset fails (#106020)
* Try again to reset a job if waiting for completion of an existing reset task fails.
* Update docs/changelog/106020.yaml
* Update 106020.yaml
* Update docs/changelog/106020.yaml
* Improve code
* Trigger rebuild
Previously the `categorize_text` aggregation could throw an
exception if nested as a sub-aggregation of another aggregation
that produced empty buckets at the end of its results. This
change avoids this possibility.
Fixes#105836
* ESQL: fix single valued query tests (#105986)
In some cases the tests for our lucene query that makes sure a field is
single-valued was asserting incorrect things about the stats that come
from the query. That was failing the test from time to time. This fixes
the assertion in those cases.
Closes#105918
* ESQL: Reenable svq tests
We fixed the test failure in #105986 but this snuck in.
Closes#105952
We seem to have a couple of checks to make sure we delete the data
stream when the last index reaches the delete step however, these checks
seem a bit contradictory.
Namely, the first check makes use if `Index` equality (UUID included)
and the second just checks the index name. So if a data stream with just
one index (the write index) is restored from snapshot (different UUID)
we would've failed the first index equality check and go through the
second check `dataStream.getWriteIndex().getName().equals(indexName)`
and fail the delete step (in a non-retryable way :( ) because we don't
want to delete the write index of a data stream (but we really do if the
data stream has only one index)
This PR makes 2 changes: 1. use the index name equality everywhere in
the step (we already looked up the index abstraction and the parent data
stream, so we know for sure the managed index is part of the data
stream) 2. do not throw exception when we got here via a write index
that is NOT the last index in the data stream but report the exception
so we keep retrying this step (i.e. this enables our users to simply
execute a manual rollover and the index is deleted by ILM eventually on
retry)
The heap attack tests hit OOM where the circuit breaker was
under-accounted. This was because the ProjectOperator retained
references to released blocks. Consequently, the released block couldn't
be GCed although we have decreased memory usage in the circuit breaker.
Relates #10563
* ESQL: Fix wrong attribute shadowing in pushdown rules (#105650)
Fix https://github.com/elastic/elasticsearch/issues/105434
Fixes accidental shadowing when pushing down `GROK`/`DISSECT`, `EVAL` or
`ENRICH` past a `SORT`.
Example for how this works:
```
...
| SORT x
| EVAL x = y
...
pushing this down just like that would be incorrect as x is used in the SORT, so we turn this essentially into
...
| EVAL $$x = x
| EVAL x = y
| SORT $$x
| DROP $$x
...
```
The same logic is applied to `GROK`/`DISSECT` and `ENRICH`.
This allows to re-enable the dependency checker (after fixing a small
bug in it when handling `ENRICH`).
* Make OptimizerRules compile again
ILM transitions to `wait-for-index-color` (a step that needs a cluster
state changed event to evaluate against) but misses the cluster state
event that notifies that `partial-index` is now `GREEN`. And then the
cluster is quiet and no more state changes occur and we timeout. Note
that the test is unblocked by the teardown of the IT that triggers some
cluster state changes.
This fixes the test by issueing some empty `reroute` request to cause
some cluster state traffic in the cluster and ILM notices an index is
assigned.
Note that a production cluster is busy and ILM would eventually notice
the new state and make progress.
```
2024-02-22T06:33:01,388][INFO ][o.e.x.i.IndexLifecycleTransition] [node_t0] moving
index [index] from [{"phase":"frozen","action":"searchable_snapshot","name":"mount-snapshot"}] to [{"phase":"froz
en","action":"searchable_snapshot","name":"wait-for-index-color"}] in policy [policy]
[2024-02-22T06:33:01,490][INFO ][o.e.c.r.a.AllocationService] [node_t0] current.health="GREEN"
message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started
[[partial-index][0]]])." previous.health="YELLOW" reason="shards started [[partial-index][0]]"
```
Fixes#102405
For Unattended Transforms, if we fail to create the destination index on
the first run, we will retry the transformation iteration, but we will
not retry the destination index creation on that next iteration.
This change stops the Unattended Transform from progressing beyond the
0th checkpoint, so all retries will include the destination index
creation.
Fix#105683
Relate #104146
It seems that the changes of https://github.com/elastic/ml-cpp/pull/2585
combined with the randomness of the test could cause it to fail
very occasionally, and by a tiny percentage over the expected
upper bound. This change reenables the test by very slightly
increasing the upper bound.
Fixes#105347
Currently, there is a small chance that testStopAtCheckpoint will fail
to correctly count the amount of times `doSaveState` is invoked:
```
Expected: <5>
but: was <4>
```
There are two potential issues:
1. The test thread starts the Transform thread, which starts a Search
thread. If the Search thread starts reading from the
`saveStateListeners` while the test thread writes to the
`saveStateListeners`, then there is a chance our testing logic will
not be able to count the number of times we read from
`saveStateListeners`.
2. The non-volatile integer may be read as one value and written as
another value.
Two fixes:
1. The test thread blocks the Transform thread until after the test
thread writes all the listeners. The subsequent test will
continue to verify that we can safely interlace reading and
writing.
2. The counter is now an AtomicInteger to provide thread safety.
Fixes#90549