Commit graph

7544 commits

Author SHA1 Message Date
Mary Gouseti
253f1f430f
[7.17] [ILM] More resilient when a policy is added to searchable snapshot (#102741) (#103070)
* Backport #102741
2023-12-06 20:08:15 +02:00
Ignacio Vera
33cbb8ee23
Add more logging to the real memory circuit breaker and lower minimum interval (#102396) (#102443)
It lowers the minimumInterval from 5000ms to 500ms as we observed high number of CBEs when there are bursting 
allocations in newer JDKs.
2023-11-22 02:44:30 -05:00
David Turner
4992962f19
[7.17] Unwrap exception more tenaciously in testQueuedOperationsAndBrokenRepoOnMasterFailOver (#102352) (#102368)
* Unwrap exception more tenaciously in testQueuedOperationsAndBrokenRepoOnMasterFailOver (#102352)

There can be more than 10 layers of wrapping RTEs, see #102351. As a
workaround to address the test failure, this commit just manually
unwraps them all.

Closes #102348

* Fixup
2023-11-20 04:08:45 -05:00
Lorenzo Dematte
1a67f09021 Bump versions after 7.17.15 release 2023-11-14 08:29:05 +01:00
David Turner
5ab8599ab1
Protect NodeConnectionsService from stale conns (#101988)
A call to `ConnectionTarget#connect` which happens strictly after all
calls that close connections should leave us connected to the target.
However concurrent calls to `ConnectionTarget#connect` can overlap, and
today this means that a connection returned from an earlier call may
overwrite one from a later call. The trouble is that the earlier
connection attempt may yield a closed connection (it was concurrent with
the disconnections) so we must not let it supersede the newer one.

With this commit we prevent concurrent connection attempts, which avoids
earlier attempts from overwriting the connections resulting from later
attempts.

Backport of #92558
When combined with #101910, closes #100493
2023-11-09 16:13:03 -05:00
David Turner
7d975abbd0
Delay Connection#onRemoved while pending (#101910)
Today we call `Transport.Connection#onRemoved`, notifying any
removed-listeners, when the connection is closed and removed from the
`connectedNodes` map. However, it's possible for the connection to be
closed while we're still adding it to the map and setting up the
listeners, so this now-dead connection will still be found in the
`pendingConnections` and may be returned to a future call to
`connectToNode` even if this call was made after all the
removed-listeners have been called.

With this commit we delay calling the removed-listeners until the
connection is closed and removed from both the `connectedNodes` and
`pendingConnections` maps.

Backport of #92546 to 7.17
Relates #100493
2023-11-09 19:53:16 +00:00
David Turner
e573c1d385
Fail listener on exception in TcpTransport#openConnection (#101907) (#101955)
Today `TcpTransport#openConnection` may throw exceptions on certain
kinds of failure, but other kinds of failure are passed to the listener.
This is trappy and not all callers handle it correctly. This commit
makes sure that all exceptions are passed to the listener.

Closes #100510
2023-11-09 08:09:11 -05:00
David Turner
4d7f28961f
Fail cancelled CS requests without redundant wait for state update (#101905)
Just fail the request right away if it got cancelled.

Backports #96869 to 7.17 Closes #100671

Co-authored-by: Armin Braun <me@obrown.io>
2023-11-08 07:00:12 -05:00
Volodymyr Krasnikov
5061954284
Fix race condition in SnapshotsService (#101652) (#101688)
* Fix race condition in SnapshotsService

* Update docs/changelog/101652.yaml
2023-11-01 13:30:24 -07:00
Ignacio Vera
56f8e477a7
Add tolerance to ExtendedStatsAggregatorTests#testSummationAccuracy (#100917) (#100939) 2023-10-17 02:47:25 -04:00
Dianna Hohensee
5a8b6fc972
[7.17] Stabilize testRerouteRecovery throttle testing (#100788) (#100858)
Refactor testRerouteRecovery, pulling out testing of shard recovery
throttling into separate targeted tests. Now there are two additional
tests, one testing source node throttling, and another testing target
node throttling. Throttling both nodes at once leads to primarily the
source node registering throttling, while the target node mostly has
no cause to instigate throttling.

(cherry picked from commit 323d9366df)
2023-10-16 09:03:15 -04:00
Yang Wang
9e7713a866
[7.17] Log a debug level message for deleting non-existing snapshot (#100479) (#100509)
* Log a debug level message for deleting non-existing snapshot (#100479)

The new message helps pairing with the "deleting snapshots" log message
at info level.

(cherry picked from commit 2cfdb7a92d)

# Conflicts:
#	server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

* spotless

* compilation

---------

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2023-10-11 19:43:42 -04:00
Jason Bryan
d92ea26308
Bump versions after 7.17.14 release 2023-10-10 12:52:02 -04:00
Ed Savage
ccb5c4d0da
Mute failing NodeConnectionsServiceTests/testEventuallyConnectsOnlyToAppliedNodes (#100495)
Test
`NodeConnectionsServiceTests/testEventuallyConnectsOnlyToAppliedNodes`
fails with

```
java.lang.AssertionError: not connected to {node_21}{21}{Smg5SSzlSAWdhBrP63KjTQ}{0.0.0.0}{0.0.0.0:7}{dmsw}

  at __randomizedtesting.SeedInfo.seed([963D3EFA5F943C6E:6D1F74C5990D6DDD]:0)
  at org.junit.Assert.fail(Assert.java:88)
  at org.junit.Assert.assertTrue(Assert.java:41)
  at org.elasticsearch.cluster.NodeConnectionsServiceTests.assertConnected(NodeConnectionsServiceTests.java:507)
  at org.elasticsearch.cluster.NodeConnectionsServiceTests.assertConnectedExactlyToNodes(NodeConnectionsServiceTests.java:501)
  at org.elasticsearch.cluster.NodeConnectionsServiceTests.assertConnectedExactlyToNodes(NodeConnectionsServiceTests.java:497)
  at org.elasticsearch.cluster.NodeConnectionsServiceTests.lambda$testEventuallyConnectsOnlyToAppliedNodes$6(NodeConnectionsServiceTests.java:152)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1143)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1116)
```

Mute it.

Relates #100493
2023-10-09 06:49:23 -04:00
Joe Gallo
2f8fa89fe3
Refactor WriteableIngestDocument (#99324) (#100224) 2023-10-03 15:32:07 -04:00
Mark Vieira
3dbba882f4
Mute IndexRecoveryIT.testRerouteRecovery (#100209) (#100211)
Mute failing test
2023-10-03 13:46:59 -04:00
Rene Groeschke
5afd06ae57
[7.17] Update Gradle Wrapper to 8.2 (#96686) (#97484)
* Update Gradle Wrapper to 8.2 (#96686)

- Convention usage has been deprecated and was fixed in our build files
- Fix test dependencies and deprecation
2023-09-27 08:46:44 +02:00
David Turner
84f632f254 Close expired search contexts on SEARCH thread (#99660)
In a production cluster, I observed the `[scheduler]` thread stuck for a
while trying to delete index files that became unreferenced while
closing a search context. We shouldn't be doing I/O on the scheduler
thread. This commit moves it to a `SEARCH` thread instead.
2023-09-19 15:26:26 +01:00
Simon Cooper
eb51d7b890
Fix deadlock between Cache.put and invalidateAll (#99480) (#99580)
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put.

This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
2023-09-14 11:27:04 -04:00
David Turner
ac9112b909 Reinstate testRunnableRunsAtMostOnceAfterCancellation (#99525)
This test was failing in #34004 due to a race, and although #34296 made
the failures rarer they did not actually fix the race. Then in #99201 we
fixed the race but the resulting test over-synchronizes and no longer
meaningfully verifies the concurrent behaviour we were originally trying
to check. It also fails for other reasons. This commit reverts back to
the original test showing that we might run the action at most once
after cancellation without any further synchronization, but fixes the
assertion to use the value of the counter observed immediately after the
cancellation since we cannot be sure that no extra iterations execute
before the cancellation completes.
2023-09-13 16:01:10 +01:00
Nhat Nguyen
c99902f3f8
Fix PIT when resolving with deleted indices (#99281) (#99331)
* Fix PIT when resolving with deleted indices

* Update docs/changelog/99281.yaml
2023-09-07 18:33:10 -04:00
Craig Taverner
c6a86e8269 Bump versions after 7.17.13 release 2023-09-07 17:18:38 +02:00
Tim Vernum
9d448a0f95
Introduce FilterRestHandler (#98922)
RestHandler has a number of methods that affect the behaviour of request
processing. If the handler is wrapped (e.g. SecurityRestFilter or
DeprecationRestHandler) then these methods must be delegated to the
underlying handler.

This commit introduces a new abstract base class `FilterRestHandler`
that correctly delegates these methods so that wrappers (subclasses) do
not need to implement the behaviour on a case-by-case basis

Backport of: #98861
2023-08-30 07:19:47 +10:00
Iraklis Psaroudakis
f7af19baaa
Fix autoexpand during node replace (#98891)
Prior to this change NodeReplacementAllocationDecider was unconditionally skipping both replacement source and target nodes when calculation auto-expand replicas. This is fixed by autoexpanding to the replacement node if source node already had shards of the index

Backport of PR #96281 amended for 7.17.x

Closes #89527

Co-authored-by: Ievgen Degtiarenko <ievgen.degtiarenko@elastic.co>
2023-08-28 10:15:54 +03:00
Albert Zaharovits
0df52c8f67
Netty4 HTTP authn enhancements (#92220) (#96703)
This is a backport of multiple work items related to authentication enhancements for HTTP,
which were originally merged in the 8.8 - 8.9 releases.
Hence, the HTTP (only the netty4-based implementation (default), not the NIO one) authentication
implementation gets a throughput boost (especially for requests failing authn).

Relates to: ES-6188 #92220 #95112
2023-08-23 18:52:38 +03:00
David Turner
fe18a67f02
Make TransportAddVotingConfigExclusionsAction retryable (#98568)
The docs for this API say the following:

> If the API fails, you can safely retry it. Only a successful response
> guarantees that the node has been removed from the voting
> configuration and will not be reinstated.

Unfortunately this isn't true today: if the request adds no exclusions
then we do not wait before responding. This commit makes the API wait
until all exclusions are really applied.

Backport of #98386, plus the test changes from #98146 and #98356.
2023-08-17 04:49:03 -04:00
Slobodan Adamović
6f66c75c99
[7.17] Enhance regex performance with duplicate wildcards (#98176) (#98277)
This change avoids unnecessary substring allocations and recursion calls
when more than two consecutive wildcards (`*`) are detected. Instead
skipping and calling a method recursively, we now try to skip all
consecutive `*` chars at once.
2023-08-08 14:34:05 +02:00
Alan Woodward
ce1a67b8d5
Fix bug in NestedUtils.partitionByChildren() (#97970) (#97986)
If multiple fields appeared between two child scopes, the following children
would be incorrectly assigned to the parent scope.
2023-07-26 15:54:58 -04:00
Przemyslaw Gomulka
fbdb9cd6a8
Add Configuration to PatternLayout backport(97679) (#97971)
in 2.17.2 (patch release) log4j has made a refactoring that requires a Configuration to be manually passed into the created PatternLayout
If the Configuration is not passed, the System Variable lookup will not work This results in cluster.name field not being populated in logs

This commit creates a PatternLayout with a DefaultConfiguration (the same was used previous to the refactoring)

backports #97679
2023-07-26 17:51:43 +02:00
Ryan Ernst
fe8d4a23ae
Bump versions after 7.17.12 release 2023-07-25 11:24:15 -07:00
Alan Woodward
72eb60953a
Refactor nested field handling in FieldFetcher (#97683) (#97897)
The current recursive nested field handling implementation in FieldFetcher
can be O(n^2) in the number of nested mappings, whether or not a nested
field has been requested or not. For indexes with a very large number of
nested fields, this can mean it takes multiple seconds to build a FieldFetcher,
making the fetch phase of queries extremely slow, even if no nested fields
are actually asked for.

This commit reworks the logic so that building nested fetchers is only
O(n log n) in the number of nested mappers; additionally, we only pay this
cost for nested fields that have been requested.
2023-07-25 13:45:53 +01:00
Artem Prigoda
7d11e4163b
[7.17] Preserve context in ResultDeduplicator (#84038) (#96868)
Today the `ResultDeduplicator` may complete a collection of listeners in
contexts different from the ones in which they were submitted. This
commit makes sure that the context is preserved in the listener.

Co-authored-by: David Turner <david.turner@elastic.co>
2023-07-05 11:00:52 +02:00
David Turner
4ccdcce2b3
[7.17] Handle failure in TransportUpdateAction#handleUpdateFailureWithRetry (#97290) (#97326)
* Handle failure in TransportUpdateAction#handleUpdateFailureWithRetry (#97290)

Here executor(request.getShardId()) may throw, but we're already handling a failure so we cannot simply let this exception bubble up. This commit adjusts things to catch the exception, using it to fail the listener.

Closes #97286

* Fix

---------

Co-authored-by: Iraklis Psaroudakis <kingherc@gmail.com>
2023-07-03 13:31:41 -04:00
Mark Tozzi
29ff32e276
Less jank in after-key parsing for unmapped fields (#86359) (#97282)
Resolves #85928

The after-key parsing is pretty weird, and there are probably more bugs there. I did not take the opportunity to refactor the whole thing, but we should. This fixes the immediate problem by treating after keys as bytes refs when we don't have a field but think we want a keyword. We were already doing that if the user asked for a missing bucket, this just extends the behavior in the case that we don't.

Long term, the terms Composite source (and probably other Composite sources) should have specializations for unmapped fields. That's the direction we want to take aggs in general.
2023-07-03 09:57:14 -04:00
Mark Vieira
6a55b8789e
Bump versions after 7.17.11 release 2023-06-29 13:30:37 -07:00
Ryan Ernst
537a1c9bda
Capture max processors in static init (#97119) (#97153)
The number of processors available to the jvm can change over time.
However, most of Elasticsearch assumes this value is constant. Although
we could rework all code relying on the number of processors to
dynamically support updates and poll the jvm, doing so has little value
since the processors changing is an edge case. Instead, this commit
fixes validation of the node.processors setting (our internal number of
processors) to validate based on the max processors available at launch.

closes #97088
2023-06-27 12:47:29 -04:00
Joe Gallo
62579b0062
Fix unhandled exception when blobstore repository contains unexpected file (#93914) (#97113)
If there's any file with the `index-` prefix but not a number after that at the repo root we
must not throw here. If we do, we will end up throwing an unexpected exception that is not
properly handled by `org.elasticsearch.snapshots.SnapshotsService#failAllListenersOnMasterFailOver`,
leading to the repository generation not getting correctly set in the cluster state down the line.
2023-06-27 10:04:22 -04:00
David Turner
eeedb98c60
Make cluster health API cancellable (#96990)
This API can be quite heavy in large clusters, and might spam the
`MANAGEMENT` threadpool queue with work for clients that have long-since
given up. This commit adds some basic cancellability checks to reduce
the problem.

Backport of #96551 to 7.17
2023-06-22 08:05:03 +01:00
Nhat Nguyen
e1995a708a
Increase concurrent request of opening point-in-time (#96959)
* Increase concurrent request of opening point-in-time (#96782) (#96957)

Today, we mistakenly throttle the opening point-in-time API to 1 request
per node. As a result, when attempting to open a point-in-time across
large clusters, it can take a significant amount of time and eventually
fails due to relocated target shards or deleted target indices managed
by ILM. Ideally, we should batch the requests per node and eliminate
this throttle completely. However, this requires all clusters to be on
the latest version.

This PR increases the number of concurrent requests from 1 to 5, which
is the default of search.

* Fix tests

* Fix tests
2023-06-20 14:43:56 -04:00
Ignacio Vera
2b42390ac2
Port lucene tessellator fix github-12352 to Elasticsearch 7.17 (#96721) 2023-06-14 10:42:13 +02:00
David Turner
969770f87c
Include targetNodeName in shutdown metadata toString() (#96765)
Reporting the `targetNodeName` was added to `main` in #78727 but omitted
from the backport in #78865. This commit adds the missing field to the
`toString()` response.
2023-06-12 15:58:19 +01:00
Ievgen Degtiarenko
3fc6eb6bdd
Fix AsyncShardFetchTests#testTwoNodesRemoveOne (#93734) (#96695)
This test was using the wrong `DiscoveryNodes`, but that mistake was
hidden by other leniency elsewhere in this test suite. This commit fixes
the test bug and also makes the test suite stricter.

Closes #93729

(cherry picked from commit 774e396ed5)

Co-authored-by: David Turner <david.turner@elastic.co>
2023-06-08 05:32:18 -04:00
David Turner
7c5d82d66b
Add one more consistency check in AsyncShardFetch (#96553) (#96557)
Relates #93632
2023-06-05 05:51:23 -04:00
David Turner
9a1ee48b3a
Fork TransportClusterHealthAction to MANAGEMENT (#96546)
This action can become fairly expensive for large states. Plus it is
called at high rates on e.g. Cloud which is blocking transport threads
needlessly in large deployments. Lets fork it to MANAGEMENT like we do
for similar CPU bound actions.

Backport of #90621 to 7.17

Co-authored-by: Armin Braun <me@obrown.io>
2023-06-05 09:14:21 +01:00
David Turner
6a755e3b00
Streamline AsyncShardFetch#getNumberOfInFlightFetches (#96545)
Avoids an O(#nodes) iteration by tracking the number of fetches directly.

Backport of #93632 to 7.17

Co-authored-by: luyuncheng <luyuncheng@bytedance.com>
2023-06-05 09:13:09 +01:00
Ievgen Degtiarenko
3445e5b5c9
Fix testFollowerBehaviour (#96257) (#96269)
The test was failing when responseDelay == leaderCheckTimeoutMillis.
This resulted in scheduling both handling the response and timeout at the same
mills and executing them in random order. The fix makes it impossible to
reply the same time as request is timeout as the behavior is not deterministic
 in such case.
2023-05-23 03:10:41 -04:00
Keith Massey
fb243a9d18
Avoiding running IO on scheduler thread in ResourceWatcherService (#96251) (#96261) 2023-05-22 13:02:42 -04:00
Andrei Dan
fc5e40b626
[ILM] Fix the migrate to tiers service and migrate action tiers configuration (#95934) (#95966)
The migrate action (although no allowed in the frozen phase) would seem
to convert `frozen` to `data_frozen,data_cold,data_warm,data_hot` tier
configuration. As the migrate action is not allowed in the frozen phase
this would never happen, however the code is confusing as it seems like
it could.

The migrate to data tiers routing service shared the code used by the
`migrate` action that converted `frozen` to
`data_frozen,data_cold,data_warm,data_hot` if it would encounter an
index without any `_tier_preference` setting  but with a custom node
attribute configured to `frozen` e.g. `include.data: frozen`

As part of https://github.com/elastic/elasticsearch/issues/84758 we have
seen frozen indices with the `data_frozen,data_cold,data_warm,data_hot`
tier preference however we could never reproduce it.

Relates to https://github.com/elastic/elasticsearch/issues/84758
2023-05-09 14:13:47 -04:00
Mark Vieira
90903324eb
Bump versions after 7.17.10 release 2023-05-02 09:45:33 -07:00
Armin Braun
7284dd20c4
Deduplicate Heavy CCR Repository CS Requests (#91398) (#95372)
We run the same request back to back for each put-follower call during
the restore. Also, concurrent put-follower calls will all run the same
full CS request concurrently.
In older versions prior to https://github.com/elastic/elasticsearch/pull/87235
the concurrency was limited by the size of the snapshot pool. With that
fix though, they are run at almost arbitry concurrency when many
put-follow requests are executed concurrently.
-> fixed by using the existing deduplicator to only run a single remote
CS request at a time for each CCR repository.
Also, this removes the needless forking in the put-follower action that
is not necessary any longer now that we have the CCR repository
non-blocking (we do the same for normal restores that can safely be
started from a transport thread), which should fix some bad-ux
situations where the snapshot threads are busy on master, making
the put-follower requests not go through in time.
2023-04-19 09:10:05 -04:00