elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-04-23 14:47:31 -04:00

Author	SHA1	Message	Date
Mary Gouseti	253f1f430f	[7.17] [ILM] More resilient when a policy is added to searchable snapshot (#102741 ) (#103070 ) * Backport #102741	2023-12-06 20:08:15 +02:00
Ignacio Vera	33cbb8ee23	Add more logging to the real memory circuit breaker and lower minimum interval (#102396 ) (#102443 ) It lowers the minimumInterval from 5000ms to 500ms as we observed high number of CBEs when there are bursting allocations in newer JDKs.	2023-11-22 02:44:30 -05:00
David Turner	4992962f19	[7.17] Unwrap exception more tenaciously in testQueuedOperationsAndBrokenRepoOnMasterFailOver (#102352 ) (#102368 ) * Unwrap exception more tenaciously in testQueuedOperationsAndBrokenRepoOnMasterFailOver (#102352) There can be more than 10 layers of wrapping RTEs, see #102351. As a workaround to address the test failure, this commit just manually unwraps them all. Closes #102348 * Fixup	2023-11-20 04:08:45 -05:00
Lorenzo Dematte	1a67f09021	Bump versions after 7.17.15 release	2023-11-14 08:29:05 +01:00
David Turner	5ab8599ab1	Protect NodeConnectionsService from stale conns (#101988 ) A call to `ConnectionTarget#connect` which happens strictly after all calls that close connections should leave us connected to the target. However concurrent calls to `ConnectionTarget#connect` can overlap, and today this means that a connection returned from an earlier call may overwrite one from a later call. The trouble is that the earlier connection attempt may yield a closed connection (it was concurrent with the disconnections) so we must not let it supersede the newer one. With this commit we prevent concurrent connection attempts, which avoids earlier attempts from overwriting the connections resulting from later attempts. Backport of #92558 When combined with #101910, closes #100493	2023-11-09 16:13:03 -05:00
David Turner	7d975abbd0	Delay Connection#onRemoved while pending (#101910 ) Today we call `Transport.Connection#onRemoved`, notifying any removed-listeners, when the connection is closed and removed from the `connectedNodes` map. However, it's possible for the connection to be closed while we're still adding it to the map and setting up the listeners, so this now-dead connection will still be found in the `pendingConnections` and may be returned to a future call to `connectToNode` even if this call was made after all the removed-listeners have been called. With this commit we delay calling the removed-listeners until the connection is closed and removed from both the `connectedNodes` and `pendingConnections` maps. Backport of #92546 to 7.17 Relates #100493	2023-11-09 19:53:16 +00:00
David Turner	e573c1d385	Fail listener on exception in TcpTransport#openConnection (#101907 ) (#101955 ) Today `TcpTransport#openConnection` may throw exceptions on certain kinds of failure, but other kinds of failure are passed to the listener. This is trappy and not all callers handle it correctly. This commit makes sure that all exceptions are passed to the listener. Closes #100510	2023-11-09 08:09:11 -05:00
David Turner	4d7f28961f	Fail cancelled CS requests without redundant wait for state update (#101905 ) Just fail the request right away if it got cancelled. Backports #96869 to 7.17 Closes #100671 Co-authored-by: Armin Braun <me@obrown.io>	2023-11-08 07:00:12 -05:00
Volodymyr Krasnikov	5061954284	Fix race condition in SnapshotsService (#101652 ) (#101688 ) * Fix race condition in SnapshotsService * Update docs/changelog/101652.yaml	2023-11-01 13:30:24 -07:00
Ignacio Vera	56f8e477a7	Add tolerance to ExtendedStatsAggregatorTests#testSummationAccuracy (#100917 ) (#100939 )	2023-10-17 02:47:25 -04:00
Dianna Hohensee	5a8b6fc972	[7.17] Stabilize testRerouteRecovery throttle testing (#100788 ) (#100858 ) Refactor testRerouteRecovery, pulling out testing of shard recovery throttling into separate targeted tests. Now there are two additional tests, one testing source node throttling, and another testing target node throttling. Throttling both nodes at once leads to primarily the source node registering throttling, while the target node mostly has no cause to instigate throttling. (cherry picked from commit `323d9366df`)	2023-10-16 09:03:15 -04:00
Yang Wang	9e7713a866	[7.17] Log a debug level message for deleting non-existing snapshot (#100479 ) (#100509 ) * Log a debug level message for deleting non-existing snapshot (#100479) The new message helps pairing with the "deleting snapshots" log message at info level. (cherry picked from commit `2cfdb7a92d`) # Conflicts: # server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java * spotless * compilation --------- Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>	2023-10-11 19:43:42 -04:00
Jason Bryan	d92ea26308	Bump versions after 7.17.14 release	2023-10-10 12:52:02 -04:00
Ed Savage	ccb5c4d0da	Mute failing NodeConnectionsServiceTests/testEventuallyConnectsOnlyToAppliedNodes (#100495 ) Test `NodeConnectionsServiceTests/testEventuallyConnectsOnlyToAppliedNodes` fails with ``` java.lang.AssertionError: not connected to {node_21}{21}{Smg5SSzlSAWdhBrP63KjTQ}{0.0.0.0}{0.0.0.0:7}{dmsw} at __randomizedtesting.SeedInfo.seed([963D3EFA5F943C6E:6D1F74C5990D6DDD]:0) at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.assertTrue(Assert.java:41) at org.elasticsearch.cluster.NodeConnectionsServiceTests.assertConnected(NodeConnectionsServiceTests.java:507) at org.elasticsearch.cluster.NodeConnectionsServiceTests.assertConnectedExactlyToNodes(NodeConnectionsServiceTests.java:501) at org.elasticsearch.cluster.NodeConnectionsServiceTests.assertConnectedExactlyToNodes(NodeConnectionsServiceTests.java:497) at org.elasticsearch.cluster.NodeConnectionsServiceTests.lambda$testEventuallyConnectsOnlyToAppliedNodes$6(NodeConnectionsServiceTests.java:152) at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1143) at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1116) ``` Mute it. Relates #100493	2023-10-09 06:49:23 -04:00
Joe Gallo	2f8fa89fe3	Refactor WriteableIngestDocument (#99324 ) (#100224 )	2023-10-03 15:32:07 -04:00
Mark Vieira	3dbba882f4	Mute IndexRecoveryIT.testRerouteRecovery (#100209 ) (#100211 ) Mute failing test	2023-10-03 13:46:59 -04:00
Rene Groeschke	5afd06ae57	[7.17] Update Gradle Wrapper to 8.2 (#96686 ) (#97484 ) * Update Gradle Wrapper to 8.2 (#96686) - Convention usage has been deprecated and was fixed in our build files - Fix test dependencies and deprecation	2023-09-27 08:46:44 +02:00
David Turner	84f632f254	Close expired search contexts on SEARCH thread (#99660 ) In a production cluster, I observed the `[scheduler]` thread stuck for a while trying to delete index files that became unreferenced while closing a search context. We shouldn't be doing I/O on the scheduler thread. This commit moves it to a `SEARCH` thread instead.	2023-09-19 15:26:26 +01:00
Simon Cooper	eb51d7b890	Fix deadlock between Cache.put and invalidateAll (#99480 ) (#99580 ) The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put. This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.	2023-09-14 11:27:04 -04:00
David Turner	ac9112b909	Reinstate testRunnableRunsAtMostOnceAfterCancellation (#99525 ) This test was failing in #34004 due to a race, and although #34296 made the failures rarer they did not actually fix the race. Then in #99201 we fixed the race but the resulting test over-synchronizes and no longer meaningfully verifies the concurrent behaviour we were originally trying to check. It also fails for other reasons. This commit reverts back to the original test showing that we might run the action at most once after cancellation without any further synchronization, but fixes the assertion to use the value of the counter observed immediately after the cancellation since we cannot be sure that no extra iterations execute before the cancellation completes.	2023-09-13 16:01:10 +01:00
Nhat Nguyen	c99902f3f8	Fix PIT when resolving with deleted indices (#99281 ) (#99331 ) * Fix PIT when resolving with deleted indices * Update docs/changelog/99281.yaml	2023-09-07 18:33:10 -04:00
Craig Taverner	c6a86e8269	Bump versions after 7.17.13 release	2023-09-07 17:18:38 +02:00
Tim Vernum	9d448a0f95	Introduce FilterRestHandler (#98922 ) RestHandler has a number of methods that affect the behaviour of request processing. If the handler is wrapped (e.g. SecurityRestFilter or DeprecationRestHandler) then these methods must be delegated to the underlying handler. This commit introduces a new abstract base class `FilterRestHandler` that correctly delegates these methods so that wrappers (subclasses) do not need to implement the behaviour on a case-by-case basis Backport of: #98861	2023-08-30 07:19:47 +10:00
Iraklis Psaroudakis	f7af19baaa	Fix autoexpand during node replace (#98891 ) Prior to this change NodeReplacementAllocationDecider was unconditionally skipping both replacement source and target nodes when calculation auto-expand replicas. This is fixed by autoexpanding to the replacement node if source node already had shards of the index Backport of PR #96281 amended for 7.17.x Closes #89527 Co-authored-by: Ievgen Degtiarenko <ievgen.degtiarenko@elastic.co>	2023-08-28 10:15:54 +03:00
Albert Zaharovits	0df52c8f67	Netty4 HTTP authn enhancements (#92220 ) (#96703 ) This is a backport of multiple work items related to authentication enhancements for HTTP, which were originally merged in the 8.8 - 8.9 releases. Hence, the HTTP (only the netty4-based implementation (default), not the NIO one) authentication implementation gets a throughput boost (especially for requests failing authn). Relates to: ES-6188 #92220 #95112	2023-08-23 18:52:38 +03:00
David Turner	fe18a67f02	Make TransportAddVotingConfigExclusionsAction retryable (#98568 ) The docs for this API say the following: > If the API fails, you can safely retry it. Only a successful response > guarantees that the node has been removed from the voting > configuration and will not be reinstated. Unfortunately this isn't true today: if the request adds no exclusions then we do not wait before responding. This commit makes the API wait until all exclusions are really applied. Backport of #98386, plus the test changes from #98146 and #98356.	2023-08-17 04:49:03 -04:00
Slobodan Adamović	6f66c75c99	[7.17] Enhance regex performance with duplicate wildcards (#98176 ) (#98277 ) This change avoids unnecessary substring allocations and recursion calls when more than two consecutive wildcards (``) are detected. Instead skipping and calling a method recursively, we now try to skip all consecutive `` chars at once.	2023-08-08 14:34:05 +02:00
Alan Woodward	ce1a67b8d5	Fix bug in NestedUtils.partitionByChildren() (#97970 ) (#97986 ) If multiple fields appeared between two child scopes, the following children would be incorrectly assigned to the parent scope.	2023-07-26 15:54:58 -04:00
Przemyslaw Gomulka	fbdb9cd6a8	Add Configuration to PatternLayout backport(97679) (#97971 ) in 2.17.2 (patch release) log4j has made a refactoring that requires a Configuration to be manually passed into the created PatternLayout If the Configuration is not passed, the System Variable lookup will not work This results in cluster.name field not being populated in logs This commit creates a PatternLayout with a DefaultConfiguration (the same was used previous to the refactoring) backports #97679	2023-07-26 17:51:43 +02:00
Ryan Ernst	fe8d4a23ae	Bump versions after 7.17.12 release	2023-07-25 11:24:15 -07:00
Alan Woodward	72eb60953a	Refactor nested field handling in FieldFetcher (#97683 ) (#97897 ) The current recursive nested field handling implementation in FieldFetcher can be O(n^2) in the number of nested mappings, whether or not a nested field has been requested or not. For indexes with a very large number of nested fields, this can mean it takes multiple seconds to build a FieldFetcher, making the fetch phase of queries extremely slow, even if no nested fields are actually asked for. This commit reworks the logic so that building nested fetchers is only O(n log n) in the number of nested mappers; additionally, we only pay this cost for nested fields that have been requested.	2023-07-25 13:45:53 +01:00
Artem Prigoda	7d11e4163b	[7.17] Preserve context in ResultDeduplicator (#84038 ) (#96868 ) Today the `ResultDeduplicator` may complete a collection of listeners in contexts different from the ones in which they were submitted. This commit makes sure that the context is preserved in the listener. Co-authored-by: David Turner <david.turner@elastic.co>	2023-07-05 11:00:52 +02:00
David Turner	4ccdcce2b3	[7.17] Handle failure in TransportUpdateAction#handleUpdateFailureWithRetry (#97290 ) (#97326 ) * Handle failure in TransportUpdateAction#handleUpdateFailureWithRetry (#97290) Here executor(request.getShardId()) may throw, but we're already handling a failure so we cannot simply let this exception bubble up. This commit adjusts things to catch the exception, using it to fail the listener. Closes #97286 * Fix --------- Co-authored-by: Iraklis Psaroudakis <kingherc@gmail.com>	2023-07-03 13:31:41 -04:00
Mark Tozzi	29ff32e276	Less jank in after-key parsing for unmapped fields (#86359 ) (#97282 ) Resolves #85928 The after-key parsing is pretty weird, and there are probably more bugs there. I did not take the opportunity to refactor the whole thing, but we should. This fixes the immediate problem by treating after keys as bytes refs when we don't have a field but think we want a keyword. We were already doing that if the user asked for a missing bucket, this just extends the behavior in the case that we don't. Long term, the terms Composite source (and probably other Composite sources) should have specializations for unmapped fields. That's the direction we want to take aggs in general.	2023-07-03 09:57:14 -04:00
Mark Vieira	6a55b8789e	Bump versions after 7.17.11 release	2023-06-29 13:30:37 -07:00
Ryan Ernst	537a1c9bda	Capture max processors in static init (#97119 ) (#97153 ) The number of processors available to the jvm can change over time. However, most of Elasticsearch assumes this value is constant. Although we could rework all code relying on the number of processors to dynamically support updates and poll the jvm, doing so has little value since the processors changing is an edge case. Instead, this commit fixes validation of the node.processors setting (our internal number of processors) to validate based on the max processors available at launch. closes #97088	2023-06-27 12:47:29 -04:00
Joe Gallo	62579b0062	Fix unhandled exception when blobstore repository contains unexpected file (#93914 ) (#97113 ) If there's any file with the `index-` prefix but not a number after that at the repo root we must not throw here. If we do, we will end up throwing an unexpected exception that is not properly handled by `org.elasticsearch.snapshots.SnapshotsService#failAllListenersOnMasterFailOver`, leading to the repository generation not getting correctly set in the cluster state down the line.	2023-06-27 10:04:22 -04:00
David Turner	eeedb98c60	Make cluster health API cancellable (#96990 ) This API can be quite heavy in large clusters, and might spam the `MANAGEMENT` threadpool queue with work for clients that have long-since given up. This commit adds some basic cancellability checks to reduce the problem. Backport of #96551 to 7.17	2023-06-22 08:05:03 +01:00
Nhat Nguyen	e1995a708a	Increase concurrent request of opening point-in-time (#96959 ) * Increase concurrent request of opening point-in-time (#96782) (#96957) Today, we mistakenly throttle the opening point-in-time API to 1 request per node. As a result, when attempting to open a point-in-time across large clusters, it can take a significant amount of time and eventually fails due to relocated target shards or deleted target indices managed by ILM. Ideally, we should batch the requests per node and eliminate this throttle completely. However, this requires all clusters to be on the latest version. This PR increases the number of concurrent requests from 1 to 5, which is the default of search. * Fix tests * Fix tests	2023-06-20 14:43:56 -04:00
Ignacio Vera	2b42390ac2	Port lucene tessellator fix github-12352 to Elasticsearch 7.17 (#96721 )	2023-06-14 10:42:13 +02:00
David Turner	969770f87c	Include targetNodeName in shutdown metadata toString() (#96765 ) Reporting the `targetNodeName` was added to `main` in #78727 but omitted from the backport in #78865. This commit adds the missing field to the `toString()` response.	2023-06-12 15:58:19 +01:00
Ievgen Degtiarenko	3fc6eb6bdd	Fix AsyncShardFetchTests#testTwoNodesRemoveOne (#93734 ) (#96695 ) This test was using the wrong `DiscoveryNodes`, but that mistake was hidden by other leniency elsewhere in this test suite. This commit fixes the test bug and also makes the test suite stricter. Closes #93729 (cherry picked from commit `774e396ed5`) Co-authored-by: David Turner <david.turner@elastic.co>	2023-06-08 05:32:18 -04:00
David Turner	7c5d82d66b	Add one more consistency check in AsyncShardFetch (#96553 ) (#96557 ) Relates #93632	2023-06-05 05:51:23 -04:00
David Turner	9a1ee48b3a	Fork TransportClusterHealthAction to MANAGEMENT (#96546 ) This action can become fairly expensive for large states. Plus it is called at high rates on e.g. Cloud which is blocking transport threads needlessly in large deployments. Lets fork it to MANAGEMENT like we do for similar CPU bound actions. Backport of #90621 to 7.17 Co-authored-by: Armin Braun <me@obrown.io>	2023-06-05 09:14:21 +01:00
David Turner	6a755e3b00	Streamline AsyncShardFetch#getNumberOfInFlightFetches (#96545 ) Avoids an O(#nodes) iteration by tracking the number of fetches directly. Backport of #93632 to 7.17 Co-authored-by: luyuncheng <luyuncheng@bytedance.com>	2023-06-05 09:13:09 +01:00
Ievgen Degtiarenko	3445e5b5c9	Fix testFollowerBehaviour (#96257 ) (#96269 ) The test was failing when responseDelay == leaderCheckTimeoutMillis. This resulted in scheduling both handling the response and timeout at the same mills and executing them in random order. The fix makes it impossible to reply the same time as request is timeout as the behavior is not deterministic in such case.	2023-05-23 03:10:41 -04:00
Keith Massey	fb243a9d18	Avoiding running IO on scheduler thread in ResourceWatcherService (#96251 ) (#96261 )	2023-05-22 13:02:42 -04:00
Andrei Dan	fc5e40b626	[ILM] Fix the migrate to tiers service and migrate action tiers configuration (#95934 ) (#95966 ) The migrate action (although no allowed in the frozen phase) would seem to convert `frozen` to `data_frozen,data_cold,data_warm,data_hot` tier configuration. As the migrate action is not allowed in the frozen phase this would never happen, however the code is confusing as it seems like it could. The migrate to data tiers routing service shared the code used by the `migrate` action that converted `frozen` to `data_frozen,data_cold,data_warm,data_hot` if it would encounter an index without any `_tier_preference` setting but with a custom node attribute configured to `frozen` e.g. `include.data: frozen` As part of https://github.com/elastic/elasticsearch/issues/84758 we have seen frozen indices with the `data_frozen,data_cold,data_warm,data_hot` tier preference however we could never reproduce it. Relates to https://github.com/elastic/elasticsearch/issues/84758	2023-05-09 14:13:47 -04:00
Mark Vieira	90903324eb	Bump versions after 7.17.10 release	2023-05-02 09:45:33 -07:00
Armin Braun	7284dd20c4	Deduplicate Heavy CCR Repository CS Requests (#91398 ) (#95372 ) We run the same request back to back for each put-follower call during the restore. Also, concurrent put-follower calls will all run the same full CS request concurrently. In older versions prior to https://github.com/elastic/elasticsearch/pull/87235 the concurrency was limited by the size of the snapshot pool. With that fix though, they are run at almost arbitry concurrency when many put-follow requests are executed concurrently. -> fixed by using the existing deduplicator to only run a single remote CS request at a time for each CCR repository. Also, this removes the needless forking in the put-follower action that is not necessary any longer now that we have the CCR repository non-blocking (we do the same for normal restores that can safely be started from a transport thread), which should fix some bad-ux situations where the snapshot threads are busy on master, making the put-follower requests not go through in time.	2023-04-19 09:10:05 -04:00

1 2 3 4 5 ...

7544 commits