elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-04-21 05:37:23 -04:00

Author	SHA1	Message	Date
Nik Everett	3263429a78	ESQL: Speed up VALUES for many buckets (#123073 ) (#123229 ) * ESQL: Speed up VALUES for many buckets (#123073) Speeds up the VALUES agg when collecting from many buckets. Specifically, this speeds up the algorithm used to `finish` the aggregation. Most specifically, this makes the algorithm more tollerant to large numbers of groups being collected. The old algorithm was `O(n^2)` with the number of groups. The new one is `O(n)` ``` (groups) 1 219.683 ± 1.069 -> 223.477 ± 1.990 ms/op 1000 426.323 ± 75.963 -> 463.670 ± 7.275 ms/op 100000 36690.871 ± 4656.350 -> 7800.332 ± 2775.869 ms/op 200000 89422.113 ± 2972.606 -> 21920.288 ± 3427.962 ms/op 400000 timed out at 10 minutes -> 40051.524 ± 2011.706 ms/op ``` The `1` group version was not changed at all. That's just noise in the measurement. The small bump in the `1000` case is almost certainly worth it and real. The huge drop in the `100000` case is quite real. * Fix * Compile	2025-02-27 07:35:57 +11:00
Rene Groeschke	581b9ab7c0	[8.16] [Gradle] Remove static use of BuildParams (#115122 ) (#117434 ) * [Gradle] Remove static use of BuildParams (#115122) Static fields dont do well in Gradle with configuration cache enabled. - Use buildParams extension in build scripts - Keep BuildParams.ci for now for easy serverless migration - Tweak testing doc (cherry picked from commit `13c8aaeffa`) # Conflicts: # TESTING.asciidoc # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/InternalDistributionBwcSetupPlugin.java # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rest/RestTestBasePlugin.java # build-tools-internal/src/main/java/org/elasticsearch/gradle/internal/test/rest/compat/compat/AbstractYamlRestCompatTestPlugin.java # build.gradle # modules/ingest-geoip/qa/full-cluster-restart/build.gradle # qa/mixed-cluster/build.gradle # x-pack/plugin/ent-search/qa/full-cluster-restart/build.gradle # x-pack/plugin/eql/qa/rest/build.gradle # x-pack/plugin/fleet/qa/rest/build.gradle # x-pack/plugin/kql/build.gradle # x-pack/plugin/mapper-unsigned-long/build.gradle # x-pack/plugin/ml/qa/multi-cluster-tests-with-security/build.gradle # x-pack/plugin/security/qa/multi-cluster/build.gradle # x-pack/plugin/sql/qa/jdbc/build.gradle # x-pack/plugin/transform/qa/multi-cluster-tests-with-security/build.gradle * Fix merge * [Build] Fix fips testing after buildparams rework (#116934) * More Cleanup * [Build] Fix checkstyle exclusions on windows (#115185) * More merge fixes * Delete x-pack/plugin/kql/build.gradle	2024-11-27 12:34:32 +01:00
Nik Everett	1212dee8b4	ESQL: Speed up grouping by bytes (#114021 ) (#114652 ) This speeds up grouping by bytes valued fields (keyword, text, ip, and wildcard) when the input is an ordinal block: ``` bytes_refs 22.213 ± 0.322 -> 19.848 ± 0.205 ns/op (maybe real, maybe noise. still good) ordinal didn't exist -> 2.988 ± 0.011 ns/op ``` I see this as 20ns -> 3ns, an 85% speed up. We never hard the ordinals branch before so I'm expecting the same performance there - about 20ns per op. This also speeds up grouping by a pair of byte valued fields: ``` two_bytes_refs 83.112 ± 42.348 -> 46.521 ± 0.386 ns/op two_ordinals 83.531 ± 23.473 -> 8.617 ± 0.105 ns/op ``` The speed up is much better when the fields are ordinals because hashing bytes is comparatively slow. I believe the ordinals case is quite common. I've run into it in quite a few profiles.	2024-10-12 07:22:18 +11:00
Iván Cea Fontenla	298984048b	[8.x] Add CircuitBreaker to TDigest, Step 4: Take into account shallow classes size (#114028 ) * Add CircuitBreaker to TDigest, Step 4: Take into account shallow classes size (#113613) * Removed muted tests from merge conflict * Added missing empty line in muted tests	2024-10-04 01:34:22 +10:00
Nik Everett	b7fffd9348	ESQL: Fix filtering all elements in aggs (#113804 ) (#113955 ) This adds a test to every agg for when it's entirely filtered away and another when filtering is enabled but unused. I'll follow up with another test later for partial filtering. That test caught a bug where some aggs would think they'd been `seen` when they hadn't. This fixes that too.	2024-10-04 01:06:51 +10:00
Iván Cea Fontenla	b8dcdd6303	Add CircuitBreaker to TDigest, Step 2: Add CB to array wrappers (#113105 ) (#113608 ) Part of https://github.com/elastic/elasticsearch/issues/99815 ## Steps 1. Migrate TDigest classes to use a custom Array implementation. Temporarily use a simple array wrapper (https://github.com/elastic/elasticsearch/pull/112810) 2. Implement CircuitBreaking in the `MemoryTrackingTDigestArrays` class. Add `Releasable` and ensure it's always closed within TDigest (This PR) 3. Pass the CircuitBreaker as a parameter to TDigestState from wherever it's being used 4. Account remaining TDigest classes size ("SHALLOW_SIZE") Every step should be safely mergeable to main: - The first and second steps should have no impact. - The third and fourth ones will start increasing the CB count partially. ## Remarks To simplify testing the CircuitBreaker, added a helper method + `@After` to ESTestCase. Right now CBs are usually tested through MockBigArrays. E.g: `f7a0196b45/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/AbstractFunctionTestCase.java (L1263-L1265)` So I guess there was no need for this yet. But I may have missed something somewhere. Also, I'm separating this PR from the "step 3" as integrating this (CB) in the current usages may require some refactor of external code, which may be somewhat more _dangerous_	2024-09-27 01:44:43 +10:00
Nik Everett	13a34b019f	ESQL: Speed up CASE for some parameters (#112295 ) (#113487 ) This speeds up the `CASE` function when it has two or three arguments and both of the arguments are constants or fields. This works because `CASE` is lazy so it can avoid warnings in cases like ``` CASE(foo != 0, 2 / foo, 1) ``` And, in the case where the function is very slow, it can avoid the computations. But if the lhs and rhs of the `CASE` are constant then there isn't any work to avoid. The performance improvment is pretty substantial: ``` (operation) Before Error After Error Units case_1_lazy 97.422 ± 1.048 101.571 ± 0.737 ns/op case_1_eager 79.312 ± 1.190 4.601 ± 0.049 ns/op ``` The top line is a `CASE` that has to be lazy - it shouldn't change. The 4 nanos change here is noise. The eager version improves by about 94%.	2024-09-25 03:57:46 +10:00
Iván Cea Fontenla	0e12d4a821	[8.x backport] Add CircuitBreaker to TDigest, Step 1: Arrays to BigArrays (#112810 ) (#113012 ) # Backport of https://github.com/elastic/elasticsearch/pull/112810 Part of https://github.com/elastic/elasticsearch/issues/99815	2024-09-17 18:16:10 +02:00
Mark Vieira	0279c0a909	Add AGPLv3 as a supported license	2024-09-13 14:30:33 -07:00
Ryan Ernst	7b4443016f	Use test util for finding platform dir (#112286 ) The native platform dir can be found using a TestUtil method, but benchmarks was trying to construct it on its own. This commit switches to using the util method.	2024-08-28 12:10:24 -07:00
Nik Everett	c05f7e9c81	ESQL: Add way for `Block` to `keepMask` (#112160 ) This adds a `Block#keepMask(BooleanVector)` method that will make a new block, keeping all of the values where the vector is `true` and `null`ing all of the velues where the vector is false. This will be useful for implementing partial aggregation application like `\| STATS MAX(a WHERE b > 1), MIN(j WHERE b > 2) BY bar`. Or however the syntax ends up being. We already skip `null` group keys and we can evaluate the `b > 2` bits to a mask pretty easily. It should also be useful in optimizing `CASE(a > 2, foo)` - but only when the RHS of the CASE is `null` and the LHS is a constant or constant-like. This is something that's very optimize-able. I haven't really optimized it in this PR, but it should be possible to speed this up a ton and remove a lot of copying. Here's where the benchmarks start: ``` (dataTypeAndBlockKind) Mode Cnt Score Error Units int/array avgt 7 3.705 ± 0.153 ns/op int/vector avgt 7 3.234 ± 0.078 ns/op ``` That's about the same speed as reading the block. In a few of these cases I expect we can get them to constant performance rather than per-record performance.	2024-08-27 13:54:40 -04:00
Aurélien FOUCRET	29121fdf8f	New version of the script_score term stats helpers. (#108634 )	2024-08-27 18:12:46 +02:00
Patrick Doyle	ae41e9ab65	Pluggable BuiltInExecutorBuilders (#111939 ) * Refactor: move static calculations to Util * BuiltInExecutorBuilders * Spotless * Change to getBuilders * Move helper functions back into ThreadPool	2024-08-27 11:22:54 -04:00
Ryan Ernst	0aa4758f02	Stop setting java.library.path (#112119 ) Native libraries in Java are loaded by calling System.loadLibrary. This method inspects paths in the java.library.path to find the requested library. Elasticsearch previously used this to find libsystemd, but now the only remaining use is to set the additional platform directory in which Elasticsearch keeps its own native libraries. One issue with setting java.library.path is that its not set for the cli process, which makes loading the native library infrastructure from clis difficult. This commit reworks how Elasticsearch native libraries are found in order to avoid needing to set java.library.path. There are two cases. The simplest is production, where the working directory is the Elasticsearch installation directory, so the platform specific directory can be constructed. The second case is for tests where we don't have an installtion. We already pass in java.library.path there, so this change renames the system property to be a test specific property that the new loading infrastructure looks for.	2024-08-23 11:16:18 -07:00
Ignacio Vera	585bd64695	Add H3 Benchmarks (#111359 ) Microbenchmarks for H3	2024-08-22 12:56:31 +02:00
Nhat Nguyen	1964be565c	Allow querying index_mode (#110676 ) This change allows querying the `index.mode` setting via a new `_index_mode` metadata field, enabling APIs such as `field_caps` or `resolve_indices` to target indices that are either time_series or logs only. This approach avoids adding and handling a new parameter for `index_mode` in these APIs. Both ES\|QL and the `_search` API should also work with this new field.	2024-07-10 16:45:11 -07:00
Chris Hegarty	af779d68a2	Upgrade to JMH 1.37 (#110580 ) This commit upgrades to JMH 1.37. There are some fixes for Mac that allow easier running of profilers, etc.	2024-07-09 08:59:28 +01:00
Ignacio Vera	5d53c9a363	Add protection for OOM during aggregations partial reduction (#110520 ) This commit adds a check the parent circuit breaker every 1024 call to the buckets consumer during aggregations partial reduction.	2024-07-05 15:33:12 +02:00
Patrick Doyle	43b2e877e0	Revert "Move PluginsService to its own internal package (#109872 )" (#109946 ) This reverts commit `b9e7965184`.	2024-06-19 18:10:50 -04:00
Patrick Doyle	b9e7965184	Move PluginsService to its own internal package (#109872 ) * Mechanical package change in IntelliJ * A couple of manual fixups * Export plugins.loading to deprecation * Put plugin-cli in a module so can export PluginsUtils to it.	2024-06-19 15:23:47 -04:00
Benjamin Trent	acc99302c6	Adding hamming distance function to painless for dense_vector fields (#109359 ) This adds `hamming` distances, the pop-count of `xor` byte vectors as a first class citizen in painless. For byte vectors, this means that we can compute hamming distances via script_score (aka, brute-force). The implementation of `hamming` is the same that is available in Lucene, and when lucene 9.11 is merged, we should update our logic where applicable to utilize it. NOTE: this does not yet add hamming distance as a metric for indexed vectors. This will be a future PR after the Lucene 9.11 upgrade.	2024-06-18 03:41:20 +10:00
Chris Hegarty	fa364bfcaf	Rename the vec module to better reflect that it provides SIMD optimized vector scorers (#109661 ) This commit renames the vector module to better reflect its intent - to provide SIMD optimized vector scorer implementations.	2024-06-17 11:10:02 +01:00
Benjamin Trent	29288d6590	Merge remote-tracking branch 'upstream/main' into lucene_snapshot_9_11	2024-06-11 06:54:23 -04:00
Luigi Dell'Aquila	a5b4f1fa61	ES\|QL: vectorize eval (#109332 ) Use VectorFixedBuilders for better optimization and to facilitate JIT auto-vectorization	2024-06-07 19:24:04 +02:00
Benjamin Trent	ac53d6020b	Merge remote-tracking branch 'upstream/main' into lucene_snapshot_9_11	2024-06-05 12:38:23 -04:00
Jim Ferenczi	564549af8f	Expose the bitset filter cache in the MappingParserContext (#109298 ) Add the bitset filter cache in the MappingParserContext	2024-06-05 11:43:07 +01:00
Benjamin Trent	013e0c7cc6	Merge branch 'main' into lucene_snapshot_9_11	2024-06-04 18:08:29 -04:00
Mark Tozzi	16849c3b9f	[ESQL] Make Datatypes an enum (#109227 ) After #109162, all instances of DataType were in one class, as a collection of static constants. That pattern can then be simplified into an enum, merging the behavior class and the static collection class. That opens the door to future optimizations, like using enum serialization rather than string serialization to save bytes over the wire. It also makes the code easier to read, as all the behavior is now in a single file (which is still pretty short). Most of this PR is just juggling names around to have the two references refer to the same thing. Follow up work can merge the functions from EsqlDataTypes into this enum, but this PR is long enough already.	2024-06-04 15:08:51 -04:00
elasticsearchmachine	7b5925f4b6	Merge remote-tracking branch 'origin/main' into lucene_snapshot	2024-05-30 10:01:52 +00:00
Nik Everett	6a51c81abe	ESQL: Move data types into core (#109162 ) This moves all of the new data types declared in `EsqlDataTypes` into `DataTypes`. It also removes `EsqlDataTypes#types` and makes `DataTypes#types` return what that used to return. It doesn't modify any other methods of `EsqlDataTypes`. That's a change for another time.	2024-05-29 12:32:04 -04:00
Chris Hegarty	f71aba1fdd	Use the SIMD optimized SQ vector scorer at search time (#109109 ) This commit extends the custom SIMD optimized SQ vector scorer to include search time scoring. When run on JDK22+ vector scoring with be done with the custom scorer. The implementation uses the JDK 22+ on-heap ALLOW_HEAP_ACCESS Linker.Option so that the native code can access the query vector directly.	2024-05-29 16:32:06 +01:00
ChrisHegarty	a415cf2f4d	Post merge cleanup	2024-05-24 10:27:27 +01:00
ChrisHegarty	bfa73a73ca	Merge branch 'main' into lucene_snapshot	2024-05-23 15:33:52 +01:00
Chris Hegarty	ff37f1f767	Refactor libvec to replace custom scorer types with Lucene types (#108917 ) This commit refactors libvec to replace custom scorer types with Lucene types. The initial implementation created separate types to model the vector scorer with an adapter between them and the Lucene types. This was done to avoid a dependency on Lucene from the native module. This is no longer an issue, since the code is now separated from the native module already, and in fact already depends on Lucene. This PR drops the custom types infavour of the Lucene ones. This will help future refactoring, and avoid bugs by reusing the existing and know model in this area. I also took the liberty of reflowing the code to match that of the recent change in Lucene to support off-heap scoring - this code is now very similar to that, and will become even more clean and streamlined in the lucene_snapshot branch. This refactoring is not directly dependent on the next version of Lucene, so it done in main.	2024-05-23 15:28:32 +01:00
elasticsearchmachine	325bf47462	Merge remote-tracking branch 'origin/main' into lucene_snapshot	2024-05-22 10:01:47 +00:00
Alexander Spies	16a5d248b7	ESQL: Clone ql for esql (#108773 ) Part of https://github.com/elastic/elasticsearch/issues/106679 * Copy the `ql` project into a different project _just for esql_, call it `esql-core`. * Make `esql` depend only on the latter. * Fix `EsqlNodeSubclassTests`; I'm confused why this didn't bite us earlier. * Update the warning regexes in some csv tests as the exceptions have other package names now. Note to reviewers: Exclude the first commit when viewing the diff, as that contains only the actual copying of `ql`. The remaining commits are the actually meaningful ones. _The `build.gradle` files probably require the most attention._	2024-05-22 04:35:17 -04:00
elasticsearchmachine	0ce5dadc6b	Merge remote-tracking branch 'origin/main' into lucene_snapshot	2024-05-15 10:02:23 +00:00
Oleksandr Kolomiiets	954efe0185	Added initial metrics for synthetic source (#106732 ) This PR adds basic infra for mapper metrics and adds first metrics for synthetic source load latency.	2024-05-14 13:29:12 -07:00
ChrisHegarty	bdcdd2e8b9	Merge branch 'main' into lucene_snapshot	2024-05-07 12:50:34 +01:00
Chris Hegarty	7f90a98ed5	Update native vector provider to use unsigned int7 values only (#108243 ) This commit updates the native vector provider to reflect that Lucene's scalar quantization is unsigned int7, with a range of values from 0 to 127 inclusive. Stride has been pushed down into native, to allow other platforms to more easily select there own stride length. Previously the implementation supports signed int8. We might want the more general signed int8 implementation in the future, but for now unsigned int7 is sufficient, and allows to provide more efficient implementations on x64.	2024-05-04 10:42:55 +01:00
elasticsearchmachine	e945f44edb	Merge remote-tracking branch 'origin/main' into lucene_snapshot	2024-04-30 13:10:53 +00:00
Alexander Spies	b0f58ab388	ESQL: Move expression classes into common package (#105407 ) Move comparisons to the expression package, so that all expressions are in the same package.	2024-04-30 14:53:04 +02:00
elasticsearchmachine	451d8e128f	Merge remote-tracking branch 'origin/main' into lucene_snapshot	2024-04-23 10:01:27 +00:00
Nik Everett	3e2dc4f555	ESQL: Finish a TODO in BlockHashes (#107701 ) We had a TODO in our `BlockHash` implementations optimized for pairs of columns - we wanted to use `MultiValueDedupe` inside their `add` methods for `Block`s. This implements to TODO. It makes a small behavior change on one of the blocks we don't yet use in production - the `BytesRefLongBlockHash` will now properly keep `null` bytes strings. That's a side effect of reusing other components and doesn't actually allow us to use it in production - we're still waiting on the joint hash tables which are blocked behind vector instructions. I moved a bunch of files to different places so I could reach into the innards of the `MultivalueDedupe` subclasses to build the block hash addition logic. It seemed like a reasonable thing to do. And it seemed reasonable not to expose the raw arrays outside of the package.	2024-04-22 15:22:51 -04:00
Nhat Nguyen	8441a0514c	Move GroupSpec to BlockHash (#107665 ) This change moves GroupSpec from HashAggregationOperator to BlockHash, making it available for MetricsAggregatorOperator, which will be introduced soon.	2024-04-22 08:31:39 -07:00
ChrisHegarty	1539fdd55c	Post merge fixes	2024-04-12 17:17:37 +01:00
Chris Hegarty	6b52d7837b	Add an optimised int8 vector distance function for aarch64. (#106133 ) This commit adds an optimised int8 vector distance implementation for aarch64. Additional platforms like, say, x64, will be added as a follow-up. The vector distance implementation outperforms Lucene's Pamana Vector implementation for binary comparisons by approx 5x (depending on the number of dimensions). It does so by means of compiler intrinsics built into a separate native library and link by Panama's FFI. Comparisons are performed on off-heap mmap'ed vector data. The implementation is currently only used during merging of scalar quantized segments, through a custom format ES814HnswScalarQuantizedVectorsFormat, but its usage will likely be expanded over time. Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com> Co-authored-by: Lorenzo Dematté <lorenzo.dematte@elastic.co> Co-authored-by: Mark Vieira <portugee@gmail.com> Co-authored-by: Ryan Ernst <ryan@iernst.net>	2024-04-12 08:44:21 +01:00
Nik Everett	bffd2a964c	ESQL: Regex improvements (#106429 ) This makes a couple of changes to regex processing in the compute engine: 1. Process utf-8 strings directly. This should save a ton of time. 2. Snip the `toString` output if it is too big - I chose 64kb of strings. 3. I changed the formatting of the automaton to a slightly customized `dot` output. Because automata are graphs. Everyone knows it. And they are a lot easier to read as graphs. `dot` is easy to convert into a graph. 4. I implement `EvaluatorMapper` for regex operations which is pretty standard for the rest of our operations.	2024-03-19 14:35:14 -04:00
Nik Everett	c1d0e8ed2a	ESQL: Drop a parameter from `BlockHash` (#106417 ) This drops the `DriverContext` from `BlockHash` - we already have it in `BlockFactory`.	2024-03-19 09:20:11 -04:00
Przemyslaw Gomulka	11f3c29089	DocumentSizeObserver infrastructure to allow not reporting upon failures (#104859 ) We want to report that observation of document parsing has finished only upon a successful indexing. To achieve this, we need to perform reporting only in one place (not as previously in both IngestService and 'bulk action') This commit splits the DocumentParsingObserver in two. One for wrapping an XContentParser and returning the observed state - the DocumentSizeObserver and a DocumentSizeReporter to perform an action when parsing has been completed and indexing successful. To perform reporting in one place we need to pass the state from IngestService to 'bulk action'. The state is currently represented as long - normalisedBytesParsed. In TransportShardBulkAction we are getting the normalisedBytesParsed information and in the serverless plugin we will check if the value is indicating that parsing already happened in IngestService (value being != -1) we create a DocumentSizeObserver with the fixed normalisedBytesParsed and won't increment it. When the indexing is completed and successful we report the observed state for an index with DocumentSizeReporter small nit: by passing the documentParsingObserve via SourceToParse we no longer have to inject it via complex hierarchy for DocumentParser. Hence some constructor changes	2024-02-12 17:16:24 +01:00

1 2 3 4 5 ...

336 commits