Commit graph

127 commits

Author SHA1 Message Date
Kathleen DeRusso
c9a6a2c841
Add match support for semantic_text fields (#117839)
* Added query name to inference field metadata

* Fix build error

* Added query builder service

* Add query builder service to query rewrite context

* Updated match query to support querying semantic text fields

* Fix build error

* Fix NPE

* Update the POC to rewrite to a bool query when combined inference and non-inference fields

* Separate clause for each inference index (to avoid inference ID clashes)

* Simplify query builder service concept to a single default inference query

* Rename QueryBuilderService, remove query name from inference metadata

* Fix too many rewrite rounds error by injecting booleans in constructors for match query builder and semantic text

* Fix test compilation errors

* Fix tests

* Add yaml test for semantic match

* Add NodeFeature

* Fix license headers

* Spotless

* Updated getClass comparison in MatchQueryBuilder

* Cleanup

* Add Mock Inference Query Builder Service

* Spotless

* Cleanup

* Update docs/changelog/117839.yaml

* Update changelog

* Replace the default inference query builder with a query rewrite interceptor

* Cleanup

* Some more cleanup/renames

* Some more cleanup/renames

* Spotless

* Checkstyle

* Convert List<QueryRewriteInterceptor> to Map keyed on query name, error on query name collisions

* PR feedback - remove check on QueryRewriteContext class only

* PR feedback

* Remove intercept flag from MatchQueryBuilder and replace with wrapper

* Move feature to test feature

* Ensure interception happens only once

* Rename InterceptedQueryBuilderWrapper to AbstractQueryBuilderWrapper

* Add lenient field to SemanticQueryBuilder

* Clean up yaml test

* Add TODO comment

* Add comment

* Spotless

* Rename AbstractQueryBuilderWrapper back to InterceptedQueryBuilderWrapper

* Spotless

* Didn't mean to commit that

* Remove static class wrapping the InterceptedQueryBuilderWrapper

* Make InterceptedQueryBuilderWrapper part of QueryRewriteInterceptor

* Refactor the interceptor to be an internal plugin that cannot be used outside inference plugin

* Fix tests

* Spotless

* Minor cleanup

* C'mon spotless

* Test spotless

* Cleanup InternalQueryRewriter

* Change if statement to assert

* Simplify template of InterceptedQueryBuilderWrapper

* Change constructor of InterceptedQueryBuilderWrapper

* Refactor InterceptedQueryBuilderWrapper to extend QueryBuilder

* Cleanup

* Add test

* Spotless

* Rename rewrite to interceptAndRewrite in QueryRewriteInterceptor

* DOESN'T WORK - for testing

* Add comment

* Getting closer - match on single typed fields works now

* Deleted line by mistake

* Checkstyle

* Fix over-aggressive IntelliJ Refactor/Rename

* And another one

* Move SemanticMatchQueryRewriteInterceptor.SEMANTIC_MATCH_QUERY_REWRITE_INTERCEPTION_SUPPORTED to Test feature

* PR feedback

* Require query name with no default

* PR feedback & update test

* Add rewrite test

* Update server/src/main/java/org/elasticsearch/index/query/InnerHitContextBuilder.java

Co-authored-by: Mike Pellegrini <mike.pellegrini@elastic.co>

---------

Co-authored-by: Mike Pellegrini <mike.pellegrini@elastic.co>
2024-12-12 16:55:00 +01:00
Benjamin Trent
5e859d9301
Even better(er) binary quantization (#117994)
This measurably improves BBQ by adjusting the underlying algorithm to an
optimized per vector scalar quantization.

This is a brand new way to quantize vectors. Instead of there being a
global set of upper and lower quantile bands, these are optimized and
calculated per individual vector. Additionally, vectors are centered on
a common centroid. 

This allows for an almost 32x reduction in memory, and even better
recall than before at the cost of slightly increasing indexing time.

Additionally, this new approach is easily generalizable to various other
bit sizes (e.g. 2 bits, etc.). While not taken advantage of yet, we may
update our scalar quantized indices in the future to use this new
algorithm, giving significant boosts in recall.

The recall gains spread from 2% to almost 10% for certain datasets with
an additional 5-10% indexing cost when indexing with HNSW when compared
with current BBQ.
2024-12-10 03:06:27 +11:00
Niels Bauman
032b42fcf7
Make TransportLocalClusterStateAction wait for cluster to unblock (#117230)
This will make `TransportLocalClusterStateAction` wait for a new state
that is not blocked. This means we need a timeout (again). For
consistency's sake, we're reusing the REST param `master_timeout` for
this timeout as well.

The only class that was using `TransportLocalClusterStateAction` was
`TransportGetAliasesAction`, so its request needed to accept a timeout
again as well.
2024-12-04 12:17:13 +01:00
Benjamin Trent
6c2f6071b2
Refactor/bbq format (#117847)
* Refactor bbq format to be contained in a package

* fixing license headers

* fixing module

* fix style
2024-12-02 16:04:31 -05:00
John Verwolf
8350ff29ba
Extensible Completion Postings Formats (#111494)
Allows the Completion Postings Format to be extensible by providing an implementation of the CompletionsPostingsFormatExtension SPIs.
2024-11-28 13:25:02 -08:00
Martijn van Groningen
6a4b68d263
Add source mode stats to MappingStats (#117463) 2024-11-28 10:53:39 +01:00
Simon Cooper
cc35f1dc6a
Remove transport versions fixup listener and associated code (#116941) 2024-11-18 16:19:14 +00:00
Simon Cooper
c832572709
Remove some historical features (#116926)
Historical features are now trivially true on v9 - so we can remove the features, and the check.
Historical features do not affect cluster state, so this has no compatibility restrictions.
2024-11-18 14:33:05 +00:00
Alexis Charveriat
e0af1238fc
Index stats enhancement: creation date and tier_preference (#116339)
* Expose tier preference as part of the index stats
* Also expose index creation date in index stats
* Added test
2024-11-15 09:08:42 +01:00
Patrick Doyle
338c0538b7
Dynamic entitlement agent (#116125)
* Refactor: treat "maybe" JVM options uniformly

* WIP

* Get entitlement running with bridge all the way through, with qualified
exports

* Cosmetic changes to SystemJvmOptions

* Disable entitlements by default

* Bridge module comments

* Fixup forbidden APIs

* spotless

* Rename EntitlementChecker

* Fixup InstrumenterTests

* exclude recursive dep

* Fix some compliance stuff

* Rename asm-provider

* Stop using bridge in InstrumenterTests

* Generalize readme for asm-provider

* InstrumenterTests doesn't need EntitlementCheckerHandle

* Better javadoc

* Call parseBoolean

* Add entitlement to internal module list

* Docs as requested by Lorenzo

* Changes from Jack

* Rename ElasticsearchEntitlementChecker

* Remove logging javadoc

* exportInitializationToAgent should reference EntitlementInitialization, not EntitlementBootstrap.

They're currently in the same module, but if that ever changes, this code would have become wrong.

* Some suggestions from Mark

---------

Co-authored-by: Ryan Ernst <ryan@iernst.net>
2024-11-06 00:07:52 +01:00
Nhat Nguyen
fa6c5296d4
Add num docs and size to logsdb telemetry (#116128)
Follow-up on #115994 to add telemetry for the total number of documents 
and size in bytes of logsdb indices.

Relates #115994
2024-11-05 08:46:46 -08:00
Ying Mao
4ecdfbb214
[Inference API] Add API to get configuration of inference services (#114862)
* Adding API to get list of service configurations

* Update docs/changelog/114862.yaml

* Fixing some configurations

* PR feedback -> Stream.of

* PR feedback -> singleton

* Renaming ServiceConfiguration to SettingsConfiguration. Adding TaskSettingsConfiguration

* Adding task type settings configuration to response

* PR feedback
2024-10-30 13:29:58 -04:00
Luca Cavanna
8efd08b019
Upgrade to Lucene 10 (#114741)
The most relevant ES changes that upgrading to Lucene 10 requires are:

- use the appropriate IOContext
- Scorer / ScorerSupplier breaking changes
- Regex automaton are no longer determinized by default
- minimize moved to test classes
- introduce Elasticsearch900Codec
- adjust slicing code according to the added support for intra-segment concurrency
- disable intra-segment concurrency in tests
- adjust accessor methods for many Lucene classes that became a record
- adapt to breaking changes in the analysis area

Co-authored-by: Christoph Büscher <christophbuescher@posteo.de>
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
Co-authored-by: ChrisHegarty <chegar999@gmail.com>
Co-authored-by: Brian Seeders <brian.seeders@elastic.co>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Panagiotis Bailis <pmpailis@gmail.com>
Co-authored-by: Benjamin Trent <4357155+benwtrent@users.noreply.github.com>
2024-10-21 13:38:23 +02:00
Benjamin Trent
6c752abc23
Adding new bbq index types behind a feature flag (#114439)
new index types of bbq_hnsw and bbq_flat which utilize the better binary quantization formats. A 32x reduction in memory, with nice recall properties.
2024-10-14 20:13:27 -04:00
Mary Gouseti
0d03207477
Clean up factory retention settings from elasticsearch (#114396)
This removes the possibility for a plugin to provide factory retention settings. Factory retention settings have been deprecated and completely replaced by #111972.

Note: this feature is not in use. If someone wants to set global retention they can use the cluster settings as defined in #111972.
2024-10-10 11:45:46 +03:00
David Turner
07c3acf1c0
Remove cluster state from /_cluster/reroute response (#114231)
Including the cluster state in responses to the `POST _cluster/state`
API  was deprecated in #90399 (v8.6.0) requiring callers to pass
`?metric=none` to avoid the deprecation warning. This commit adjusts the
behaviour as promised in v9 so that this API never returns the cluster
state, and deprecates the `?metric` parameter itself.

Closes #88978
2024-10-08 07:59:57 +01:00
Chris Hegarty
32dde26e49
Upgrade to Lucene 9.12.0 (#113333)
This commit upgrades to Lucene 9.12.0.

Co-authored-by: Adrien Grand <jpountz@gmail.com>
Co-authored-by: Armin Braun <me@obrown.io>
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Chris Hegarty <chegar999@gmail.com>
Co-authored-by: John Wagster <john.wagster@elastic.co>
Co-authored-by: Luca Cavanna <javanna@apache.org>
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
2024-10-01 08:39:27 +01:00
David Turner
97db0c2182
Remove {Indices,}ClusterStateUpdateRequest (#113483)
These abstract classes are now unused so this commit removes them.
2024-09-25 08:17:51 +01:00
Simon Cooper
f9aa6f40cd
Always use CLDR locale on ES v9 (#113184)
Regardless of JDK version, ES should always use CLDR locale database from 9.0.0.
This also removes IsoCalendarDataProvider used to override week-date calculations for the root locale only.
2024-09-23 11:05:08 +01:00
Ryan Ernst
2ecfb397ad
Remove plugin classloader indirection (#113154)
Extensible plugins use a custom classloader for other plugin jars. When
extensible plugins were first added, the transport client still existed,
and elasticsearch plugins did not exist in the transport client (at
least not the ones that create classloaders). Yet the transport client
still created a PluginsService. An indirection was used to avoid
creating separate classloaders when the transport client had created the
PluginsService.

The transport client was removed in 8.0, but the indirection still
exists. This commit removes that indirection layer.
2024-09-20 07:45:40 -07:00
Mark Vieira
a59c182f9f
Add AGPLv3 as a supported license 2024-09-13 15:29:46 -07:00
Patrick Doyle
50871a3d28
New injector (#111722)
* Initial new injector

* Allow createComponents to return classes

* Downsample injection

* Remove more vestiges of subtype handling

* Lowercase logger

* Respond to code review comments

* Only one object per class

* Some additional cleanup incl spotless

* PR feedback

* Missed one

* Rename workQueue

* Remove Injector.addRecordContents

* TelemetryProvider requires us to inject an object using a supertype

* Address Simon's comments

* Clarify the reason for SuppressForbidden

* Make log indentation code less intrusive
2024-08-28 11:13:47 -04:00
David Turner
f150e2c11d
Add telemetry for repository usage (#112133)
Adds to the `GET _cluster/stats` endpoint information about the snapshot
repositories in use, including their types, whether they are read-only
or read-write, and for Azure repositories the kind of credentials in
use.
2024-08-27 23:34:02 +10:00
Panagiotis Bailis
b685a436ce
Adding RankDocsRetrieverBuilder and RankDocsQuery (#111709) 2024-08-26 15:18:47 +03:00
Patrick Doyle
35a375329a
Move Guice to org.elasticsearch.injection.guice (#111723)
* Move files and fix imports & module exports
* Other consequences of moving Guice
2024-08-12 10:47:46 -04:00
Keith Massey
a2814e816b
Adding mapping validation to the simulate ingest API (#110606) 2024-07-19 08:08:21 -05:00
Joe Gallo
27e7601698
Directly download commercial ip geolocation databases from providers (#110844)
Co-authored-by: Keith Massey <keith.massey@elastic.co>
2024-07-17 20:55:14 -04:00
Ryan Ernst
e6713a5c0a
Remove JNA from server dependencies (#110809)
All native methods are now bound through NativeAccess. This commit
removes the jna dependency from server.

relates #104876
2024-07-12 19:49:13 -07:00
Ryan Ernst
8417d3f141
Move preallocate functionality to native access (#110678)
This commit moves the file preallocation functionality into
NativeAccess. The code is basically the same. One small tweak is that
instead of breaking Java access boundaries in order to get an open file
handle, the new code uses posix open directly.

relates #104876
2024-07-11 09:42:44 -07:00
Mayya Sharipova
405e39660b
Support k parameter for knn query (#110233)
Introduce an optional k param for knn query

If k is not set, knn query has the previous behaviour:
- `num_candidates` docs  is collected from each shard. This `num_candidates` docs
are used for combining with results with other queries and aggregations on each shard.
- docs from all shards are merged to produce the top global `size` results

If k is set, the behaviour instead is following:
- `k` docs is collected from each shard. This `k` docs are used for
combining results with other queries and aggregations on each shard.
- similarly, docs from all shards are merged to produce the top global `size`
results.

Having `k` param makes it more intuitive for users to address their needs.
They also don't need to care and can skip `num_candidates` param for this query
as it is of more internal details to tune how knn search operates.

Closes #108473
2024-06-28 09:59:28 -04:00
Benjamin Trent
5add44d7d1
Adds new bit element_type for dense_vectors (#110059)
This commit adds `bit` vector support by adding `element_type: bit` for
vectors. This new element type works for indexed and non-indexed
vectors. Additionally, it works with `hnsw` and `flat` index types. No
quantization based codec works with this element type, this is
consistent with `byte` vectors.

`bit` vectors accept up to `32768` dimensions in size and expect vectors
that are being indexed to be encoded either as a hexidecimal string or a
`byte[]` array where each element of the `byte` array represents `8`
bits of the vector.

`bit` vectors support script usage and regular query usage. When
indexed, all comparisons done are `xor` and `popcount` summations (aka,
hamming distance), and the scores are transformed and normalized given
the vector dimensions. Note, indexed bit vectors require `l2_norm` to be
the similarity.

For scripts, `l1norm` is the same as `hamming` distance and `l2norm` is
`sqrt(l1norm)`. `dotProduct` and `cosineSimilarity` are not supported.

Note, the dimensions expected by this element_type are always to be
divisible by `8`, and the `byte[]` vectors provided for index must be
have size `dim/8` size, where each byte element represents `8` bits of
the vectors.

closes: https://github.com/elastic/elasticsearch/issues/48322
2024-06-27 04:48:41 +10:00
Patrick Doyle
43b2e877e0
Revert "Move PluginsService to its own internal package (#109872)" (#109946)
This reverts commit b9e7965184.
2024-06-19 18:10:50 -04:00
Patrick Doyle
b9e7965184
Move PluginsService to its own internal package (#109872)
* Mechanical package change in IntelliJ
* A couple of manual fixups
* Export plugins.loading to deprecation
* Put plugin-cli in a module so can export PluginsUtils to it.
2024-06-19 15:23:47 -04:00
Benjamin Trent
acc99302c6
Adding hamming distance function to painless for dense_vector fields (#109359)
This adds `hamming` distances, the pop-count of `xor` byte vectors as a
first class citizen in painless. 

For byte vectors, this means that we can compute hamming distances via
script_score (aka, brute-force).

The implementation of `hamming` is the same that is available in Lucene,
and when lucene 9.11 is merged, we should update our logic where
applicable to utilize it.

NOTE: this does not yet add hamming distance as a metric for indexed
vectors. This will be a future PR after the Lucene 9.11 upgrade.
2024-06-18 03:41:20 +10:00
Chris Hegarty
fa364bfcaf
Rename the vec module to better reflect that it provides SIMD optimized vector scorers (#109661)
This commit renames the vector module to better reflect its intent - to provide SIMD optimized vector scorer implementations.
2024-06-17 11:10:02 +01:00
Panagiotis Bailis
4dd4356c15
Adding RerankingContext classes to support global reranking (#109435) 2024-06-12 18:09:06 +03:00
Ryan Ernst
0be3c741df
Guard file settings readiness on file settings support (#109500)
Consistency of file settings is an important invariant. However, when
upgrading from Elasticsearch versions before file settings existed,
cluster state will not yet have the file settings metadata. If the first
node upgraded is not the master node, new nodes will never become ready
while they wait for file settings metadata to exist.

This commit adds a node feature for file settings to guard waiting on
file settings for readiness. Although file settings has existed since
8.4, the feature is not a historical feature because historical features
are not applied to cluster state that readiness checks. In this case it
is not needed since upgrading from 8.4+ will already contain file
settings metadata.
2024-06-11 06:55:53 +10:00
Panagiotis Bailis
4a1d7426d7
Adding RankFeature implementation (#108538) 2024-06-06 11:20:53 +03:00
Przemyslaw Gomulka
437e7db499
Refactor reporting of RA metrics to not to be done in TransportShardBulkAction (#108449)
previously DocumentSizeReporter was reporting upon indexing being completed in TransportShardBulkAction#onComplete
This commit renames the method to onIndexingCompleted and moves that reporting to IndexEngine in serverless plugin.
This will be followed up in a separate PR that will be reporting in an Engine#index subclass (serverless)
2024-05-16 13:57:06 +02:00
Simon Cooper
e7350dce29
Add a capabilities API to check node and cluster capabilities (#106820)
This adds a /_capabilities rest endpoint for checking the capabilities of a cluster - what endpoints, parameters, and endpoint capabilities the cluster supports
2024-05-08 14:44:26 +01:00
Kostas Krikellas
3183e6d6c9
Add ignored field values to synthetic source (#107567)
* Add ignored field values to synthetic source

* Update docs/changelog/107567.yaml

* initialize map

* yaml fix

* add node feature

* add comments

* small fixes

* missing cluster feature in yaml

* constants for chars, stored fields

* remove duplicate method

* throw exception on parse failure

* remove Base64 encoding

* add assert on IgnoredValuesFieldMapper::write

* changes from review

* simplify logic

* add comment

* rename classes

* rename _ignored_values to _ignored_source

* rename _ignored_values to _ignored_source
2024-04-26 15:35:31 +03:00
Panagiotis Bailis
d029d40cea
Adding new RankContext classes per different search phase/node type (#107093) 2024-04-24 14:58:16 +03:00
Mary Gouseti
119f6e71ce
[Data stream lifecycle] Introduce factory retention settings (#107741)
We introduce the plumbing so that a plugin can provide factory retention. This retention will take effect if there is no global retention provided by the user.

Without a plugin defining the factory retention, elasticsearch will have no factory retention.
2024-04-24 11:52:24 +03:00
Howard
fdbb21bba4
Support effective watermark thresholds in node stats API (#107244)
Adds to the `fs` component of the node stats API some additional values
indicating the disk watermarks that are currently in effect.

Relates #106676
2024-04-18 09:57:28 -04:00
Chris Hegarty
6b52d7837b
Add an optimised int8 vector distance function for aarch64. (#106133)
This commit adds an optimised int8 vector distance implementation for aarch64. Additional platforms like, say, x64, will be added as a follow-up.

The vector distance implementation outperforms Lucene's Pamana Vector implementation for binary comparisons by approx 5x (depending on the number of dimensions). It does so by means of compiler intrinsics built into a separate native library and link by Panama's FFI. Comparisons are performed on off-heap mmap'ed vector data.

The implementation is currently only used during merging of scalar quantized segments, through a custom format ES814HnswScalarQuantizedVectorsFormat, but its usage will likely be expanded over time.

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Co-authored-by: Lorenzo Dematté <lorenzo.dematte@elastic.co>
Co-authored-by: Mark Vieira <portugee@gmail.com>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
2024-04-12 08:44:21 +01:00
Tim Vernum
36d5282907
Allow additional JSON log fields via SPI (#106980)
This adds a new SPI based `LoggingDataProvider` service that can be
implemented in order to add new fields to the main JSON log
2024-04-10 22:14:00 -04:00
Adrien Grand
49ffa045a6
Cut over stored fields to ZSTD for compression. (#103374)
This cuts over stored fields with `index.codec: best_speed` (default) to ZSTD with level 0 and blocks of at most 128 documents or 14kB, and `index.codec: best_compression` to ZSTD with level 3 and blocks of at most 2,048 documents or 240kB.

Compared with the current codecs, this would yield similar indexing speed, much better space efficiency and similar retrieval speed. Benchmarks on the `elastic/logs` track suggest 10% better storage efficiency and slightly faster ingestion.

The Lucene codec infrastructure records the codec on a per-segment basis and ensures that this change is backward-compatible. Segments will get progressively migrated to ZSTD as they get merged in the background.

Bindings for ZSTD are provided by the Panama FFI API on JDK21+ and JNA on older JDKs.

ZSTD support is currently behind a feature flag, so it won't be enabled immediately when this feature gets merged, this will need a follow-up change.

Co-authored-by: Mark Vieira <portugee@gmail.com>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
2024-04-09 09:18:58 +02:00
Jack Conradson
68b0acac8f
Add retrievers using the parser-only approach (#105470)
This enhancement adds a new abstraction to the _search API called "retriever." A 
retriever is something that returns top hits. This adds three initial retrievers called
"standard", "knn", and "rrf". The retrievers use a parser-only approach where they
are parsed and then translated into a SearchSourceBuilder to execute the actual
search.
---------

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
2024-03-12 10:11:55 -07:00
Andrei Dan
882b92ab60
Add service for computing the optimal number of shards for data streams (#105498)
This adds the `DataStreamAutoShardingService` that will compute the
optimal number of shards for a data stream and return a recommendation
as to when to apply it (a time interval we call cool down which is 0
when the auto sharding recommendation can be applied immediately).

This also introduces a `DataStreamAutoShardingEvent` object that will be
stored in the data stream metadata to indicate the last auto sharding
event that was applied to a data stream and its cluster state
representation looks like so:

```
"auto_sharding": {
 "trigger_index_name": ".ds-logs-nginx-2024.02.12-000002",
 "target_number_of_shards": 3,
 "event_timestamp": 1707739707954
}
```

The auto sharding service is not used in this PR, so the auto sharding
event will not be stored in the data stream metadata, but the required
infrastructure to configure it is in place.
2024-03-06 05:12:08 -05:00
Ryan Ernst
6375e9f443
Add native access library (#105100)
Elasticsearch requires access to some native functions. Historically
this has been achieved with the JNA library. However, JNA is a
complicated, magical library, and has caused various problems booting
Elasticsearch over the years. The new Java Foreign Function and Memory
API allows access to call native functions directly from Java. It also
has the advantage of tight integration with hotspot which can improve
performance of these functions (though performance of Elasticsearch's
native calls has never been much of an issue since they are mostly at
boot time).

This commit adds a new native lib that is internal to Elasticsearch. It
is built to use the foreign function api starting with Java 21, and
continue using JNA with Java versions below that.

Only one function, checking whether Elasticsearch is running as root, is
migrated. Future changes will migrate other native functions.
2024-02-07 18:27:09 -05:00