Removing the custom dependency checksum functionality in favor of Gradle build-in dependency verification support.
- Use sha256 in favor of sha1 as sha1 is not considered safe these days.
Closes https://github.com/elastic/elasticsearch/issues/69736
This change adds an operation parameter to FieldDataContext that allows us to specialize the field data that are returned from fielddataBuilder in MappedFieldType. Keyword, integer, and geo point field types now support source fallback where we build a doc values wrapper using source if doc values doesn't exist for this field under the operation SCRIPT. This allows us to have source fallback in scripting for the scripting fields API.
MappedFieldType#fieldDataBuilder() currently takes two parameters, a fully qualified
index name and a supplier for a SearchLookup. We expect to add more parameters here
as we add support for loading fielddata from source. Rather than telescoping the
parameter list, this commit instead introduces a new FieldDataContext carrier object
which will allow us to add to these context parameters more easily.
This changes the LoggedExec task to be configuration cache compatible. We changed the implementation
to use `ExecOperations` instead of extending `Exec` task. As double checked with the Gradle team this task
is not planned to be made configuration cache compatible out of the box anytime soon.
This is part of the effort on https://github.com/elastic/elasticsearch/issues/57918
This adds the generation and upload logic of Gradle dependency graphs to snyk
We directly implemented a rest api based snyk plugin as:
the existing snyk gradle plugin delegates to the snyk command line tool the command line tool
uses custom gradle logic by injecting a init file that is
a) using deprecated build logic which we definitely want to avoid
b) uses gradle api we avoid like eager task creation.
Shipping this as a internal gradle plugin gives us the most flexibility as we only want to monitor
production code for now we apply this plugin as part of the elasticsearch.build plugin,
that usage has been for now the de-facto indicator if a project is considered a "production" project
that ends up in our distribution or public maven repositories. This isnt yet ideal and we will revisit
the distinction between production and non production code / projects in a separate effort.
As part of this effort we added the elasticsearch.build plugin to more projects that actually end up
in the distribution. To unblock us on this we for now disabled a few check tasks that started failing by applying elasticsearch.build.
Addresses #87620
No need for this extension, we don't make use of the settings or deprecation logger
in production any more. Also, this slows down CS operations that require a
temporary index service which builds quite a bit slower when the loggers
need to be set up via reflective calls.
The ingest attachment processor is currently available as a plugin. This
commit moves the processor to the default distribution so it is always
available.
This PR uses Lucene-9.3 snapshot in Elasticsearch 8.4. Noticeable changes in this Lucene snapshot:
- Merge-on-refresh (disabled)
- No more pathological merging
- SortedSetDocValues#count for value_count aggs
The APIs that we use in azure-svc-mgmt-compute use the Apache HTTP client and the built-in Java XML parser, so it doesn't require Jersey JAXB bindings for databinding JSON/XML data to Java objects via old Jackson dependencies.
This PR reworks the testing conventions precommit plugin. This plugin now:
- is compatible with yaml, java rest tests and internalClusterTest (aka different sourceSets per test type)
- enforces test base class and simple naming conventions (as it did before)
- adds one check task per test sourceSet
- uses the worker api to improve task execution parallelism and encapsulation
- is gradle configuration cache compatible
This also ports the TestingConventions integration testing to Spock and removes the build-tools-internal/test kit folder that is not required anymore. We also add some common logic for testing java related gradle plugins.
We will apply further cleanup on other tests within our test suite in a dedicated follow up cleanup
This is a result of structural search/replace in intellij. This only affects log methods with a signature
logger.info((Supplier) ()-> ParametrizedMessage) logger.info((Supplier) ()-> ParametrizedMessage, Throwable)
relates #86549
This commit fixesx the license/notice files for ingest attachment
dependency to account for both tika-langdetect and tika-langdetect-tika
jars when building dependency info.
Tika 1.x is end of life as of later this year. This change updates the
AttachmentProcessor to use tika 2. The goal was to keep the
functionality as close as possible, just with upgraded tika. The tests
have been slightly modified because of a small change in tika
functionality -- as of 2.4.0 it now adds an extra newline to the output
for every embedded attachment in a document. Also as part of this I have
broken apart the tika-parsers into individual dependencies. The reason
is that we are considering breaking this plugin apart, and want to know
exactly which parsers we pull in.
Boolean-only privilege checks, i.e. the ones currently used in the
"profile has privilege" API, now benefit from a performance improvement,
because the check will now stop upon first encountering a privilege that
is NOT granted over a resource (and return `false` overall). Previously,
all the privileges were always checked over all the resources in order
to assemble a comprehensive response with all the privileges that are
not granted.
Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks.
Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists
and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter
during serialization.
This PR represents the initial phase of Modularizing Elasticsearch (with
Java Modules).
This initial phase modularizes the core of the Elasticsearch server
with Java Modules, which is then used to load and configure extension
components atop the server. Only a subset of extension components are
modularized at this stage (other components come in a later phase).
Components are loaded dynamically at runtime with custom class loaders
(same as is currently done). Components with a module-info.class are
defined to a module layer.
This architecture is somewhat akin to the Modular JDK, where
applications run on the classpath. In the analogy, the Elasticsearch
server modules are the platform (thus are always resolved and present),
while components without a module-info.class are non-modular code
running atop the Elasticsearch server modules. The extension components
cannot access types from non-exported packages of the server modules, in
the same way that classpath applications cannot access types from
non-exported packages of modules from the JDK. Broadly, the core
Elasticseach java modules simply "wrap" the existing packages and export
them. There are opportunites to export less, which is best done in more
narrowly focused follow-up PRs.
The Elasticsearch distribution startup scripts are updated to put jars
on the module path (the class path is empty), so the distribution will
run the core of the server as java modules. A number of key components
have been retrofitted with module-info.java's too, and the remaining
components can follow later. Unit and functional tests run as
non-modular (since they commonly require package-private access), while
higher-level integration tests, that run the distribution, run as
modular.
Co-authored-by: Chris Hegarty <christopher.hegarty@elastic.co>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
Co-authored-by: Rene Groeschke <rene@elastic.co>
Adds support for "text" fields in archive indices, with the goal of adding simple filtering support on text fields when
querying archive indices.
There are some differences to regular text fields:
- no global statistics: queries on text fields return constant score (similar to match_only_text).
- analyzer fields can be updated
- if defined analyzer is not available, falls back to default analyzer
- no guarantees that analyzers are BWC
The above limitations also give us the flexibility to eventually swap out the implementation with a "runtime-text field"
variant, and hence only provide those capabilities that can be emulated via a runtime field.
Relates #81210
Final (hopefully!) snapshot before the 9.2.0 release
* Update test to expect persian tokenfilter - will be exposed later
* Fix KnnVectorQueryBuilderTests::doAssertLuceneQuery
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
In the many-shards benchmarks the singleton maps storing just a single
analyzer for each keyword field mapper cost around 5% of the total heap
usage on data nodes (700MB for ~15k indices which translate into ~16M instances
of keyword field mapper for Beats mappings).
Creating specific implementations for the zero, one or many analyzers
use cases that already have their own specialized constructors eliminates this
overhead completely.
relates #77466
This commit removes the remaining ParameterizedMessages that take a
single argument, this time where the argument contains method calls.
This was again done almost entirely through find/replace with regex in
IntelliJ.
relates #86549
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
```
And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.
Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.
## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)
## Educated guesses
The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values
These are mostly artifacts of how doc values are stored.
### sorted alphabetically
```
{
"b": 1,
"c": 2,
"a": 3
}
```
becomes
```
{
"a": 3,
"b": 1,
"c": 2
}
```
### as "objecty" as possible
```
{
"a.b": "foo"
}
```
becomes
```
{
"a": {
"b": "foo"
}
}
```
### pushes all arrays to the "leaf" fields
```
{
"a": [
{
"b": "foo",
"c": "bar"
},
{
"c": "bort"
},
{
"b": "snort"
}
}
```
becomes
```
{
"a" {
"b": ["foo", "snort"],
"c": ["bar", "bort"]
}
}
```
### sorts most array values
```
{
"a": [2, 3, 1]
}
```
becomes
```
{
"a": [1, 2, 3]
}
```
### removes duplicate text and keyword values
```
{
"a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
"a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`
Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.
That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
## perf numbers
I loaded the entire tsdb data set with this change and the size:
```
standard -> synthetic
store size 31.0 GB -> 7.0 GB (77.5% reduction)
_source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source)
```
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
Remove the inheritance here to make instances smaller and speed up many-shards benchmarks a little.
Did not remove the dead arguments from the constructors in this PR as that would have been a
very noisy change.
This introduces a new Security API `_security/profile/_has_privileges`
that can be used to verify which Users have the requested privileges,
given their associated User Profiles. Multiple profile uids can be specified
in a single has privileges request.
This is analogous to the existing Has privileges API. It also uses the same
format for specifying the privileges to be checked, and should be used in
the same situations (ie to run an authorization preflight check or to verify
privileges over application resources). However, unlike the existing
has privilege API, this can be used to check the privileges of multiple
users (not only of the currently authenticated one), but the users must
have an existing profile, and the response is binary only (either it has or
it does not have the requested privileges).
Calling this API requires the `manage_user_profile` cluster privilege.
Notable changes include:
count implementations for MultiRangeQuery and IndexSortedNumericDocValuesRangeQuery, which may speed up certain aggregations
more efficient decoding of docids in BKD reader
The default type is incredibly common and instances are not trivial
in size with 16 fields. Heap dumps from larger data nodes holding many
keyword fields with the default field type can contain hundreds of MB
of heap used for these.
Same reasoning applies to the `TextSearchInfo` deduplication.
`TextSearchInfo` was turned into a record to give us an `equals` implementation.
In #86206, we closed down Authentication constructors to favour
dedicated convenient methods for instantiation. The constructor usages
in the example project were however left out (another refactor fallout).
Relates: #86206Resolves: #86378
Most of the Jackson uses, eg in x-content and azure, have already been
upgraded. This commit upgrades the rest of the uses. Note that it does
not yet upgrade the aws sdk, this should also be done on its own.
Use the public static final constant org.apache.lucene.analysis.icu.NORMALIZER, rather than poking around inside lucene resources - the Normalizer2 instance is equivalent. It would appear that this code, doing the resource lookup, predates the lucene public field.
Most classes under elasticsearch-core had been moved to the o.e.core
package. However, a couple io related classes remained in an "internal"
package. This commit moves Streams and IOUtils to the core package, as
they are no more "internal" than the rest of the classes in core.