Commit graph

3053 commits

Author SHA1 Message Date
Yoann Rodière
841ac8e43a
Upgrade Apache Commons Logging to 1.2 (#85745)
* Upgrade to Apache Commons Logging 1.2 (#40305)
* Clarify that Apache HTTP/commons-* dependencies are not just for tests
2022-08-10 13:19:15 -04:00
Mark Vieira
398b0147a7
Upgrade Gradle wrapper to 7.5.1 (#88918) 2022-08-08 12:34:58 -07:00
Rene Groeschke
3909b5eaf9
Add verification metadata for dependencies (#88814)
Removing the custom dependency checksum functionality in favor of Gradle build-in dependency verification support. 

- Use sha256 in favor of sha1 as sha1 is not considered safe these days.

Closes https://github.com/elastic/elasticsearch/issues/69736
2022-08-04 09:51:16 +02:00
Artem Prigoda
2a03ac35a6
Fix compilation in the rescore plugin (#89004)
Add source fallback operation when looking up a the factor field added in #88735

Resolves #88985
2022-08-01 21:05:57 +02:00
Ignacio Vera
ed564f6e1d
Update lo lucene-9.3.0 (#88927) 2022-08-01 07:21:13 +02:00
Jack Conradson
5e0701f026
Add source fallback for keyword fields using operation (#88735)
This change adds an operation parameter to FieldDataContext that allows us to specialize the field data that are returned from fielddataBuilder in MappedFieldType. Keyword, integer, and geo point field types now support source fallback where we build a doc values wrapper using source if doc values doesn't exist for this field under the operation SCRIPT. This allows us to have source fallback in scripting for the scripting fields API.
2022-07-28 10:34:05 -07:00
Alan Woodward
bc8ebbf540
Add FieldDataContext (#88779)
MappedFieldType#fieldDataBuilder() currently takes two parameters, a fully qualified
index name and a supplier for a SearchLookup. We expect to add more parameters here
as we add support for loading fielddata from source. Rather than telescoping the
parameter list, this commit instead introduces a new FieldDataContext carrier object
which will allow us to add to these context parameters more easily.
2022-07-26 14:47:50 +01:00
Ignacio Vera
3b7f393a82
Upgrade to lucene snapshot lucene-9.3.0-snapshot-b8d1fcfd0ec (#88706) 2022-07-22 11:22:39 +02:00
Rene Groeschke
98b789c940
Update to to Gradle wrapper 7.5 (#85141)
This updates the gradle wrapper to a 7.5

Fixes #85123
2022-07-19 08:12:19 +02:00
Rene Groeschke
dbf39741a0
Make LoggedExec gradle task configuration cache compatible (#87621)
This changes the LoggedExec task to be configuration cache compatible. We changed the implementation
to use `ExecOperations` instead of extending `Exec` task. As double checked with the Gradle team this task
is not planned to be made configuration cache compatible out of the box anytime soon.

This is part of the effort on https://github.com/elastic/elasticsearch/issues/57918
2022-07-11 08:46:54 +02:00
Nhat Nguyen
bd69f90fff
Upgrade to Lucene-9.3.0-snapshot-2d05f5c623e (#88284)
To include LUCENE-10620 - which passes Weight to Collector
2022-07-06 16:16:03 -04:00
Chris Hegarty
453f12c72d
Upgrade to Log4J 2.18.0 (#88237) 2022-07-04 11:30:38 +01:00
Rene Groeschke
8ccae4da71
Setup elasticsearch dependency monitoring with Snyk for production code (#88036)
This adds the generation and upload logic of Gradle dependency graphs to snyk

We directly implemented a rest api based snyk plugin as:

the existing snyk gradle plugin delegates to the snyk command line tool the command line tool 
uses custom gradle logic by injecting a init file that is 

a) using deprecated build logic which we definitely want to avoid
b) uses gradle api we avoid like eager task creation.

Shipping this as a internal gradle plugin gives us the most flexibility as we only want to monitor 
production code for now we apply this plugin as part of the elasticsearch.build plugin, 
that usage has been for now the de-facto indicator if a project is considered a "production" project 
that ends up in our distribution or public maven repositories. This isnt yet ideal and we will revisit 
the distinction between production and non production code / projects in a separate effort.

As part of this effort we added the elasticsearch.build plugin to more projects that actually end up 
in the distribution. To unblock us on this we for now disabled a few check tasks that started failing by applying elasticsearch.build. 

Addresses  #87620
2022-06-29 13:29:14 +02:00
James Baiera
08d1c3e643
Update HDFS Repository to HDFS 3.3.3 (#88039)
This updates the HDFS repository plugin to use HDFS 3.3.3.
2022-06-28 11:02:54 -04:00
Armin Braun
eda1c511dd
Don't extend AbstractIndexComponent in AbstractCharFilterFactory (#88125)
Same as #88113 but for AbstractCharFilterFactory.
2022-06-28 14:51:51 +02:00
Armin Braun
02568210ba
Don't extend AbstractIndexComponent in AbstractTokenFilter (#88113)
No need for this extension, we don't make use of the settings or deprecation logger
in production any more. Also, this slows down CS operations that require a
temporary index service which builds quite a bit slower when the loggers
need to be set up via reflective calls.
2022-06-28 12:13:36 +02:00
Tim Vernum
6078fc3cbf
Update http client version (#87491)
Moves a few Apache HTTP client dependencies to their latest version

- httpclient -> 4.5.13
- httpasyncclient -> 4.1.5
- httpcore -> 4.4.13
2022-06-28 06:10:17 -04:00
Ryan Ernst
eed8da3919
Move the ingest attachment processor to the default distribution (#87989)
The ingest attachment processor is currently available as a plugin. This
commit moves the processor to the default distribution so it is always
available.
2022-06-28 02:10:36 -04:00
Nhat Nguyen
c2dc6e6ef4
Upgrade to new Lucene snapshot (#87932)
This PR uses Lucene-9.3 snapshot in Elasticsearch 8.4. Noticeable changes in this Lucene snapshot:

- Merge-on-refresh (disabled)
- No more pathological merging
- SortedSetDocValues#count for value_count aggs
2022-06-23 12:18:27 -04:00
Artem Prigoda
e17f805ccc
Remove redundant jackson dependencies from discovery-azure (#87898)
The APIs that we use in azure-svc-mgmt-compute use the Apache HTTP client and the built-in Java XML parser, so it doesn't require Jersey JAXB bindings for databinding JSON/XML data to Java objects via old Jackson dependencies.
2022-06-23 14:39:25 +02:00
Ryan Ernst
6084b9d321
Fix rest example plugin (#87923)
This is a followup to
https://github.com/elastic/elasticsearch/pull/87504, to fix the example
plugin that used BytesRestResponse.
2022-06-22 08:57:00 -07:00
Rene Groeschke
cdf5bd7ed0
Rework testing conventions gradle plugin (#87213)
This PR reworks the testing conventions precommit plugin. This plugin now:
- is compatible with yaml, java rest tests and internalClusterTest (aka different sourceSets per test type)
- enforces test base class and simple naming conventions (as it did before)
- adds one check task per test sourceSet
- uses the worker api to improve task execution parallelism and encapsulation
- is gradle configuration cache compatible  

This also ports the TestingConventions integration testing to Spock and removes the build-tools-internal/test kit folder that is not required anymore. We also add some common logic for testing java related gradle plugins. 
We will apply further cleanup on other tests within our test suite in a dedicated follow up cleanup
2022-06-20 16:26:38 +02:00
Armin Braun
0132541d60
Remove redundant BlobMetadata interface (#87705)
No need to have more than a simple record here at this point.
2022-06-18 20:41:32 +02:00
Nikola Grcevski
06d5baaba5
Add more GraalThread filtering in tests (#87571) 2022-06-09 16:06:57 -04:00
Alan Woodward
048fa422c2
Update to public lucene 9.2.0 release (#87162) 2022-06-06 10:06:41 +01:00
Armin Braun
da4577ea82
Speed up NumberFieldMapper (#85688)
No need to create an intermediary list here. Creating it and adding it
to the document tended to take more time than the parsing of the number itself.
2022-06-04 12:24:41 +02:00
Przemyslaw Gomulka
705b27ae3b
Refactor ParameterizedMessage used in lambda and casted to Supplier (#87156)
This is a result of structural search/replace in intellij. This only affects log methods with a signature
logger.info((Supplier) ()-> ParametrizedMessage) logger.info((Supplier) ()-> ParametrizedMessage, Throwable)

relates #86549
2022-05-31 08:46:35 +02:00
Ryan Ernst
e2e241ec01
Fix the ingest attachment license (#87189)
This commit fixesx the license/notice files for ingest attachment
dependency to account for both tika-langdetect and tika-langdetect-tika
jars when building dependency info.
2022-05-27 06:14:30 -07:00
Keith Massey
6b34671dad
Upgrading to tika 2.4 (#86015)
Tika 1.x is end of life as of later this year. This change updates the
AttachmentProcessor to use tika 2. The goal was to keep the
functionality as close as possible, just with upgraded tika. The tests
have been slightly modified because of a small change in tika
functionality -- as of 2.4.0 it now adds an extra newline to the output
for every embedded attachment in a document. Also as part of this I have
broken apart the tika-parsers into individual dependencies. The reason
is that we are considering breaking this plugin apart, and want to know
exactly which parsers we pull in.
2022-05-24 16:34:19 -04:00
Albert Zaharovits
346abf9816
Improve "Has Privilege" performance for boolean-only response (#86685)
Boolean-only privilege checks, i.e. the ones currently used in the
"profile has privilege" API, now benefit from a performance improvement,
because the check will now stop upon first encountering a privilege that
is NOT granted over a resource (and return `false` overall). Previously,
all the privileges were always checked over all the resources in order
to assemble a comprehensive response with all the privileges that are
not granted.
2022-05-24 11:41:20 -04:00
Armin Braun
7a25453dec
Speed up FieldMapper construction/parsing/serialization (#86860)
Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks.
Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists
and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter
during serialization.
2022-05-23 12:09:00 +02:00
Chris Hegarty
3071c6a055
Modularize Elasticsearch (#81066)
This PR represents the initial phase of Modularizing Elasticsearch (with
Java Modules).

This initial phase modularizes the core of the Elasticsearch server
with Java Modules, which is then used to load and configure extension
components atop the server. Only a subset of extension components are
modularized at this stage (other components come in a later phase).
Components are loaded dynamically at runtime with custom class loaders
(same as is currently done). Components with a module-info.class are
defined to a module layer.

This architecture is somewhat akin to the Modular JDK, where
applications run on the classpath. In the analogy, the Elasticsearch
server modules are the platform (thus are always resolved and present),
while components without a module-info.class are non-modular code
running atop the Elasticsearch server modules. The extension components
cannot access types from non-exported packages of the server modules, in
the same way that classpath applications cannot access types from
non-exported packages of modules from the JDK. Broadly, the core
Elasticseach java modules simply "wrap" the existing packages and export
them. There are opportunites to export less, which is best done in more
narrowly focused follow-up PRs.

The Elasticsearch distribution startup scripts are updated to put jars
on the module path (the class path is empty), so the distribution will
run the core of the server as java modules. A number of key components
have been retrofitted with module-info.java's too, and the remaining
components can follow later. Unit and functional tests run as
non-modular (since they commonly require package-private access), while
higher-level integration tests, that run the distribution, run as
modular.

Co-authored-by: Chris Hegarty <christopher.hegarty@elastic.co>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
Co-authored-by: Rene Groeschke <rene@elastic.co>
2022-05-20 13:11:42 +01:00
Alan Woodward
205cfec52f
Upgrade to lucene 9.2.0-RC2 snapshot (#86931)
Only difference from last snapshot is a revert of a change in the behaviour
of PersianAnalyzer
2022-05-20 08:54:35 +01:00
Yannick Welsch
5aebb8ee38
Add text field support to archive indices (#86591)
Adds support for "text" fields in archive indices, with the goal of adding simple filtering support on text fields when
querying archive indices.

There are some differences to regular text fields:

- no global statistics: queries on text fields return constant score (similar to match_only_text).
- analyzer fields can be updated
- if defined analyzer is not available, falls back to default analyzer
- no guarantees that analyzers are BWC
The above limitations also give us the flexibility to eventually swap out the implementation with a "runtime-text field"
variant, and hence only provide those capabilities that can be emulated via a runtime field.

Relates #81210
2022-05-18 10:25:38 +02:00
Alan Woodward
0418e8a9d8
Upgrade to lucene snapshot 978eef5459c (#86852)
Final (hopefully!) snapshot before the 9.2.0 release

* Update test to expect persian tokenfilter - will be exposed later
* Fix KnnVectorQueryBuilderTests::doAssertLuceneQuery

Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
2022-05-17 15:27:52 -07:00
Armin Braun
82933a8599
Save redundant singleton maps in field mappers (#86785)
In the many-shards benchmarks the singleton maps storing just a single
analyzer for each keyword field mapper cost around 5% of the total heap
usage on data nodes (700MB for ~15k indices which translate into ~16M instances
of keyword field mapper for Beats mappings).
Creating specific implementations for the zero, one or many analyzers
use cases that already have their own specialized constructors eliminates this
overhead completely.

relates #77466
2022-05-16 15:13:51 +02:00
Ryan Ernst
12b98b37b6
Remove remaining single arg ParameterizedMessages (#86715)
This commit removes the remaining ParameterizedMessages that take a
single argument, this time where the argument contains method calls.
This was again done almost entirely through find/replace with regex in
IntelliJ.

relates #86549
2022-05-12 10:09:11 +02:00
Mark Vieira
22aeebcd9f
Avoid starting test fixtures when resolving all external dependencies (#86357) 2022-05-11 07:59:34 -07:00
Nik Everett
a589456b81
Synthetic source (#85649)
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
  "mappings": {
    "_source": {
      "synthetic": true
    }
  }
}
```

And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.

Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.

## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)


## Educated guesses

The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values

These are mostly artifacts of how doc values are stored.

### sorted alphabetically
```
{
  "b": 1,
  "c": 2,
  "a": 3
}
```
becomes
```
{
  "a": 3,
  "b": 1,
  "c": 2
}
```

### as "objecty" as possible
```
{
  "a.b": "foo"
}
```
becomes
```
{
  "a": {
    "b": "foo"
  }
}
```

### pushes all arrays to the "leaf" fields
```
{
  "a": [
    {
      "b": "foo",
      "c": "bar"
    },
    {
      "c": "bort"
    },
    {
      "b": "snort"
    }
}
```
becomes
```
{
  "a" {
    "b": ["foo", "snort"],
    "c": ["bar", "bort"]
  }
}
```

### sorts most array values
```
{
  "a": [2, 3, 1]
}
```
becomes
```
{
  "a": [1, 2, 3]
}
```

### removes duplicate text and keyword values
```
{
  "a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
  "a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`

Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.

That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.

## perf numbers

I loaded the entire tsdb data set with this change and the size:

```
           standard -> synthetic
store size  31.0 GB ->  7.0 GB  (77.5% reduction)
_source  24695.7 MB -> 47.6 MB  (99.8% reduction - synthetic is in _recovery_source)
```

A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.

With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
2022-05-10 07:46:58 -04:00
Armin Braun
7b916f2678
AbstractAnalyzerProvider does not need to extend AbstractIndexComponent (#86537)
Remove the inheritance here to make instances smaller and speed up many-shards benchmarks a little.
Did not remove the dead arguments from the constructors in this PR as that would have been a
very noisy change.
2022-05-08 22:34:52 +02:00
Albert Zaharovits
3d4234e80e
Has privileges API for profiles (#85898)
This introduces a new Security API `_security/profile/_has_privileges`
that can be used to verify which Users have the requested privileges,
given their associated User Profiles. Multiple profile uids can be specified
in a single has privileges request.

This is analogous to the existing Has privileges API. It also uses the same
format for specifying the privileges to be checked, and should be used in
the same situations (ie to run an authorization preflight check or to verify
privileges over application resources). However, unlike the existing
has privilege API, this can be used to check the privileges of multiple
users (not only of the currently authenticated one), but the users must
have an existing profile, and the response is binary only (either it has or
it does not have the requested privileges).
Calling this API requires the `manage_user_profile` cluster privilege.
2022-05-06 09:54:34 +03:00
Alan Woodward
4d076eee20
Upgrade to Lucene 9.2 snapshot efa5d6f4d43 (#86227)
Notable changes include:

count implementations for MultiRangeQuery and IndexSortedNumericDocValuesRangeQuery, which may speed up certain aggregations
more efficient decoding of docids in BKD reader
2022-05-05 15:48:13 +01:00
Yang Wang
286cb2b26c
[Test] Replace removed User methods (#86422)
Another refactor leftover.

Relates: #86246 Resolves: #86421
2022-05-04 08:36:36 -04:00
Armin Braun
cb41ed09e3
Deduplicate default FieldType in KeywordFieldMapper (#86346)
The default type is incredibly common and instances are not trivial
in size with 16 fields. Heap dumps from larger data nodes holding many
keyword fields with the default field type can contain hundreds of MB
of heap used for these.
Same reasoning applies to the `TextSearchInfo` deduplication.
`TextSearchInfo` was turned into a record to give us an `equals` implementation.
2022-05-03 16:11:36 +02:00
Yang Wang
210ce86663
[Test] Fix authentication creation in example project (#86385)
In #86206, we closed down Authentication constructors to favour
dedicated convenient methods for instantiation. The constructor usages
in the example project were however left out (another refactor fallout).

Relates: #86206
Resolves: #86378
2022-05-03 20:28:01 +10:00
Rene Groeschke
177b0fa47f
Mute failing example project (#86379)
Exclude example project to unblock PR checks till #86378 is addressed.
2022-05-03 05:15:29 -04:00
Ryan Ernst
af7525e1f0
Upgrade jackson to 2.13.2 (#86051)
Most of the Jackson uses, eg in x-content and azure, have already been
upgraded. This commit upgrades the rest of the uses. Note that it does
not yet upgrade the aws sdk, this should also be done on its own.
2022-04-22 07:21:17 -07:00
Chris Hegarty
603ca53798
Use declared constant (rather than resource lookup) (#86083)
Use the public static final constant org.apache.lucene.analysis.icu.NORMALIZER, rather than poking around inside lucene resources - the Normalizer2 instance is equivalent. It would appear that this code, doing the resource lookup, predates the lucene public field.
2022-04-22 13:55:24 +01:00
Ryan Ernst
b2c9028384
Move io utils to core package (#85954)
Most classes under elasticsearch-core had been moved to the o.e.core
package. However, a couple io related classes remained in an "internal"
package. This commit moves Streams and IOUtils to the core package, as
they are no more "internal" than the rest of the classes in core.
2022-04-19 21:26:28 -07:00
Artem Prigoda
b841b5f7d5
[discovery-gce] Fix initialisation of transport in FIPS mode (#85817)
Load the the keystore with Google certificates in the JKS format
instead of the default p12 which is not compatible with FIPS.
2022-04-13 10:57:37 +02:00