* Use single-char variant of String.indexOf() where possible
indexOf(char) is more efficient than searching for the same one-character String.
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Same as #101175, shorten `client().prepareIndex(index)` and
`client().prepareIndex().setIndex(index)` via a test utility.
Saves lots of code now and sets up some follow-up simplifcations.
Another round of automated fixes to this, marking things that can be
made static as static. Saves some JIT cycles but also turns some lambdas
from capturing to non-capturing and makes the "utilityness" of some
classes visible.
Data-stream mappings require a @timestamp field to be present and configured
as a date with a specific set of parameters. The index-wide setting of
ignore_malformed can cause problems here if it is set to true, because it needs
to be false for the @timestamp field.
This commit detects if a set of mappings is configured for a datastream by checking
for the presence of a DataStreamTimestampFieldMapper metadata field, and passes
that information on during Mapper construction as part of the MapperBuilderContext.
DateFieldMapper.Builder now checks to see if it is specifically for a data stream timestamp
field, and if it is, sets ignore_malformed to false.
Relates to #96051
We mostly have a handful of `FieldType` values here across all mappers and none of them contain
attributes. There's only so many combinations here, lets deduplicate these to save some heap and set up
subsequent mapper heap savings.
The copyTo builder is really hard to reason about when it comes to
mapper merging, because the `reset` method would actually mutate an
existing mapper. That seems dangerous and the whole thing is quite
inefficient as well. -> this PR just removes it and uses a copy
constructor for copy on write, avoiding instance creation on mapper
merges here and there and leaving no doubt about these things being
immutable.
Replacing the remaining usages that I could automatically replace
and a couple that I did by hand in this PR.
Also, added the same shortcut to the single node tests to save some
duplication there.
In certain scenarios, running a MultiTerm query sets a `null` rewrite method. While `null` is usually checked, there are branches in the code where this is not adequately checked.
Additionally, `MultiTermQuery#setRewriteMethod` has been deprecated for a while. So, to correct this bug,
- Remove calls to `MultiTermQuery#setRewriteMethod` where possible
- Always check for `null` rewrite method
closes: https://github.com/elastic/elasticsearch/issues/94364
This commit adds a new test framework for configuring and orchestrating
test clusters for both Java and YAML REST testing. This will eventually
replace the existing "test-clusters" Gradle plugin and the build-time
cluster orchestration.
We currently work out whether or not a mapper should be storing additional
values for synthetic source by looking at the DocumentParserContext. However,
this value does not change for the lifetime of the mapper - it is defined by
metadata on the root mapper and is immutable - and DocumentParserContext
feels like the wrong place for this information as it holds context specific
to the document being parsed.
This commit moves synthetic source status information from DocumentParserContext
to MapperBuilderContext instead. Mappers which need this information retrieve
it at build time and hold it on final fields.
This adds support for `ignore_malformed` to numeric fields other than
`scaled_float` in synthetic `_source`. Their values are saved to a
stored field and loaded to render the `_source`.
This makes sure that all field types that support `ignore_malfored` do
so in the same way.
Production changes:
* All mapper has an `ignoreMalformed` method that must return `true` if
the field accepts the `ignore_malformed` mapping parameter was
configured. It defaults to `false` because many fields either don't
have a concept of "malformed" value or don't have the ability to
ignore malformed values.
* Fix the `scaled_float` field to store it's field name in `_ignored` if
it ignores any malfored values. This is how all other field mappers
work.
Test changes:
* `MapperTestCase` forces subclasses to declare if their
`supportIgnoreMalformed` or not.
* If `MapperTestCase` subclasses `supportIgnoreMalfored` they must
define some `exampleMalformedValues`.
* `MapperTestCase` always grows three new tests:
* One that creates a field without setting `ignore_malformed` and
verifies that all `exampleMalformedValues` throw expected errors
* On that explicitly configured `ignore_malformed` to false and, if
`supportIgnoreMalformed` it verifies the errors again. If not
`supportIgnoreMalformed` it verifies that the parameter is unknown.
* On that explicitly configured `ignore_malformed` to true and, if
`supportIgnoreMalformed` it verifies that parsing doesn't produce
errors and correctly produces `_ignored`. If not
`supportIgnoreMalformed` it verifies that the parameter is unknown.
* Moved some subclasesses of `MapperTestCase` from
`internalClusterTests` to `tests`. This isn't strictly required but
that's the right place for them.
Removing the custom dependency checksum functionality in favor of Gradle build-in dependency verification support.
- Use sha256 in favor of sha1 as sha1 is not considered safe these days.
Closes https://github.com/elastic/elasticsearch/issues/69736
MappedFieldType#fieldDataBuilder() currently takes two parameters, a fully qualified
index name and a supplier for a SearchLookup. We expect to add more parameters here
as we add support for loading fielddata from source. Rather than telescoping the
parameter list, this commit instead introduces a new FieldDataContext carrier object
which will allow us to add to these context parameters more easily.
No need for this extension, we don't make use of the settings or deprecation logger
in production any more. Also, this slows down CS operations that require a
temporary index service which builds quite a bit slower when the loggers
need to be set up via reflective calls.
This PR uses Lucene-9.3 snapshot in Elasticsearch 8.4. Noticeable changes in this Lucene snapshot:
- Merge-on-refresh (disabled)
- No more pathological merging
- SortedSetDocValues#count for value_count aggs
Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks.
Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists
and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter
during serialization.
This PR represents the initial phase of Modularizing Elasticsearch (with
Java Modules).
This initial phase modularizes the core of the Elasticsearch server
with Java Modules, which is then used to load and configure extension
components atop the server. Only a subset of extension components are
modularized at this stage (other components come in a later phase).
Components are loaded dynamically at runtime with custom class loaders
(same as is currently done). Components with a module-info.class are
defined to a module layer.
This architecture is somewhat akin to the Modular JDK, where
applications run on the classpath. In the analogy, the Elasticsearch
server modules are the platform (thus are always resolved and present),
while components without a module-info.class are non-modular code
running atop the Elasticsearch server modules. The extension components
cannot access types from non-exported packages of the server modules, in
the same way that classpath applications cannot access types from
non-exported packages of modules from the JDK. Broadly, the core
Elasticseach java modules simply "wrap" the existing packages and export
them. There are opportunites to export less, which is best done in more
narrowly focused follow-up PRs.
The Elasticsearch distribution startup scripts are updated to put jars
on the module path (the class path is empty), so the distribution will
run the core of the server as java modules. A number of key components
have been retrofitted with module-info.java's too, and the remaining
components can follow later. Unit and functional tests run as
non-modular (since they commonly require package-private access), while
higher-level integration tests, that run the distribution, run as
modular.
Co-authored-by: Chris Hegarty <christopher.hegarty@elastic.co>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
Co-authored-by: Rene Groeschke <rene@elastic.co>
Final (hopefully!) snapshot before the 9.2.0 release
* Update test to expect persian tokenfilter - will be exposed later
* Fix KnnVectorQueryBuilderTests::doAssertLuceneQuery
Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>
In the many-shards benchmarks the singleton maps storing just a single
analyzer for each keyword field mapper cost around 5% of the total heap
usage on data nodes (700MB for ~15k indices which translate into ~16M instances
of keyword field mapper for Beats mappings).
Creating specific implementations for the zero, one or many analyzers
use cases that already have their own specialized constructors eliminates this
overhead completely.
relates #77466
This attempts to shrink the index by implementing a "synthetic _source" field.
You configure it by in the mapping:
```
{
"mappings": {
"_source": {
"synthetic": true
}
}
}
```
And we just stop storing the `_source` field - kind of. When you go to access
the `_source` we regenerate it on the fly by loading doc values. Doc values
don't preserve the original structure of the source you sent so we have to
make some educated guesses. And we have a rule: the source we generate would
result in the same index if you sent it back to us. That way you can use it
for things like `_reindex`.
Fetching the `_source` from doc values does slow down loading somewhat. See
numbers further down.
## Supported fields
This only works for the following fields:
* `boolean`
* `byte`
* `date`
* `double`
* `float`
* `geo_point` (with precision loss)
* `half_float`
* `integer`
* `ip`
* `keyword`
* `long`
* `scaled_float`
* `short`
* `text` (when there is a `keyword` sub-field that is compatible with this feature)
## Educated guesses
The synthetic source generator makes `_source` fields that are:
* sorted alphabetically
* as "objecty" as possible
* pushes all arrays to the "leaf" fields
* sorts most array values
* removes duplicate text and keyword values
These are mostly artifacts of how doc values are stored.
### sorted alphabetically
```
{
"b": 1,
"c": 2,
"a": 3
}
```
becomes
```
{
"a": 3,
"b": 1,
"c": 2
}
```
### as "objecty" as possible
```
{
"a.b": "foo"
}
```
becomes
```
{
"a": {
"b": "foo"
}
}
```
### pushes all arrays to the "leaf" fields
```
{
"a": [
{
"b": "foo",
"c": "bar"
},
{
"c": "bort"
},
{
"b": "snort"
}
}
```
becomes
```
{
"a" {
"b": ["foo", "snort"],
"c": ["bar", "bort"]
}
}
```
### sorts most array values
```
{
"a": [2, 3, 1]
}
```
becomes
```
{
"a": [1, 2, 3]
}
```
### removes duplicate text and keyword values
```
{
"a": ["bar", "baz", "baz", "baz", "foo", "foo"]
}
```
becomes
```
{
"a": ["bar", "baz", "foo"]
}
```
## `_recovery_source`
Elasticsearch's shard "recovery" process needs `_source` *sometimes*. So does
cross cluster replication. If you disable source or filter it somehow we store
a `_recovery_source` field for as long as the recovery process might need it.
When everything is running smoothly that's generally a few seconds or minutes.
Then the fields is removed on merge. This synthetic source feature continues
to produce `_recovery_source` and relies on it for recovery. It's *possible*
to synthesize `_source` during recovery but we don't do it.
That means that synethic source doesn't speed up writing the index. But in the
future we might be able to turn this on to trade writing less data at index
time for slower recovery and cross cluster replication. That's an area of
future improvement.
## perf numbers
I loaded the entire tsdb data set with this change and the size:
```
standard -> synthetic
store size 31.0 GB -> 7.0 GB (77.5% reduction)
_source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source)
```
A second _forcemerge a few minutes after rally finishes should removes the
remaining 47.6MB of _recovery_source.
With this fetching source for 1,000 documents seems to take about 500ms. I
spot checked a lot of different areas and haven't seen any different hit. I
*expect* this performance impact is based on the number of doc values fields
in the index and how sparse they are.
Remove the inheritance here to make instances smaller and speed up many-shards benchmarks a little.
Did not remove the dead arguments from the constructors in this PR as that would have been a
very noisy change.
Notable changes include:
count implementations for MultiRangeQuery and IndexSortedNumericDocValuesRangeQuery, which may speed up certain aggregations
more efficient decoding of docids in BKD reader
Use the public static final constant org.apache.lucene.analysis.icu.NORMALIZER, rather than poking around inside lucene resources - the Normalizer2 instance is equivalent. It would appear that this code, doing the resource lookup, predates the lucene public field.
This param was incredibly expensive to set up when parsing mappings and
is one of the big contributors to mapping parsing slowness on master.
Since all uses of this parameter type are statically known it seems the most
straight forward to simply statically hard code the validators so that we save
some allocations.
This PR upgrades Lucene to a newer snapshot `9.1.0-snapshot-9497524cc2d`.
Changes:
* Adapt to `LeafReader#searchNearestVectors` signature change
* Adapt checks in `GeometryIndexerTests`, `SearchServiceTests`, `FiltersAggregatorTests`, `AggregationProfilerIT`
* Address highlighting failures in `MultiPhrasePrefixQuery` and `HasChildQueryBuilder.LateParsingQuery`
Lucene issues that resulted in elasticsearch changes:
LUCENE-9820 Separate logic for reading the BKD index from logic to intersecting it.
LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator()
LUCENE-10301: make the test-framework a proper module by moving all test
classes to org.apache.lucene.tests
LUCENE-10300: rewrite how resources are read in ukrainian morfologik analyzer:
LUCENE-10054 Make HnswGraph hierarchical
Follow-up from #77144 (comment) with converting id/_id to always be strings instead of integers. This makes the type value in the Elasticsearch specification be only string instead of string | number.
this change was generated using following command on ubuntu
find . -type f -name "*.yml" -print0 | xargs -0 sed -i -r 's/([^a-zA-Z0-9_\.]id|[^a-zA-Z0-9_]_id):(\s*)([0-9]+)/\1:\2"\3"/g'
Currently, composite aggregation supports only string or numeric values for the fields
in the sources part of the composite aggregation. Although _tsid is encoded and stored
as a byte array, it is formatted as map {"dim1": "value1", "dim2": value2", ...} for input/output.
This PR adds support for composite aggregation on _tsid field.
Relates to #74660
fix as the #80945 do.
register a settings update consumer for the end_time for
the tsdb index even when the end_time setting wasn't registered.
Pass the feature flag to reindex yaml tests.
Co-authored-by: Igor Motov <igor@motovs.org>
The ES code base is quite JSON heavy. It uses a lot of multi-line JSON requests in tests which need to be escaped and concatenated which in turn makes them hard to read. Let's try to leverage Java 15 text blocks for representing them.
Add plumbing for ordinal field data for the field API.
The scripting fields API needs to know the mapped type of the each field in the document. This is ensured by having a `ToScriptField` method reference passed from the `MappedField`, through the `IndexFieldData`, to the `LeafFieldData`.
Knowing the mapped type allows the API to provide relevant helper methods as well as appropriately use the fields available in the document.
Refs: #79105