elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-04-23 14:47:31 -04:00

Author	SHA1	Message	Date
Dmitry Cherniachenko	a50e58d99a	Use single-char variant of String.indexOf() where possible (#105205 ) * Use single-char variant of String.indexOf() where possible indexOf(char) is more efficient than searching for the same one-character String. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>	2024-02-12 14:14:32 -05:00
Armin Braun	cdc83ad29b	Add shorthand for `prepareIndex` to test infrastructure (#101187 ) Same as #101175, shorten `client().prepareIndex(index)` and `client().prepareIndex().setIndex(index)` via a test utility. Saves lots of code now and sets up some follow-up simplifcations.	2023-11-23 15:47:36 +01:00
Ignacio Vera	b7b5518acc	Remove explicit SearchResponse references from plugins (#101277 ) Remove explicit SearchResponse references from plugins.	2023-11-09 22:08:18 +01:00
Armin Braun	b7eafce32c	Make some practically static methods static (#97565 ) Another round of automated fixes to this, marking things that can be made static as static. Saves some JIT cycles but also turns some lambdas from capturing to non-capturing and makes the "utilityness" of some classes visible.	2023-10-06 23:37:07 +02:00
Alan Woodward	4e1fb3fca5	Automatically disable `ignore_malformed` on datastream `@timestamp` fields (#99346 ) Data-stream mappings require a @timestamp field to be present and configured as a date with a specific set of parameters. The index-wide setting of ignore_malformed can cause problems here if it is set to true, because it needs to be false for the @timestamp field. This commit detects if a set of mappings is configured for a datastream by checking for the presence of a DataStreamTimestampFieldMapper metadata field, and passes that information on during Mapper construction as part of the MapperBuilderContext. DateFieldMapper.Builder now checks to see if it is specifically for a data stream timestamp field, and if it is, sets ignore_malformed to false. Relates to #96051	2023-09-13 15:02:22 +01:00
Armin Braun	574fb05946	Deduplicate org.apache.lucene.document.FieldType instances across mappers (#99361 ) We mostly have a handful of `FieldType` values here across all mappers and none of them contain attributes. There's only so many combinations here, lets deduplicate these to save some heap and set up subsequent mapper heap savings.	2023-09-08 22:18:35 +02:00
Armin Braun	f1a376c317	Remove CopyTo.Builder (#99368 ) The copyTo builder is really hard to reason about when it comes to mapper merging, because the `reset` method would actually mutate an existing mapper. That seems dangerous and the whole thing is quite inefficient as well. -> this PR just removes it and uses a copy constructor for copy on write, avoiding instance creation on mapper merges here and there and leaving no doubt about these things being immutable.	2023-09-08 13:24:31 -04:00
Simon Cooper	a830787b07	Bulk migration of Version.CURRENT for index created to IndexVersion.current() (#98490 )	2023-08-15 13:47:27 +01:00
Armin Braun	3f8ee82ef8	Use indices admin client shortcut in most integration tests (#96946 ) Replacing the remaining usages that I could automatically replace and a couple that I did by hand in this PR. Also, added the same shortcut to the single node tests to save some duplication there.	2023-06-20 13:32:59 +02:00
Simon Cooper	56d53da381	Migrate LuceneDocument.getFields(String) to a List (#94830 )	2023-03-29 11:08:36 +01:00
Benjamin Trent	bc2755f0df	Fix NPE thrown by prefix and regex query in strange scenarios (#94369 ) In certain scenarios, running a MultiTerm query sets a `null` rewrite method. While `null` is usually checked, there are branches in the code where this is not adequately checked. Additionally, `MultiTermQuery#setRewriteMethod` has been deprecated for a while. So, to correct this bug, - Remove calls to `MultiTermQuery#setRewriteMethod` where possible - Always check for `null` rewrite method closes: https://github.com/elastic/elasticsearch/issues/94364	2023-03-08 09:36:17 -05:00
Mark Vieira	c2eda511de	Add JUnit rule based integration test cluster orchestration framework (#92379 ) This commit adds a new test framework for configuring and orchestrating test clusters for both Java and YAML REST testing. This will eventually replace the existing "test-clusters" Gradle plugin and the build-time cluster orchestration.	2022-12-21 15:33:46 -08:00
Alan Woodward	41ab45a5d9	Report synthetic source status in MapperBuilderContext (#91400 ) We currently work out whether or not a mapper should be storing additional values for synthetic source by looking at the DocumentParserContext. However, this value does not change for the lifetime of the mapper - it is defined by metadata on the root mapper and is immutable - and DocumentParserContext feels like the wrong place for this information as it holds context specific to the document being parsed. This commit moves synthetic source status information from DocumentParserContext to MapperBuilderContext instead. Mappers which need this information retrieve it at build time and hold it on final fields.	2022-11-08 14:55:16 +00:00
Nik Everett	bc49392bfb	Support malformed numbers in synthetic _source (#90428 ) This adds support for `ignore_malformed` to numeric fields other than `scaled_float` in synthetic `_source`. Their values are saved to a stored field and loaded to render the `_source`.	2022-10-04 12:17:30 -04:00
Nik Everett	f4fad2548f	Always support ignore_malformed in the same way (#90565 ) This makes sure that all field types that support `ignore_malfored` do so in the same way. Production changes: * All mapper has an `ignoreMalformed` method that must return `true` if the field accepts the `ignore_malformed` mapping parameter was configured. It defaults to `false` because many fields either don't have a concept of "malformed" value or don't have the ability to ignore malformed values. * Fix the `scaled_float` field to store it's field name in `_ignored` if it ignores any malfored values. This is how all other field mappers work. Test changes: * `MapperTestCase` forces subclasses to declare if their `supportIgnoreMalformed` or not. * If `MapperTestCase` subclasses `supportIgnoreMalfored` they must define some `exampleMalformedValues`. * `MapperTestCase` always grows three new tests: * One that creates a field without setting `ignore_malformed` and verifies that all `exampleMalformedValues` throw expected errors * On that explicitly configured `ignore_malformed` to false and, if `supportIgnoreMalformed` it verifies the errors again. If not `supportIgnoreMalformed` it verifies that the parameter is unknown. * On that explicitly configured `ignore_malformed` to true and, if `supportIgnoreMalformed` it verifies that parsing doesn't produce errors and correctly produces `_ignored`. If not `supportIgnoreMalformed` it verifies that the parameter is unknown. * Moved some subclasesses of `MapperTestCase` from `internalClusterTests` to `tests`. This isn't strictly required but that's the right place for them.	2022-10-03 06:18:02 -04:00
Rene Groeschke	3909b5eaf9	Add verification metadata for dependencies (#88814 ) Removing the custom dependency checksum functionality in favor of Gradle build-in dependency verification support. - Use sha256 in favor of sha1 as sha1 is not considered safe these days. Closes https://github.com/elastic/elasticsearch/issues/69736	2022-08-04 09:51:16 +02:00
Ignacio Vera	ed564f6e1d	Update lo lucene-9.3.0 (#88927 )	2022-08-01 07:21:13 +02:00
Alan Woodward	bc8ebbf540	Add FieldDataContext (#88779 ) MappedFieldType#fieldDataBuilder() currently takes two parameters, a fully qualified index name and a supplier for a SearchLookup. We expect to add more parameters here as we add support for loading fielddata from source. Rather than telescoping the parameter list, this commit instead introduces a new FieldDataContext carrier object which will allow us to add to these context parameters more easily.	2022-07-26 14:47:50 +01:00
Ignacio Vera	3b7f393a82	Upgrade to lucene snapshot lucene-9.3.0-snapshot-b8d1fcfd0ec (#88706 )	2022-07-22 11:22:39 +02:00
Nhat Nguyen	bd69f90fff	Upgrade to Lucene-9.3.0-snapshot-2d05f5c623e (#88284 ) To include LUCENE-10620 - which passes Weight to Collector	2022-07-06 16:16:03 -04:00
Armin Braun	eda1c511dd	Don't extend AbstractIndexComponent in AbstractCharFilterFactory (#88125 ) Same as #88113 but for AbstractCharFilterFactory.	2022-06-28 14:51:51 +02:00
Armin Braun	02568210ba	Don't extend AbstractIndexComponent in AbstractTokenFilter (#88113 ) No need for this extension, we don't make use of the settings or deprecation logger in production any more. Also, this slows down CS operations that require a temporary index service which builds quite a bit slower when the loggers need to be set up via reflective calls.	2022-06-28 12:13:36 +02:00
Nhat Nguyen	c2dc6e6ef4	Upgrade to new Lucene snapshot (#87932 ) This PR uses Lucene-9.3 snapshot in Elasticsearch 8.4. Noticeable changes in this Lucene snapshot: - Merge-on-refresh (disabled) - No more pathological merging - SortedSetDocValues#count for value_count aggs	2022-06-23 12:18:27 -04:00
Nikola Grcevski	06d5baaba5	Add more GraalThread filtering in tests (#87571 )	2022-06-09 16:06:57 -04:00
Alan Woodward	048fa422c2	Update to public lucene 9.2.0 release (#87162 )	2022-06-06 10:06:41 +01:00
Armin Braun	7a25453dec	Speed up FieldMapper construction/parsing/serialization (#86860 ) Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks. Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter during serialization.	2022-05-23 12:09:00 +02:00
Chris Hegarty	3071c6a055	Modularize Elasticsearch (#81066 ) This PR represents the initial phase of Modularizing Elasticsearch (with Java Modules). This initial phase modularizes the core of the Elasticsearch server with Java Modules, which is then used to load and configure extension components atop the server. Only a subset of extension components are modularized at this stage (other components come in a later phase). Components are loaded dynamically at runtime with custom class loaders (same as is currently done). Components with a module-info.class are defined to a module layer. This architecture is somewhat akin to the Modular JDK, where applications run on the classpath. In the analogy, the Elasticsearch server modules are the platform (thus are always resolved and present), while components without a module-info.class are non-modular code running atop the Elasticsearch server modules. The extension components cannot access types from non-exported packages of the server modules, in the same way that classpath applications cannot access types from non-exported packages of modules from the JDK. Broadly, the core Elasticseach java modules simply "wrap" the existing packages and export them. There are opportunites to export less, which is best done in more narrowly focused follow-up PRs. The Elasticsearch distribution startup scripts are updated to put jars on the module path (the class path is empty), so the distribution will run the core of the server as java modules. A number of key components have been retrofitted with module-info.java's too, and the remaining components can follow later. Unit and functional tests run as non-modular (since they commonly require package-private access), while higher-level integration tests, that run the distribution, run as modular. Co-authored-by: Chris Hegarty <christopher.hegarty@elastic.co> Co-authored-by: Ryan Ernst <ryan@iernst.net> Co-authored-by: Rene Groeschke <rene@elastic.co>	2022-05-20 13:11:42 +01:00
Alan Woodward	205cfec52f	Upgrade to lucene 9.2.0-RC2 snapshot (#86931 ) Only difference from last snapshot is a revert of a change in the behaviour of PersianAnalyzer	2022-05-20 08:54:35 +01:00
Alan Woodward	0418e8a9d8	Upgrade to lucene snapshot 978eef5459c (#86852 ) Final (hopefully!) snapshot before the 9.2.0 release * Update test to expect persian tokenfilter - will be exposed later * Fix KnnVectorQueryBuilderTests::doAssertLuceneQuery Co-authored-by: Mayya Sharipova <mayya.sharipova@elastic.co>	2022-05-17 15:27:52 -07:00
Armin Braun	82933a8599	Save redundant singleton maps in field mappers (#86785 ) In the many-shards benchmarks the singleton maps storing just a single analyzer for each keyword field mapper cost around 5% of the total heap usage on data nodes (700MB for ~15k indices which translate into ~16M instances of keyword field mapper for Beats mappings). Creating specific implementations for the zero, one or many analyzers use cases that already have their own specialized constructors eliminates this overhead completely. relates #77466	2022-05-16 15:13:51 +02:00
Nik Everett	a589456b81	Synthetic source (#85649 ) This attempts to shrink the index by implementing a "synthetic _source" field. You configure it by in the mapping: ``` { "mappings": { "_source": { "synthetic": true } } } ``` And we just stop storing the `_source` field - kind of. When you go to access the `_source` we regenerate it on the fly by loading doc values. Doc values don't preserve the original structure of the source you sent so we have to make some educated guesses. And we have a rule: the source we generate would result in the same index if you sent it back to us. That way you can use it for things like `_reindex`. Fetching the `_source` from doc values does slow down loading somewhat. See numbers further down. ## Supported fields This only works for the following fields: * `boolean` * `byte` * `date` * `double` * `float` * `geo_point` (with precision loss) * `half_float` * `integer` * `ip` * `keyword` * `long` * `scaled_float` * `short` * `text` (when there is a `keyword` sub-field that is compatible with this feature) ## Educated guesses The synthetic source generator makes `_source` fields that are: * sorted alphabetically * as "objecty" as possible * pushes all arrays to the "leaf" fields * sorts most array values * removes duplicate text and keyword values These are mostly artifacts of how doc values are stored. ### sorted alphabetically ``` { "b": 1, "c": 2, "a": 3 } ``` becomes ``` { "a": 3, "b": 1, "c": 2 } ``` ### as "objecty" as possible ``` { "a.b": "foo" } ``` becomes ``` { "a": { "b": "foo" } } ``` ### pushes all arrays to the "leaf" fields ``` { "a": [ { "b": "foo", "c": "bar" }, { "c": "bort" }, { "b": "snort" } } ``` becomes ``` { "a" { "b": ["foo", "snort"], "c": ["bar", "bort"] } } ``` ### sorts most array values ``` { "a": [2, 3, 1] } ``` becomes ``` { "a": [1, 2, 3] } ``` ### removes duplicate text and keyword values ``` { "a": ["bar", "baz", "baz", "baz", "foo", "foo"] } ``` becomes ``` { "a": ["bar", "baz", "foo"] } ``` ## `_recovery_source` Elasticsearch's shard "recovery" process needs `_source` sometimes. So does cross cluster replication. If you disable source or filter it somehow we store a `_recovery_source` field for as long as the recovery process might need it. When everything is running smoothly that's generally a few seconds or minutes. Then the fields is removed on merge. This synthetic source feature continues to produce `_recovery_source` and relies on it for recovery. It's possible to synthesize `_source` during recovery but we don't do it. That means that synethic source doesn't speed up writing the index. But in the future we might be able to turn this on to trade writing less data at index time for slower recovery and cross cluster replication. That's an area of future improvement. ## perf numbers I loaded the entire tsdb data set with this change and the size: ``` standard -> synthetic store size 31.0 GB -> 7.0 GB (77.5% reduction) _source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source) ``` A second _forcemerge a few minutes after rally finishes should removes the remaining 47.6MB of _recovery_source. With this fetching source for 1,000 documents seems to take about 500ms. I spot checked a lot of different areas and haven't seen any different hit. I expect this performance impact is based on the number of doc values fields in the index and how sparse they are.	2022-05-10 07:46:58 -04:00
Armin Braun	7b916f2678	AbstractAnalyzerProvider does not need to extend AbstractIndexComponent (#86537 ) Remove the inheritance here to make instances smaller and speed up many-shards benchmarks a little. Did not remove the dead arguments from the constructors in this PR as that would have been a very noisy change.	2022-05-08 22:34:52 +02:00
Alan Woodward	4d076eee20	Upgrade to Lucene 9.2 snapshot efa5d6f4d43 (#86227 ) Notable changes include: count implementations for MultiRangeQuery and IndexSortedNumericDocValuesRangeQuery, which may speed up certain aggregations more efficient decoding of docids in BKD reader	2022-05-05 15:48:13 +01:00
Chris Hegarty	603ca53798	Use declared constant (rather than resource lookup) (#86083 ) Use the public static final constant org.apache.lucene.analysis.icu.NORMALIZER, rather than poking around inside lucene resources - the Normalizer2 instance is equivalent. It would appear that this code, doing the resource lookup, predates the lucene public field.	2022-04-22 13:55:24 +01:00
Ignacio Vera	af2fe8ee33	Upgrade Lucene to 9.1.0 release (#85211 )	2022-03-22 14:11:53 +01:00
Armin Braun	9ec646302d	Remove Restricted String Mapping Param (#85129 ) This param was incredibly expensive to set up when parsing mappings and is one of the big contributors to mapping parsing slowness on master. Since all uses of this parameter type are statically known it seems the most straight forward to simply statically hard code the validators so that we save some allocations.	2022-03-21 12:35:43 +01:00
Alan Woodward	0863fb83d5	Upgrade to lucene 9.1.0-snapshot-5b522487ba8 (#85025 ) Specifically includes LUCENE-10469 which should address a performance regression in EQL.	2022-03-16 14:58:20 +00:00
Julie Tibshirani	bba2dfac56	Upgrade Lucene to 9.1.0-snapshot-949752 (#84540 ) This PR upgrades Lucene to a newer snapshot `9.1.0-snapshot-9497524cc2d`. Changes: * Adapt to `LeafReader#searchNearestVectors` signature change * Adapt checks in `GeometryIndexerTests`, `SearchServiceTests`, `FiltersAggregatorTests`, `AggregationProfilerIT` * Address highlighting failures in `MultiPhrasePrefixQuery` and `HasChildQueryBuilder.LateParsingQuery`	2022-03-04 08:36:26 -08:00
Mayya Sharipova	26c3dd6857	Upgrade to lucene-9.1.0-snapshot-1336263051c (#83667 ) Lucene issues that resulted in elasticsearch changes: LUCENE-9820 Separate logic for reading the BKD index from logic to intersecting it. LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator() LUCENE-10301: make the test-framework a proper module by moving all test classes to org.apache.lucene.tests LUCENE-10300: rewrite how resources are read in ukrainian morfologik analyzer: LUCENE-10054 Make HnswGraph hierarchical	2022-02-22 09:53:20 +01:00
Przemyslaw Gomulka	037261356e	Convert 'id' and '_id' values in REST API tests to strings (#82681 ) Follow-up from #77144 (comment) with converting id/_id to always be strings instead of integers. This makes the type value in the Elasticsearch specification be only string instead of string \| number. this change was generated using following command on ubuntu find . -type f -name ".yml" -print0 \| xargs -0 sed -i -r 's/([^a-zA-Z0-9_\.]id\|[^a-zA-Z0-9_]_id):(\s)([0-9]+)/\1:\2"\3"/g'	2022-02-10 09:14:17 +01:00
Christos Soulios	8ae978126b	TSDB: Add support for composite aggregation on `_tsid` field (#81998 ) Currently, composite aggregation supports only string or numeric values for the fields in the sources part of the composite aggregation. Although _tsid is encoded and stored as a byte array, it is formatted as map {"dim1": "value1", "dim2": value2", ...} for input/output. This PR adds support for composite aggregation on _tsid field. Relates to #74660	2022-01-19 17:57:33 +02:00
Mary Gouseti	4499050341	Use pattern matching for instanceof in plugins through qa, server/internalClusterTest (#82161 )	2022-01-12 11:34:15 +01:00
weizijun	b6e8b59880	TSDB: fix reindex failed tests without feature flag (#81967 ) fix as the #80945 do. register a settings update consumer for the end_time for the tsdb index even when the end_time setting wasn't registered. Pass the feature flag to reindex yaml tests. Co-authored-by: Igor Motov <igor@motovs.org>	2022-01-06 14:45:08 -05:00
Artem Prigoda	763d6d510f	Use Java 15 text blocks for JSON and multiline strings (#80751 ) The ES code base is quite JSON heavy. It uses a lot of multi-line JSON requests in tests which need to be escaped and concatenated which in turn makes them hard to read. Let's try to leverage Java 15 text blocks for representing them.	2021-12-15 18:01:28 +01:00
Alan Woodward	33ef38e478	Upgrade to released lucene 9.0.0 (#81426 ) This commit makes elasticsearch depend on the released artifacts of lucene 9.0.0, rather than an internal snapshot.	2021-12-07 14:19:56 +00:00
Stuart Tettemer	5e357e7331	Script: Ordinal field data plumbing (#80970 ) Add plumbing for ordinal field data for the field API. The scripting fields API needs to know the mapped type of the each field in the document. This is ensured by having a `ToScriptField` method reference passed from the `MappedField`, through the `IndexFieldData`, to the `LeafFieldData`. Knowing the mapped type allows the API to provide relevant helper methods as well as appropriately use the fields available in the document. Refs: #79105	2021-12-02 09:18:48 -06:00
Rory Hunter	add386dd00	Fix shadowed vars pt5 (#80855 ) Part of #19752. Fix more instances where local variable names were shadowing field names.	2021-11-19 10:47:26 +00:00
Armin Braun	01a144a60b	Cleanup some Dead Code in Mappers (#80526 ) Just cleaning up some unused code in the mapper package in preparation for some further deduplication/scalability improvements.	2021-11-09 12:44:08 +01:00
Mayya Sharipova	db0b4ba08a	Upgrade Lucene 9 snapshot cc2a31f2be8 (#80213 )	2021-11-02 15:50:33 -04:00
Rene Groeschke	92e8ba2e74	Check for multiple javadocs in java headers (#79603 ) We also now enforce to have the license statement on the very top of the java file before the package declaration Fixes #79235	2021-10-29 08:32:11 +02:00

1 2 3 4 5 ...

415 commits