elasticsearch

mirror of https://github.com/elastic/elasticsearch.git synced 2025-06-28 17:34:17 -04:00

Author	SHA1	Message	Date
Armin Braun	7a25453dec	Speed up FieldMapper construction/parsing/serialization (#86860 ) Speeding this up some more as it's now 50% of the bootstrap time of the many shards benchmarks. Iterating an array here in all cases is quite a bit faster than iterating various kinds of lists and doesn't complicate the code. Also removes a redundant call to `getValue()` for each parameter during serialization.	2022-05-23 12:09:00 +02:00
Yannick Welsch	5aebb8ee38	Add text field support to archive indices (#86591 ) Adds support for "text" fields in archive indices, with the goal of adding simple filtering support on text fields when querying archive indices. There are some differences to regular text fields: - no global statistics: queries on text fields return constant score (similar to match_only_text). - analyzer fields can be updated - if defined analyzer is not available, falls back to default analyzer - no guarantees that analyzers are BWC The above limitations also give us the flexibility to eventually swap out the implementation with a "runtime-text field" variant, and hence only provide those capabilities that can be emulated via a runtime field. Relates #81210	2022-05-18 10:25:38 +02:00
Armin Braun	82933a8599	Save redundant singleton maps in field mappers (#86785 ) In the many-shards benchmarks the singleton maps storing just a single analyzer for each keyword field mapper cost around 5% of the total heap usage on data nodes (700MB for ~15k indices which translate into ~16M instances of keyword field mapper for Beats mappings). Creating specific implementations for the zero, one or many analyzers use cases that already have their own specialized constructors eliminates this overhead completely. relates #77466	2022-05-16 15:13:51 +02:00
Nik Everett	a589456b81	Synthetic source (#85649 ) This attempts to shrink the index by implementing a "synthetic _source" field. You configure it by in the mapping: ``` { "mappings": { "_source": { "synthetic": true } } } ``` And we just stop storing the `_source` field - kind of. When you go to access the `_source` we regenerate it on the fly by loading doc values. Doc values don't preserve the original structure of the source you sent so we have to make some educated guesses. And we have a rule: the source we generate would result in the same index if you sent it back to us. That way you can use it for things like `_reindex`. Fetching the `_source` from doc values does slow down loading somewhat. See numbers further down. ## Supported fields This only works for the following fields: * `boolean` * `byte` * `date` * `double` * `float` * `geo_point` (with precision loss) * `half_float` * `integer` * `ip` * `keyword` * `long` * `scaled_float` * `short` * `text` (when there is a `keyword` sub-field that is compatible with this feature) ## Educated guesses The synthetic source generator makes `_source` fields that are: * sorted alphabetically * as "objecty" as possible * pushes all arrays to the "leaf" fields * sorts most array values * removes duplicate text and keyword values These are mostly artifacts of how doc values are stored. ### sorted alphabetically ``` { "b": 1, "c": 2, "a": 3 } ``` becomes ``` { "a": 3, "b": 1, "c": 2 } ``` ### as "objecty" as possible ``` { "a.b": "foo" } ``` becomes ``` { "a": { "b": "foo" } } ``` ### pushes all arrays to the "leaf" fields ``` { "a": [ { "b": "foo", "c": "bar" }, { "c": "bort" }, { "b": "snort" } } ``` becomes ``` { "a" { "b": ["foo", "snort"], "c": ["bar", "bort"] } } ``` ### sorts most array values ``` { "a": [2, 3, 1] } ``` becomes ``` { "a": [1, 2, 3] } ``` ### removes duplicate text and keyword values ``` { "a": ["bar", "baz", "baz", "baz", "foo", "foo"] } ``` becomes ``` { "a": ["bar", "baz", "foo"] } ``` ## `_recovery_source` Elasticsearch's shard "recovery" process needs `_source` sometimes. So does cross cluster replication. If you disable source or filter it somehow we store a `_recovery_source` field for as long as the recovery process might need it. When everything is running smoothly that's generally a few seconds or minutes. Then the fields is removed on merge. This synthetic source feature continues to produce `_recovery_source` and relies on it for recovery. It's possible to synthesize `_source` during recovery but we don't do it. That means that synethic source doesn't speed up writing the index. But in the future we might be able to turn this on to trade writing less data at index time for slower recovery and cross cluster replication. That's an area of future improvement. ## perf numbers I loaded the entire tsdb data set with this change and the size: ``` standard -> synthetic store size 31.0 GB -> 7.0 GB (77.5% reduction) _source 24695.7 MB -> 47.6 MB (99.8% reduction - synthetic is in _recovery_source) ``` A second _forcemerge a few minutes after rally finishes should removes the remaining 47.6MB of _recovery_source. With this fetching source for 1,000 documents seems to take about 500ms. I spot checked a lot of different areas and haven't seen any different hit. I expect this performance impact is based on the number of doc values fields in the index and how sparse they are.	2022-05-10 07:46:58 -04:00
Armin Braun	cb41ed09e3	Deduplicate default FieldType in KeywordFieldMapper (#86346 ) The default type is incredibly common and instances are not trivial in size with 16 fields. Heap dumps from larger data nodes holding many keyword fields with the default field type can contain hundreds of MB of heap used for these. Same reasoning applies to the `TextSearchInfo` deduplication. `TextSearchInfo` was turned into a record to give us an `equals` implementation.	2022-05-03 16:11:36 +02:00
Ryan Ernst	f0d0c373cd	Remove uses of Charset name parsing (#85795 ) There are many places in Elasticsearch which must decode some stream of bytes into characters. Most of the time this is expected to be UTF-8 encoded data, and we hardcode that charset name. However, methods in the JDK that take a String charset name require catching UnsupportedEncodingException. Yet most of these APIs also has a variant of the same methods which take a known Charset instance, for which we can use StandardCharsets.UTF_8. This commit converts most instances of passing string charset names to use a Charset instance.	2022-04-12 12:05:32 -07:00
Armin Braun	9ec646302d	Remove Restricted String Mapping Param (#85129 ) This param was incredibly expensive to set up when parsing mappings and is one of the big contributors to mapping parsing slowness on master. Since all uses of this parameter type are statically known it seems the most straight forward to simply statically hard code the validators so that we save some allocations.	2022-03-21 12:35:43 +01:00
Mayya Sharipova	26c3dd6857	Upgrade to lucene-9.1.0-snapshot-1336263051c (#83667 ) Lucene issues that resulted in elasticsearch changes: LUCENE-9820 Separate logic for reading the BKD index from logic to intersecting it. LUCENE-10377: Replace 'sortPos' with 'enableSkipping' in SortField.getComparator() LUCENE-10301: make the test-framework a proper module by moving all test classes to org.apache.lucene.tests LUCENE-10300: rewrite how resources are read in ukrainian morfologik analyzer: LUCENE-10054 Make HnswGraph hierarchical	2022-02-22 09:53:20 +01:00
Przemyslaw Gomulka	037261356e	Convert 'id' and '_id' values in REST API tests to strings (#82681 ) Follow-up from #77144 (comment) with converting id/_id to always be strings instead of integers. This makes the type value in the Elasticsearch specification be only string instead of string \| number. this change was generated using following command on ubuntu find . -type f -name ".yml" -print0 \| xargs -0 sed -i -r 's/([^a-zA-Z0-9_\.]id\|[^a-zA-Z0-9_]_id):(\s)([0-9]+)/\1:\2"\3"/g'	2022-02-10 09:14:17 +01:00
Artem Prigoda	fc5a820da9	Migrate to Java 16 Records (part 1) (#82338 ) Try to represent immutable data with Java records introduced in [JEP 395](https://openjdk.java.net/jeps/395)	2022-01-18 17:53:06 +01:00
weizijun	b6e8b59880	TSDB: fix reindex failed tests without feature flag (#81967 ) fix as the #80945 do. register a settings update consumer for the end_time for the tsdb index even when the end_time setting wasn't registered. Pass the feature flag to reindex yaml tests. Co-authored-by: Igor Motov <igor@motovs.org>	2022-01-06 14:45:08 -05:00
Rory Hunter	add386dd00	Fix shadowed vars pt5 (#80855 ) Part of #19752. Fix more instances where local variable names were shadowing field names.	2021-11-19 10:47:26 +00:00
Mark Vieira	12ad399c48	Reformat Elasticsearch source	2021-10-27 08:19:51 -07:00
Chris Hegarty	20c9f756d2	Fix split package org.elasticsearch.common.xcontent (#78831 ) Fix the split package org.elasticsearch.common.xcontent, between server and the x-content lib. Move the x-content lib exported package from org.elasticsearch.common.xcontent to org.elasticsearch.xcontent ( following the naming convention of similar libraries ). Removing split packages is a prerequisite to modularization.	2021-10-08 17:14:26 +01:00
Ryan Ernst	0a1a7b3559	Fix split package in annotated text plugin (#78133 ) The annotated text mapper plugin reuses package names from server. This commit moves the implementation classes into an annotated text package specifically for the plugin.	2021-09-21 13:00:36 -07:00
Chris Hegarty	c1950a6d27	Relocate org.apache.lucene.search.uhighlight -> org.elasticsearch.search.uhighlight (#78099 ) Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>	2021-09-21 12:15:19 -04:00
Alan Woodward	9312eba5ed	Change Mapper.build() to take a context object (#77108 ) Mapper.build() currently takes a ContentPath object that it can use to generate field type names that will include its parent names. We would like to expand field types to include more information about their parents, and ContentPath does not hold this information. This commit replaces the ContentPath parameter with a new MapperBuilderContext, which currently holds only the content path information but can be expanded in future to hold parent relationship information. Relates to #75474	2021-09-08 16:34:14 +01:00
Armin Braun	096b8ccc26	Fix TextFieldMapper Retaining a Reference to its Builder (#77251 ) Fixes the text field mapper and the analyzers class that also retained parameter references that go really heavy. Makes `TextFieldMapper` take hundreds of bytes compared to multiple kb per instance. closes #73845	2021-09-03 18:44:11 +02:00
Rene Groeschke	35ec6f348c	Introduce simple public yaml-rest-test plugin (#76554 ) This introduces a basic public yaml rest test plugin that is supposed to be used by external elasticsearch plugin authors. This is driven by #76215 - Rename yaml-rest-test to intern-yaml-rest-test - Use public yaml plugin in example plugins Co-authored-by: Mark Vieira <portugee@gmail.com>	2021-08-31 08:45:52 +02:00
Luca Cavanna	c6641bf00c	Rename ParseContext to DocumentParserContext (#74963 ) ParseContext is used to parse documents. It was easily confused with ParserContext (now renamed to MappingParserContext) which is instead used to parse mappings. To remove any confusion, this commit renames ParseContext to DocumentParserContext and adapts its subclasses accordingly.	2021-07-06 09:15:59 -04:00
Ryan Ernst	ab1a2e4a84	Add precommit task for detecting split packages (#73784 ) Modularization of the JDK has been ongoing for several years. Recently in Java 16 the JDK began enforcing module boundaries by default. While Elasticsearch does not yet use the module system directly, there are some side effects even for those projects not modularized (eg #73517). Before we can even begin to think about how to modularize, we must Prepare The Way by enforcing packages only exist in a single jar file, since the module system does not allow packages to coexist in multiple modules. This commit adds a precommit check to the build which detects split packages. The expectation is that we will add the existing split packages to the ignore list so that any new classes will not exacerbate the problem, and the work to cleanup these split packages can be parallelized. relates #73525	2021-06-08 15:04:23 -07:00
Alan Woodward	b27eaa38dc	Remove 'external values', and replace with swapped out XContentParsers (#72203 ) The majority of field mappers read a single value from their positioned XContentParser, and do not need to call nextToken. There is a general assumption that the same holds for any multifields defined on them, and so the XContentParser is passed down to their multifields builder as-is. This assumption does not hold for mappers that accept json objects, and so we have a second mechanism for passing values around called 'external values', where a mapper can set a specific value on its context and child mappers can then check for these external values before reading from xcontent. The disadvantage of this is that every field mapper now needs to check its context for external values. Because the values are defined by their java class, we can also know that in the vast majority of cases this functionality is unused. We have only two mappers that actually make use of this, CompletionFieldMapper and GeoPointFieldMapper. This commit removes external values entirely, and replaces it with the ability to pass a modified XContentParser to multifields. FieldMappers can just check the parser attached to their context for data and don't need to worry about multiple sources. Plugins implementing field mappers will need to take the removal of external values into account. Implementations that are passing structured objects as external values should instead use ParseContext.switchParser and wrap the objects using MapXContentParser.wrapObject(). GeoPointFieldMapper passes on a fake parser that just wraps its input data formatted as a geohash; CompletionFieldMapper has a slightly more complicated parser that in general wraps its metadata, but if textOrNull() is called without the parser being advanced just returns its text input. Relates to #56063	2021-04-29 09:17:18 +01:00
Alan Woodward	e002aa809b	Make FieldNamesFieldMapper responsible for adding its own doc fields (#71929 ) The FieldNamesFieldMapper is a metadata mapper defining a field that can be used for exists queries if a mapper does not use doc values or norms. Currently, data is added to it via a special method on FieldMapper that pulls the metadata mapper from a mapping lookup, checks to see if it is enabled, and then adds the relevant value to a lucene document. This is one of only two places that pulls a metadata mapper from the MappingLookup, and it would be nice to remove this method. This commit refactors field name handling by instead storing the names of fields to index in the fieldnames field in a set on the ParseContext, and then building the field itself in FieldNamesFieldMapper.postParse(). This means that all of the responsibility for enabling indexing, etc, is handled within the metadata mapper itself.	2021-04-27 16:03:46 +01:00
Adrien Grand	25750a3696	Make intervals queries fully pluggable through field mappers. (#71429 ) `MappedFieldType` only allows configuring `match` and `prefix` queries today. This change makes it possible to configure how to create `wildcard` and `fuzzy` queries as well. This will allow making the upcoming `match_only_text` field fully support intervals queries.	2021-04-20 18:10:12 +02:00
Jake Landis	b1ef1fd800	Introduce yamlRestCompatTests for :plugins projects (#71440 )	2021-04-08 16:11:50 -05:00
Mark Vieira	6339691fe3	Consolidate REST API specifications and publish under Apache 2.0 license (#70036 )	2021-03-26 16:20:14 -07:00
Nik Everett	91c700bd99	Super randomized tests for fetch fields API (#70278 ) We've had a few bugs in the fields API where is doesn't behave like we'd expect. Typically this happens because it isn't obvious what we expct. So we'll try and use randomized testing to ferret out what we want. This adds a test for most field types that asserts that `fields` works similarly to `docvalues_fields`. We expect this to be true for most fields. It does so by forcing all subclasses of `MapperTestCase` to define a method that makes random values. It declares a few other hooks that subclasses can override to further randomize the test. We skip the test for a few field types that don't have doc values: * `annotated_text` * `completion` * `search_as_you_type` * `text` We should come up with some way to test these without doc values, even if it isn't as nice. But that is a problem for another time, I think. We skip the test for a few more types just because I wanted to cut this PR in half so we could get to reviewing it earlier. We'll get to those in a follow up change. I've filed a few bugs for things that are inconsistent with `docvalues_fields`. Typically that means that we have to limit the random values that we generate to those that do round trip properly.	2021-03-24 14:16:27 -04:00
Alan Woodward	49897be1bc	Fix position increment gap on phrase/prefix analyzers (#70096 ) Custom position increments are handled by wrapping analyzers with a NamedAnalyzer and passing the custom increment through to its constructor. However, phrase and prefix analyzers use delegating analyzer wrappers to add extra filtering to their parent analyzers, and we can't wrap analyzers multiple times because this wrecks reuse strategies, so we unwrap the parent before passing it to phrase and prefix builders. This unwrapping means that we lose the custom position increments; in particular, it means that we can end up with a position increment gap of -1, which is the sentinel value for the unset parameter - and that means exceptions at index time for backwards-moving positions on fields with multiple values. This commit removes the sentinel value and uses standard parameter defaults and the isConfigured() method instead, plus it adds some more comprehensive testing for position increments when combined with phrase/prefix index options on text fields. Fixes #70049	2021-03-09 12:10:13 +00:00
Marios Trivyzas	1e12c93a31	Fix issue with AnnotatedTextHighlighter and max_analyzed_offset (#69028 ) With the newly introduced `max_analyzed_offset` the analyzer of `AnnotatedTextHighlighter` was wrapped twice with the `LimitTokenOffsetAnalyzer` by mistake. Follows: #67325	2021-02-16 17:08:07 +01:00
Marios Trivyzas	f9af60bf69	Add query param to limit highlighting to specified length (#67325 ) Add a `max_analyzed_offset` query parameter to allow users to limit the highlighting of text fields to a value less than or equal to the `index.highlight.max_analyzed_offset`, thus avoiding an exception when the length of the text field exceeds the limit. The highlighting still takes place, but stops at the length defined by the new parameter. Closes: #52155	2021-02-16 09:25:45 +01:00
Luca Cavanna	0ca6819882	DocumentMapper to not implement ToXContent (#68653 ) DocumentMapper does not need to implement ToXContent, in fact it is its inner Mapping that needs to and already does. Consumers can switch to calling mapping() and toXContent against it.	2021-02-08 14:17:31 +01:00
Mark Vieira	a92a647b9f	Update sources with new SSPL+Elastic-2.0 license headers As per the new licensing change for Elasticsearch and Kibana this commit moves existing Apache 2.0 licensed source code to the new dual license SSPL+Elastic license 2.0. In addition, existing x-pack code now uses the new version 2.0 of the Elastic license. Full changes include: - Updating LICENSE and NOTICE files throughout the code base, as well as those packaged in our published artifacts - Update IDE integration to now use the new license header on newly created source files - Remove references to the "OSS" distribution from our documentation - Update build time verification checks to no longer allow Apache 2.0 license header in Elasticsearch source code - Replace all existing Apache 2.0 license headers for non-xpack code with updated header (vendored code with Apache 2.0 headers obviously remains the same). - Replace all Elastic license 1.0 headers with new 2.0 header in xpack.	2021-02-02 16:10:53 -08:00
Julie Tibshirani	5852fbedf5	Rename QueryShardContext -> SearchExecutionContext. (#67490 ) We decided to rename `QueryShardContext` to clarify that it supports all parts of search request execution. Before there was confusion over whether it should only be used for building queries, or maybe only used in the query phase. This PR also updates the javadocs. Closes #64740.	2021-01-14 09:11:59 -08:00
markharwood	aa01af882e	Annotated text plugin highlighter causes "array_index_out_of_bounds_exception" (#66593 ) Recent changes to the way Analyzers and field mappings are managed revealed a bug in the AnnotatedHighlighterAnalyzer class. Old sequences of calls avoided the issue but under the new scheme a counter reset was required between documents being highlighted. Closes #66535	2021-01-04 15:41:49 +00:00
Alan Woodward	1a8ce8716d	Restore use of default search and search_quote analyzers (#65491 ) In the refactoring of TextFieldMapper, we lost the ability to define a default search or search_quote analyzer in index settings. This commit restores that ability, and adds some more comprehensive testing. Fixes #65434	2020-11-26 16:57:45 +00:00
Alan Woodward	d088171a87	Use ValueFetcher when loading text snippets to highlight (#63572 ) HighlighterUtils.loadFieldValues() loads values directly from the source, and then callers have to deal with filtering out values that would have been removed by an ignore_above filter on keyword fields. Instead, we can use the ValueFetcher for the relevant field, which handles all this logic for us. Closes #59931.	2020-11-24 16:09:37 +00:00
markharwood	ef810ba76b	Added test for significant_text on annotated_text field. (#64491 )	2020-11-09 10:27:15 +00:00
Alan Woodward	0fd70ae383	Remove Mapper.BuilderContext (#64625 ) Mapper.BuilderContext is a simple wrapper around two objects, some IndexSettings and a ContentPath. The IndexSettings are the same as those provided in the ParserContext, so we can simplify things here by removing them and just passing ContentPath directly to Mapper.Builder#build()	2020-11-05 10:48:39 +00:00
Luca Cavanna	131bcf2d6a	Remove mapperService method from FetchContext (#64620 ) There was one leftover usage of FetchContext#mapperService which can be easily replaced with retrieving the field name index analyzer from QueryShardContext.	2020-11-05 11:38:35 +01:00
Alan Woodward	f010269ab7	Move index analyzer management to FieldMapper/MapperService (#63937 ) Index-time analyzers are currently specified on the MappedFieldType. This has a number of unfortunate consequences; for example, field mappers that index data into implementation sub-fields, such as prefix or phrase accelerators on text fields, need to expose these sub-fields as MappedFieldTypes, which means that they then appear in field caps, are externally searchable, etc. It also adds index-time logic to a class that should only be concerned with search-time behaviour. This commit removes references to the index analyzer from MappedFieldType. Instead, FieldMappers that use the terms index can pass either a single analyzer or a Map of fields to analyzers to their super constructor, which are then exposed via a new FieldMapper#indexAnalyzers() method; all index-time analysis is mediated through the delegating analyzer wrapper on MapperService. In a follow-up, this will make it possible to register multiple field analyzers from a single FieldMapper, removing the need for 'hidden' mapper implementations on text field, parent joins, and elsewhere.	2020-11-04 13:53:09 +00:00
Alan Woodward	a5168572d5	Collapse ParametrizedFieldMapper into FieldMapper (#64365 ) Now that all our FieldMapper implementations extend ParametrizedFieldMapper, we can collapse the two classes together, and remove a load of cruft from FieldMapper that is unused. In particular: * we no longer need the lucene FieldType field on FieldMapper * we no longer use clone() for merging, so we can remove it from all impls * the serialization code in FieldMapper that assumes we're looking at text fields can go	2020-11-02 15:07:52 +00:00
Luca Cavanna	f491422e1e	Ensure field types consistency on supporting text queries (#63487 ) Some supported field types don't support term queries, and throw exception in their termQuery method. That exception is either an IllegalArgumentException or a QueryShardException. There is logic in MatchQuery that skips the field or not depending on the exception that is thrown. Also, such field types should hold a TextSearchInfo.NONE while that is not always the case. With this commit we make the following changes: - streamline using TextSearchInfo.NONE in all field types that don't support text queries - standardize the exception being thrown when a field type does not support term queries to be IllegalArgumentException. Note that this is not a breaking change as both exceptions previously returned translated to 400 status code. - Adapt the MatchQuery logic to skip fields that don't support term queries. There is no need to call termQuery passing an empty string and catch exceptions potentially thrown. We can rather check the TextSearchInfo which tells already whether the field supports text queries or not. - add a test method to MapperTestCase that verifies the consistency of a field type by verifying that it is not searchable whenever it uses TextSearchInfo.NONE, while it is otherwise. This is what triggered all of the above changes.	2020-10-13 11:05:43 +02:00
Alan Woodward	f4c85e4562	Convert TextFieldMapper to parametrized form (#63269 ) As a result of this, we can remove a chunk of code from TypeParsers as well. Tests for search/index mode analyzers have moved into their own file. This commit also rationalises the serialization checks for parameters into a single SerializerCheck interface that takes the values includeDefaults, isConfigured and the value itself. Relates to #62988	2020-10-07 10:29:29 +01:00
Alan Woodward	ce649d07d7	Move FieldMapper#valueFetcher to MappedFieldType (#62974 ) For runtime fields, we will want to do all search-time interaction with a field definition via a MappedFieldType, rather than a FieldMapper, to avoid interfering with the logic of document parsing. Currently, fetching values for runtime scripts and for building top hits responses need to call a method on FieldMapper. This commit moves this method to MappedFieldType, incidentally simplifying the current call sites and freeing us up to implement runtime fields as pure MappedFieldType objects.	2020-10-04 10:47:04 +01:00
Luca Cavanna	daade44174	Share same existsQuery impl throughout mappers (#57607 ) Most of our field types have the same implementation for their `existsQuery` method which relies on doc_values if present, otherwise it queries norms if available or uses a term query against the _field_names meta field. This standard implementation is repeated in many different mappers. There are field types that only query doc_values, because they always have them, and field types that always query _field_names, because they never have norms nor doc_values. We could apply the same standard logic to all of these field types as `MappedFieldType` has the knowledge about what data structures are available. This commit introduces a standard implementation that does the right thing depending on the data structure that is available. With that only field types that require a different behaviour need to override the existsQuery method. At the same time, this no longer forces subclasses to override `existsQuery`, which could be forgotten when needed. To address this we introduced a new test method in `MapperTestCase` that verifies the `existsQuery` being generated and its consistency with the available data structures.	2020-09-23 08:58:09 +02:00
Luca Cavanna	3a9b65733c	Move stored flag from TextSearchInfo to MappedFieldType (#62717 )	2020-09-22 15:41:24 +02:00
Nik Everett	9a127adb4b	Implement fields fetch for runtime fields (#61995 ) This implements the `fields` API in `_search` for runtime fields using doc values. Most of that implementation is stolen from the `docvalue_fields` fetch sub-phase, just moved into the same API that the `fields` API uses. At this point the `docvalue_fields` fetch phase looks like a special case of the `fields` API. While I was at it I moved the "which doc values sub-implementation should I use for fetching?" question from a bunch of `instanceof`s to a method on `LeafFieldData` so we can be much more flexible with what is returned and we're not forced to extend certain classes just to make the fetch phase happy. Relates to #59332	2020-09-15 15:57:26 -04:00
Nik Everett	ba39f46e8b	Speed up empty highlighting many fields (#61860 ) Kibana often highlights everything like this: ``` POST /_search { "query": ..., "size": 500, "highlight": { "fields": { "": { ... } } } } ``` This can get slow when there are hundreds of mapped fields. I tested this locally and unscientifically and it took a request from 20ms to 150ms when there are 100 fields. I've seen clusters with 2000 fields where simple search go from 500ms to 1500ms just by turning on this sort of highlighting. Even when the query is just a `range` that and the fields are all numbers and stuff so it won't highlight anything. This speeds up the `unified` highlighter in this case in a few ways: 1. Build the highlighting infrastructure once field rather than once pre document per field. This cuts out a ton* of work analyzing the query over and over and over again. 2. Bail out of the highlighter before loading values if we can't produce any results. Combined these take that local 150ms case down to 65ms. This is unlikely to be really useful when there are only a few fetched docs and only a few fields, but we often end up having many fields with many fetched docs.	2020-09-08 12:33:23 -04:00
Alan Woodward	e6b62930db	Merge FetchSubPhase hitsExecute and hitExecute methods (#60907 ) FetchSubPhase has two 'execute' methods, one which takes all hits to be examined, and one which takes a single HitContext. It's not obvious which one should be implemented by a given sub-phase, or if implementing both is a possibility; nor is it obvious that we first run the hitExecute methods of all subphases, and then subsequently call all the hitsExecute methods. This commit reworks FetchSubPhase to replace these two variants with a processor class, `FetchSubPhaseProcessor`, that is returned from a single `getProcessor` method. This processor class has two methods, `setNextReader()` and `process`. FetchPhase collects processors from all its subphases (if a subphase does not need to execute on the current search context, it can return `null` from `getProcessor`). It then sorts its hits by docid, and groups them by lucene leaf reader. For each reader group, it calls `setNextReader()` on all non-null processors, and then passes each doc id to `process()`. Implementations of fetch sub phases can divide their concerns into per-request, per-reader and per-document sections, and no longer need to worry about sorting docs or dealing with reader slices. FetchSubPhase now provides a FetchSubPhaseExecutor that exposes two methods, setNextReader(LeafReaderContext) and execute(HitContext). The parent FetchPhase collects all these executors together (if a phase should not be executed, then it returns null here); then it sorts hits, and groups them by reader; for each reader it calls setNextReader, and then execute for each hit in turn. Individual sub phases no longer need to concern themselves with sorting docs or keeping track of readers; global structures can be built in getExecutor(SearchContext), per-reader structures in setNextReader and per-doc in execute.	2020-09-03 10:21:39 +01:00
Julie Tibshirani	5457b34343	Correct how field retrieval handles multifields and copy_to. (#61309 ) Before when a value was copied to a field through a parent field or `copy_to`, we parsed it using the `FieldMapper` from the source field. Instead we should parse it using the target `FieldMapper`. This ensures that we apply the appropriate mapping type and options to the copied value. To implement the fix cleanly, this PR refactors the value parsing strategy. Now instead of looking up values directly, field mappers produce a helper object `ValueFetcher`. The value fetchers are responsible for almost all aspects of fetching, including looking up the right paths in the _source. The PR is fairly big but each commit can be reviewed individually. Fixes #61033.	2020-08-19 16:50:27 -07:00

1 2 3

136 commits