From 14a7b8fe67010e406a9c77bb084e593b743c42bb Mon Sep 17 00:00:00 2001 From: Kathleen DeRusso Date: Wed, 6 Nov 2024 14:42:06 -0500 Subject: [PATCH] Add documentation for query rules retriever (#115696) * Add initial query rules retriever docs * Add docs tests * Apply suggestions from code review Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> * PR feedback * Make query rules guide retriever-first * Add warning to DSL doc * Update docs/reference/search/retriever.asciidoc Co-authored-by: Mike Pellegrini * Update docs/reference/search/retriever.asciidoc Co-authored-by: Mike Pellegrini * Apply suggestions from code review Co-authored-by: Mike Pellegrini * Give parameters subheading an explicit id * Fix formatting --------- Co-authored-by: Elastic Machine Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Co-authored-by: Mike Pellegrini --- docs/reference/query-dsl/rule-query.asciidoc | 6 + docs/reference/search/retriever.asciidoc | 237 +++++++++++++++--- .../retrievers-overview.asciidoc | 59 +++-- .../search-using-query-rules.asciidoc | 76 +++++- 4 files changed, 303 insertions(+), 75 deletions(-) diff --git a/docs/reference/query-dsl/rule-query.asciidoc b/docs/reference/query-dsl/rule-query.asciidoc index dfedc2261bbd..43e79f656a55 100644 --- a/docs/reference/query-dsl/rule-query.asciidoc +++ b/docs/reference/query-dsl/rule-query.asciidoc @@ -12,6 +12,12 @@ The old syntax using `rule_query` and `ruleset_id` is deprecated and will be removed in a future release, so it is strongly advised to migrate existing rule queries to the new API structure. ==== +[TIP] +==== +The rule query is not supported for use alongside reranking. +If you want to use query rules in conjunction with reranking, use the <> instead. +==== + Applies <> to the query before returning results. Query rules can be used to promote documents in the manner of a <> based on matching defined rules, or to identify specific documents to exclude from a contextual result set. If no matching query rules are defined, the "organic" matches for the query are returned. diff --git a/docs/reference/search/retriever.asciidoc b/docs/reference/search/retriever.asciidoc index 9306d83c7913..74497c53c602 100644 --- a/docs/reference/search/retriever.asciidoc +++ b/docs/reference/search/retriever.asciidoc @@ -1,14 +1,12 @@ [[retriever]] === Retriever -A retriever is a specification to describe top documents returned from a -search. A retriever replaces other elements of the <> +A retriever is a specification to describe top documents returned from a search. +A retriever replaces other elements of the <> that also return top documents such as <> and -<>. A retriever may have child retrievers where a -retriever with two or more children is considered a compound retriever. This -allows for complex behavior to be depicted in a tree-like structure, called -the retriever tree, to better clarify the order of operations that occur -during a search. +<>. +A retriever may have child retrievers where a retriever with two or more children is considered a compound retriever. +This allows for complex behavior to be depicted in a tree-like structure, called the retriever tree, which clarifies the order of operations that occur during a search. [TIP] ==== @@ -29,6 +27,9 @@ A <> that produces top documents from <> that enhances search results by re-ranking documents based on semantic similarity to a specified inference text, using a machine learning model. +`rule`:: +A <> that applies contextual <> to pin or exclude documents for specific queries. + [[standard-retriever]] ==== Standard Retriever @@ -44,8 +45,7 @@ Defines a query to retrieve a set of top documents. `filter`:: (Optional, <>) + -Applies a <> to this retriever -where all documents must match this query but do not contribute to the score. +Applies a <> to this retriever, where all documents must match this query but do not contribute to the score. `search_after`:: (Optional, <>) @@ -56,14 +56,13 @@ include::{es-ref-dir}/rest-api/common-parms.asciidoc[tag=terminate_after] `sort`:: + -(Optional, <>) -A sort object that that specifies the order of matching documents. +(Optional, <>) A sort object that specifies the order of matching documents. `min_score`:: (Optional, `float`) + -Minimum <> for matching documents. Documents with a -lower `_score` are not included in the top documents. +Minimum <> for matching documents. +Documents with a lower `_score` are not included in the top documents. `collapse`:: (Optional, <>) @@ -72,8 +71,7 @@ Collapses the top documents by a specified key into a single top document per ke ===== Restrictions -When a retriever tree contains a compound retriever (a retriever with two or more child -retrievers) the <> parameter is not supported. +When a retriever tree contains a compound retriever (a retriever with two or more child retrievers) the <> parameter is not supported. [discrete] [[standard-retriever-example]] @@ -105,12 +103,39 @@ POST /restaurants/_bulk?refresh {"region": "Austria", "year": "2020", "vector": [10, 22, 79]} {"index":{}} {"region": "France", "year": "2020", "vector": [10, 22, 80]} + +PUT /movies + +PUT _query_rules/my-ruleset +{ + "rules": [ + { + "rule_id": "my-rule1", + "type": "pinned", + "criteria": [ + { + "type": "exact", + "metadata": "query_string", + "values": [ "pugs" ] + } + ], + "actions": { + "ids": [ + "id1" + ] + } + } + ] +} + ---- // TESTSETUP [source,console] -------------------------------------------------- DELETE /restaurants + +DELETE /movies -------------------------------------------------- // TEARDOWN //// @@ -143,11 +168,13 @@ GET /restaurants/_search } } ---- + <1> Opens the `retriever` object. <2> The `standard` retriever is used for defining traditional {es} queries. <3> The entry point for defining the search query. <4> The `bool` object allows for combining multiple query clauses logically. -<5> The `should` array indicates conditions under which a document will match. Documents matching these conditions will increase their relevancy score. +<5> The `should` array indicates conditions under which a document will match. +Documents matching these conditions will have increased relevancy scores. <6> The `match` object finds documents where the `region` field contains the word "Austria." <7> The `filter` array provides filtering conditions that must be met but do not contribute to the relevancy score. <8> The `term` object is used for exact matches, in this case, filtering documents by the `year` field. @@ -178,8 +205,8 @@ Defines a <> to build a query vector. `k`:: (Required, integer) + -Number of nearest neighbors to return as top hits. This value must be fewer than -or equal to `num_candidates`. +Number of nearest neighbors to return as top hits. +This value must be fewer than or equal to `num_candidates`. `num_candidates`:: (Required, integer) @@ -222,16 +249,15 @@ GET /restaurants/_search <1> Configuration for k-nearest neighbor (knn) search, which is based on vector similarity. <2> Specifies the field name that contains the vectors. <3> The query vector against which document vectors are compared in the `knn` search. -<4> The number of nearest neighbors to return as top hits. This value must be fewer than or equal to `num_candidates`. +<4> The number of nearest neighbors to return as top hits. +This value must be fewer than or equal to `num_candidates`. <5> The size of the initial candidate set from which the final `k` nearest neighbors are selected. [[rrf-retriever]] ==== RRF Retriever -An <> retriever returns top documents based on the RRF formula, -equally weighting two or more child retrievers. -Reciprocal rank fusion (RRF) is a method for combining multiple result -sets with different relevance indicators into a single result set. +An <> retriever returns top documents based on the RRF formula, equally weighting two or more child retrievers. +Reciprocal rank fusion (RRF) is a method for combining multiple result sets with different relevance indicators into a single result set. ===== Parameters @@ -357,7 +383,8 @@ Refer to <> for a high level overview of semantic re-ranking ===== Prerequisites To use `text_similarity_reranker` you must first set up a `rerank` task using the <>. -The `rerank` task should be set up with a machine learning model that can compute text similarity. Refer to {ml-docs}/ml-nlp-model-ref.html#ml-nlp-model-ref-text-similarity[the Elastic NLP model reference] for a list of third-party text similarity models supported by {es}. +The `rerank` task should be set up with a machine learning model that can compute text similarity. +Refer to {ml-docs}/ml-nlp-model-ref.html#ml-nlp-model-ref-text-similarity[the Elastic NLP model reference] for a list of third-party text similarity models supported by {es}. Currently you can: @@ -368,6 +395,7 @@ Currently you can: ** Refer to the <> on this page for a step-by-step guide. ===== Parameters + `retriever`:: (Required, <>) + @@ -376,7 +404,8 @@ The child retriever that generates the initial set of top documents to be re-ran `field`:: (Required, `string`) + -The document field to be used for text similarity comparisons. This field should contain the text that will be evaluated against the `inferenceText`. +The document field to be used for text similarity comparisons. +This field should contain the text that will be evaluated against the `inferenceText`. `inference_id`:: (Required, `string`) @@ -391,25 +420,28 @@ The text snippet used as the basis for similarity comparison. `rank_window_size`:: (Optional, `int`) + -The number of top documents to consider in the re-ranking process. Defaults to `10`. +The number of top documents to consider in the re-ranking process. +Defaults to `10`. `min_score`:: (Optional, `float`) + -Sets a minimum threshold score for including documents in the re-ranked results. Documents with similarity scores below this threshold will be excluded. Note that score calculations vary depending on the model used. +Sets a minimum threshold score for including documents in the re-ranked results. +Documents with similarity scores below this threshold will be excluded. +Note that score calculations vary depending on the model used. `filter`:: (Optional, <>) + Applies the specified <> to the child <>. -If the child retriever already specifies any filters, then this top-level filter is applied in conjuction -with the filter defined in the child retriever. +If the child retriever already specifies any filters, then this top-level filter is applied in conjuction with the filter defined in the child retriever. [discrete] [[text-similarity-reranker-retriever-example-cohere]] ==== Example: Cohere Rerank -This example enables out-of-the-box semantic search by re-ranking top documents using the Cohere Rerank API. This approach eliminate the need to generate and store embeddings for all indexed documents. +This example enables out-of-the-box semantic search by re-ranking top documents using the Cohere Rerank API. +This approach eliminates the need to generate and store embeddings for all indexed documents. This requires a <> using the `rerank` task type. [source,console] @@ -459,7 +491,9 @@ Follow these steps to load the model and create a semantic re-ranker. python -m pip install eland[pytorch] ---- + -. Upload the model to {es} using Eland. This example assumes you have an Elastic Cloud deployment and an API key. Refer to the https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch-auth[Eland documentation] for more authentication options. +. Upload the model to {es} using Eland. +This example assumes you have an Elastic Cloud deployment and an API key. +Refer to the https://www.elastic.co/guide/en/elasticsearch/client/eland/current/machine-learning.html#ml-nlp-pytorch-auth[Eland documentation] for more authentication options. + [source,sh] ---- @@ -517,14 +551,142 @@ POST movies/_search This retriever uses a standard `match` query to search the `movie` index for films tagged with the genre "drama". It then re-ranks the results based on semantic similarity to the text in the `inference_text` parameter, using the model we uploaded to {es}. +[[rule-retriever]] +==== Query Rules Retriever + +The `rule` retriever enables fine-grained control over search results by applying contextual <> to pin or exclude documents for specific queries. +This retriever has similar functionality to the <>, but works out of the box with other retrievers. + +===== Prerequisites + +To use the `rule` retriever you must first create one or more query rulesets using the <>. + +[discrete] +[[rule-retriever-parameters]] +===== Parameters + +`retriever`:: +(Required, <>) ++ +The child retriever that returns the results to apply query rules on top of. +This can be a standalone retriever such as the <> or <> retriever, or it can be a compound retriever. + +`ruleset_ids`:: +(Required, `array`) ++ +An array of one or more unique <> IDs with query-based rules to match and apply as applicable. +Rulesets and their associated rules are evaluated in the order in which they are specified in the query and ruleset. +The maximum number of rulesets to specify is 10. + +`match_criteria`:: +(Required, `object`) ++ +Defines the match criteria to apply to rules in the given query ruleset(s). +Match criteria should match the keys defined in the `criteria.metadata` field of the rule. + +`rank_window_size`:: +(Optional, `int`) ++ +The number of top documents to return from the `rule` retriever. +Defaults to `10`. + +[discrete] +[[rule-retriever-example]] +==== Example: Rule retriever + +This example shows the rule retriever executed without any additional retrievers. +It runs the query defined by the `retriever` and applies the rules from `my-ruleset` on top of the returned results. + +[source,console] +---- +GET movies/_search +{ + "retriever": { + "rule": { + "match_criteria": { + "query_string": "harry potter" + }, + "ruleset_ids": [ + "my-ruleset" + ], + "retriever": { + "standard": { + "query": { + "query_string": { + "query": "harry potter" + } + } + } + } + } + } +} +---- + +[discrete] +[[rule-retriever-example-rrf]] +==== Example: Rule retriever combined with RRF + +This example shows how to combine the `rule` retriever with other rerank retrievers such as <> or <>. + +[WARNING] +==== +The `rule` retriever will apply rules to any documents returned from its defined `retriever` or any of its sub-retrievers. +This means that for the best results, the `rule` retriever should be the outermost defined retriever. +Nesting a `rule` retriever as a sub-retriever under a reranker such as `rrf` or `text_similarity_reranker` may not produce the expected results. +==== + +[source,console] +---- +GET movies/_search +{ + "retriever": { + "rule": { <1> + "match_criteria": { + "query_string": "harry potter" + }, + "ruleset_ids": [ + "my-ruleset" + ], + "retriever": { + "rrf": { <2> + "retrievers": [ + { + "standard": { + "query": { + "query_string": { + "query": "sorcerer's stone" + } + } + } + }, + { + "standard": { + "query": { + "query_string": { + "query": "chamber of secrets" + } + } + } + } + ] + } + } + } + } +} +---- + +<1> The `rule` retriever is the outermost retriever, applying rules to the search results that were previously reranked using the `rrf` retriever. +<2> The `rrf` retriever returns results from all of its sub-retrievers, and the output of the `rrf` retriever is used as input to the `rule` retriever. + ==== Using `from` and `size` with a retriever tree The <> and <> parameters are provided globally as part of the general -<>. They are applied to all retrievers in a -retriever tree unless a specific retriever overrides the `size` parameter -using a different parameter such as `rank_window_size`. Though, the final -search hits are always limited to `size`. +<>. +They are applied to all retrievers in a retriever tree, unless a specific retriever overrides the `size` parameter using a different parameter such as `rank_window_size`. +Though, the final search hits are always limited to `size`. ==== Using aggregations with a retriever tree @@ -534,8 +696,8 @@ clauses in a <>. ==== Restrictions on search parameters when specifying a retriever -When a retriever is specified as part of a search the following elements are not allowed -at the top-level and instead are only allowed as elements of specific retrievers: +When a retriever is specified as part of a search, the following elements are not allowed at the top-level. +Instead they are only allowed as elements of specific retrievers: * <> * <> @@ -543,3 +705,4 @@ at the top-level and instead are only allowed as elements of specific retrievers * <> * <> * <> + diff --git a/docs/reference/search/search-your-data/retrievers-overview.asciidoc b/docs/reference/search/search-your-data/retrievers-overview.asciidoc index 9df4026fc644..8e5955fc4178 100644 --- a/docs/reference/search/search-your-data/retrievers-overview.asciidoc +++ b/docs/reference/search/search-your-data/retrievers-overview.asciidoc @@ -16,22 +16,21 @@ For implementation details, including notable restrictions, check out the Retrievers come in various types, each tailored for different search operations. The following retrievers are currently available: -* <>. Returns top documents from a -traditional https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl.html[query]. -Mimics a traditional query but in the context of a retriever framework. This -ensures backward compatibility as existing `_search` requests remain supported. -That way you can transition to the new abstraction at your own pace without -mixing syntaxes. -* <>. Returns top documents from a <>, -in the context of a retriever framework. -* <>. Combines and ranks multiple first-stage retrievers using -the reciprocal rank fusion (RRF) algorithm. Allows you to combine multiple result sets -with different relevance indicators into a single result set. -An RRF retriever is a *compound retriever*, where its `filter` element is -propagated to its sub retrievers. -+ - -* <>. Used for <>. +* <>. +Returns top documents from a traditional https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl.html[query]. +Mimics a traditional query but in the context of a retriever framework. +This ensures backward compatibility as existing `_search` requests remain supported. +That way you can transition to the new abstraction at your own pace without mixing syntaxes. +* <>. +Returns top documents from a <>, in the context of a retriever framework. +* <>. +Combines and ranks multiple first-stage retrievers using the reciprocal rank fusion (RRF) algorithm. +Allows you to combine multiple result sets with different relevance indicators into a single result set. +An RRF retriever is a *compound retriever*, where its `filter` element is propagated to its sub retrievers. +* <>. +Applies <> to the query before returning results. +* <>. +Used for <>. Requires first creating a `rerank` task using the <>. [discrete] @@ -69,8 +68,11 @@ When using compound retrievers, only the query element is allowed, which enforce [[retrievers-overview-example]] ==== Example -The following example demonstrates the powerful queries that we can now compose, and how retrievers simplify this process. We can use any combination of retrievers we want, propagating the -results of a nested retriever to its parent. In this scenario, we'll make use of all 4 (currently) available retrievers, i.e. `standard`, `knn`, `text_similarity_reranker` and `rrf`. +The following example demonstrates the powerful queries that we can now compose, and how retrievers simplify this process. +We can use any combination of retrievers we want, propagating the results of a nested retriever to its parent. +In this scenario, we'll make use of 4 of our currently available retrievers, i.e. `standard`, `knn`, `text_similarity_reranker` and `rrf`. +See <> for the complete list of available retrievers. + We'll first combine the results of a `semantic` query using the `standard` retriever, and that of a `knn` search on a dense vector field, using `rrf` to get the top 100 results. Finally, we'll then rerank the top-50 results of `rrf` using the `text_similarity_reranker` @@ -126,15 +128,18 @@ GET example-index/_search Here are some important terms: -* *Retrieval Pipeline*. Defines the entire retrieval and ranking logic to -produce top hits. -* *Retriever Tree*. A hierarchical structure that defines how retrievers interact. -* *First-stage Retriever*. Returns an initial set of candidate documents. -* *Compound Retriever*. Builds on one or more retrievers, -enhancing document retrieval and ranking logic. -* *Combiners*. Compound retrievers that merge top hits -from multiple sub-retrievers. -* *Rerankers*. Special compound retrievers that reorder hits and may adjust the number of hits, with distinctions between first-stage and second-stage rerankers. +* *Retrieval Pipeline*. +Defines the entire retrieval and ranking logic to produce top hits. +* *Retriever Tree*. +A hierarchical structure that defines how retrievers interact. +* *First-stage Retriever*. +Returns an initial set of candidate documents. +* *Compound Retriever*. +Builds on one or more retrievers, enhancing document retrieval and ranking logic. +* *Combiners*. +Compound retrievers that merge top hits from multiple sub-retrievers. +* *Rerankers*. +Special compound retrievers that reorder hits and may adjust the number of hits, with distinctions between first-stage and second-stage rerankers. [discrete] [[retrievers-overview-play-in-search]] diff --git a/docs/reference/search/search-your-data/search-using-query-rules.asciidoc b/docs/reference/search/search-your-data/search-using-query-rules.asciidoc index 18be825d0237..7d9d14684bee 100644 --- a/docs/reference/search/search-your-data/search-using-query-rules.asciidoc +++ b/docs/reference/search/search-your-data/search-using-query-rules.asciidoc @@ -10,7 +10,7 @@ _Query rules_ allow customization of search results for queries that match speci This allows for more control over results, for example ensuring that promoted documents that match defined criteria are returned at the top of the result list. Metadata is defined in the query rule, and is matched against the query criteria. Query rules use metadata to match a query. -Metadata is provided as part of the <> as an object and can be anything that helps differentiate the query, for example: +Metadata is provided as part of the search request as an object and can be anything that helps differentiate the query, for example: * A user-entered query string * Personalized metadata about users (e.g. country, language, etc) @@ -18,13 +18,13 @@ Metadata is provided as part of the <> as an o * A referring site * etc. -Query rules define a metadata key that will be used to match the metadata provided in the <> with the criteria specified in the rule. +Query rules define a metadata key that will be used to match the metadata provided in the <> with the criteria specified in the rule. -When a query rule matches the <> metadata according to its defined criteria, the query rule action is applied to the underlying `organic` query. +When a query rule matches the rule metadata according to its defined criteria, the query rule action is applied to the underlying `organic` query. For example, a query rule could be defined to match a user-entered query string of `pugs` and a country `us` and promote adoptable shelter dogs if the rule query met both criteria. -Rules are defined using the <> and searched using the <>. +Rules are defined using the <> and searched using the <> or the <>. [discrete] [[query-rule-definition]] @@ -189,9 +189,11 @@ You can use the <> call to retrieve the ruleset you just crea [discrete] [[rule-query-search]] -==== Perform a rule query +==== Search using query rules + +Once you have defined one or more query rulesets, you can search using these rulesets using the <> or the <>. +Retrievers are the recommended way to use rule queries, as they will work out of the box with other reranking retrievers such as <>. -Once you have defined one or more query rulesets, you can search these rulesets using the <> query. Rulesets are evaluated in order, so rules in the first ruleset you specify will be applied before any subsequent rulesets. An example query for the `my-ruleset` defined above is: @@ -200,18 +202,22 @@ An example query for the `my-ruleset` defined above is: ---- GET /my-index-000001/_search { - "query": { + "retriever": { "rule": { - "organic": { - "query_string": { - "query": "puggles" + "retriever": { + "standard": { + "query": { + "query_string": { + "query": "puggles" + } + } } }, "match_criteria": { "query_string": "puggles", "user_country": "us" }, - "ruleset_ids": ["my-ruleset"] + "ruleset_ids": [ "my-ruleset" ] } } } @@ -227,3 +233,51 @@ In this case, the rules are applied in the following order: - Where the matching rule appears in the ruleset - If multiple documents are specified in a single rule, in the order they are specified - If a document is matched by both a `pinned` rule and an `exclude` rule, the `exclude` rule will take precedence + +You can specify reranking retrievers such as <> or <> in the rule query to apply query rules on already-reranked results. +Here is an example: + +[source,console] +---- +GET my-index-000001/_search +{ + "retriever": { + "rule": { + "match_criteria": { + "query_string": "puggles", + "user_country": "us" + }, + "ruleset_ids": [ + "my-ruleset" + ], + "retriever": { + "rrf": { + "retrievers": [ + { + "standard": { + "query": { + "query_string": { + "query": "pugs" + } + } + } + }, + { + "standard": { + "query": { + "query_string": { + "query": "puggles" + } + } + } + } + ] + } + } + } + } +} +---- +// TEST[continued] + +This will apply pinned and excluded query rules on top of the content that was reranked by RRF.