From 7b39d3db526bc48bfd85c6bcf09247806e16a64d Mon Sep 17 00:00:00 2001 From: Liam Thompson <32779855+leemthompo@users.noreply.github.com> Date: Mon, 4 Nov 2024 13:28:12 +0100 Subject: [PATCH] Term Stats documentation (#115933) (#116167) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Term Stats documentation * Update docs/reference/reranking/learning-to-rank-model-training.asciidoc Co-authored-by: István Zoltán Szabó * Fix query example. --------- Co-authored-by: István Zoltán Szabó (cherry picked from commit 0416812456af0a763a5f43f9ab6813221ea6e4d8) Co-authored-by: Aurélien FOUCRET --- .../query-dsl/script-score-query.asciidoc | 13 +++- .../learning-to-rank-model-training.asciidoc | 37 +++++++--- .../learning-to-rank-search-usage.asciidoc | 7 -- docs/reference/scripting/fields.asciidoc | 73 +++++++++++++++++++ 4 files changed, 108 insertions(+), 22 deletions(-) diff --git a/docs/reference/query-dsl/script-score-query.asciidoc b/docs/reference/query-dsl/script-score-query.asciidoc index 9291b8c15f0d..051c9c6f9c32 100644 --- a/docs/reference/query-dsl/script-score-query.asciidoc +++ b/docs/reference/query-dsl/script-score-query.asciidoc @@ -62,10 +62,17 @@ multiplied by `boost` to produce final documents' scores. Defaults to `1.0`. ===== Use relevance scores in a script Within a script, you can -{ref}/modules-scripting-fields.html#scripting-score[access] +{ref}/modules-scripting-fields.html#scripting-score[access] the `_score` variable which represents the current relevance score of a document. +[[script-score-access-term-statistics]] +===== Use term statistics in a script + +Within a script, you can +{ref}/modules-scripting-fields.html#scripting-term-statistics[access] +the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query. + [[script-score-predefined-functions]] ===== Predefined functions You can use any of the available {painless}/painless-contexts.html[painless @@ -147,7 +154,7 @@ updated since update operations also update the value of the `_seq_no` field. [[decay-functions-numeric-fields]] ====== Decay functions for numeric fields -You can read more about decay functions +You can read more about decay functions {ref}/query-dsl-function-score-query.html#function-decay[here]. * `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)` @@ -233,7 +240,7 @@ The `script_score` query calculates the score for every matching document, or hit. There are faster alternative query types that can efficiently skip non-competitive hits: -* If you want to boost documents on some static fields, use the +* If you want to boost documents on some static fields, use the <> query. * If you want to boost documents closer to a date or geographic point, use the <> query. diff --git a/docs/reference/reranking/learning-to-rank-model-training.asciidoc b/docs/reference/reranking/learning-to-rank-model-training.asciidoc index 0f4640ebdf34..8e0b3f9ae94c 100644 --- a/docs/reference/reranking/learning-to-rank-model-training.asciidoc +++ b/docs/reference/reranking/learning-to-rank-model-training.asciidoc @@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc from eland.ml.ltr import QueryFeatureExtractor feature_extractors=[ - # We want to use the score of the match query for the title field as a feature: + # We want to use the BM25 score of the match query for the title field as a feature: QueryFeatureExtractor( feature_name="title_bm25", query={"match": {"title": "{{query}}"}} ), + # We want to use the the number of matched terms in the title field as a feature: + QueryFeatureExtractor( + feature_name="title_matched_term_count", + query={ + "script_score": { + "query": {"match": {"title": "{{query}}"}}, + "script": {"source": "return _termStats.matchedTermsCount();"}, + } + }, + ), # We can use a script_score query to get the value # of the field rating directly as a feature: QueryFeatureExtractor( @@ -54,19 +64,13 @@ feature_extractors=[ } }, ), - # We can execute a script on the value of the query - # and use the return value as a feature: - QueryFeatureExtractor( - feature_name="query_length", + # We extract the number of terms in the query as feature. + QueryFeatureExtractor( + feature_name="query_term_count", query={ "script_score": { - "query": {"match_all": {}}, - "script": { - "source": "return params['query'].splitOnToken(' ').length;", - "params": { - "query": "{{query}}", - } - }, + "query": {"match": {"title": "{{query}}"}}, + "script": {"source": "return _termStats.uniqueTermsCount();"}, } }, ), @@ -74,6 +78,15 @@ feature_extractors=[ ---- // NOTCONSOLE +[NOTE] +.Tern statistics as features +=================================================== + +It is very common for an LTR model to leverage raw term statistics as features. +To extract this information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the <> query. + +=================================================== + Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps: [source,python] diff --git a/docs/reference/reranking/learning-to-rank-search-usage.asciidoc b/docs/reference/reranking/learning-to-rank-search-usage.asciidoc index f14219e24bc1..afb623dc2b1c 100644 --- a/docs/reference/reranking/learning-to-rank-search-usage.asciidoc +++ b/docs/reference/reranking/learning-to-rank-search-usage.asciidoc @@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each ====== Negative scores Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer. - -[discrete] -[[learning-to-rank-rescorer-limitations-term-statistics]] -====== Term statistics as features - -We do not currently support term statistics as features, however future releases will introduce this capability. - diff --git a/docs/reference/scripting/fields.asciidoc b/docs/reference/scripting/fields.asciidoc index c2a40d4519f9..8a9bb3c71278 100644 --- a/docs/reference/scripting/fields.asciidoc +++ b/docs/reference/scripting/fields.asciidoc @@ -80,6 +80,79 @@ GET my-index-000001/_search } ------------------------------------- +[discrete] +[[scripting-term-statistics]] +=== Accessing term statistics of a document within a script + +Scripts used in a <> query have access to the `_termStats` variable which provides statistical information about the terms in the child query. + +In the following example, `_termStats` is used within a <> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field: + +[source,console] +------------------------------------- +PUT my-index-000001/_doc/1?refresh +{ + "text": "quick brown fox" +} + +PUT my-index-000001/_doc/2?refresh +{ + "text": "quick fox" +} + +GET my-index-000001/_search +{ + "query": { + "script_score": { + "query": { <1> + "match": { + "text": "quick brown fox" + } + }, + "script": { + "source": "_termStats.termFreq().getAverage()" <2> + } + } + } +} +------------------------------------- + +<1> Child query used to infer the field and the terms considered in term statistics. + +<2> The script calculates the average document frequency for the terms in the query using `_termStats`. + +`_termStats` provides access to the following functions for working with term statistics: + +- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents. +- `matchedTermsCount`: Returns the count of query terms that matched within the current document. +- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents. +- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents. +- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document. + +[NOTE] +.Functions returning aggregated statistics +=================================================== + +The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query. + +Statistics provides support for the following methods: + +`getAverage()`: Returns the average value of the metric. +`getMin()`: Returns the minimum value of the metric. +`getMax()`: Returns the maximum value of the metric. +`getSum()`: Returns the sum of the metric values. +`getCount()`: Returns the count of terms included in the metric calculation. + +=================================================== + + +[NOTE] +.Painless language required +=================================================== + +The `_termStats` variable is only available when using the <> scripting language. + +=================================================== [discrete] [[modules-scripting-doc-vals]]