Term Stats documentation (#115933)

* Term Stats documentation * Update docs/reference/reranking/learning-to-rank-model-training.asciidoc Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Fix query example. --------- Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2025-04-25 07:37:19 -04:00 · 2024-10-30 15:31:26 +01:00 · 2024-10-30 15:31:26 +01:00 · 0416812456
commit 0416812456
parent c6f7827105
4 changed files with 108 additions and 22 deletions
--- a/docs/reference/query-dsl/script-score-query.asciidoc
+++ b/docs/reference/query-dsl/script-score-query.asciidoc
@ -66,6 +66,13 @@ Within a script, you can
 the `_score` variable which represents the current relevance score of a
 document.
 [[script-score-access-term-statistics]]
 ===== Use term statistics in a script
 Within a script, you can
 {ref}/modules-scripting-fields.html#scripting-term-statistics[access]
 the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.
 [[script-score-predefined-functions]]
 ===== Predefined functions
 You can use any of the available {painless}/painless-contexts.html[painless
--- a/docs/reference/reranking/learning-to-rank-model-training.asciidoc
+++ b/docs/reference/reranking/learning-to-rank-model-training.asciidoc
@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
 from eland.ml.ltr import QueryFeatureExtractor
 feature_extractors=[
-    # We want to use the score of the match query for the title field as a feature:
+    # We want to use the BM25 score of the match query for the title field as a feature:
    QueryFeatureExtractor(
        feature_name="title_bm25",
        query={"match": {"title": "{{query}}"}}
    ),
    # We want to use the the number of matched terms in the title field as a feature:
    QueryFeatureExtractor(
        feature_name="title_matched_term_count",
        query={
            "script_score": {
                "query": {"match": {"title": "{{query}}"}},
                "script": {"source": "return _termStats.matchedTermsCount();"},
            }
        },
    ),
    # We can use a script_score query to get the value
    # of the field rating directly as a feature:
    QueryFeatureExtractor(
@ -54,19 +64,13 @@ feature_extractors=[
            }
        },
    ),
-    # We can execute a script on the value of the query
+    # We extract the number of terms in the query as feature.
    # and use the return value as a feature:
   QueryFeatureExtractor(
-        feature_name="query_length",
+        feature_name="query_term_count",
        query={
            "script_score": {
-                "query": {"match_all": {}},
+                "query": {"match": {"title": "{{query}}"}},
-                "script": {
+                "script": {"source": "return _termStats.uniqueTermsCount();"},
                    "source": "return params['query'].splitOnToken(' ').length;",
                    "params": {
                        "query": "{{query}}",
                    }
                },
            }
        },
    ),
@ -74,6 +78,15 @@ feature_extractors=[
 ----
 // NOTCONSOLE
 [NOTE]
 .Tern statistics as features
 ===================================================
 It is very common for an LTR model to leverage raw term statistics as features.
 To extract this information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the  <<query-dsl-script-score-query,`script_score`>> query.
 ===================================================
 Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
 [source,python]
--- a/docs/reference/reranking/learning-to-rank-search-usage.asciidoc
+++ b/docs/reference/reranking/learning-to-rank-search-usage.asciidoc
@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
 ====== Negative scores
 Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
 [discrete]
 [[learning-to-rank-rescorer-limitations-term-statistics]]
 ====== Term statistics as features
 We do not currently support term statistics as features, however future releases will introduce this capability.
--- a/docs/reference/scripting/fields.asciidoc
+++ b/docs/reference/scripting/fields.asciidoc
@ -80,6 +80,79 @@ GET my-index-000001/_search
 }
 -------------------------------------
 [discrete]
 [[scripting-term-statistics]]
 === Accessing term statistics of a document within a script
 Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
 In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
 [source,console]
 -------------------------------------
 PUT my-index-000001/_doc/1?refresh
 {
  "text": "quick brown fox"
 }
 PUT my-index-000001/_doc/2?refresh
 {
  "text": "quick fox"
 }
 GET my-index-000001/_search
 {
  "query": {
    "script_score": {
      "query": { <1>
        "match": {
          "text": "quick brown fox"
        }
      },
      "script": {
        "source": "_termStats.termFreq().getAverage()" <2>
      }
    }
  }
 }
 -------------------------------------
 <1> Child query used to infer the field and the terms considered in term statistics.
 <2> The script calculates the average document frequency for the terms in the query using `_termStats`.
 `_termStats` provides access to the following functions for working with term statistics:
 - `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
 - `matchedTermsCount`: Returns the count of query terms that matched within the current document.
 - `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
 - `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
 - `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
 [NOTE]
 .Functions returning aggregated statistics
 ===================================================
 The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
 Statistics provides support for the following methods:
 `getAverage()`: Returns the average value of the metric.
 `getMin()`: Returns the minimum value of the metric.
 `getMax()`: Returns the maximum value of the metric.
 `getSum()`: Returns the sum of the metric values.
 `getCount()`: Returns the count of terms included in the metric calculation.
 ===================================================
 [NOTE]
 .Painless language required
 ===================================================
 The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
 ===================================================
 [discrete]
 [[modules-scripting-doc-vals]]