Term Stats documentation (#115933) (#116167)

* Term Stats documentation

* Update docs/reference/reranking/learning-to-rank-model-training.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Fix query example.

---------

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
(cherry picked from commit 0416812456)

Co-authored-by: Aurélien FOUCRET <aurelien.foucret@gmail.com>
This commit is contained in:
Liam Thompson 2024-11-04 13:28:12 +01:00 committed by GitHub
parent 6cd1f8cbcd
commit 7b39d3db52
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 108 additions and 22 deletions

View file

@ -80,6 +80,79 @@ GET my-index-000001/_search
}
-------------------------------------
[discrete]
[[scripting-term-statistics]]
=== Accessing term statistics of a document within a script
Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
[source,console]
-------------------------------------
PUT my-index-000001/_doc/1?refresh
{
"text": "quick brown fox"
}
PUT my-index-000001/_doc/2?refresh
{
"text": "quick fox"
}
GET my-index-000001/_search
{
"query": {
"script_score": {
"query": { <1>
"match": {
"text": "quick brown fox"
}
},
"script": {
"source": "_termStats.termFreq().getAverage()" <2>
}
}
}
}
-------------------------------------
<1> Child query used to infer the field and the terms considered in term statistics.
<2> The script calculates the average document frequency for the terms in the query using `_termStats`.
`_termStats` provides access to the following functions for working with term statistics:
- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
[NOTE]
.Functions returning aggregated statistics
===================================================
The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
Statistics provides support for the following methods:
`getAverage()`: Returns the average value of the metric.
`getMin()`: Returns the minimum value of the metric.
`getMax()`: Returns the maximum value of the metric.
`getSum()`: Returns the sum of the metric values.
`getCount()`: Returns the count of terms included in the metric calculation.
===================================================
[NOTE]
.Painless language required
===================================================
The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
===================================================
[discrete]
[[modules-scripting-doc-vals]]