Term Stats documentation (#115933)

* Term Stats documentation

* Update docs/reference/reranking/learning-to-rank-model-training.asciidoc

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

* Fix query example.

---------

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
This commit is contained in:
Aurélien FOUCRET 2024-10-30 15:31:26 +01:00 committed by GitHub
parent c6f7827105
commit 0416812456
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 108 additions and 22 deletions

View file

@ -66,6 +66,13 @@ Within a script, you can
the `_score` variable which represents the current relevance score of a the `_score` variable which represents the current relevance score of a
document. document.
[[script-score-access-term-statistics]]
===== Use term statistics in a script
Within a script, you can
{ref}/modules-scripting-fields.html#scripting-term-statistics[access]
the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.
[[script-score-predefined-functions]] [[script-score-predefined-functions]]
===== Predefined functions ===== Predefined functions
You can use any of the available {painless}/painless-contexts.html[painless You can use any of the available {painless}/painless-contexts.html[painless

View file

@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
from eland.ml.ltr import QueryFeatureExtractor from eland.ml.ltr import QueryFeatureExtractor
feature_extractors=[ feature_extractors=[
# We want to use the score of the match query for the title field as a feature: # We want to use the BM25 score of the match query for the title field as a feature:
QueryFeatureExtractor( QueryFeatureExtractor(
feature_name="title_bm25", feature_name="title_bm25",
query={"match": {"title": "{{query}}"}} query={"match": {"title": "{{query}}"}}
), ),
# We want to use the the number of matched terms in the title field as a feature:
QueryFeatureExtractor(
feature_name="title_matched_term_count",
query={
"script_score": {
"query": {"match": {"title": "{{query}}"}},
"script": {"source": "return _termStats.matchedTermsCount();"},
}
},
),
# We can use a script_score query to get the value # We can use a script_score query to get the value
# of the field rating directly as a feature: # of the field rating directly as a feature:
QueryFeatureExtractor( QueryFeatureExtractor(
@ -54,19 +64,13 @@ feature_extractors=[
} }
}, },
), ),
# We can execute a script on the value of the query # We extract the number of terms in the query as feature.
# and use the return value as a feature:
QueryFeatureExtractor( QueryFeatureExtractor(
feature_name="query_length", feature_name="query_term_count",
query={ query={
"script_score": { "script_score": {
"query": {"match_all": {}}, "query": {"match": {"title": "{{query}}"}},
"script": { "script": {"source": "return _termStats.uniqueTermsCount();"},
"source": "return params['query'].splitOnToken(' ').length;",
"params": {
"query": "{{query}}",
}
},
} }
}, },
), ),
@ -74,6 +78,15 @@ feature_extractors=[
---- ----
// NOTCONSOLE // NOTCONSOLE
[NOTE]
.Tern statistics as features
===================================================
It is very common for an LTR model to leverage raw term statistics as features.
To extract this information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the <<query-dsl-script-score-query,`script_score`>> query.
===================================================
Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps: Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
[source,python] [source,python]

View file

@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
====== Negative scores ====== Negative scores
Depending on how your model is trained, its possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer. Depending on how your model is trained, its possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
[discrete]
[[learning-to-rank-rescorer-limitations-term-statistics]]
====== Term statistics as features
We do not currently support term statistics as features, however future releases will introduce this capability.

View file

@ -80,6 +80,79 @@ GET my-index-000001/_search
} }
------------------------------------- -------------------------------------
[discrete]
[[scripting-term-statistics]]
=== Accessing term statistics of a document within a script
Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
[source,console]
-------------------------------------
PUT my-index-000001/_doc/1?refresh
{
"text": "quick brown fox"
}
PUT my-index-000001/_doc/2?refresh
{
"text": "quick fox"
}
GET my-index-000001/_search
{
"query": {
"script_score": {
"query": { <1>
"match": {
"text": "quick brown fox"
}
},
"script": {
"source": "_termStats.termFreq().getAverage()" <2>
}
}
}
}
-------------------------------------
<1> Child query used to infer the field and the terms considered in term statistics.
<2> The script calculates the average document frequency for the terms in the query using `_termStats`.
`_termStats` provides access to the following functions for working with term statistics:
- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
[NOTE]
.Functions returning aggregated statistics
===================================================
The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
Statistics provides support for the following methods:
`getAverage()`: Returns the average value of the metric.
`getMin()`: Returns the minimum value of the metric.
`getMax()`: Returns the maximum value of the metric.
`getSum()`: Returns the sum of the metric values.
`getCount()`: Returns the count of terms included in the metric calculation.
===================================================
[NOTE]
.Painless language required
===================================================
The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
===================================================
[discrete] [discrete]
[[modules-scripting-doc-vals]] [[modules-scripting-doc-vals]]