mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-24 15:17:30 -04:00
Term Stats documentation (#115933)
* Term Stats documentation * Update docs/reference/reranking/learning-to-rank-model-training.asciidoc Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co> * Fix query example. --------- Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
This commit is contained in:
parent
c6f7827105
commit
0416812456
4 changed files with 108 additions and 22 deletions
|
@ -62,10 +62,17 @@ multiplied by `boost` to produce final documents' scores. Defaults to `1.0`.
|
|||
===== Use relevance scores in a script
|
||||
|
||||
Within a script, you can
|
||||
{ref}/modules-scripting-fields.html#scripting-score[access]
|
||||
{ref}/modules-scripting-fields.html#scripting-score[access]
|
||||
the `_score` variable which represents the current relevance score of a
|
||||
document.
|
||||
|
||||
[[script-score-access-term-statistics]]
|
||||
===== Use term statistics in a script
|
||||
|
||||
Within a script, you can
|
||||
{ref}/modules-scripting-fields.html#scripting-term-statistics[access]
|
||||
the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.
|
||||
|
||||
[[script-score-predefined-functions]]
|
||||
===== Predefined functions
|
||||
You can use any of the available {painless}/painless-contexts.html[painless
|
||||
|
@ -147,7 +154,7 @@ updated since update operations also update the value of the `_seq_no` field.
|
|||
|
||||
[[decay-functions-numeric-fields]]
|
||||
====== Decay functions for numeric fields
|
||||
You can read more about decay functions
|
||||
You can read more about decay functions
|
||||
{ref}/query-dsl-function-score-query.html#function-decay[here].
|
||||
|
||||
* `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
|
||||
|
@ -233,7 +240,7 @@ The `script_score` query calculates the score for
|
|||
every matching document, or hit. There are faster alternative query types that
|
||||
can efficiently skip non-competitive hits:
|
||||
|
||||
* If you want to boost documents on some static fields, use the
|
||||
* If you want to boost documents on some static fields, use the
|
||||
<<query-dsl-rank-feature-query, `rank_feature`>> query.
|
||||
* If you want to boost documents closer to a date or geographic point, use the
|
||||
<<query-dsl-distance-feature-query, `distance_feature`>> query.
|
||||
|
|
|
@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
|
|||
from eland.ml.ltr import QueryFeatureExtractor
|
||||
|
||||
feature_extractors=[
|
||||
# We want to use the score of the match query for the title field as a feature:
|
||||
# We want to use the BM25 score of the match query for the title field as a feature:
|
||||
QueryFeatureExtractor(
|
||||
feature_name="title_bm25",
|
||||
query={"match": {"title": "{{query}}"}}
|
||||
),
|
||||
# We want to use the the number of matched terms in the title field as a feature:
|
||||
QueryFeatureExtractor(
|
||||
feature_name="title_matched_term_count",
|
||||
query={
|
||||
"script_score": {
|
||||
"query": {"match": {"title": "{{query}}"}},
|
||||
"script": {"source": "return _termStats.matchedTermsCount();"},
|
||||
}
|
||||
},
|
||||
),
|
||||
# We can use a script_score query to get the value
|
||||
# of the field rating directly as a feature:
|
||||
QueryFeatureExtractor(
|
||||
|
@ -54,19 +64,13 @@ feature_extractors=[
|
|||
}
|
||||
},
|
||||
),
|
||||
# We can execute a script on the value of the query
|
||||
# and use the return value as a feature:
|
||||
QueryFeatureExtractor(
|
||||
feature_name="query_length",
|
||||
# We extract the number of terms in the query as feature.
|
||||
QueryFeatureExtractor(
|
||||
feature_name="query_term_count",
|
||||
query={
|
||||
"script_score": {
|
||||
"query": {"match_all": {}},
|
||||
"script": {
|
||||
"source": "return params['query'].splitOnToken(' ').length;",
|
||||
"params": {
|
||||
"query": "{{query}}",
|
||||
}
|
||||
},
|
||||
"query": {"match": {"title": "{{query}}"}},
|
||||
"script": {"source": "return _termStats.uniqueTermsCount();"},
|
||||
}
|
||||
},
|
||||
),
|
||||
|
@ -74,6 +78,15 @@ feature_extractors=[
|
|||
----
|
||||
// NOTCONSOLE
|
||||
|
||||
[NOTE]
|
||||
.Tern statistics as features
|
||||
===================================================
|
||||
|
||||
It is very common for an LTR model to leverage raw term statistics as features.
|
||||
To extract this information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the <<query-dsl-script-score-query,`script_score`>> query.
|
||||
|
||||
===================================================
|
||||
|
||||
Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
|
||||
|
||||
[source,python]
|
||||
|
|
|
@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
|
|||
====== Negative scores
|
||||
|
||||
Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
|
||||
|
||||
[discrete]
|
||||
[[learning-to-rank-rescorer-limitations-term-statistics]]
|
||||
====== Term statistics as features
|
||||
|
||||
We do not currently support term statistics as features, however future releases will introduce this capability.
|
||||
|
||||
|
|
|
@ -80,6 +80,79 @@ GET my-index-000001/_search
|
|||
}
|
||||
-------------------------------------
|
||||
|
||||
[discrete]
|
||||
[[scripting-term-statistics]]
|
||||
=== Accessing term statistics of a document within a script
|
||||
|
||||
Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
|
||||
|
||||
In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
|
||||
|
||||
[source,console]
|
||||
-------------------------------------
|
||||
PUT my-index-000001/_doc/1?refresh
|
||||
{
|
||||
"text": "quick brown fox"
|
||||
}
|
||||
|
||||
PUT my-index-000001/_doc/2?refresh
|
||||
{
|
||||
"text": "quick fox"
|
||||
}
|
||||
|
||||
GET my-index-000001/_search
|
||||
{
|
||||
"query": {
|
||||
"script_score": {
|
||||
"query": { <1>
|
||||
"match": {
|
||||
"text": "quick brown fox"
|
||||
}
|
||||
},
|
||||
"script": {
|
||||
"source": "_termStats.termFreq().getAverage()" <2>
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
-------------------------------------
|
||||
|
||||
<1> Child query used to infer the field and the terms considered in term statistics.
|
||||
|
||||
<2> The script calculates the average document frequency for the terms in the query using `_termStats`.
|
||||
|
||||
`_termStats` provides access to the following functions for working with term statistics:
|
||||
|
||||
- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
|
||||
- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
|
||||
- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
|
||||
- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
|
||||
- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
|
||||
|
||||
[NOTE]
|
||||
.Functions returning aggregated statistics
|
||||
===================================================
|
||||
|
||||
The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
|
||||
|
||||
Statistics provides support for the following methods:
|
||||
|
||||
`getAverage()`: Returns the average value of the metric.
|
||||
`getMin()`: Returns the minimum value of the metric.
|
||||
`getMax()`: Returns the maximum value of the metric.
|
||||
`getSum()`: Returns the sum of the metric values.
|
||||
`getCount()`: Returns the count of terms included in the metric calculation.
|
||||
|
||||
===================================================
|
||||
|
||||
|
||||
[NOTE]
|
||||
.Painless language required
|
||||
===================================================
|
||||
|
||||
The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
|
||||
|
||||
===================================================
|
||||
|
||||
[discrete]
|
||||
[[modules-scripting-doc-vals]]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue