mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
* Term Stats documentation
* Update docs/reference/reranking/learning-to-rank-model-training.asciidoc
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Fix query example.
---------
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
(cherry picked from commit 0416812456
)
Co-authored-by: Aurélien FOUCRET <aurelien.foucret@gmail.com>
This commit is contained in:
parent
6cd1f8cbcd
commit
7b39d3db52
4 changed files with 108 additions and 22 deletions
|
@ -62,10 +62,17 @@ multiplied by `boost` to produce final documents' scores. Defaults to `1.0`.
|
||||||
===== Use relevance scores in a script
|
===== Use relevance scores in a script
|
||||||
|
|
||||||
Within a script, you can
|
Within a script, you can
|
||||||
{ref}/modules-scripting-fields.html#scripting-score[access]
|
{ref}/modules-scripting-fields.html#scripting-score[access]
|
||||||
the `_score` variable which represents the current relevance score of a
|
the `_score` variable which represents the current relevance score of a
|
||||||
document.
|
document.
|
||||||
|
|
||||||
|
[[script-score-access-term-statistics]]
|
||||||
|
===== Use term statistics in a script
|
||||||
|
|
||||||
|
Within a script, you can
|
||||||
|
{ref}/modules-scripting-fields.html#scripting-term-statistics[access]
|
||||||
|
the `_termStats` variable which provides statistical information about the terms used in the child query of the `script_score` query.
|
||||||
|
|
||||||
[[script-score-predefined-functions]]
|
[[script-score-predefined-functions]]
|
||||||
===== Predefined functions
|
===== Predefined functions
|
||||||
You can use any of the available {painless}/painless-contexts.html[painless
|
You can use any of the available {painless}/painless-contexts.html[painless
|
||||||
|
@ -147,7 +154,7 @@ updated since update operations also update the value of the `_seq_no` field.
|
||||||
|
|
||||||
[[decay-functions-numeric-fields]]
|
[[decay-functions-numeric-fields]]
|
||||||
====== Decay functions for numeric fields
|
====== Decay functions for numeric fields
|
||||||
You can read more about decay functions
|
You can read more about decay functions
|
||||||
{ref}/query-dsl-function-score-query.html#function-decay[here].
|
{ref}/query-dsl-function-score-query.html#function-decay[here].
|
||||||
|
|
||||||
* `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
|
* `double decayNumericLinear(double origin, double scale, double offset, double decay, double docValue)`
|
||||||
|
@ -233,7 +240,7 @@ The `script_score` query calculates the score for
|
||||||
every matching document, or hit. There are faster alternative query types that
|
every matching document, or hit. There are faster alternative query types that
|
||||||
can efficiently skip non-competitive hits:
|
can efficiently skip non-competitive hits:
|
||||||
|
|
||||||
* If you want to boost documents on some static fields, use the
|
* If you want to boost documents on some static fields, use the
|
||||||
<<query-dsl-rank-feature-query, `rank_feature`>> query.
|
<<query-dsl-rank-feature-query, `rank_feature`>> query.
|
||||||
* If you want to boost documents closer to a date or geographic point, use the
|
* If you want to boost documents closer to a date or geographic point, use the
|
||||||
<<query-dsl-distance-feature-query, `distance_feature`>> query.
|
<<query-dsl-distance-feature-query, `distance_feature`>> query.
|
||||||
|
|
|
@ -38,11 +38,21 @@ Feature extractors are defined using templated queries. https://eland.readthedoc
|
||||||
from eland.ml.ltr import QueryFeatureExtractor
|
from eland.ml.ltr import QueryFeatureExtractor
|
||||||
|
|
||||||
feature_extractors=[
|
feature_extractors=[
|
||||||
# We want to use the score of the match query for the title field as a feature:
|
# We want to use the BM25 score of the match query for the title field as a feature:
|
||||||
QueryFeatureExtractor(
|
QueryFeatureExtractor(
|
||||||
feature_name="title_bm25",
|
feature_name="title_bm25",
|
||||||
query={"match": {"title": "{{query}}"}}
|
query={"match": {"title": "{{query}}"}}
|
||||||
),
|
),
|
||||||
|
# We want to use the the number of matched terms in the title field as a feature:
|
||||||
|
QueryFeatureExtractor(
|
||||||
|
feature_name="title_matched_term_count",
|
||||||
|
query={
|
||||||
|
"script_score": {
|
||||||
|
"query": {"match": {"title": "{{query}}"}},
|
||||||
|
"script": {"source": "return _termStats.matchedTermsCount();"},
|
||||||
|
}
|
||||||
|
},
|
||||||
|
),
|
||||||
# We can use a script_score query to get the value
|
# We can use a script_score query to get the value
|
||||||
# of the field rating directly as a feature:
|
# of the field rating directly as a feature:
|
||||||
QueryFeatureExtractor(
|
QueryFeatureExtractor(
|
||||||
|
@ -54,19 +64,13 @@ feature_extractors=[
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
),
|
),
|
||||||
# We can execute a script on the value of the query
|
# We extract the number of terms in the query as feature.
|
||||||
# and use the return value as a feature:
|
QueryFeatureExtractor(
|
||||||
QueryFeatureExtractor(
|
feature_name="query_term_count",
|
||||||
feature_name="query_length",
|
|
||||||
query={
|
query={
|
||||||
"script_score": {
|
"script_score": {
|
||||||
"query": {"match_all": {}},
|
"query": {"match": {"title": "{{query}}"}},
|
||||||
"script": {
|
"script": {"source": "return _termStats.uniqueTermsCount();"},
|
||||||
"source": "return params['query'].splitOnToken(' ').length;",
|
|
||||||
"params": {
|
|
||||||
"query": "{{query}}",
|
|
||||||
}
|
|
||||||
},
|
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
),
|
),
|
||||||
|
@ -74,6 +78,15 @@ feature_extractors=[
|
||||||
----
|
----
|
||||||
// NOTCONSOLE
|
// NOTCONSOLE
|
||||||
|
|
||||||
|
[NOTE]
|
||||||
|
.Tern statistics as features
|
||||||
|
===================================================
|
||||||
|
|
||||||
|
It is very common for an LTR model to leverage raw term statistics as features.
|
||||||
|
To extract this information, you can use the {ref}/modules-scripting-fields.html#scripting-term-statistics[term statistics feature] provided as part of the <<query-dsl-script-score-query,`script_score`>> query.
|
||||||
|
|
||||||
|
===================================================
|
||||||
|
|
||||||
Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
|
Once the feature extractors have been defined, they are wrapped in an `eland.ml.ltr.LTRModelConfig` object for use in later training steps:
|
||||||
|
|
||||||
[source,python]
|
[source,python]
|
||||||
|
|
|
@ -61,10 +61,3 @@ When exposing pagination to users, `window_size` should remain constant as each
|
||||||
====== Negative scores
|
====== Negative scores
|
||||||
|
|
||||||
Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
|
Depending on how your model is trained, it’s possible that the model will return negative scores for documents. While negative scores are not allowed from first-stage retrieval and ranking, it is possible to use them in the LTR rescorer.
|
||||||
|
|
||||||
[discrete]
|
|
||||||
[[learning-to-rank-rescorer-limitations-term-statistics]]
|
|
||||||
====== Term statistics as features
|
|
||||||
|
|
||||||
We do not currently support term statistics as features, however future releases will introduce this capability.
|
|
||||||
|
|
||||||
|
|
|
@ -80,6 +80,79 @@ GET my-index-000001/_search
|
||||||
}
|
}
|
||||||
-------------------------------------
|
-------------------------------------
|
||||||
|
|
||||||
|
[discrete]
|
||||||
|
[[scripting-term-statistics]]
|
||||||
|
=== Accessing term statistics of a document within a script
|
||||||
|
|
||||||
|
Scripts used in a <<query-dsl-script-score-query,`script_score`>> query have access to the `_termStats` variable which provides statistical information about the terms in the child query.
|
||||||
|
|
||||||
|
In the following example, `_termStats` is used within a <<query-dsl-script-score-query,`script_score`>> query to retrieve the average term frequency for the terms `quick`, `brown`, and `fox` in the `text` field:
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
-------------------------------------
|
||||||
|
PUT my-index-000001/_doc/1?refresh
|
||||||
|
{
|
||||||
|
"text": "quick brown fox"
|
||||||
|
}
|
||||||
|
|
||||||
|
PUT my-index-000001/_doc/2?refresh
|
||||||
|
{
|
||||||
|
"text": "quick fox"
|
||||||
|
}
|
||||||
|
|
||||||
|
GET my-index-000001/_search
|
||||||
|
{
|
||||||
|
"query": {
|
||||||
|
"script_score": {
|
||||||
|
"query": { <1>
|
||||||
|
"match": {
|
||||||
|
"text": "quick brown fox"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"script": {
|
||||||
|
"source": "_termStats.termFreq().getAverage()" <2>
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
-------------------------------------
|
||||||
|
|
||||||
|
<1> Child query used to infer the field and the terms considered in term statistics.
|
||||||
|
|
||||||
|
<2> The script calculates the average document frequency for the terms in the query using `_termStats`.
|
||||||
|
|
||||||
|
`_termStats` provides access to the following functions for working with term statistics:
|
||||||
|
|
||||||
|
- `uniqueTermsCount`: Returns the total number of unique terms in the query. This value is the same across all documents.
|
||||||
|
- `matchedTermsCount`: Returns the count of query terms that matched within the current document.
|
||||||
|
- `docFreq`: Provides document frequency statistics for the terms in the query, indicating how many documents contain each term. This value is consistent across all documents.
|
||||||
|
- `totalTermFreq`: Provides the total frequency of terms across all documents, representing how often each term appears in the entire corpus. This value is consistent across all documents.
|
||||||
|
- `termFreq`: Returns the frequency of query terms within the current document, showing how often each term appears in that document.
|
||||||
|
|
||||||
|
[NOTE]
|
||||||
|
.Functions returning aggregated statistics
|
||||||
|
===================================================
|
||||||
|
|
||||||
|
The `docFreq`, `termFreq` and `totalTermFreq` functions return objects that represent statistics across all terms of the child query.
|
||||||
|
|
||||||
|
Statistics provides support for the following methods:
|
||||||
|
|
||||||
|
`getAverage()`: Returns the average value of the metric.
|
||||||
|
`getMin()`: Returns the minimum value of the metric.
|
||||||
|
`getMax()`: Returns the maximum value of the metric.
|
||||||
|
`getSum()`: Returns the sum of the metric values.
|
||||||
|
`getCount()`: Returns the count of terms included in the metric calculation.
|
||||||
|
|
||||||
|
===================================================
|
||||||
|
|
||||||
|
|
||||||
|
[NOTE]
|
||||||
|
.Painless language required
|
||||||
|
===================================================
|
||||||
|
|
||||||
|
The `_termStats` variable is only available when using the <<modules-scripting-painless, Painless>> scripting language.
|
||||||
|
|
||||||
|
===================================================
|
||||||
|
|
||||||
[discrete]
|
[discrete]
|
||||||
[[modules-scripting-doc-vals]]
|
[[modules-scripting-doc-vals]]
|
||||||
|
|
Loading…
Add table
Add a link
Reference in a new issue