The cross_fields scoring type can produce negative scores when some documents
are missing fields. When blending term document frequencies, we take the maximum
document frequency across all fields. If one field appears in fewer documents
than another, this means that its IDF can become negative. This is because IDF
is calculated as `Math.log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5))`
This change adjusts the docFreq for each field to `Math.min(docCount, docFreq)`
so that the IDF can never become negative. It makes sense that the term document
frequency should never exceed the number of documents containing the field.
There are some redundant words so I just removed those words. Please accept this change.
(cherry picked from commit e1e5398051)
Co-authored-by: Adnan Ashraf <adnan.ashraff1@gmail.com>
The current docs mention that Elasticsearch indexes prefixes between 2 and 5 characters in a separate field. 2 and 5 are default values, and the size of the prefixes indexed depend on the configuration settings.
* Soft-deprecation of point/geo_point formats
Since GeoJSON and WKT are now common formats for all three types:
geo_shape, geo_point and point
We decided to soft-deprecate the other point formats by ordering:
* GeoJSON (object with keys `type` and `coordinates`)
* WKT `POINT(x y)`
* Object with keys `lat` and `lon` (or `x` and `y` for point)
* Array [lon,lat]
* String `"lat,lon"` (or `"x,y"` in point)
* String with geohash (only in `geo_point`)
The geohash is last because it is only in one field type.
The string version is second last because it is the most controversial
being the only version to reverse the coordinate order from all other
formats (for geo_point only, since the coordinates are not reversed
in point).
In addition we replaced many examples in both documentation and tests
to prioritize WKT over the plain string format.
Many remaining examples of array format or object with keys still exist
and could be replaced by, for example, GeoJSON, if we feel the need.
* Incorrect quote position
Documents the `EMPTY` and `NONE` `flag` values for the `regexp` query.
Also documents the `""` (empty string) value, which is an alias for `ALL`.
Closes#81978.
Changes:
* Notes that the query string query's `default_field` and `fields` parameters support wildcards.
* Adds an xref to the `index.query.default_field` docs to the `default_field` parameter.
The current `multi_match` docs contain an erroneous reference to the `combined_fields` query. This updates the reference to reference the correct query.
Relates to https://github.com/elastic/elasticsearch/pull/76893
Removes `testenv` annotations and related code. These annotations originally let you skip x-pack snippet tests in the docs. However, that's no longer possible.
Relates to #79309, #31619
Changes:
* Documents the `wildcard` parameter for the `wildcard` query. This parameter is an alias for the `value` parameter.
* Reorders the parameters alphabetically.
Closes#79711
As the script has only access to the nested document, this should be
documented.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Adds additional information about how Elasticsearch uses polygon orientation. Elasticsearch only uses a polygon's orientation to determine if it crosses the international dateline. If so, Elasticsearch splits the polygon at the dateline.
Closes#74891
Changes:
* Notes that you can't use cross-cluster search to run a terms lookup on a remote index.
* Removes a redundant sentence noting `_source` is enabled by default.
Closes#61364.
Changes:
* Use "geopoint" when not referring to the literal field type
* Use "geoshape" when not referring to the literal field type or query type
* Use "GeoJSON" consistently
The current `ids` option doesn't allow pinning a specific document in a
single index when searching over multiple indices. This introduces a
`documents` option, which is an array of `_id` and `_index`
fields to allow index-specific pins.
Closes https://github.com/elastic/elasticsearch/issues/67855.
In the upcoming Lucene 9 release, `indices.query.bool.max_clause_count` is
going to apply to the entire query tree rather than per `bool` query. In order
to avoid breaks, the limit has been bumped from 1024 to 4096.
The semantics will effectively change when we upgrade to Lucene 9, this PR
is only about agreeing on a migration strategy and documenting this change.
To avoid further breaks, I am leaning towards keeping the current setting name
even though it contains `bool`. I believe that it still makes sense given that
`bool` queries are typically the main contributors to high numbers of clauses.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
`field_masking_span` is the only span query that does not begin with
`span_`. This commit deprecates the existing name and adds a new
name `span_field_masking` to better fit with the other queries.
* Removes docs and references for the following `geo_shape` mapping parameters:
* `tree`
* `tree_levels`
* `strategy`
* `distance_error_pct`
* Updates a related breaking change.
Relates to #70850
This PR introduces a new query called `combined_fields` for searching multiple
text fields. It takes a term-centric view, first analyzing the query string
into individual terms, then searching for each term any of the fields as though
they were one combined field. It is based on Lucene's `CombinedFieldQuery`,
which takes a principled approach to scoring based on the BM25F formula.
This query provides an alternative to the `cross_fields` `multi_match` mode. It
has simpler behavior and a more robust approach to scoring.
Addresses #41106.
This adds a "note" on the docs for the script query pointing folks to
runtime fields because they are more flexible. It also translates the
request example into runtime fields.
Relates to #69291
Co-authored-by: Adam Locke <adam.locke@elastic.co>
When performing a multi_match in cross_fields mode, we group fields based on
their analyzer and create a blended query per group. Our docs claimed that the
group scores were combined through a boolean query, but they are actually
combined through a dismax that incorporates the tiebreaker parameter.
This commit updates the docs and adds a test verifying the behavior.
The doc is misleading : The following intervals search returns documents containing `my favorite food` **immediately** followed by `hot water` or `cold porridge`
max_gaps apply only to the match query and is not used for checking proximity with the other match, the example given actually`This search would match a my_text value of my favorite food is cold`
Co-authored-by: Julien Guay <guay_j@yahoo.fr>