* Add index_options parameter to semantic_text field mapping * Cleanup & tests * Update docs * Update docs/changelog/119967.yaml * Addressed some PR feedbak * Update yaml tests * Refactoring * Cleanup * Fix some tests * Hack in inferring text_embedding task type from index options * [CI] Auto commit changes from spotless * Fix error inferring model settings * Update docs * Update tests * Update docs/reference/mapping/types/semantic-text.asciidoc Co-authored-by: Mike Pellegrini <mike.pellegrini@elastic.co> * Address some minor PR feedback * Remove partial model_settings with inferred task type * Cleanup * Remove unnecessary changes * Fix errors from merge * [CI] Auto commit changes from spotless * Cleanup * Checkpoint, saving changes before merge * Update parsing * [CI] Auto commit changes from spotless * Stash changes * Fix compile errors * [CI] Auto commit changes from spotless * Cleanup error * fix test * fix test * Fix another test * A bit of cleanup * Fix tests * Spotless * Respect index options if set over defaults * Cleanup * [CI] Auto commit changes from spotless * Support updating to compatible versions, add some cleanup and validation * Remove test that can't be done here - needs to be unit test * Add validation * Cleanup * Fix some yaml tests * [CI] Auto commit changes from spotless * Happy path early index validation works now; edge cases surrounding default BBQ remain * Always emit index options, even when using defaults * Minor cleanup * Fix test compilation failures * Fix some tests * Continue to iterate on test failures * Remove index options from inference field metadata as it is only needed at field creation time * Fix some tests * Remove transport version, no longer needed * Fix yaml tests * Add tests * IndexOptions don't need to implement Writeable * [CI] Auto commit changes from spotless * Refactor - move SemanticTextIndexOptions * Remove writeable * Move index_options parsing to semantic text field mapper * Cleanup * Fix test compilation issue * Cleanup * Remove whitespace * Remove writeables from index options * Disable merging null options? * Add docs * [CI] Auto commit changes from spotless * Revert "Disable merging null options?" This reverts commit2ef8b1dc29
. * Remove default serialization * Include default index option type to defaults * [CI] Auto commit changes from spotless * Go back to allowing null updateS * Cleanup * Fix validation error * Revert "Include default index option type to defaults" This reverts commitb08e2a1d7e
. * Update tests * Revert "Update tests" This reverts commitaedfafe0e7
. * Better fix for null inputs * Remove redundant merge validation --------- Co-authored-by: elasticsearchmachine <infra-root+elasticsearchmachine@elastic.co> Co-authored-by: Mike Pellegrini <mike.pellegrini@elastic.co>
14 KiB
navigation_title | mapped_pages | |
---|---|---|
Semantic text |
|
Semantic text field type [semantic-text]
The semantic_text
field type automatically generates embeddings for text
content using an inference endpoint. Long passages
are automatically chunked to smaller sections to enable
the processing of larger corpuses of text.
The semantic_text
field type specifies an inference endpoint identifier that
will be used to generate embeddings. You can create the inference endpoint by
using
the Create {{infer}} API.
This field type and the
semantic
query
type make it simpler to perform semantic search on your data. The
semantic_text
field type may also be queried
with match, sparse_vector
or knn queries.
If you don’t specify an inference endpoint, the inference_id
field defaults to
.elser-2-elasticsearch
, a preconfigured endpoint for the elasticsearch
service.
Using semantic_text
, you won’t need to specify how to generate embeddings for
your data, or how to index it. The {{infer}} endpoint automatically determines
the embedding generation, indexing, and query to use.
Newly created indices with semantic_text
fields using dense embeddings will be
quantized
to bbq_hnsw
automatically.
If you use the preconfigured .elser-2-elasticsearch
endpoint, you can set up
semantic_text
with the following API request:
PUT my-index-000001
{
"mappings": {
"properties": {
"inference_field": {
"type": "semantic_text"
}
}
}
}
To use a custom {{infer}} endpoint instead of the default
.elser-2-elasticsearch
, you
must Create {{infer}} API
and specify its inference_id
when setting up the semantic_text
field type.
PUT my-index-000002
{
"mappings": {
"properties": {
"inference_field": {
"type": "semantic_text",
"inference_id": "my-openai-endpoint" <1>
}
}
}
}
- The
inference_id
of the {{infer}} endpoint to use to generate embeddings.
The recommended way to use semantic_text
is by having dedicated {{infer}}
endpoints for ingestion and search. This ensures that search speed remains
unaffected by ingestion workloads, and vice versa. After creating dedicated
{{infer}} endpoints for both, you can reference them using the inference_id
and search_inference_id
parameters when setting up the index mapping for an
index that uses the semantic_text
field.
PUT my-index-000003
{
"mappings": {
"properties": {
"inference_field": {
"type": "semantic_text",
"inference_id": "my-elser-endpoint-for-ingest",
"search_inference_id": "my-elser-endpoint-for-search"
}
}
}
}
Parameters for semantic_text
fields [semantic-text-params]
inference_id
- (Optional, string) {{infer-cap}} endpoint that will be used to generate
embeddings for the field. By default,
.elser-2-elasticsearch
is used. This parameter cannot be updated. Use the Create {{infer}} API to create the endpoint. Ifsearch_inference_id
is specified, the {{infer}} endpoint will only be used at index time. search_inference_id
- (Optional, string) {{infer-cap}} endpoint that will be used to generate
embeddings at query time. You can update this parameter by using
the Update mapping API.
Use
the Create {{infer}} API
to create the endpoint. If not specified, the {{infer}} endpoint defined by
inference_id
will be used at both index and query time. index_options
- (Optional, string) Specifies the index options to override default values
for the field. Currently,
dense_vector
index options are supported. For text embeddings,index_options
may match any allowed dense_vector index options.
An example of how to set index_options for a semantic_text
field:
PUT my-index-000004
{
"mappings": {
"properties": {
"inference_field": {
"type": "semantic_text",
"inference_id": "my-text-embedding-endpoint",
"index_options": {
"dense_vector": {
"type": "int4_flat"
}
}
}
}
}
}
chunking_settings
- (Optional, object) Settings for chunking text into smaller passages.
If specified, these will override the chunking settings set in the {{infer-cap}}
endpoint associated with
inference_id
. If chunking settings are updated, they will not be applied to existing documents until they are reindexed. To completely disable chunking, use thenone
chunking strategy.Valid values for
chunking_settings
:type
- Indicates the type of chunking strategy to use. Valid values are
none
,word
orsentence
. Required. max_chunk_size
- The maximum number of words in a chunk. Required for
word
andsentence
strategies. overlap
- The number of overlapping words allowed in chunks. This cannot be defined as
more than half of the
max_chunk_size
. Required forword
type chunking settings. sentence_overlap
- The number of overlapping sentences allowed in chunks. Valid values are
0
or1
. Required forsentence
type chunking settings
::::{warning}
If the input exceeds the maximum token limit of the underlying model, some
services (such as OpenAI) may return an
error. In contrast, the elastic
and elasticsearch
services will
automatically truncate the input to fit within the
model's limit.
::::
{{infer-cap}} endpoint validation [infer-endpoint-validation]
The inference_id
will not be validated when the mapping is created, but when
documents are ingested into the index. When the first document is indexed, the
inference_id
will be used to generate underlying indexing structures for the
field.
::::{warning}
Removing an {{infer}} endpoint will cause ingestion of documents and semantic
queries to fail on indices that define semantic_text
fields with that
{{infer}} endpoint as their inference_id
. Trying
to delete an {{infer}} endpoint
that is used on a semantic_text
field will result in an error.
::::
Text chunking [auto-text-chunking]
{{infer-cap}} endpoints have a limit on the amount of text they can process. To
allow for large amounts of text to be used in semantic search, semantic_text
automatically generates smaller passages if needed, called chunks.
Each chunk refers to a passage of the text and the corresponding embedding generated from it. When querying, the individual passages will be automatically searched for each document, and the most relevant passage will be used to compute a score.
For more details on chunking and how to configure chunking settings, see Configuring chunking in the Inference API documentation.
You can pre-chunk the input by sending it to Elasticsearch as an array of strings. Example:
PUT test-index
{
"mappings": {
"properties": {
"my_semantic_field": {
"type": "semantic_text",
"chunking_settings": {
"strategy": "none" <1>
}
}
}
}
}
- Disable chunking on
my_semantic_field
.
PUT test-index/_doc/1
{
"my_semantic_field": ["my first chunk", "my second chunk", ...] <1>
...
}
- The text is pre-chunked and provided as an array of strings. Each element in the array represents a single chunk that will be sent directly to the inference service without further chunking.
Important considerations:
- When providing pre-chunked input, ensure that you set the chunking strategy to
none
to avoid additional processing. - Each chunk should be sized carefully, staying within the token limit of the inference service and the underlying model.
- If a chunk exceeds the model's token limit, the behavior depends on the
service:
- Some services (such as OpenAI) will return an error.
- Others (such as
elastic
andelasticsearch
) will automatically truncate the input.
Refer
to this tutorial
to learn more about semantic search using semantic_text
.
Extracting Relevant Fragments from Semantic Text [semantic-text-highlighting]
You can extract the most relevant fragments from a semantic text field by using the highlight parameter in the Search API.
POST test-index/_search
{
"query": {
"match": {
"my_semantic_field": "Which country is Paris in?"
}
},
"highlight": {
"fields": {
"my_semantic_field": {
"number_of_fragments": 2, <1>
"order": "score" <2>
}
}
}
}
- Specifies the maximum number of fragments to return.
- Sorts highlighted fragments by score when set to
score
. By default, fragments will be output in the order they appear in the field (order: none).
Highlighting is supported on fields other than semantic_text. However, if you
want to restrict highlighting to the semantic highlighter and return no
fragments when the field is not of type semantic_text, you can explicitly
enforce the semantic
highlighter in the query:
PUT test-index
{
"query": {
"match": {
"my_field": "Which country is Paris in?"
}
},
"highlight": {
"fields": {
"my_field": {
"type": "semantic", <1>
"number_of_fragments": 2,
"order": "score"
}
}
}
}
- Ensures that highlighting is applied exclusively to semantic_text fields.
Customizing semantic_text
indexing [custom-indexing]
semantic_text
uses defaults for indexing data based on the {{infer}} endpoint
specified. It enables you to quickstart your semantic search by providing
automatic {{infer}} and a dedicated query so you don’t need to provide further
details.
In case you want to customize data indexing, use the
sparse_vector
or dense_vector
field types and create an ingest pipeline with
an {{infer}} processor to
generate the
embeddings. This tutorial
walks you through the process. In these cases - when you use sparse_vector
or
dense_vector
field types instead of the semantic_text
field type to
customize indexing - using the
semantic_query
is not supported for querying the field data.
Updates to semantic_text
fields [update-script]
For indices containing semantic_text
fields, updates that use scripts have the
following behavior:
- Are supported through the Update API.
- Are not supported through
the Bulk API
and will fail. Even if the script targets non-
semantic_text
fields, the update will fail when the index contains asemantic_text
field.
copy_to
and multi-fields support [copy-to-support]
The semantic_text field type can serve as the target of copy_to fields, be part of a multi-field structure, or contain multi-fields internally. This means you can use a single field to collect the values of other fields for semantic search.
For example, the following mapping:
PUT test-index
{
"mappings": {
"properties": {
"source_field": {
"type": "text",
"copy_to": "infer_field"
},
"infer_field": {
"type": "semantic_text",
"inference_id": ".elser-2-elasticsearch"
}
}
}
}
can also be declared as multi-fields:
PUT test-index
{
"mappings": {
"properties": {
"source_field": {
"type": "text",
"fields": {
"infer_field": {
"type": "semantic_text",
"inference_id": ".elser-2-elasticsearch"
}
}
}
}
}
}
Limitations [limitations]
semantic_text
field types have the following limitations:
semantic_text
fields are not currently supported as elements of nested fields.semantic_text
fields can’t currently be set as part of Dynamic templates.semantic_text
fields are not supported with Cross-Cluster Search (CCS) or Cross-Cluster Replication (CCR).