elasticsearch/docs/reference/ingest/enrich.asciidoc
Niels Bauman e0c1ccbc1e
Make enrich cache based on memory usage (#111412)
The max enrich cache size setting now also supports an absolute max size in bytes (of used heap space) and a percentage of the max heap space, next to the existing flat document count. The default is 1% of the max heap space.

This should prevent issues where the enrich cache takes up a lot of memory when there are large documents in the cache.
2024-08-23 09:26:55 +02:00

289 lines
10 KiB
Text

[[ingest-enriching-data]]
== Enrich your data
You can use the <<enrich-processor,enrich processor>> to add data from your
existing indices to incoming documents during ingest.
For example, you can use the enrich processor to:
* Identify web services or vendors based on known IP addresses
* Add product information to retail orders based on product IDs
* Supplement contact information based on an email address
* Add postal codes based on user coordinates
[discrete]
[[how-enrich-works]]
=== How the enrich processor works
Most processors are self-contained and only change _existing_ data in incoming
documents.
image::images/ingest/ingest-process.svg[align="center"]
The enrich processor adds _new_ data to incoming documents and requires a few
special components:
image::images/ingest/enrich/enrich-process.svg[align="center"]
[[enrich-policy]]
enrich policy::
+
--
A set of configuration options used to add the right enrich data to the right
incoming documents.
An enrich policy contains:
// tag::enrich-policy-fields[]
* A list of one or more _source indices_ which store enrich data as documents
* The _policy type_ which determines how the processor matches the enrich data
to incoming documents
* A _match field_ from the source indices used to match incoming documents
* _Enrich fields_ containing enrich data from the source indices you want to add
to incoming documents
// end::enrich-policy-fields[]
Before it can be used with an enrich processor, an enrich policy must be
<<execute-enrich-policy-api,executed>>. When executed, an enrich policy uses
enrich data from the policy's source indices to create a streamlined system
index called the _enrich index_. The processor uses this index to match and
enrich incoming documents.
--
[[source-index]]
source index::
An index which stores enrich data you'd like to add to incoming documents. You
can create and manage these indices just like a regular {es} index. You can use
multiple source indices in an enrich policy. You also can use the same source
index in multiple enrich policies.
[[enrich-index]]
enrich index::
+
--
A special system index tied to a specific enrich policy.
Directly matching incoming documents to documents in source indices could be
slow and resource intensive. To speed things up, the enrich processor uses an
enrich index.
// tag::enrich-index[]
Enrich indices contain enrich data from source indices but have a few special
properties to help streamline them:
* They are system indices, meaning they're managed internally by {es} and only
intended for use with enrich processors and the {esql} `ENRICH` command.
* They always begin with `.enrich-*`.
* They are read-only, meaning you can't directly change them.
* They are <<indices-forcemerge,force merged>> for fast retrieval.
// end::enrich-index[]
--
[[enrich-setup]]
=== Set up an enrich processor
To set up an enrich processor, follow these steps:
. Check the <<enrich-prereqs, prerequisites>>.
. <<create-enrich-source-index>>.
. <<create-enrich-policy>>.
. <<execute-enrich-policy>>.
. <<add-enrich-processor>>.
. <<ingest-enrich-docs>>.
Once you have an enrich processor set up,
you can <<update-enrich-data,update your enrich data>>
and <<update-enrich-policies, update your enrich policies>>.
[IMPORTANT]
====
The enrich processor performs several operations and may impact the speed of
your ingest pipeline.
We strongly recommend testing and benchmarking your enrich processors
before deploying them in production.
We do not recommend using the enrich processor to append real-time data.
The enrich processor works best with reference data
that doesn't change frequently.
====
[discrete]
[[enrich-prereqs]]
==== Prerequisites
include::{es-ref-dir}/ingest/apis/enrich/put-enrich-policy.asciidoc[tag=enrich-policy-api-prereqs]
[[create-enrich-source-index]]
==== Add enrich data
// tag::create-enrich-source-index[]
To begin, add documents to one or more source indices. These documents should
contain the enrich data you eventually want to add to incoming data.
You can manage source indices just like regular {es} indices using the
<<docs,document>> and <<indices,index>> APIs.
You also can set up {beats-ref}/getting-started.html[{beats}], such as a
{filebeat-ref}/filebeat-installation-configuration.html[{filebeat}], to
automatically send and index documents to your source indices. See
{beats-ref}/getting-started.html[Getting started with {beats}].
//end::create-enrich-source-index[]
[[create-enrich-policy]]
==== Create an enrich policy
// tag::create-enrich-policy[]
After adding enrich data to your source indices, use the
<<put-enrich-policy-api,create enrich policy API>> or
<<manage-enrich-policies,Index Management in {kib}>> to create an enrich policy.
[WARNING]
====
Once created, you can't update or change an enrich policy.
See <<update-enrich-policies>>.
====
// end::create-enrich-policy[]
[[execute-enrich-policy]]
==== Execute the enrich policy
// tag::execute-enrich-policy1[]
Once the enrich policy is created, you need to execute it using the
<<execute-enrich-policy-api,execute enrich policy API>> or
<<manage-enrich-policies,Index Management in {kib}>> to create an
<<enrich-index,enrich index>>.
// end::execute-enrich-policy1[]
image::images/ingest/enrich/enrich-policy-index.svg[align="center"]
// tag::execute-enrich-policy2[]
include::apis/enrich/execute-enrich-policy.asciidoc[tag=execute-enrich-policy-def]
// end::execute-enrich-policy2[]
[[add-enrich-processor]]
==== Add an enrich processor to an ingest pipeline
Once you have source indices, an enrich policy, and the related enrich index in
place, you can set up an ingest pipeline that includes an enrich processor for
your policy.
image::images/ingest/enrich/enrich-processor.svg[align="center"]
Define an <<enrich-processor,enrich processor>> and add it to an ingest
pipeline using the <<put-pipeline-api,create or update pipeline API>>.
When defining the enrich processor, you must include at least the following:
* The enrich policy to use.
* The field used to match incoming documents to the documents in your enrich index.
* The target field to add to incoming documents. This target field contains the
match and enrich fields specified in your enrich policy.
You also can use the `max_matches` option to set the number of enrich documents
an incoming document can match. If set to the default of `1`, data is added to
an incoming document's target field as a JSON object. Otherwise, the data is
added as an array.
See <<enrich-processor>> for a full list of configuration options.
You also can add other <<processors,processors>> to your ingest pipeline.
[[ingest-enrich-docs]]
==== Ingest and enrich documents
You can now use your ingest pipeline to enrich and index documents.
image::images/ingest/enrich/enrich-process.svg[align="center"]
Before implementing the pipeline in production, we recommend indexing a few test
documents first and verifying enrich data was added correctly using the
<<docs-get,get API>>.
[[update-enrich-data]]
==== Update an enrich index
include::{es-ref-dir}/ingest/apis/enrich/execute-enrich-policy.asciidoc[tag=update-enrich-index]
If wanted, you can <<docs-reindex,reindex>>
or <<docs-update-by-query,update>> any already ingested documents
using your ingest pipeline.
[[update-enrich-policies]]
==== Update an enrich policy
// tag::update-enrich-policy[]
Once created, you can't update or change an enrich policy.
Instead, you can:
. Create and <<execute-enrich-policy-api,execute>> a new enrich policy.
. Replace the previous enrich policy
with the new enrich policy
in any in-use enrich processors or {esql} queries.
. Use the <<delete-enrich-policy-api, delete enrich policy>> API or
<<manage-enrich-policies,Index Management in {kib}>>
to delete the previous enrich policy.
// end::update-enrich-policy[]
[[ingest-enrich-components]]
==== Enrich components
The enrich coordinator is a component that manages and performs the searches
required to enrich documents on each ingest node. It combines searches from all enrich
processors in all pipelines into bulk <<search-multi-search,multi-searches>>.
The enrich policy executor is a component that manages the executions of all
enrich policies. When an enrich policy is executed, this component creates
a new enrich index and removes the previous enrich index. The enrich policy
executions are managed from the elected master node. The execution of these
policies occurs on a different node.
[[ingest-enrich-settings]]
==== Node Settings
The `enrich` processor has node settings for enrich coordinator and
enrich policy executor.
The enrich coordinator supports the following node settings:
`enrich.cache_size`::
Maximum size of the cache that caches searches for enriching documents.
The size can be specified in three units: the raw number of
cached searches (e.g. `1000`), an absolute size in bytes (e.g. `100Mb`),
or a percentage of the max heap space of the node (e.g. `1%`).
Both for the absolute byte size and the percentage of heap space,
{es} does not guarantee that the enrich cache size will adhere exactly to that maximum,
as {es} uses the byte size of the serialized search response
which is is a good representation of the used space on the heap, but not an exact match.
Defaults to `1%`. There is a single cache for all enrich processors in the cluster.
`enrich.coordinator_proxy.max_concurrent_requests`::
Maximum number of concurrent <<search-multi-search,multi-search requests>> to
run when enriching documents. Defaults to `8`.
`enrich.coordinator_proxy.max_lookups_per_request`::
Maximum number of searches to include in a <<search-multi-search,multi-search
request>> when enriching documents. Defaults to `128`.
The enrich policy executor supports the following node settings:
`enrich.fetch_size`::
Maximum batch size when reindexing a source index into an enrich index. Defaults
to `10000`.
`enrich.max_force_merge_attempts`::
Maximum number of <<indices-forcemerge,force merge>> attempts allowed on an
enrich index. Defaults to `3`.
`enrich.cleanup_period`::
How often {es} checks whether unused enrich indices can be deleted. Defaults to
`15m`.
`enrich.max_concurrent_policy_executions`::
Maximum number of enrich policies to execute concurrently. Defaults to `50`.
include::geo-match-enrich-policy-type-ex.asciidoc[]
include::match-enrich-policy-type-ex.asciidoc[]
include::range-enrich-policy-type-ex.asciidoc[]