mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-06-29 01:44:36 -04:00
327 lines
13 KiB
Text
327 lines
13 KiB
Text
[#es-sync-rules]
|
|
=== Connector sync rules
|
|
++++
|
|
<titleabbrev>Sync rules</titleabbrev>
|
|
++++
|
|
|
|
Use connector sync rules to help control which documents are synced between the third-party data source and Elasticsearch.
|
|
Define sync rules in the Kibana UI for each connector index, under the `Sync rules` tab for the index.
|
|
|
|
Sync rules apply to <<es-native-connectors,managed connectors>> and <<es-build-connector,self-managed connectors>>.
|
|
|
|
[discrete#es-sync-rules-availability-prerequisites]
|
|
==== Availability and prerequisites
|
|
|
|
In Elastic versions *8.8.0 and later* all connectors have support for _basic_ sync rules.
|
|
|
|
Some connectors support _advanced_ sync rules.
|
|
Learn more in the <<es-connectors, individual connector's reference documentation>>.
|
|
|
|
[discrete#es-sync-rules-types]
|
|
==== Types of sync rule
|
|
|
|
There are two types of sync rule:
|
|
|
|
* **Basic sync rules** - these rules are represented in a table-like view.
|
|
Basic sync rules are identical for all connectors.
|
|
* **Advanced sync rules** - these rules cover complex query-and-filter scenarios that cannot be expressed with basic sync rules.
|
|
Advanced sync rules are defined through a _source-specific_ DSL JSON snippet.
|
|
|
|
[.screenshot]
|
|
image::images/filtering-rules-zero-state.png[Sync rules tab]
|
|
|
|
[discrete#es-sync-rules-general-filtering]
|
|
==== General data filtering concepts
|
|
|
|
Before discussing sync rules, it's important to establish a basic understanding of _data filtering_ concepts.
|
|
The following diagram shows that data filtering can occur in several different processes/locations.
|
|
|
|
[.screenshot]
|
|
image::images/filtering-general-diagram.png[Filtering]
|
|
|
|
In this documentation we will focus on remote and integration filtering.
|
|
Sync rules can be used to modify both of these.
|
|
|
|
[discrete#es-sync-rules-general-filtering-remote]
|
|
===== Remote filtering
|
|
|
|
Data might be filtered at its source.
|
|
We call this *remote filtering*, as the filtering process is external to Elastic.
|
|
|
|
[discrete#es-sync-rules-general-filtering-integration]
|
|
===== Integration filtering
|
|
|
|
*Integration filtering* acts as a bridge between the original data source and Elasticsearch.
|
|
Filtering that takes place in connectors is an example of integration filtering.
|
|
|
|
[discrete#es-sync-rules-general-filtering-pipeline]
|
|
===== Pipeline filtering
|
|
|
|
Finally, Elasticsearch can filter data right _before persistence_ using {ref}/ingest-pipeline-search.html[ingest pipelines].
|
|
We will not focus on ingest pipeline filtering in this guide.
|
|
|
|
[NOTE]
|
|
====
|
|
Currently, basic sync rules are the only way to control _integration filtering_ for connectors.
|
|
Remember that remote filtering extends far beyond the scope of connectors alone.
|
|
For best results, collaborate with the owners and maintainers of your data source.
|
|
Ensure the source data is well-organized and optimized for the query types made by the connectors.
|
|
====
|
|
|
|
[discrete#es-sync-rules-overview]
|
|
==== Sync rules overview
|
|
|
|
In most cases, your data lake will contain far more data than you want to expose to end users.
|
|
For example, you may want to search a product catalog, but not include vendor contact information, even if the two are co-located for business purposes.
|
|
|
|
The optimal time to filter data is _early_ in the data pipeline.
|
|
There are two main reasons:
|
|
|
|
* *Performance*:
|
|
It's more efficient to send a query to the backing data source than to obtain all the data and then filter it in the connector.
|
|
It's faster to send a smaller dataset over a network and to process it on the connector side.
|
|
* *Security*:
|
|
Query-time filtering is applied on the data source side, so the data is not sent over the network and into the connector, which limits the exposure of your data.
|
|
|
|
In a perfect world, all filtering would be done as remote filtering.
|
|
|
|
In practice, however, this is not always possible.
|
|
Some sources do not allow robust remote filtering.
|
|
Others do, but require special setup (building indexes on specific fields, tweaking settings, etc.) that may require attention from other stakeholders in your organization.
|
|
|
|
With this in mind, sync rules were designed to modify both remote filtering and integration filtering.
|
|
Your goal should be to do as much remote filtering as possible, but integration is a perfectly viable fall-back.
|
|
By definition, remote filtering is applied before the data is obtained from a third-party source.
|
|
Integration filtering is applied after the data is obtained from a third-party source, but before it is ingested into the Elasticsearch index.
|
|
|
|
[NOTE]
|
|
====
|
|
All sync rules are applied to a given document _before_ any {ref}/ingest-pipeline-search.html[ingest pipelines] are run on that document.
|
|
Therefore, you can use ingest pipelines for any processing that must occur _after_ integration filtering has occurred.
|
|
====
|
|
|
|
[NOTE]
|
|
====
|
|
If a sync rule is added, edited or removed, it will only take effect after the next full sync.
|
|
====
|
|
|
|
[discrete#es-sync-rules-basic]
|
|
==== Basic sync rules
|
|
|
|
Each basic sync rules can be one of two "policies": `include` and `exclude`.
|
|
`Include` rules are used to include the documents that "match" the specified condition.
|
|
`Exclude` rules are used to exclude the documents that "match" the specified condition.
|
|
|
|
A "match" is determined based on a condition defined by a combination of "field", "rule", and "value".
|
|
|
|
The `Field` column should be used to define which field on a given document should be considered.
|
|
|
|
The following rules are available in the `Rule` column:
|
|
|
|
* `equals` - The field value is equal to the specified value.
|
|
* `starts_with` - The field value starts with the specified (string) value.
|
|
* `ends_with` - The field value ends with the specified (string) value.
|
|
* `contains` - The field value includes the specified (string) value.
|
|
* `regex` - The field value matches the specified https://en.wikipedia.org/wiki/Regular_expression[regular expression^].
|
|
* `>` - The field value is greater than the specified value.
|
|
* `<` - The field value is less than the specified value.
|
|
|
|
Finally, the `Value` column is dependent on:
|
|
|
|
* the data type in the specified "field"
|
|
* which "rule" was selected.
|
|
|
|
For example, if a value of `[A-Z]{2}` might make sense for a `regex` rule, but much less so for a `>` rule.
|
|
Similarly, you probably wouldn't have a value of `espresso` when operating on an `ip_address` field, but perhaps you would for a `beverage` field.
|
|
|
|
[discrete#es-sync-rules-basic-examples]
|
|
===== Basic sync rules examples
|
|
|
|
[discrete#es-sync-rules-basic-examples-1]
|
|
====== Example 1
|
|
|
|
Exclude all documents that have an `ID` field with the value greater than 1000.
|
|
|
|
[.screenshot]
|
|
image::images/simple-rule-greater.png[Simple greater than rule]
|
|
|
|
[discrete#es-sync-rules-basic-examples-2]
|
|
====== Example 2
|
|
|
|
Exclude all documents that have a `state` field that matches a specified regex.
|
|
|
|
[.screenshot]
|
|
image::images/simple-rule-regex.png[Simple regex rule]
|
|
|
|
[discrete#es-sync-rules-performance-implications]
|
|
===== Performance implications
|
|
|
|
- If you're relying solely on basic sync rules in the integration filtering phase the connector will fetch *all* the data from the data source
|
|
- For data sources without automatic pagination, or similar optimizations, fetching all the data can lead to memory issues.
|
|
For example, loading datasets which are too big to fit in memory at once.
|
|
|
|
[NOTE]
|
|
====
|
|
The native MongoDB connector provided by Elastic uses pagination and therefore has optimized performance.
|
|
Keep in mind that custom community-built self-managed connectors may not have these performance optimizations.
|
|
====
|
|
|
|
The following diagrams illustrate the concept of pagination.
|
|
A huge data set may not fit into a connector instance's memory.
|
|
Splitting data into smaller chunks reduces the risk of out-of-memory errors.
|
|
|
|
This diagram illustrates an entire dataset being extracted at once:
|
|
[.screenshot]
|
|
image::images/sync-rules-extract-all-at-once.png[Extract whole dataset at once]
|
|
|
|
By comparison, this diagram illustrates a paginated dataset:
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-pagination.png[Pagination]
|
|
|
|
[discrete#es-sync-rules-advanced]
|
|
==== Advanced sync rules
|
|
|
|
[IMPORTANT]
|
|
====
|
|
Advanced sync rules overwrite any remote filtering query that could have been inferred from the basic sync rules.
|
|
If an advanced sync rule is defined, any defined basic sync rules will be used exclusively for integration filtering.
|
|
====
|
|
|
|
Advanced sync rules are only used in remote filtering.
|
|
You can think of advanced sync rules as a language-agnostic way to represent queries to the data source.
|
|
Therefore, these rules are highly *source-specific*.
|
|
|
|
The following connectors support advanced sync rules:
|
|
|
|
include::_connectors-list-advanced-rules.asciidoc[]
|
|
|
|
Each connector supporting advanced sync rules provides its own DSL to specify rules.
|
|
Refer to the documentation for <<es-connectors,each connector>> for details.
|
|
|
|
[discrete#es-interplay-basic-rules-advanced-rules]
|
|
==== Combining basic and advanced sync rules
|
|
|
|
You can also use basic sync rules and advanced sync rules together to filter a data set.
|
|
|
|
The following diagram provides an overview of the order in which advanced sync rules, basic sync rules, and pipeline filtering, are applied to your documents:
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-time-dimension.png[Sync Rules: What is applied when?]
|
|
|
|
[discrete#es-example-interplay-basic-rules-advanced-rules]
|
|
===== Example
|
|
|
|
In the following example we want to filter a data set containing apartments to only contain apartments with specific properties.
|
|
We'll use basic and advanced sync rules throughout the example.
|
|
|
|
A sample apartment looks like this in the `.json` format:
|
|
[source, js]
|
|
----
|
|
{
|
|
"id": 1234,
|
|
"bedrooms": 3,
|
|
"price": 1500,
|
|
"address": {
|
|
"street": "Street 123",
|
|
"government_area": "Area",
|
|
"country_information": {
|
|
"country_code": "PT",
|
|
"country": "Portugal"
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
The target data set should fulfill the following conditions:
|
|
|
|
. Every apartment should have at least *3 bedrooms*
|
|
. The apartments should not be more expensive than *1500 per month*
|
|
. The apartment with id '1234' should get included without considering the first two conditions
|
|
. Each apartment should be located in either 'Portugal' or 'Spain'
|
|
|
|
The first 3 conditions can be handled by basic sync rules, but we'll need to use advanced sync rules for number 4.
|
|
|
|
[discrete#es-example-interplay-basic-rules]
|
|
====== Basic sync rules examples
|
|
|
|
To create a new basic sync rule, navigate to the 'Sync Rules' tab and select *Draft new sync rules*:
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-draft-new-rules.png[Draft new rules]
|
|
|
|
Afterwards you need to press the 'Save and validate draft' button to validate these rules.
|
|
Note that when saved the rules will be in _draft_ state. They won't be executed in the next sync unless they are _applied_.
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-save-and-validate-draft.png[Save and validate draft]
|
|
|
|
After a successful validation you can apply your rules so they'll be executed in the next sync.
|
|
|
|
These following conditions can be covered by basic sync rules:
|
|
|
|
1. The apartment with id '1234' should get included without considering the first two conditions
|
|
2. Every apartment should have at least three bedrooms
|
|
3. The apartments should not be more expensive than 1000/month
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-rules-fulfilling-properties.png[Save and validate draft]
|
|
|
|
[NOTE]
|
|
====
|
|
Remember that order matters for basic sync rules.
|
|
You may get different results for a different ordering.
|
|
====
|
|
|
|
[discrete#es-example-interplay-advanced-rules]
|
|
====== Advanced sync rules example
|
|
|
|
You want to only include apartments which are located in Portugal or Spain.
|
|
We need to use advanced sync rules here, because we're dealing with deeply nested objects.
|
|
|
|
Let's assume that the apartment data is stored inside a MongoDB instance.
|
|
For MongoDB we support https://www.mongodb.com/docs/manual/core/aggregation-pipeline/[aggregation pipelines^] in our advanced sync rules among other things.
|
|
An aggregation pipeline to only select properties located in Portugal or Spain looks like this:
|
|
[source, js]
|
|
----
|
|
[
|
|
{
|
|
"$match": {
|
|
"$or": [
|
|
{
|
|
"address.country_information.country": "Portugal"
|
|
},
|
|
{
|
|
"address.country_information.country": "Spain"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
]
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
To create these advanced sync rules navigate to the sync rules creation dialog and select the 'Advanced rules' tab.
|
|
You can now paste your aggregation pipeline into the input field under `aggregate.pipeline`:
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-paste-aggregation-pipeline.png[Paste aggregation pipeline]
|
|
|
|
Once validated, apply these rules.
|
|
The following screenshot shows the applied sync rules, which will be executed in the next sync:
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-advanced-rules-appeared.png[Advanced sync rules appeared]
|
|
|
|
After a successful sync you can expand the sync details to see which rules were applied:
|
|
|
|
[.screenshot]
|
|
image::images/sync-rules-applied-rules-during-sync.png[Applied rules during sync]
|
|
|
|
[WARNING]
|
|
====
|
|
Active sync rules can become invalid when changed outside of the UI.
|
|
Sync jobs with invalid rules will fail.
|
|
One workaround is to revalidate the draft rules and override the invalid active rules.
|
|
====
|