mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-25 07:37:19 -04:00
318 lines
No EOL
15 KiB
Text
318 lines
No EOL
15 KiB
Text
[[elasticsearch-intro]]
|
||
== What is {es}?
|
||
|
||
{es-repo}[{es}] is a distributed search and analytics engine, scalable data store, and vector database built on Apache Lucene.
|
||
It's optimized for speed and relevance on production-scale workloads.
|
||
Use {es} to search, index, store, and analyze data of all shapes and sizes in near real time.
|
||
|
||
[TIP]
|
||
====
|
||
{es} has a lot of features. Explore the full list on the https://www.elastic.co/elasticsearch/features[product webpage^].
|
||
====
|
||
|
||
{es} is the heart of the {estc-welcome-current}/stack-components.html[Elastic Stack] and powers the Elastic https://www.elastic.co/enterprise-search[Search], https://www.elastic.co/observability[Observability] and https://www.elastic.co/security[Security] solutions.
|
||
|
||
{es} is used for a wide and growing range of use cases. Here are a few examples:
|
||
|
||
* *Monitor log and event data*. Store logs, metrics, and event data for observability and security information and event management (SIEM).
|
||
* *Build search applications*. Add search capabilities to apps or websites, or build enterprise search engines over your organization's internal data sources.
|
||
* *Vector database*. Store and search vectorized data, and create vector embeddings with built-in and third-party natural language processing (NLP) models.
|
||
* *Retrieval augmented generation (RAG)*. Use {es} as a retrieval engine to augment Generative AI models.
|
||
* *Application and security monitoring*. Monitor and analyze application performance and security data effectively.
|
||
* *Machine learning*. Use {ml} to automatically model the behavior of your data in real-time.
|
||
|
||
This is just a sample of search, observability, and security use cases enabled by {es}.
|
||
Refer to our https://www.elastic.co/customers/success-stories[customer success stories] for concrete examples across a range of industries.
|
||
// Link to demos, search labs chatbots
|
||
|
||
[discrete]
|
||
[[elasticsearch-intro-elastic-stack]]
|
||
.What is the Elastic Stack?
|
||
*******************************
|
||
{es} is the core component of the Elastic Stack, a suite of products for collecting, storing, searching, and visualizing data.
|
||
https://www.elastic.co/guide/en/starting-with-the-elasticsearch-platform-and-its-solutions/current/stack-components.html[Learn more about the Elastic Stack].
|
||
*******************************
|
||
// TODO: Remove once we've moved Stack Overview to a subpage?
|
||
|
||
[discrete]
|
||
[[elasticsearch-intro-deploy]]
|
||
=== Deployment options
|
||
|
||
To use {es}, you need a running instance of the {es} service.
|
||
You can deploy {es} in various ways:
|
||
|
||
* <<run-elasticsearch-locally,*Local dev*>>. Get started quickly with a minimal local Docker setup.
|
||
* {cloud}/ec-getting-started-trial.html[*Elastic Cloud*]. {es} is available as part of our hosted Elastic Stack offering, deployed in the cloud with your provider of choice. Sign up for a https://cloud.elastic.co/registration[14 day free trial].
|
||
* {serverless-docs}/general/sign-up-trial[*Elastic Cloud Serverless* (technical preview)]. Create serverless projects for autoscaled and fully managed {es} deployments. Sign up for a https://cloud.elastic.co/serverless-registration[14 day free trial].
|
||
|
||
**Advanced deployment options**
|
||
|
||
* <<elasticsearch-deployment-options,*Self-managed*>>. Install, configure, and run {es} on your own premises.
|
||
* {ece-ref}/Elastic-Cloud-Enterprise-overview.html[*Elastic Cloud Enterprise*]. Deploy Elastic Cloud on public or private clouds, virtual machines, or your own premises.
|
||
* {eck-ref}/k8s-overview.html[*Elastic Cloud on Kubernetes*]. Deploy Elastic Cloud on Kubernetes.
|
||
|
||
[discrete]
|
||
[[elasticsearch-next-steps]]
|
||
=== Learn more
|
||
|
||
Some resources to help you get started:
|
||
|
||
* <<getting-started, Quickstart>>. A beginner's guide to deploying your first {es} instance, indexing data, and running queries.
|
||
* https://elastic.co/webinars/getting-started-elasticsearch[Webinar: Introduction to {es}]. Register for our live webinars to learn directly from {es} experts.
|
||
* https://www.elastic.co/search-labs[Elastic Search Labs]. Tutorials and blogs that explore AI-powered search using the latest {es} features.
|
||
** Follow our tutorial https://www.elastic.co/search-labs/tutorials/search-tutorial/welcome[to build a hybrid search solution in Python].
|
||
** Check out the https://github.com/elastic/elasticsearch-labs?tab=readme-ov-file#elasticsearch-examples--apps[`elasticsearch-labs` repository] for a range of Python notebooks and apps for various use cases.
|
||
|
||
// new html page
|
||
[[documents-indices]]
|
||
=== Indices, documents, and fields
|
||
++++
|
||
<titleabbrev>Indices and documents</titleabbrev>
|
||
++++
|
||
|
||
The index is the fundamental unit of storage in {es}, a logical namespace for storing data that share similar characteristics.
|
||
After you have {es} <<elasticsearch-intro-deploy,deployed>>, you'll get started by creating an index to store your data.
|
||
|
||
[TIP]
|
||
====
|
||
A closely related concept is a <<data-streams,data stream>>.
|
||
This index abstraction is optimized for append-only time-series data, and is made up of hidden, auto-generated backing indices.
|
||
If you're working with time-series data, we recommend the {observability-guide}[Elastic Observability] solution.
|
||
====
|
||
|
||
Some key facts about indices:
|
||
|
||
* An index is a collection of documents
|
||
* An index has a unique name
|
||
* An index can also be referred to by an alias
|
||
* An index has a mapping that defines the schema of its documents
|
||
|
||
[discrete]
|
||
[[elasticsearch-intro-documents-fields]]
|
||
==== Documents and fields
|
||
|
||
{es} serializes and stores data in the form of JSON documents.
|
||
A document is a set of fields, which are key-value pairs that contain your data.
|
||
Each document has a unique ID, which you can create or have {es} auto-generate.
|
||
|
||
A simple {es} document might look like this:
|
||
|
||
[source,js]
|
||
----
|
||
{
|
||
"_index": "my-first-elasticsearch-index",
|
||
"_id": "DyFpo5EBxE8fzbb95DOa",
|
||
"_version": 1,
|
||
"_seq_no": 0,
|
||
"_primary_term": 1,
|
||
"found": true,
|
||
"_source": {
|
||
"email": "john@smith.com",
|
||
"first_name": "John",
|
||
"last_name": "Smith",
|
||
"info": {
|
||
"bio": "Eco-warrior and defender of the weak",
|
||
"age": 25,
|
||
"interests": [
|
||
"dolphins",
|
||
"whales"
|
||
]
|
||
},
|
||
"join_date": "2024/05/01"
|
||
}
|
||
}
|
||
----
|
||
// NOTCONSOLE
|
||
|
||
[discrete]
|
||
[[elasticsearch-intro-documents-fields-data-metadata]]
|
||
==== Data and metadata
|
||
|
||
An indexed document contains data and metadata.
|
||
In {es}, metadata fields are prefixed with an underscore.
|
||
|
||
The most important metadata fields are:
|
||
|
||
* `_source`. Contains the original JSON document.
|
||
* `_index`. The name of the index where the document is stored.
|
||
* `_id`. The document's ID. IDs must be unique per index.
|
||
|
||
[discrete]
|
||
[[elasticsearch-intro-documents-fields-mappings]]
|
||
==== Mappings and data types
|
||
|
||
Each index has a <<mapping,mapping>> or schema for how the fields in your documents are indexed.
|
||
A mapping defines the <<mapping-types,data type>> for each field, how the field should be indexed,
|
||
and how it should be stored.
|
||
When adding documents to {es}, you have two options for mappings:
|
||
|
||
* <<mapping-dynamic, Dynamic mapping>>. Let {es} automatically detect the data types and create the mappings for you. This is great for getting started quickly.
|
||
* <<mapping-explicit, Explicit mapping>>. Define the mappings up front by specifying data types for each field. Recommended for production use cases.
|
||
|
||
[TIP]
|
||
====
|
||
You can use a combination of dynamic and explicit mapping on the same index.
|
||
This is useful when you have a mix of known and unknown fields in your data.
|
||
====
|
||
|
||
// New html page
|
||
[[search-analyze]]
|
||
=== Search and analyze
|
||
|
||
While you can use {es} as a document store and retrieve documents and their
|
||
metadata, the real power comes from being able to easily access the full suite
|
||
of search capabilities built on the Apache Lucene search engine library.
|
||
|
||
{es} provides a simple, coherent REST API for managing your cluster and indexing
|
||
and searching your data. For testing purposes, you can easily submit requests
|
||
directly from the command line or through the Developer Console in {kib}. From
|
||
your applications, you can use the
|
||
https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client]
|
||
for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python
|
||
or Ruby.
|
||
|
||
[discrete]
|
||
[[search-data]]
|
||
==== Searching your data
|
||
|
||
The {es} REST APIs support structured queries, full text queries, and complex
|
||
queries that combine the two. Structured queries are
|
||
similar to the types of queries you can construct in SQL. For example, you
|
||
could search the `gender` and `age` fields in your `employee` index and sort the
|
||
matches by the `hire_date` field. Full-text queries find all documents that
|
||
match the query string and return them sorted by _relevance_—how good a
|
||
match they are for your search terms.
|
||
|
||
In addition to searching for individual terms, you can perform phrase searches,
|
||
similarity searches, and prefix searches, and get autocomplete suggestions.
|
||
|
||
Have geospatial or other numerical data that you want to search? {es} indexes
|
||
non-textual data in optimized data structures that support
|
||
high-performance geo and numerical queries.
|
||
|
||
You can access all of these search capabilities using {es}'s
|
||
comprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can also
|
||
construct <<sql-overview, SQL-style queries>> to search and aggregate data
|
||
natively inside {es}, and JDBC and ODBC drivers enable a broad range of
|
||
third-party applications to interact with {es} via SQL.
|
||
|
||
[discrete]
|
||
[[analyze-data]]
|
||
==== Analyzing your data
|
||
|
||
{es} aggregations enable you to build complex summaries of your data and gain
|
||
insight into key metrics, patterns, and trends. Instead of just finding the
|
||
proverbial “needle in a haystack”, aggregations enable you to answer questions
|
||
like:
|
||
|
||
* How many needles are in the haystack?
|
||
* What is the average length of the needles?
|
||
* What is the median length of the needles, broken down by manufacturer?
|
||
* How many needles were added to the haystack in each of the last six months?
|
||
|
||
You can also use aggregations to answer more subtle questions, such as:
|
||
|
||
* What are your most popular needle manufacturers?
|
||
* Are there any unusual or anomalous clumps of needles?
|
||
|
||
Because aggregations leverage the same data-structures used for search, they are
|
||
also very fast. This enables you to analyze and visualize your data in real time.
|
||
Your reports and dashboards update as your data changes so you can take action
|
||
based on the latest information.
|
||
|
||
What’s more, aggregations operate alongside search requests. You can search
|
||
documents, filter results, and perform analytics at the same time, on the same
|
||
data, in a single request. And because aggregations are calculated in the
|
||
context of a particular search, you’re not just displaying a count of all
|
||
size 70 needles, you’re displaying a count of the size 70 needles
|
||
that match your users' search criteria--for example, all size 70 _non-stick
|
||
embroidery_ needles.
|
||
|
||
[[scalability]]
|
||
=== Scalability and resilience
|
||
|
||
{es} is built to be always available and to scale with your needs. It does this
|
||
by being distributed by nature. You can add servers (nodes) to a cluster to
|
||
increase capacity and {es} automatically distributes your data and query load
|
||
across all of the available nodes. No need to overhaul your application, {es}
|
||
knows how to balance multi-node clusters to provide scale and high availability.
|
||
The more nodes, the merrier.
|
||
|
||
How does this work? Under the covers, an {es} index is really just a logical
|
||
grouping of one or more physical shards, where each shard is actually a
|
||
self-contained index. By distributing the documents in an index across multiple
|
||
shards, and distributing those shards across multiple nodes, {es} can ensure
|
||
redundancy, which both protects against hardware failures and increases
|
||
query capacity as nodes are added to a cluster. As the cluster grows (or shrinks),
|
||
{es} automatically migrates shards to rebalance the cluster.
|
||
|
||
There are two types of shards: primaries and replicas. Each document in an index
|
||
belongs to one primary shard. A replica shard is a copy of a primary shard.
|
||
Replicas provide redundant copies of your data to protect against hardware
|
||
failure and increase capacity to serve read requests
|
||
like searching or retrieving a document.
|
||
|
||
The number of primary shards in an index is fixed at the time that an index is
|
||
created, but the number of replica shards can be changed at any time, without
|
||
interrupting indexing or query operations.
|
||
|
||
[discrete]
|
||
[[it-depends]]
|
||
==== Shard size and number of shards
|
||
|
||
There are a number of performance considerations and trade offs with respect
|
||
to shard size and the number of primary shards configured for an index. The more
|
||
shards, the more overhead there is simply in maintaining those indices. The
|
||
larger the shard size, the longer it takes to move shards around when {es}
|
||
needs to rebalance a cluster.
|
||
|
||
Querying lots of small shards makes the processing per shard faster, but more
|
||
queries means more overhead, so querying a smaller
|
||
number of larger shards might be faster. In short...it depends.
|
||
|
||
As a starting point:
|
||
|
||
* Aim to keep the average shard size between a few GB and a few tens of GB. For
|
||
use cases with time-based data, it is common to see shards in the 20GB to 40GB
|
||
range.
|
||
|
||
* Avoid the gazillion shards problem. The number of shards a node can hold is
|
||
proportional to the available heap space. As a general rule, the number of
|
||
shards per GB of heap space should be less than 20.
|
||
|
||
The best way to determine the optimal configuration for your use case is
|
||
through https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[
|
||
testing with your own data and queries].
|
||
|
||
[discrete]
|
||
[[disaster-ccr]]
|
||
==== Disaster recovery
|
||
|
||
A cluster's nodes need good, reliable connections to each other. To provide
|
||
better connections, you typically co-locate the nodes in the same data center or
|
||
nearby data centers. However, to maintain high availability, you
|
||
also need to avoid any single point of failure. In the event of a major outage
|
||
in one location, servers in another location need to be able to take over. The
|
||
answer? {ccr-cap} (CCR).
|
||
|
||
CCR provides a way to automatically synchronize indices from your primary cluster
|
||
to a secondary remote cluster that can serve as a hot backup. If the primary
|
||
cluster fails, the secondary cluster can take over. You can also use CCR to
|
||
create secondary clusters to serve read requests in geo-proximity to your users.
|
||
|
||
{ccr-cap} is active-passive. The index on the primary cluster is
|
||
the active leader index and handles all write requests. Indices replicated to
|
||
secondary clusters are read-only followers.
|
||
|
||
[discrete]
|
||
[[admin]]
|
||
==== Security, management, and monitoring
|
||
|
||
As with any enterprise system, you need tools to secure, manage, and
|
||
monitor your {es} clusters. Security, monitoring, and administrative features
|
||
that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}]
|
||
as a control center for managing a cluster. Features like <<downsampling,
|
||
downsampling>> and <<index-lifecycle-management, index lifecycle management>>
|
||
help you intelligently manage your data over time.
|
||
|
||
Refer to <<monitor-elasticsearch-cluster>> for more information. |