elasticsearch/docs/reference/intro.asciidoc

318 lines
No EOL
15 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

[[elasticsearch-intro]]
== What is {es}?
{es-repo}[{es}] is a distributed search and analytics engine, scalable data store, and vector database built on Apache Lucene.
It's optimized for speed and relevance on production-scale workloads.
Use {es} to search, index, store, and analyze data of all shapes and sizes in near real time.
[TIP]
====
{es} has a lot of features. Explore the full list on the https://www.elastic.co/elasticsearch/features[product webpage^].
====
{es} is the heart of the {estc-welcome-current}/stack-components.html[Elastic Stack] and powers the Elastic https://www.elastic.co/enterprise-search[Search], https://www.elastic.co/observability[Observability] and https://www.elastic.co/security[Security] solutions.
{es} is used for a wide and growing range of use cases. Here are a few examples:
* *Monitor log and event data*. Store logs, metrics, and event data for observability and security information and event management (SIEM).
* *Build search applications*. Add search capabilities to apps or websites, or build enterprise search engines over your organization's internal data sources.
* *Vector database*. Store and search vectorized data, and create vector embeddings with built-in and third-party natural language processing (NLP) models.
* *Retrieval augmented generation (RAG)*. Use {es} as a retrieval engine to augment Generative AI models.
* *Application and security monitoring*. Monitor and analyze application performance and security data effectively.
* *Machine learning*. Use {ml} to automatically model the behavior of your data in real-time.
This is just a sample of search, observability, and security use cases enabled by {es}.
Refer to our https://www.elastic.co/customers/success-stories[customer success stories] for concrete examples across a range of industries.
// Link to demos, search labs chatbots
[discrete]
[[elasticsearch-intro-elastic-stack]]
.What is the Elastic Stack?
*******************************
{es} is the core component of the Elastic Stack, a suite of products for collecting, storing, searching, and visualizing data.
https://www.elastic.co/guide/en/starting-with-the-elasticsearch-platform-and-its-solutions/current/stack-components.html[Learn more about the Elastic Stack].
*******************************
// TODO: Remove once we've moved Stack Overview to a subpage?
[discrete]
[[elasticsearch-intro-deploy]]
=== Deployment options
To use {es}, you need a running instance of the {es} service.
You can deploy {es} in various ways:
* <<run-elasticsearch-locally,*Local dev*>>. Get started quickly with a minimal local Docker setup.
* {cloud}/ec-getting-started-trial.html[*Elastic Cloud*]. {es} is available as part of our hosted Elastic Stack offering, deployed in the cloud with your provider of choice. Sign up for a https://cloud.elastic.co/registration[14 day free trial].
* {serverless-docs}/general/sign-up-trial[*Elastic Cloud Serverless* (technical preview)]. Create serverless projects for autoscaled and fully managed {es} deployments. Sign up for a https://cloud.elastic.co/serverless-registration[14 day free trial].
**Advanced deployment options**
* <<elasticsearch-deployment-options,*Self-managed*>>. Install, configure, and run {es} on your own premises.
* {ece-ref}/Elastic-Cloud-Enterprise-overview.html[*Elastic Cloud Enterprise*]. Deploy Elastic Cloud on public or private clouds, virtual machines, or your own premises.
* {eck-ref}/k8s-overview.html[*Elastic Cloud on Kubernetes*]. Deploy Elastic Cloud on Kubernetes.
[discrete]
[[elasticsearch-next-steps]]
=== Learn more
Some resources to help you get started:
* <<getting-started, Quickstart>>. A beginner's guide to deploying your first {es} instance, indexing data, and running queries.
* https://elastic.co/webinars/getting-started-elasticsearch[Webinar: Introduction to {es}]. Register for our live webinars to learn directly from {es} experts.
* https://www.elastic.co/search-labs[Elastic Search Labs]. Tutorials and blogs that explore AI-powered search using the latest {es} features.
** Follow our tutorial https://www.elastic.co/search-labs/tutorials/search-tutorial/welcome[to build a hybrid search solution in Python].
** Check out the https://github.com/elastic/elasticsearch-labs?tab=readme-ov-file#elasticsearch-examples--apps[`elasticsearch-labs` repository] for a range of Python notebooks and apps for various use cases.
// new html page
[[documents-indices]]
=== Indices, documents, and fields
++++
<titleabbrev>Indices and documents</titleabbrev>
++++
The index is the fundamental unit of storage in {es}, a logical namespace for storing data that share similar characteristics.
After you have {es} <<elasticsearch-intro-deploy,deployed>>, you'll get started by creating an index to store your data.
[TIP]
====
A closely related concept is a <<data-streams,data stream>>.
This index abstraction is optimized for append-only time-series data, and is made up of hidden, auto-generated backing indices.
If you're working with time-series data, we recommend the {observability-guide}[Elastic Observability] solution.
====
Some key facts about indices:
* An index is a collection of documents
* An index has a unique name
* An index can also be referred to by an alias
* An index has a mapping that defines the schema of its documents
[discrete]
[[elasticsearch-intro-documents-fields]]
==== Documents and fields
{es} serializes and stores data in the form of JSON documents.
A document is a set of fields, which are key-value pairs that contain your data.
Each document has a unique ID, which you can create or have {es} auto-generate.
A simple {es} document might look like this:
[source,js]
----
{
"_index": "my-first-elasticsearch-index",
"_id": "DyFpo5EBxE8fzbb95DOa",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"email": "john@smith.com",
"first_name": "John",
"last_name": "Smith",
"info": {
"bio": "Eco-warrior and defender of the weak",
"age": 25,
"interests": [
"dolphins",
"whales"
]
},
"join_date": "2024/05/01"
}
}
----
// NOTCONSOLE
[discrete]
[[elasticsearch-intro-documents-fields-data-metadata]]
==== Data and metadata
An indexed document contains data and metadata.
In {es}, metadata fields are prefixed with an underscore.
The most important metadata fields are:
* `_source`. Contains the original JSON document.
* `_index`. The name of the index where the document is stored.
* `_id`. The document's ID. IDs must be unique per index.
[discrete]
[[elasticsearch-intro-documents-fields-mappings]]
==== Mappings and data types
Each index has a <<mapping,mapping>> or schema for how the fields in your documents are indexed.
A mapping defines the <<mapping-types,data type>> for each field, how the field should be indexed,
and how it should be stored.
When adding documents to {es}, you have two options for mappings:
* <<mapping-dynamic, Dynamic mapping>>. Let {es} automatically detect the data types and create the mappings for you. This is great for getting started quickly.
* <<mapping-explicit, Explicit mapping>>. Define the mappings up front by specifying data types for each field. Recommended for production use cases.
[TIP]
====
You can use a combination of dynamic and explicit mapping on the same index.
This is useful when you have a mix of known and unknown fields in your data.
====
// New html page
[[search-analyze]]
=== Search and analyze
While you can use {es} as a document store and retrieve documents and their
metadata, the real power comes from being able to easily access the full suite
of search capabilities built on the Apache Lucene search engine library.
{es} provides a simple, coherent REST API for managing your cluster and indexing
and searching your data. For testing purposes, you can easily submit requests
directly from the command line or through the Developer Console in {kib}. From
your applications, you can use the
https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client]
for your language of choice: Java, JavaScript, Go, .NET, PHP, Perl, Python
or Ruby.
[discrete]
[[search-data]]
==== Searching your data
The {es} REST APIs support structured queries, full text queries, and complex
queries that combine the two. Structured queries are
similar to the types of queries you can construct in SQL. For example, you
could search the `gender` and `age` fields in your `employee` index and sort the
matches by the `hire_date` field. Full-text queries find all documents that
match the query string and return them sorted by _relevance_&mdash;how good a
match they are for your search terms.
In addition to searching for individual terms, you can perform phrase searches,
similarity searches, and prefix searches, and get autocomplete suggestions.
Have geospatial or other numerical data that you want to search? {es} indexes
non-textual data in optimized data structures that support
high-performance geo and numerical queries.
You can access all of these search capabilities using {es}'s
comprehensive JSON-style query language (<<query-dsl, Query DSL>>). You can also
construct <<sql-overview, SQL-style queries>> to search and aggregate data
natively inside {es}, and JDBC and ODBC drivers enable a broad range of
third-party applications to interact with {es} via SQL.
[discrete]
[[analyze-data]]
==== Analyzing your data
{es} aggregations enable you to build complex summaries of your data and gain
insight into key metrics, patterns, and trends. Instead of just finding the
proverbial “needle in a haystack”, aggregations enable you to answer questions
like:
* How many needles are in the haystack?
* What is the average length of the needles?
* What is the median length of the needles, broken down by manufacturer?
* How many needles were added to the haystack in each of the last six months?
You can also use aggregations to answer more subtle questions, such as:
* What are your most popular needle manufacturers?
* Are there any unusual or anomalous clumps of needles?
Because aggregations leverage the same data-structures used for search, they are
also very fast. This enables you to analyze and visualize your data in real time.
Your reports and dashboards update as your data changes so you can take action
based on the latest information.
Whats more, aggregations operate alongside search requests. You can search
documents, filter results, and perform analytics at the same time, on the same
data, in a single request. And because aggregations are calculated in the
context of a particular search, youre not just displaying a count of all
size 70 needles, youre displaying a count of the size 70 needles
that match your users' search criteria--for example, all size 70 _non-stick
embroidery_ needles.
[[scalability]]
=== Scalability and resilience
{es} is built to be always available and to scale with your needs. It does this
by being distributed by nature. You can add servers (nodes) to a cluster to
increase capacity and {es} automatically distributes your data and query load
across all of the available nodes. No need to overhaul your application, {es}
knows how to balance multi-node clusters to provide scale and high availability.
The more nodes, the merrier.
How does this work? Under the covers, an {es} index is really just a logical
grouping of one or more physical shards, where each shard is actually a
self-contained index. By distributing the documents in an index across multiple
shards, and distributing those shards across multiple nodes, {es} can ensure
redundancy, which both protects against hardware failures and increases
query capacity as nodes are added to a cluster. As the cluster grows (or shrinks),
{es} automatically migrates shards to rebalance the cluster.
There are two types of shards: primaries and replicas. Each document in an index
belongs to one primary shard. A replica shard is a copy of a primary shard.
Replicas provide redundant copies of your data to protect against hardware
failure and increase capacity to serve read requests
like searching or retrieving a document.
The number of primary shards in an index is fixed at the time that an index is
created, but the number of replica shards can be changed at any time, without
interrupting indexing or query operations.
[discrete]
[[it-depends]]
==== Shard size and number of shards
There are a number of performance considerations and trade offs with respect
to shard size and the number of primary shards configured for an index. The more
shards, the more overhead there is simply in maintaining those indices. The
larger the shard size, the longer it takes to move shards around when {es}
needs to rebalance a cluster.
Querying lots of small shards makes the processing per shard faster, but more
queries means more overhead, so querying a smaller
number of larger shards might be faster. In short...it depends.
As a starting point:
* Aim to keep the average shard size between a few GB and a few tens of GB. For
use cases with time-based data, it is common to see shards in the 20GB to 40GB
range.
* Avoid the gazillion shards problem. The number of shards a node can hold is
proportional to the available heap space. As a general rule, the number of
shards per GB of heap space should be less than 20.
The best way to determine the optimal configuration for your use case is
through https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[
testing with your own data and queries].
[discrete]
[[disaster-ccr]]
==== Disaster recovery
A cluster's nodes need good, reliable connections to each other. To provide
better connections, you typically co-locate the nodes in the same data center or
nearby data centers. However, to maintain high availability, you
also need to avoid any single point of failure. In the event of a major outage
in one location, servers in another location need to be able to take over. The
answer? {ccr-cap} (CCR).
CCR provides a way to automatically synchronize indices from your primary cluster
to a secondary remote cluster that can serve as a hot backup. If the primary
cluster fails, the secondary cluster can take over. You can also use CCR to
create secondary clusters to serve read requests in geo-proximity to your users.
{ccr-cap} is active-passive. The index on the primary cluster is
the active leader index and handles all write requests. Indices replicated to
secondary clusters are read-only followers.
[discrete]
[[admin]]
==== Security, management, and monitoring
As with any enterprise system, you need tools to secure, manage, and
monitor your {es} clusters. Security, monitoring, and administrative features
that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}]
as a control center for managing a cluster. Features like <<downsampling,
downsampling>> and <<index-lifecycle-management, index lifecycle management>>
help you intelligently manage your data over time.
Refer to <<monitor-elasticsearch-cluster>> for more information.