[[elasticsearch-intro]] == {es} basics This guide covers the core concepts you need to understand to get started with {es}. If you'd prefer to start working with {es} right away, set up a <> and jump to <>. This guide covers the following topics: * <>: Learn about {es} and some of its main use cases. * <>: Understand your options for deploying {es} in different environments, including a fast local development setup. * <>: Understand {es}'s most important primitives and how it stores data. * <>: Understand your options for ingesting data into {es}. * <>: Understand your options for searching and analyzing data in {es}. * <>: Understand the basic concepts required for moving your {es} deployment to production. [[elasticsearch-intro-what-is-es]] === What is {es}? {es-repo}[{es}] is a distributed search and analytics engine, scalable data store, and vector database built on Apache Lucene. It's optimized for speed and relevance on production-scale workloads. Use {es} to search, index, store, and analyze data of all shapes and sizes in near real time. {es} is the heart of the {estc-welcome-current}/stack-components.html[Elastic Stack]. Combined with https://www.elastic.co/kibana[{kib}], it powers the following Elastic solutions: * https://www.elastic.co/observability[Observability] * https://www.elastic.co/enterprise-search[Search] * https://www.elastic.co/security[Security] [TIP] ==== {es} has a lot of features. Explore the full list on the https://www.elastic.co/elasticsearch/features[product webpage^]. ==== [discrete] [[elasticsearch-intro-elastic-stack]] .What is the Elastic Stack? ******************************* {es} is the core component of the Elastic Stack, a suite of products for collecting, storing, searching, and visualizing data. {estc-welcome-current}/stack-components.html[Learn more about the Elastic Stack]. ******************************* [discrete] [[elasticsearch-intro-use-cases]] ==== Use cases {es} is used for a wide and growing range of use cases. Here are a few examples: **Observability** * *Logs, metrics, and traces*: Collect, store, and analyze logs, metrics, and traces from applications, systems, and services. * *Application performance monitoring (APM)*: Monitor and analyze the performance of business-critical software applications. * *Real user monitoring (RUM)*: Monitor, quantify, and analyze user interactions with web applications. * *OpenTelemetry*: Reuse your existing instrumentation to send telemetry data to the Elastic Stack using the OpenTelemetry standard. **Search** * *Full-text search*: Build a fast, relevant full-text search solution using inverted indexes, tokenization, and text analysis. * *Vector database*: Store and search vectorized data, and create vector embeddings with built-in and third-party natural language processing (NLP) models. * *Semantic search*: Understand the intent and contextual meaning behind search queries using tools like synonyms, dense vector embeddings, and learned sparse query-document expansion. * *Hybrid search*: Combine full-text search with vector search using state-of-the-art ranking algorithms. * *Build search experiences*: Add hybrid search capabilities to apps or websites, or build enterprise search engines over your organization's internal data sources. * *Retrieval augmented generation (RAG)*: Use {es} as a retrieval engine to supplement generative AI models with more relevant, up-to-date, or proprietary data for a range of use cases. * *Geospatial search*: Search for locations and calculate spatial relationships using geospatial queries. **Security** * *Security information and event management (SIEM)*: Collect, store, and analyze security data from applications, systems, and services. * *Endpoint security*: Monitor and analyze endpoint security data. * *Threat hunting*: Search and analyze data to detect and respond to security threats. This is just a sample of search, observability, and security use cases enabled by {es}. Refer to Elastic https://www.elastic.co/customers/success-stories[customer success stories] for concrete examples across a range of industries. [[elasticsearch-intro-deploy]] === Run {es} To use {es}, you need a running instance of the {es} service. You can deploy {es} in various ways. **Quick start option** * <>: Get started quickly with a minimal local Docker setup for development and testing. **Hosted options** * {cloud}/ec-getting-started-trial.html[*Elastic Cloud Hosted*]: {es} is available as part of the hosted Elastic Stack offering, deployed in the cloud with your provider of choice. Sign up for a https://cloud.elastic.co/registration[14-day free trial]. * {serverless-docs}/general/sign-up-trial[*Elastic Cloud Serverless* (technical preview)]: Create serverless projects for autoscaled and fully managed {es} deployments. Sign up for a https://cloud.elastic.co/serverless-registration[14-day free trial]. **Advanced options** * <>: Install, configure, and run {es} on your own premises. * {ece-ref}/Elastic-Cloud-Enterprise-overview.html[*Elastic Cloud Enterprise*]: Deploy Elastic Cloud on public or private clouds, virtual machines, or your own premises. * {eck-ref}/k8s-overview.html[*Elastic Cloud on Kubernetes*]: Deploy Elastic Cloud on Kubernetes. // new html page [[documents-indices]] === Indices, documents, and fields ++++ Indices and documents ++++ The index is the fundamental unit of storage in {es}, a logical namespace for storing data that share similar characteristics. After you have {es} <>, you'll get started by creating an index to store your data. An index is a collection of documents uniquely identified by a name or an <>. This unique name is important because it's used to target the index in search queries and other operations. [TIP] ==== A closely related concept is a <>. This index abstraction is optimized for append-only timestamped data, and is made up of hidden, auto-generated backing indices. If you're working with timestamped data, we recommend the {observability-guide}[Elastic Observability] solution for additional tools and optimized content. ==== [discrete] [[elasticsearch-intro-documents-fields]] ==== Documents and fields {es} serializes and stores data in the form of JSON documents. A document is a set of fields, which are key-value pairs that contain your data. Each document has a unique ID, which you can create or have {es} auto-generate. A simple {es} document might look like this: [source,js] ---- { "_index": "my-first-elasticsearch-index", "_id": "DyFpo5EBxE8fzbb95DOa", "_version": 1, "_seq_no": 0, "_primary_term": 1, "found": true, "_source": { "email": "john@smith.com", "first_name": "John", "last_name": "Smith", "info": { "bio": "Eco-warrior and defender of the weak", "age": 25, "interests": [ "dolphins", "whales" ] }, "join_date": "2024/05/01" } } ---- // NOTCONSOLE [discrete] [[elasticsearch-intro-documents-fields-data-metadata]] ==== Metadata fields An indexed document contains data and metadata. <> are system fields that store information about the documents. In {es}, metadata fields are prefixed with an underscore. For example, the following fields are metadata fields: * `_index`: The name of the index where the document is stored. * `_id`: The document's ID. IDs must be unique per index. [discrete] [[elasticsearch-intro-documents-fields-mappings]] ==== Mappings and data types Each index has a <> or schema for how the fields in your documents are indexed. A mapping defines the <> for each field, how the field should be indexed, and how it should be stored. When adding documents to {es}, you have two options for mappings: * <>: Let {es} automatically detect the data types and create the mappings for you. Dynamic mapping helps you get started quickly, but might yield suboptimal results for your specific use case due to automatic field type inference. * <>: Define the mappings up front by specifying data types for each field. Recommended for production use cases, because you have full control over how your data is indexed to suit your specific use case. [TIP] ==== You can use a combination of dynamic and explicit mapping on the same index. This is useful when you have a mix of known and unknown fields in your data. ==== // New html page [[es-ingestion-overview]] === Add data to {es} There are multiple ways to ingest data into {es}. The option that you choose depends on whether you're working with timestamped data or non-timestamped data, where the data is coming from, its complexity, and more. [TIP] ==== You can load {kibana-ref}/connect-to-elasticsearch.html#_add_sample_data[sample data] into your {es} cluster using {kib}, to get started quickly. ==== [discrete] [[es-ingestion-overview-general-content]] ==== General content General content is data that does not have a timestamp. This could be data like vector embeddings, website content, product catalogs, and more. For general content, you have the following options for adding data to {es} indices: * <>: Use the {es} <> to index documents directly, using the Dev Tools {kibana-ref}/console-kibana.html[Console], or cURL. + If you're building a website or app, then you can call Elasticsearch APIs using an https://www.elastic.co/guide/en/elasticsearch/client/index.html[{es} client] in the programming language of your choice. If you use the Python client, then check out the `elasticsearch-labs` repo for various https://github.com/elastic/elasticsearch-labs/tree/main/notebooks/search/python-examples[example notebooks]. * {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File upload]: Use the {kib} file uploader to index single files for one-off testing and exploration. The GUI guides you through setting up your index and field mappings. * https://github.com/elastic/crawler[Web crawler]: Extract and index web page content into {es} documents. * {enterprise-search-ref}/connectors.html[Connectors]: Sync data from various third-party data sources to create searchable, read-only replicas in {es}. [discrete] [[es-ingestion-overview-timestamped]] ==== Timestamped data Timestamped data in {es} refers to datasets that include a timestamp field. If you use the {ecs-ref}/ecs-reference.html[Elastic Common Schema (ECS)], this field is named `@timestamp`. This could be data like logs, metrics, and traces. For timestamped data, you have the following options for adding data to {es} data streams: * {fleet-guide}/fleet-overview.html[Elastic Agent and Fleet]: The preferred way to index timestamped data. Each Elastic Agent based integration includes default ingestion rules, dashboards, and visualizations to start analyzing your data right away. You can use the Fleet UI in {kib} to centrally manage Elastic Agents and their policies. * {beats-ref}/beats-reference.html[Beats]: If your data source isn't supported by Elastic Agent, use Beats to collect and ship data to Elasticsearch. You install a separate Beat for each type of data to collect. * {logstash-ref}/introduction.html[Logstash]: Logstash is an open source data collection engine with real-time pipelining capabilities that supports a wide variety of data sources. You might use this option because neither Elastic Agent nor Beats supports your data source. You can also use Logstash to persist incoming data, or if you need to send the data to multiple destinations. * {cloud}/ec-ingest-guides.html[Language clients]: The linked tutorials demonstrate how to use {es} programming language clients to ingest data from an application. In these examples, {es} is running on Elastic Cloud, but the same principles apply to any {es} deployment. [TIP] ==== If you're interested in data ingestion pipelines for timestamped data, use the decision tree in the {cloud}/ec-cloud-ingest-data.html#ec-data-ingest-pipeline[Elastic Cloud docs] to understand your options. ==== // New html page [[search-analyze]] === Search and analyze data You can use {es} as a basic document store to retrieve documents and their metadata. However, the real power of {es} comes from its advanced search and analytics capabilities. You'll use a combination of an API endpoint and a query language to interact with your data. [discrete] [[search-analyze-rest-api]] ==== REST API Use REST APIs to manage your {es} cluster, and to index and search your data. For testing purposes, you can submit requests directly from the command line or through the Dev Tools {kibana-ref}/console-kibana.html[Console] in {kib}. From your applications, you can use a https://www.elastic.co/guide/en/elasticsearch/client/index.html[client] in your programming language of choice. Refer to <> for a hands-on example of using the `_search` endpoint, adding data to {es}, and running basic searches in Query DSL syntax. [discrete] [[search-analyze-query-languages]] ==== Query languages {es} provides a number of query languages for interacting with your data. *Query DSL* is the primary query language for {es} today. *{esql}* is a new piped query language and compute engine which was first added in version *8.11*. {esql} does not yet support all the features of Query DSL, like full-text search and semantic search. Look forward to new {esql} features and functionalities in each release. Refer to <> for a full overview of the query languages available in {es}. [discrete] [[search-analyze-query-dsl]] ===== Query DSL <> is a full-featured JSON-style query language that enables complex searching, filtering, and aggregations. It is the original and most powerful query language for {es} today. The <> accepts queries written in Query DSL syntax. [discrete] [[search-analyze-query-dsl-search-filter]] ====== Search and filter with Query DSL Query DSL support a wide range of search techniques, including the following: * <>: Search text that has been analyzed and indexed to support phrase or proximity queries, fuzzy matches, and more. * <>: Search for exact matches using `keyword` fields. * <>: Search `semantic_text` fields using dense or sparse vector search on embeddings generated in your {es} cluster. * <>: Search for similar dense vectors using the kNN algorithm for embeddings generated outside of {es}. * <>: Search for locations and calculate spatial relationships using geospatial queries. Learn about the full range of queries supported by <>. You can also filter data using Query DSL. Filters enable you to include or exclude documents by retrieving documents that match specific field-level criteria. A query that uses the `filter` parameter indicates <>. [discrete] [[search-analyze-data-query-dsl]] ====== Analyze with Query DSL <> are the primary tool for analyzing {es} data using Query DSL. Aggregrations enable you to build complex summaries of your data and gain insight into key metrics, patterns, and trends. Because aggregations leverage the same data structures used for search, they are also very fast. This enables you to analyze and visualize your data in real time. You can search documents, filter results, and perform analytics at the same time, on the same data, in a single request. That means aggregations are calculated in the context of the search query. The folowing aggregation types are available: * <>: Calculate metrics, such as a sum or average, from field values. * <>: Group documents into buckets based on field values, ranges, or other criteria. * <>: Run aggregations on the results of other aggregations. Run aggregations by specifying the <>'s `aggs` parameter. Learn more in <>. [discrete] [[search-analyze-data-esql]] ===== {esql} <> is a piped query language for filtering, transforming, and analyzing data. {esql} is built on top of a new compute engine, where search, aggregation, and transformation functions are directly executed within {es} itself. {esql} syntax can also be used within various {kib} tools. The <> accepts queries written in {esql} syntax. Today, it supports a subset of the features available in Query DSL, like aggregations, filters, and transformations. It does not yet support full-text search or semantic search. It comes with a comprehensive set of <> for working with data and has robust integration with {kib}'s Discover, dashboards and visualizations. Learn more in <>, or try https://www.elastic.co/training/introduction-to-esql[our training course]. [discrete] [[search-analyze-data-query-languages-table]] ==== List of available query languages The following table summarizes all available {es} query languages, to help you choose the right one for your use case. [cols="1,2,2,1", options="header"] |=== | Name | Description | Use cases | API endpoint | <> | The primary query language for {es}. A powerful and flexible JSON-style language that enables complex queries. | Full-text search, semantic search, keyword search, filtering, aggregations, and more. | <> | <> | Introduced in *8.11*, the Elasticsearch Query Language ({esql}) is a piped query language language for filtering, transforming, and analyzing data. | Initially tailored towards working with time series data like logs and metrics. Robust integration with {kib} for querying, visualizing, and analyzing data. Does not yet support full-text search. | <> | <> | Event Query Language (EQL) is a query language for event-based time series data. Data must contain the `@timestamp` field to use EQL. | Designed for the threat hunting security use case. | <> | <> | Allows native, real-time SQL-like querying against {es} data. JDBC and ODBC drivers are available for integration with business intelligence (BI) tools. | Enables users familiar with SQL to query {es} data using familiar syntax for BI and reporting. | <> | {kibana-ref}/kuery-query.html[Kibana Query Language (KQL)] | Kibana Query Language (KQL) is a text-based query language for filtering data when you access it through the {kib} UI. | Use KQL to filter documents where a value for a field exists, matches a given value, or is within a given range. | N/A |=== // New html page // TODO: this page won't live here long term [[scalability]] === Plan for production {es} is built to be always available and to scale with your needs. It does this by being distributed by nature. You can add servers (nodes) to a cluster to increase capacity and {es} automatically distributes your data and query load across all of the available nodes. No need to overhaul your application, {es} knows how to balance multi-node clusters to provide scale and high availability. The more nodes, the merrier. How does this work? Under the covers, an {es} index is really just a logical grouping of one or more physical shards, where each shard is actually a self-contained index. By distributing the documents in an index across multiple shards, and distributing those shards across multiple nodes, {es} can ensure redundancy, which both protects against hardware failures and increases query capacity as nodes are added to a cluster. As the cluster grows (or shrinks), {es} automatically migrates shards to rebalance the cluster. There are two types of shards: primaries and replicas. Each document in an index belongs to one primary shard. A replica shard is a copy of a primary shard. Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document. The number of primary shards in an index is fixed at the time that an index is created, but the number of replica shards can be changed at any time, without interrupting indexing or query operations. [discrete] [[it-depends]] ==== Shard size and number of shards There are a number of performance considerations and trade offs with respect to shard size and the number of primary shards configured for an index. The more shards, the more overhead there is simply in maintaining those indices. The larger the shard size, the longer it takes to move shards around when {es} needs to rebalance a cluster. Querying lots of small shards makes the processing per shard faster, but more queries means more overhead, so querying a smaller number of larger shards might be faster. In short...it depends. As a starting point: * Aim to keep the average shard size between a few GB and a few tens of GB. For use cases with time-based data, it is common to see shards in the 20GB to 40GB range. * Avoid the gazillion shards problem. The number of shards a node can hold is proportional to the available heap space. As a general rule, the number of shards per GB of heap space should be less than 20. The best way to determine the optimal configuration for your use case is through https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing[ testing with your own data and queries]. [discrete] [[disaster-ccr]] ==== Disaster recovery A cluster's nodes need good, reliable connections to each other. To provide better connections, you typically co-locate the nodes in the same data center or nearby data centers. However, to maintain high availability, you also need to avoid any single point of failure. In the event of a major outage in one location, servers in another location need to be able to take over. The answer? {ccr-cap} (CCR). CCR provides a way to automatically synchronize indices from your primary cluster to a secondary remote cluster that can serve as a hot backup. If the primary cluster fails, the secondary cluster can take over. You can also use CCR to create secondary clusters to serve read requests in geo-proximity to your users. {ccr-cap} is active-passive. The index on the primary cluster is the active leader index and handles all write requests. Indices replicated to secondary clusters are read-only followers. [discrete] [[admin]] ==== Security, management, and monitoring As with any enterprise system, you need tools to secure, manage, and monitor your {es} clusters. Security, monitoring, and administrative features that are integrated into {es} enable you to use {kibana-ref}/introduction.html[{kib}] as a control center for managing a cluster. Features like <> and <> help you intelligently manage your data over time. Refer to <> for more information.