mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-24 23:27:25 -04:00
* WIP, port docs - Update link syntax - Update ids - Fix n^n build failures :/ - * Fix id for doclink * Let's try this on for size * Idem * Update attributes, Test image rendering * Update image name * Fix typo * Update filename * Add images, cleanup, standardize naming * Tweak heading * Cleanup, rewordings - Modified introduction in `search-inference-processing.asciidoc`. - Changed "Search connector" to "Elastic connector". - Adjusted heading levels in `search-inference-processing.asciidoc`. - Simplified ingest pipelines intro in `search-ingest-pipelines.asciidoc`. - Edited ingest pipelines section for the *Content* UI. - Reordered file inclusions in `search-ingest-pipelines.asciidoc`. - Formatted inference pipeline creation into steps in `search-nlp-tutorial.asciidoc`. * Lingering erroneousness * Delete FAQ
259 lines
11 KiB
Text
259 lines
11 KiB
Text
[[nlp-example]]
|
|
=== Tutorial: Natural language processing (NLP)
|
|
++++
|
|
<titleabbrev>NLP tutorial</titleabbrev>
|
|
++++
|
|
|
|
This guide focuses on a concrete task: getting a machine learning trained model loaded into Elasticsearch and set up to enrich your documents.
|
|
|
|
Elasticsearch supports many different ways to use machine learning models.
|
|
In this guide, we will use a trained model to enrich documents at ingest time using ingest pipelines configured within Kibana's *Content* UI.
|
|
|
|
In this guide, we'll accomplish the above using the following steps:
|
|
|
|
- *Set up a Cloud deployment*: We will use Elastic Cloud to host our deployment, as it makes it easy to scale machine learning nodes.
|
|
- *Load a model with Eland*: We will use the Eland Elasticsearch client to import our chosen model into Elasticsearch.
|
|
Once we've verified that the model is loaded, we will be able to use it in an ingest pipeline.
|
|
- *Setup an ML inference pipeline*: We will create an Elasticsearch index with a predefined mapping and add an inference pipeline.
|
|
- *Show enriched results*: We will ingest some data into our index and observe that the pipeline enriches our documents.
|
|
|
|
Follow the instructions to load a text classification model and set it up to enrich some photo comment data.
|
|
Once you're comfortable with the steps involved, use this guide as a blueprint for working with other machine learning trained models.
|
|
|
|
*Table of contents*:
|
|
|
|
* <<nlp-example-cloud-deployment>>
|
|
* <<nlp-example-clone-eland>>
|
|
* <<nlp-example-deploy-model>>
|
|
* <<nlp-example-create-index-and-define-ml-inference-pipeline>>
|
|
* <<nlp-example-index-documents>>
|
|
* <<nlp-example-summary>>
|
|
* <<nlp-example-learn-more>>
|
|
|
|
[discrete#nlp-example-cloud-deployment]
|
|
==== Create an {ecloud} deployment
|
|
|
|
Your deployment will need a machine learning instance to upload and deploy trained models.
|
|
|
|
If your team already has an Elastic Cloud deployment, make sure it has at least one machine learning instance.
|
|
If it does not, *Edit* your deployment to add capacity.
|
|
For this tutorial, we'll need at least 2GB of RAM on a single machine learning instance.
|
|
|
|
If your team does not have an Elastic Cloud deployment, start by signing up for a https://cloud.elastic.co/registration[free Elastic Cloud trial^].
|
|
After creating an account, you'll have an active subscription and you'll be prompted to create your first deployment.
|
|
|
|
Follow the steps to *Create* a new deployment.
|
|
Make sure to add capacity to the *Machine Learning instances* under the *Advanced settings* before creating the deployment.
|
|
To simplify scaling, turn on the *Autoscale this deployment* feature.
|
|
If you use autoscaling, you should increase the minimum RAM for the machine learning instance.
|
|
For this tutorial, we'll need at least 2GB of RAM.
|
|
For more details, refer to {cloud}/ec-create-deployment.html[Create a deployment^] in the Elastic Cloud documentation.
|
|
|
|
Enriching documents using machine learning was introduced in Enterprise Search *8.5.0*, so be sure to use version *8.5.0 or later*.
|
|
|
|
[discrete#nlp-example-clone-eland]
|
|
==== Clone Eland
|
|
|
|
Elastic's https://github.com/elastic/eland[Eland^] tool makes it easy to upload trained models to your deployment via Docker.
|
|
|
|
Eland is a specialized Elasticsearch client for exploring and manipulating data, which we can use to upload trained models into Elasticsearch.
|
|
|
|
To clone and build Eland using Docker, run the following commands:
|
|
|
|
[source,sh]
|
|
----
|
|
git clone git@github.com:elastic/eland.git
|
|
cd eland
|
|
docker build -t elastic/eland .
|
|
----
|
|
|
|
[discrete#nlp-example-deploy-model]
|
|
==== Deploy the trained model
|
|
|
|
Now that you have a deployment and a way to upload models, you will need to choose a trained model that fits your data.
|
|
https://huggingface.co/[Hugging Face^] has a large repository of publicly available trained models.
|
|
The model you choose will depend on your data and what you would like to do with it.
|
|
|
|
For the purposes of this guide, let's say we have a data set of photo comments.
|
|
In order to promote a positive atmosphere on our platform, we'd like the first few comments on each photo to be positive comments.
|
|
For this task, the https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you[`distilbert-base-uncased-finetuned-sst-2-english`^] model is a good fit.
|
|
|
|
To upload this model to your deployment, you need a few pieces of data:
|
|
|
|
- The deployment URL.
|
|
You can get this via the *Copy endpoint* link next to *Elasticsearch* on the deployment management screen.
|
|
It will look like `https://ml-test.es.us-west1.gcp.cloud.es.io:443`.
|
|
Make sure to append the port if it isn't present, as Eland requires the URL to have a scheme, host, and port.
|
|
443 is the default port for HTTPS.
|
|
- The deployment username and password for your deployment.
|
|
This is displayed one time when the deployment is created.
|
|
It will look like `elastic` and `xUjaFNTyycG34tQx5Iq9JIIA`.
|
|
- The trained model id.
|
|
This comes from Hugging Face.
|
|
It will look like `distilbert-base-uncased-finetuned-sst-2-english`.
|
|
- The trained model task type.
|
|
This is the kind of machine learning task the model is designed to achieve.
|
|
It will be one of: `fill_mask`, `ner`, `text_classification`, `text_embedding`, and `zero_shot_classification`.
|
|
For our use case, we will use `text_classification`.
|
|
|
|
We can now upload our chosen model to Elasticsearch by providing these options to Eland.
|
|
|
|
[source,sh]
|
|
----
|
|
docker run -it --rm --network host \
|
|
elastic/eland \
|
|
eland_import_hub_model \
|
|
--url https://ml-test.es.us-west1.gcp.cloud.es.io:443 \
|
|
-u elastic -p <PASSWORD> \
|
|
--hub-model-id distilbert-base-uncased-finetuned-sst-2-english \
|
|
--task-type text_classification \
|
|
--start
|
|
----
|
|
|
|
This script should take roughly 2-3 minutes to run.
|
|
Once your model has been successfully deployed to your Elastic deployment, navigate to Kibana's *Trained Models* page to verify it is ready.
|
|
You can find this page under *Machine Learning > Analytics* menu and then *Trained Models > Model Management*.
|
|
If you do not see your model in the list, you may need to click *Synchronize your jobs and trained models*.
|
|
Your model is now ready to be used.
|
|
|
|
[discrete#nlp-example-create-index-and-define-ml-inference-pipeline]
|
|
==== Create an index and define an ML inference pipeline
|
|
|
|
We are now ready to use Kibana's *Content* UI to enrich our documents with inference data.
|
|
Before we ingest photo comments into Elasticsearch, we will first create an ML inference pipeline.
|
|
The pipeline will enrich the incoming photo comments with inference data indicating if the comments are positive.
|
|
|
|
Let's say our photo comments look like this when they are uploaded as a document into Elasticsearch:
|
|
|
|
[source,js]
|
|
----
|
|
{
|
|
"photo_id": "78sdv71-8vdkjaj-knew629-vc8459p",
|
|
"body": "your dog is so cute!",
|
|
...
|
|
}
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
We want to run our documents through an inference processor that uses the trained model we uploaded to determine if the comments are positive.
|
|
To do this, we first need to set up an Elasticsearch index.
|
|
|
|
* From the Kibana home page, start by clicking the Search card.
|
|
* Click the button to *Create an Elasticsearch index*.
|
|
* Choose to *Use the API* and give your index a name.
|
|
It will automatically be prefixed with `search-`.
|
|
For this demo, we will name the index `search-photo-comments`.
|
|
* After clicking *Create Index*, you will be redirected to the overview page for your new index.
|
|
|
|
To configure the ML inference pipeline, we need the index to have an existing field mapping so we can choose which field to analyze.
|
|
This can be done via the <<indices-put-mapping, index mapping API>> in the Kibana Dev Tools or simply through a cURL command:
|
|
|
|
[source,js]
|
|
----
|
|
PUT search-photo-comments/_mapping
|
|
{
|
|
"properties": {
|
|
"photo_id": { "type": "keyword" },
|
|
"body": { "type": "text" }
|
|
}
|
|
}
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
Now it's time to create an inference pipeline.
|
|
|
|
1. From the overview page for your `search-photo-comments` index in "Search", click the *Pipelines* tab.
|
|
By default, Elasticsearch does not create any index-specific ingest pipelines.
|
|
2. Because we want to customize these pipelines, we need to *Copy and customize* the `ent-search-generic-ingestion` ingest pipeline.
|
|
Find this option above the settings for the `ent-search-generic-ingestion` ingest pipeline.
|
|
This will create two new index-specific ingest pipelines.
|
|
|
|
Next, we'll add an inference pipeline.
|
|
|
|
1. Locate the section *Machine Learning Inference Pipelines*, then select *Add inference pipeline*.
|
|
2. Give your inference pipeline a name, select the trained model we uploaded, and select the `body` field to be analyzed.
|
|
3. Optionally, choose a field name to store the output.
|
|
We'll call it `positivity_result`.
|
|
|
|
You can also run example documents through a simulator and review the pipeline before creating it.
|
|
|
|
[discrete#nlp-example-index-documents]
|
|
==== Index documents
|
|
|
|
At this point, everything is ready to enrich documents at index time.
|
|
|
|
From the Kibana Dev Console, or simply using a cURL command, we can index a document.
|
|
We'll use a `_run_ml_inference` flag to tell the `search-photo-comments` pipeline to run the index-specific ML inference pipeline that we created.
|
|
This field will not be indexed in the document.
|
|
|
|
[source,js]
|
|
----
|
|
POST search-photo-comments/_doc/my-new-doc?pipeline=search-photo-comments
|
|
{
|
|
"photo_id": "78sdv71-8vdkjaj-knew629-vc8459p",
|
|
"body": "your dog is so cute!",
|
|
"_run_ml_inference": true
|
|
}
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
Once the document is indexed, use the API to retrieve it and view the enriched data.
|
|
|
|
[source,js]
|
|
----
|
|
GET search-photo-comments/_doc/my-new-doc
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
[source,js]
|
|
----
|
|
{
|
|
"_index": "search-photo-comments",
|
|
"_id": "_MQggoQBKYghsSwHbDvG",
|
|
...
|
|
"_source": {
|
|
...
|
|
"photo_id": "78sdv71-8vdkjaj-knew629-vc8459p",
|
|
"body": "your dog is so cute!",
|
|
"ml": {
|
|
"inference": {
|
|
"positivity_result": {
|
|
"predicted_value": "POSITIVE",
|
|
"prediction_probability": 0.9998022925461774,
|
|
"model_id": "distilbert-base-uncased-finetuned-sst-2-english"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
----
|
|
// NOTCONSOLE
|
|
|
|
The document has new fields with the enriched data.
|
|
The `ml.inference.positivity_result` field is an object with the analysis from the machine learning model.
|
|
The model we used predicted with 99.98% confidence that the analyzed text is positive.
|
|
|
|
From here, we can write search queries to boost on `ml.inference.positivity_result.predicted_value`.
|
|
This field will also be stored in a top-level `positivity_result` field if the model was confident enough.
|
|
|
|
[discrete#nlp-example-summary]
|
|
==== Summary
|
|
|
|
In this guide, we covered how to:
|
|
|
|
- Set up a deployment on Elastic Cloud with a machine learning instance.
|
|
- Deploy a machine learning trained model using the Eland Elasticsearch client.
|
|
- Configure an inference pipeline to use the trained model with Elasticsearch.
|
|
- Enrich documents with inference results from the trained model at ingest time.
|
|
- Query your search engine and sort by `positivity_result`.
|
|
|
|
[discrete#nlp-example-learn-more]
|
|
==== Learn more
|
|
|
|
* {ml-docs}/ml-nlp-model-ref.html[Compatible third party models^]
|
|
* {ml-docs}/ml-nlp-overview.html[NLP Overview^]
|
|
* https://github.com/elastic/eland#docker[Docker section of Eland readme^]
|
|
* {ml-docs}/ml-nlp-deploy-models.html[Deploying a model ML guide^]
|
|
* {ml-docs}/ml-nlp-import-model.html#ml-nlp-authentication[Eland Authentication methods^]
|
|
* <<ingest-pipeline-search-inference-add-inference-processors,Adding inference pipelines>>
|
|
// * <<elser-text-expansion,Using ELSER for text expansion>>
|