[ML] Machine learning data frame analytics (#43544)

This merges the initial work that adds a framework for performing
machine learning analytics on data frames. The feature is currently experimental
and requires a platinum license. Note that the original commits can be
found in the `feature-ml-data-frame-analytics` branch.

A new set of APIs is added which allows the creation of data frame analytics
jobs. Configuration allows specifying different types of analysis to be performed
on a data frame. At first there is support for outlier detection.

The APIs are:

- PUT _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}
- GET _ml/data_frame/analysis/{id}/_stats
- POST _ml/data_frame/analysis/{id}/_start
- POST _ml/data_frame/analysis/{id}/_stop
- DELETE _ml/data_frame/analysis/{id}

When a data frame analytics job is started a persistent task is created and started.
The main steps of the task are:

1. reindex the source index into the dest index
2. analyze the data through the data_frame_analyzer c++ process
3. merge the results of the process back into the destination index

In addition, an evaluation API is added which packages commonly used metrics
that provide evaluation of various analysis:

- POST _ml/data_frame/_evaluate
This commit is contained in:
Dimitris Athanasiou 2019-06-25 10:48:27 +03:00 committed by GitHub
parent b4f30cf1ed
commit 5fa36dad0b
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
244 changed files with 20924 additions and 1335 deletions

View file

@ -0,0 +1,28 @@
--
:api: delete-data-frame-analytics
:request: DeleteDataFrameAnalyticsRequest
:response: AcknowledgedResponse
--
[id="{upid}-{api}"]
=== Delete Data Frame Analytics API
The Delete Data Frame Analytics API is used to delete an existing {dataframe-analytics-config}.
The API accepts a +{request}+ object as a request and returns a +{response}+.
[id="{upid}-{api}-request"]
==== Delete Data Frame Analytics Request
A +{request}+ object requires a {dataframe-analytics-config} id.
["source","java",subs="attributes,callouts,macros"]
---------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
---------------------------------------------------
<1> Constructing a new request referencing an existing {dataframe-analytics-config}
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ object acknowledges the {dataframe-analytics-config} deletion.

View file

@ -0,0 +1,45 @@
--
:api: evaluate-data-frame
:request: EvaluateDataFrameRequest
:response: EvaluateDataFrameResponse
--
[id="{upid}-{api}"]
=== Evaluate Data Frame API
The Evaluate Data Frame API is used to evaluate an ML algorithm that ran on a {dataframe}.
The API accepts an +{request}+ object and returns an +{response}+.
[id="{upid}-{api}-request"]
==== Evaluate Data Frame Request
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
--------------------------------------------------
<1> Constructing a new evaluation request
<2> Reference to an existing index
<3> Kind of evaluation to perform
<4> Name of the field in the index. Its value denotes the actual (i.e. ground truth) label for an example. Must be either true or false
<5> Name of the field in the index. Its value denotes the probability (as per some ML algorithm) of the example being classified as positive
<6> The remaining parameters are the metrics to be calculated based on the two fields described above.
<7> https://en.wikipedia.org/wiki/Precision_and_recall[Precision] calculated at thresholds: 0.4, 0.5 and 0.6
<8> https://en.wikipedia.org/wiki/Precision_and_recall[Recall] calculated at thresholds: 0.5 and 0.7
<9> https://en.wikipedia.org/wiki/Confusion_matrix[Confusion matrix] calculated at threshold 0.5
<10> https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve[AuC ROC] calculated and the curve points returned
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ contains the requested evaluation metrics.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-response]
--------------------------------------------------
<1> Fetching all the calculated metrics results
<2> Fetching precision metric by name
<3> Fetching precision at a given (0.4) threshold
<4> Fetching confusion matrix metric by name
<5> Fetching confusion matrix at a given (0.5) threshold

View file

@ -0,0 +1,34 @@
--
:api: get-data-frame-analytics-stats
:request: GetDataFrameAnalyticsStatsRequest
:response: GetDataFrameAnalyticsStatsResponse
--
[id="{upid}-{api}"]
=== Get Data Frame Analytics Stats API
The Get Data Frame Analytics Stats API is used to read the operational statistics of one or more {dataframe-analytics-config}s.
The API accepts a +{request}+ object and returns a +{response}+.
[id="{upid}-{api}-request"]
==== Get Data Frame Analytics Stats Request
A +{request}+ requires either a {dataframe-analytics-config} id, a comma separated list of ids or
the special wildcard `_all` to get the statistics for all {dataframe-analytics-config}s
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
--------------------------------------------------
<1> Constructing a new GET Stats request referencing an existing {dataframe-analytics-config}
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ contains the requested {dataframe-analytics-config} statistics.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-response]
--------------------------------------------------

View file

@ -0,0 +1,34 @@
--
:api: get-data-frame-analytics
:request: GetDataFrameAnalyticsRequest
:response: GetDataFrameAnalyticsResponse
--
[id="{upid}-{api}"]
=== Get Data Frame Analytics API
The Get Data Frame Analytics API is used to get one or more {dataframe-analytics-config}s.
The API accepts a +{request}+ object and returns a +{response}+.
[id="{upid}-{api}-request"]
==== Get Data Frame Analytics Request
A +{request}+ requires either a {dataframe-analytics-config} id, a comma separated list of ids or
the special wildcard `_all` to get all {dataframe-analytics-config}s.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
--------------------------------------------------
<1> Constructing a new GET request referencing an existing {dataframe-analytics-config}
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ contains the requested {dataframe-analytics-config}s.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-response]
--------------------------------------------------

View file

@ -0,0 +1,115 @@
--
:api: put-data-frame-analytics
:request: PutDataFrameAnalyticsRequest
:response: PutDataFrameAnalyticsResponse
--
[id="{upid}-{api}"]
=== Put Data Frame Analytics API
The Put Data Frame Analytics API is used to create a new {dataframe-analytics-config}.
The API accepts a +{request}+ object as a request and returns a +{response}+.
[id="{upid}-{api}-request"]
==== Put Data Frame Analytics Request
A +{request}+ requires the following argument:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
--------------------------------------------------
<1> The configuration of the {dataframe-job} to create
[id="{upid}-{api}-config"]
==== Data Frame Analytics Configuration
The `DataFrameAnalyticsConfig` object contains all the details about the {dataframe-job}
configuration and contains the following arguments:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-config]
--------------------------------------------------
<1> The {dataframe-analytics-config} id
<2> The source index and query from which to gather data
<3> The destination index
<4> The analysis to be performed
<5> The fields to be included in / excluded from the analysis
<6> The memory limit for the model created as part of the analysis process
[id="{upid}-{api}-query-config"]
==== SourceConfig
The index and the query from which to collect data.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-source-config]
--------------------------------------------------
<1> Constructing a new DataFrameAnalyticsSource
<2> The source index
<3> The query from which to gather the data. If query is not set, a `match_all` query is used by default.
===== QueryConfig
The query with which to select data from the source.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-query-config]
--------------------------------------------------
==== DestinationConfig
The index to which data should be written by the {dataframe-job}.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-dest-config]
--------------------------------------------------
<1> Constructing a new DataFrameAnalyticsDest
<2> The destination index
==== Analysis
The analysis to be performed.
Currently, only one analysis is supported: +OutlierDetection+.
+OutlierDetection+ analysis can be created in one of two ways:
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-analysis-default]
--------------------------------------------------
<1> Constructing a new OutlierDetection object with default strategy to determine outliers
or
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-analysis-customized]
--------------------------------------------------
<1> Constructing a new OutlierDetection object
<2> The method used to perform the analysis
<3> Number of neighbors taken into account during analysis
==== Analyzed fields
FetchContext object containing fields to be included in / excluded from the analysis
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-analyzed-fields]
--------------------------------------------------
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ contains the newly created {dataframe-analytics-config}.
["source","java",subs="attributes,callouts,macros"]
--------------------------------------------------
include-tagged::{doc-tests-file}[{api}-response]
--------------------------------------------------

View file

@ -0,0 +1,28 @@
--
:api: start-data-frame-analytics
:request: StartDataFrameAnalyticsRequest
:response: AcknowledgedResponse
--
[id="{upid}-{api}"]
=== Start Data Frame Analytics API
The Start Data Frame Analytics API is used to start an existing {dataframe-analytics-config}.
It accepts a +{request}+ object and responds with a +{response}+ object.
[id="{upid}-{api}-request"]
==== Start Data Frame Analytics Request
A +{request}+ object requires a {dataframe-analytics-config} id.
["source","java",subs="attributes,callouts,macros"]
---------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
---------------------------------------------------
<1> Constructing a new start request referencing an existing {dataframe-analytics-config}
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ object acknowledges the {dataframe-job} has started.

View file

@ -0,0 +1,28 @@
--
:api: stop-data-frame-analytics
:request: StopDataFrameAnalyticsRequest
:response: StopDataFrameAnalyticsResponse
--
[id="{upid}-{api}"]
=== Stop Data Frame Analytics API
The Stop Data Frame Analytics API is used to stop a running {dataframe-analytics-config}.
It accepts a +{request}+ object and responds with a +{response}+ object.
[id="{upid}-{api}-request"]
==== Stop Data Frame Analytics Request
A +{request}+ object requires a {dataframe-analytics-config} id.
["source","java",subs="attributes,callouts,macros"]
---------------------------------------------------
include-tagged::{doc-tests-file}[{api}-request]
---------------------------------------------------
<1> Constructing a new stop request referencing an existing {dataframe-analytics-config}
include::../execution.asciidoc[]
[id="{upid}-{api}-response"]
==== Response
The returned +{response}+ object acknowledges the {dataframe-job} has stopped.

View file

@ -285,6 +285,13 @@ The Java High Level REST Client supports the following Machine Learning APIs:
* <<{upid}-put-calendar-job>>
* <<{upid}-delete-calendar-job>>
* <<{upid}-delete-calendar>>
* <<{upid}-get-data-frame-analytics>>
* <<{upid}-get-data-frame-analytics-stats>>
* <<{upid}-put-data-frame-analytics>>
* <<{upid}-delete-data-frame-analytics>>
* <<{upid}-start-data-frame-analytics>>
* <<{upid}-stop-data-frame-analytics>>
* <<{upid}-evaluate-data-frame>>
* <<{upid}-put-filter>>
* <<{upid}-get-filters>>
* <<{upid}-update-filter>>
@ -329,6 +336,13 @@ include::ml/delete-calendar-event.asciidoc[]
include::ml/put-calendar-job.asciidoc[]
include::ml/delete-calendar-job.asciidoc[]
include::ml/delete-calendar.asciidoc[]
include::ml/get-data-frame-analytics.asciidoc[]
include::ml/get-data-frame-analytics-stats.asciidoc[]
include::ml/put-data-frame-analytics.asciidoc[]
include::ml/delete-data-frame-analytics.asciidoc[]
include::ml/start-data-frame-analytics.asciidoc[]
include::ml/stop-data-frame-analytics.asciidoc[]
include::ml/evaluate-data-frame.asciidoc[]
include::ml/put-filter.asciidoc[]
include::ml/get-filters.asciidoc[]
include::ml/update-filter.asciidoc[]