* Remove `es-test-dir` book-scoped variable
* Remove `plugins-examples-dir` book-scoped variable
* Remove `:dependencies-dir:` and `:xes-repo-dir:` book-scoped variables
- In `index.asciidoc`, two variables (`:dependencies-dir:` and `:xes-repo-dir:`) were removed.
- In `sql/index.asciidoc`, the `:sql-tests:` path was updated to fuller path
- In `esql/index.asciidoc`, the `:esql-tests:` path was updated idem
* Replace `es-repo-dir` with `es-ref-dir`
* Move `:include-xpack: true` to few files that use it, remove from index.asciidoc
This adds a new parameter to the start trained model deployment API,
namely `priority`. The available settings are `normal` and `low`.
For normal priority deployments the allocations get distributed so that
node processors are never oversubscribed.
Low priority deployments allow users to test model functionality even if there
are no node processors available. They are limited to 1 allocation with a single thread.
In addition, the process is executed in low priority which limits the amount of
CPU that can be used when the CPU is under pressure. The intention of this is to
limit the impact of low priority deployments on normal priority deployments.
When we rebalance model assignments we now:
1. compute a plan just for normal priority deployments
2. fix the resources used by normal deployments
3. compute a plan just for low priority deployments
4. merge the two plans
Closes#91024
The inference node stats for deployed PyTorch inference
models now contain two new fields: `inference_cache_hit_count`
and `inference_cache_hit_count_last_minute`.
These indicate how many inferences on that node were served
from the C++-side response cache that was added in
https://github.com/elastic/ml-cpp/pull/2305. Cache hits
occur when exactly the same inference request is sent to the
same node more than once.
The `average_inference_time_ms` and
`average_inference_time_ms_last_minute` fields now refer to
the time taken to do the cache lookup, plus, if necessary,
the time to do the inference. We would expect average inference
time to be vastly reduced in situations where the cache hit
rate is high.
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model.
By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead.
This is due to the model having to be loaded in memory twice when serializing to the native process.
But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node.
Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
When starting a trained model deployment the user can tweak performance
by setting the `model_threads` and `inference_threads` parameters.
These parameters are hard to understand and cause confusion.
This commit renames these as well as the fields where their values are
reported in the stats API.
- `model_threads` => `number_of_allocations`
- `inference_threads` => `threads_per_allocation`
Now the terminology is as follows.
A model deployment starts with a requested `number_of_allocations`.
Each allocation means the model gets another thread for executing
parallel inference requests. Thus, more allocations should increase
throughput. In its turn, each allocation is may be using a number
of threads to parallelize each individual inference request.
This is the `threads_per_allocation` setting and increases inference
speed (which might also result in improved throughput).
Throughput is measured as the number of inference requests
processed per minute. The node level stats peak_throughput_per_minute,
throughput_last_minute and average_inference_time_ms_last_minute are
added with a deployment level stat peak_throughput_per_minute which
is the summed throughput of all nodes.
This improves reporting of trained model size in the response of the stats API.
In particular, it removes the `model_size_bytes` from the `deployment_stats` section and
replaces it with a top-level `model_size_stats` object that contains:
- `model_size_bytes`: the actual model size
- `required_native_memory_bytes`: the amount of memory required to load a model
In addition, these are now reported for PyTorch models regardless of their deployment state.