elasticsearch/docs/reference/ml/trained-models
Benjamin Trent afa28d49b4
[ML] add new cache_size parameter to trained_model deployments API (#88450)
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model.

By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. 

This is due to the model having to be loaded in memory twice when serializing to the native process. 

But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node.

Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
2022-07-18 09:19:01 -04:00
..
apis [ML] add new cache_size parameter to trained_model deployments API (#88450) 2022-07-18 09:19:01 -04:00