mirror of
https://github.com/elastic/elasticsearch.git
synced 2025-04-27 00:27:25 -04:00
With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model. By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. This is due to the model having to be loaded in memory twice when serializing to the native process. But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node. Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency. |
||
---|---|---|
.. | ||
anomaly-detection | ||
common/apis | ||
df-analytics/apis | ||
images | ||
trained-models/apis | ||
ml-shared.asciidoc |