[ML] add new cache_size parameter to trained_model deployments API (#88450)

With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model.

By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. 

This is due to the model having to be loaded in memory twice when serializing to the native process. 

But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node.

Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
This commit is contained in:
Benjamin Trent 2022-07-18 09:19:01 -04:00 committed by GitHub
parent 5c11a81913
commit afa28d49b4
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
28 changed files with 376 additions and 32 deletions

View file

@ -28,6 +28,11 @@
]
},
"params":{
"cache_size": {
"type": "string",
"description": "A byte-size value for configuring the inference cache size. For example, 20mb.",
"required": false
},
"number_of_allocations":{
"type":"int",
"description": "The number of model allocations on each node where the model is deployed.",