[ML] add new cache_size parameter to trained_model deployments API (#88450)

With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model. By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. This is due to the model having to be loaded in memory twice when serializing to the native process. But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node. Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency.
2025-06-28 17:34:17 -04:00 · 2022-07-18 09:19:01 -04:00 · 2022-07-18 09:19:01 -04:00 · afa28d49b4
commit afa28d49b4
parent 5c11a81913
28 changed files with 376 additions and 32 deletions
--- a/rest-api-spec/src/main/resources/rest-api-spec/api/ml.start_trained_model_deployment.json
+++ b/rest-api-spec/src/main/resources/rest-api-spec/api/ml.start_trained_model_deployment.json
@ -28,6 +28,11 @@
      ]
    },
    "params":{
+      "cache_size": {
+        "type": "string",
+        "description": "A byte-size value for configuring the inference cache size. For example, 20mb.",
+        "required": false
+      },
      "number_of_allocations":{
        "type":"int",
        "description": "The number of model allocations on each node where the model is deployed.",