elasticsearch/docs/reference/ml/trained-models/apis/start-trained-model-deploym...

[role="xpack"]
[[start-trained-model-deployment]]
= Start trained model deployment API
[subs="attributes"]
++++
<titleabbrev>Start trained model deployment</titleabbrev>
++++

Starts a new trained model deployment.

beta::[]

[[start-trained-model-deployment-request]]
== {api-request-title}

`POST _ml/trained_models/<model_id>/deployment/_start`

[[start-trained-model-deployment-prereq]]
== {api-prereq-title}
Requires the `manage_ml` cluster privilege. This privilege is included in the
`machine_learning_admin` built-in role.

[[start-trained-model-deployment-desc]]
== {api-description-title}

Currently only `pytorch` models are supported for deployment. Once deployed
the model can be used by the <<inference-processor,{infer-cap} processor>>
in an ingest pipeline or directly in the <<infer-trained-model>> API.

Scaling inference performance can be achieved by setting the parameters
`number_of_allocations` and `threads_per_allocation`.

Increasing `threads_per_allocation` means more threads are used when
an inference request is processed on a node. This can improve inference speed
for certain models. It may also result in improvement to throughput.

Increasing `number_of_allocations` means more threads are used to
process multiple inference requests in parallel resulting in throughput
improvement. Each model allocation uses a number of threads defined by
`threads_per_allocation`.

Model allocations are distributed across {ml} nodes. All allocations assigned
to a node share the same copy of the model in memory. To avoid
thread oversubscription which is detrimental to performance, model allocations
are distributed in such a way that the total number of used threads does not
surpass the node's allocated processors.

[[start-trained-model-deployment-path-params]]
== {api-path-parms-title}

`<model_id>`::
(Required, string)
include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]

[[start-trained-model-deployment-query-params]]
== {api-query-parms-title}

`cache_size`::
(Optional, <<byte-units,byte value>>)
The inference cache size (in memory outside the JVM heap) per node for the model.
The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.

`number_of_allocations`::
(Optional, integer)
The total number of allocations this model is assigned across {ml} nodes.
Increasing this value generally increases the throughput.
Defaults to 1.

`queue_capacity`::
(Optional, integer)
Controls how many inference requests are allowed in the queue at a time.
Every machine learning node in the cluster where the model can be allocated
has a queue of this size; when the number of requests exceeds the total value,
new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.

`threads_per_allocation`::
(Optional, integer)
Sets the number of threads used by each model allocation during inference. This generally increases
the speed per inference request. The inference process is a compute-bound process;
`threads_per_allocations` must not exceed the number of available allocated processors per node.
Defaults to 1. Must be a power of 2. Max allowed value is 32.

`timeout`::
(Optional, time)
Controls the amount of time to wait for the model to deploy. Defaults
to 20 seconds.

`wait_for`::
(Optional, string)
Specifies the allocation status to wait for before returning. Defaults to
`started`. The value `starting` indicates deployment is starting but not yet on
any node. The value `started` indicates the model has started on at least one
node. The value `fully_allocated` indicates the deployment has started on all
valid nodes.

[[start-trained-model-deployment-example]]
== {api-examples-title}

The following example starts a new deployment for a
`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:

[source,console]
--------------------------------------------------
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
--------------------------------------------------
// TEST[skip:TBD]

The API returns the following results:

[source,console-result]
----
{
    "assignment": {
        "task_parameters": {
            "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
            "model_bytes": 265632637,
            "threads_per_allocation" : 1,
            "number_of_allocations" : 1,
            "queue_capacity" : 1024
        },
        "routing_table": {
            "uckeG3R8TLe2MMNBQ6AGrw": {
                "routing_state": "started",
                "reason": ""
            }
        },
        "assignment_state": "started",
        "start_time": "2022-11-02T11:50:34.766591Z"
    }
}
----
[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00			`[role="xpack"]`
			`[[start-trained-model-deployment]]`
			`= Start trained model deployment API`
			`[subs="attributes"]`
			`++++`
			`<titleabbrev>Start trained model deployment</titleabbrev>`
			`++++`

[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00			`Starts a new trained model deployment.`

[ML] Release native inference functionality as beta (#90418) Previously this functionality was tech preview (aka experimental). This PR changes it to beta. 2022-09-28 18:09:02 +08:00			`beta::[]`
[DOCS] Add preview admonition to infer API (#86486) 2022-05-06 04:49:02 +08:00
[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00			`[[start-trained-model-deployment-request]]`
			`== {api-request-title}`

[DOCS] Fixes typo in start trained models API (#80368) 2021-11-05 05:23:03 +08:00			`POST _ml/trained_models/<model_id>/deployment/_start`
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00
[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00			`[[start-trained-model-deployment-prereq]]`
			`== {api-prereq-title}`
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00			Requires the `manage_ml` cluster privilege. This privilege is included in the
			`machine_learning_admin` built-in role.
[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00
			`[[start-trained-model-deployment-desc]]`
			`== {api-description-title}`

[ML] Adjust docs for distributed model allocation (#87955) [ML] Adjust docs for distributed model allocation Follow up to #87366 2022-06-23 20:35:58 +08:00			Currently only `pytorch` models are supported for deployment. Once deployed
[ML] Add NLP inference configs to the inference processor docs (#82320) 2022-01-11 16:50:45 +08:00			`the model can be used by the <<inference-processor,{infer-cap} processor>>`
[ML] add new trained_models/{model_id}/_infer endpoint for all supervised models and deprecate deployment infer api (#86361) This commit adds a new `_ml/trained_models/{model_id}/_infer` API. This api works for both native NLP models and supervised models trained via Data Frame analytics. The format of the API is the same as the old `_ml/trained_models/{model_id}/deployment/_infer`. Taking a `docs` and an `inference_config` parameter. This PR also deprecates the old experimental `_ml/trained_models/{model_id}/deployment/_infer` API. The biggest difference is that the response now nests all results under an "inference_results" object. closes: https://github.com/elastic/elasticsearch/issues/86032 2022-05-06 02:58:59 +08:00			`in an ingest pipeline or directly in the <<infer-trained-model>> API.`
[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00
[ML] Adjust docs for distributed model allocation (#87955) [ML] Adjust docs for distributed model allocation Follow up to #87366 2022-06-23 20:35:58 +08:00			`Scaling inference performance can be achieved by setting the parameters`
			`number_of_allocations` and `threads_per_allocation`.

			Increasing `threads_per_allocation` means more threads are used when
			`an inference request is processed on a node. This can improve inference speed`
			`for certain models. It may also result in improvement to throughput.`

[ML] add new cache_size parameter to trained_model deployments API (#88450) With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model. By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. This is due to the model having to be loaded in memory twice when serializing to the native process. But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node. Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency. 2022-07-18 21:19:01 +08:00			Increasing `number_of_allocations` means more threads are used to
[ML] Adjust docs for distributed model allocation (#87955) [ML] Adjust docs for distributed model allocation Follow up to #87366 2022-06-23 20:35:58 +08:00			`process multiple inference requests in parallel resulting in throughput`
			`improvement. Each model allocation uses a number of threads defined by`
			`threads_per_allocation`.

			`Model allocations are distributed across {ml} nodes. All allocations assigned`
			`to a node share the same copy of the model in memory. To avoid`
			`thread oversubscription which is detrimental to performance, model allocations`
			`are distributed in such a way that the total number of used threads does not`
			`surpass the node's allocated processors.`

[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00			`[[start-trained-model-deployment-path-params]]`
			`== {api-path-parms-title}`

			`<model_id>`::
			`(Required, string)`
			`include::{es-repo-dir}/ml/ml-shared.asciidoc[tag=model-id]`

			`[[start-trained-model-deployment-query-params]]`
			`== {api-query-parms-title}`

[ML] add new cache_size parameter to trained_model deployments API (#88450) With: https://github.com/elastic/ml-cpp/pull/2305 we now support caching pytorch inference responses per node per model. By default, the cache will be the same size has the model on disk size. This is because our current best estimate for memory used (for deploying) is 2*model_size + constant_overhead. This is due to the model having to be loaded in memory twice when serializing to the native process. But, once the model is in memory and accepting requests, its actual memory usage is reduced vs. what we have "reserved" for it within the node. Consequently, having a cache layer that takes advantage of that unused (but reserved) memory is effectively free. When used in production, especially in search scenarios, caching inference results is critical for decreasing latency. 2022-07-18 21:19:01 +08:00			`cache_size`::
			`(Optional, <<byte-units,byte value>>)`
			`The inference cache size (in memory outside the JVM heap) per node for the model.`
			The default value is the same size as the `model_size_bytes`. To disable the cache, `0b` can be provided.

[ML] Rename threading params in _start trained model deployment API (#86597) When starting a trained model deployment the user can tweak performance by setting the `model_threads` and `inference_threads` parameters. These parameters are hard to understand and cause confusion. This commit renames these as well as the fields where their values are reported in the stats API. - `model_threads` => `number_of_allocations` - `inference_threads` => `threads_per_allocation` Now the terminology is as follows. A model deployment starts with a requested `number_of_allocations`. Each allocation means the model gets another thread for executing parallel inference requests. Thus, more allocations should increase throughput. In its turn, each allocation is may be using a number of threads to parallelize each individual inference request. This is the `threads_per_allocation` setting and increases inference speed (which might also result in improved throughput). 2022-05-10 22:41:00 +08:00			`number_of_allocations`::
[ML] adds new params to the start trained model deployment docs (#80016) 2021-10-28 23:23:25 +08:00			`(Optional, integer)`
[ML] Adjust docs for distributed model allocation (#87955) [ML] Adjust docs for distributed model allocation Follow up to #87366 2022-06-23 20:35:58 +08:00			`The total number of allocations this model is assigned across {ml} nodes.`
[ML][DOCS] Add note about max values of thread settings (#81367) 2021-12-14 21:07:34 +08:00			`Increasing this value generally increases the throughput.`
			`Defaults to 1.`
[DOCS] Fixes start and stop trained model deployment APIs (#80978) 2021-11-25 02:09:45 +08:00
[ML] adds new params to the start trained model deployment docs (#80016) 2021-10-28 23:23:25 +08:00			`queue_capacity`::
			`(Optional, integer)`
[ML] Add NLP inference configs to the inference processor docs (#82320) 2022-01-11 16:50:45 +08:00			`Controls how many inference requests are allowed in the queue at a time.`
			`Every machine learning node in the cluster where the model can be allocated`
			`has a queue of this size; when the number of requests exceeds the total value,`
[ML] Validate trained model deployment queue_capacity limit (#89573) When starting a trained model deployment, a queue is created. If the queue_capacity is too large, it can lead to OOM and a node crash. This commit adds validation that the queue_capacity cannot be more than 1M. Closes #89555 2022-08-24 21:52:19 +08:00			`new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.`
[ML] adds new params to the start trained model deployment docs (#80016) 2021-10-28 23:23:25 +08:00
[ML] Rename threading params in _start trained model deployment API (#86597) When starting a trained model deployment the user can tweak performance by setting the `model_threads` and `inference_threads` parameters. These parameters are hard to understand and cause confusion. This commit renames these as well as the fields where their values are reported in the stats API. - `model_threads` => `number_of_allocations` - `inference_threads` => `threads_per_allocation` Now the terminology is as follows. A model deployment starts with a requested `number_of_allocations`. Each allocation means the model gets another thread for executing parallel inference requests. Thus, more allocations should increase throughput. In its turn, each allocation is may be using a number of threads to parallelize each individual inference request. This is the `threads_per_allocation` setting and increases inference speed (which might also result in improved throughput). 2022-05-10 22:41:00 +08:00			`threads_per_allocation`::
			`(Optional, integer)`
			`Sets the number of threads used by each model allocation during inference. This generally increases`
[ML] Adjust docs for distributed model allocation (#87955) [ML] Adjust docs for distributed model allocation Follow up to #87366 2022-06-23 20:35:58 +08:00			`the speed per inference request. The inference process is a compute-bound process;`
			`threads_per_allocations` must not exceed the number of available allocated processors per node.
[ML] Require that threads_per_allocation is a power of 2 (#87697) As the number of cores in CPUs is typically a power of 2, this commit adds a validation that trained model deployments start with `threads_per_allocation` set to be a power of 2. When we look for how we distribute the allocations across the cluster, this prevents situations where we have a lot of wasted CPU cores. In addition, we add a max value limit of `32`. 2022-06-17 20:12:37 +08:00			`Defaults to 1. Must be a power of 2. Max allowed value is 32.`
[ML] Rename threading params in _start trained model deployment API (#86597) When starting a trained model deployment the user can tweak performance by setting the `model_threads` and `inference_threads` parameters. These parameters are hard to understand and cause confusion. This commit renames these as well as the fields where their values are reported in the stats API. - `model_threads` => `number_of_allocations` - `inference_threads` => `threads_per_allocation` Now the terminology is as follows. A model deployment starts with a requested `number_of_allocations`. Each allocation means the model gets another thread for executing parallel inference requests. Thus, more allocations should increase throughput. In its turn, each allocation is may be using a number of threads to parallelize each individual inference request. This is the `threads_per_allocation` setting and increases inference speed (which might also result in improved throughput). 2022-05-10 22:41:00 +08:00
[DOCS] Fixes start and stop trained model deployment APIs (#80978) 2021-11-25 02:09:45 +08:00			`timeout`::
			`(Optional, time)`
			`Controls the amount of time to wait for the model to deploy. Defaults`
			`to 20 seconds.`

			`wait_for`::
			`(Optional, string)`
			`Specifies the allocation status to wait for before returning. Defaults to`
			`started`. The value `starting` indicates deployment is starting but not yet on
			any node. The value `started` indicates the model has started on at least one
			node. The value `fully_allocated` indicates the deployment has started on all
			`valid nodes.`

[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-27 00:49:37 +08:00			`[[start-trained-model-deployment-example]]`
			`== {api-examples-title}`
[ML] add allocation state reason and support for partial model allocations (#76925) Previously, if a model failed to be allocated on any node, the deployment failed. This commit allows for an allocation to be partially_started and indicates its current state via a new state value in the deployment stats API. Additionally, when starting a deployment, the user may specify to wait_for starting, partially_started, started and the API will block (as long as timeout doesn't expire) until that state is reached. 2021-09-08 03:23:13 +08:00
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00			`The following example starts a new deployment for a`
[ML] adds new params to the start trained model deployment docs (#80016) 2021-10-28 23:23:25 +08:00			`elastic__distilbert-base-uncased-finetuned-conll03-english` trained model:
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00
			`[source,console]`
			`--------------------------------------------------`
			`POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m`
			`--------------------------------------------------`
			`// TEST[skip:TBD]`

			`The API returns the following results:`

			`[source,console-result]`
			`----`
			`{`
[ML] rename trained model allocations to assignments (#85503) This renames the internal concept of a trained model allocation into an assignment. Now models are assigned to a node and routes created for inference. Not "allocated". This is an internal rename only. The user facing concepts of trained models and deployments are untouched. 2022-04-18 23:35:10 +08:00			`"assignment": {`
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00			`"task_parameters": {`
			`"model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",`
[DOCS] Updates example output for start trained model deployment API (#86824) 2022-05-17 22:27:44 +08:00			`"model_bytes": 265632637,`
			`"threads_per_allocation" : 1,`
			`"number_of_allocations" : 1,`
			`"queue_capacity" : 1024`
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00			`},`
			`"routing_table": {`
			`"uckeG3R8TLe2MMNBQ6AGrw": {`
			`"routing_state": "started",`
			`"reason": ""`
			`}`
			`},`
[ML] rename trained model allocations to assignments (#85503) This renames the internal concept of a trained model allocation into an assignment. Now models are assigned to a node and routes created for inference. Not "allocated". This is an internal rename only. The user facing concepts of trained models and deployments are untouched. 2022-04-18 23:35:10 +08:00			`"assignment_state": "started",`
[DOCS] Updates example output for start trained model deployment API (#86824) 2022-05-17 22:27:44 +08:00			`"start_time": "2022-11-02T11:50:34.766591Z"`
[ML] adding some initial document for our pytorch NLP model support (#78270) Adding docs for: put vocab put model definition part start deployment all the new NLP configuration objects for trained model configurations 2021-09-28 00:46:13 +08:00			`}`
			`}`
			`----`