* [ML] add documentation for get deployment stats API
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels.
This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs text sequences with "entailment" clauses. An example could be:
"Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...].
This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification.
See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works.
The zeroshot classification task is configured as follows:
```js
{
// <snip> model configuration </snip>
"inference_config" : {
"zero_shot_classification": {
"classification_labels": ["entailment", "neutral", "contradiction"], // <1>
"labels": ["sad", "glad", "mad", "rad"], // <2>
"multi_label": false, // <3>
"hypothesis_template": "This example is {}.", // <4>
"tokenization": { /*<snip> tokenization configuration </snip>*/}
}
}
}
```
* <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case
* <2> This is an optional parameter for the default zero_shot labels to attempt to classify
* <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels
* <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.`
For inference in a pipeline one may provide label updates:
```js
{
//<snip> pipeline definition </snip>
"processors": [
//<snip> other processors </snip>
{
"inference": {
// <snip> general configuration </snip>
"inference_config": {
"zero_shot_classification": {
"labels": ["humanities", "science", "mathematics", "technology"], // <1>
"multi_label": true // <2>
}
}
}
}
//<snip> other processors </snip>
]
}
```
* <1> The `labels` we care about, these replace the default ones if they exist.
* <2> Should the results allow multiple true labels
Similarly one may provide label changes against the `_infer` endpoint
```js
{
"docs":[{ "text_field": "This is a very happy person"}],
"inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}}
}
```
The char filter replaces the previous default of `first_non_blank_line`.
`first_non_blank_line` worked well to figure out what line had characters at all, but log lines
like the following were handled poorly:
```
--------------------------------------------------------------------------------
Alias 'foo' already exists and this prevents setting up ILM for logs
--------------------------------------------------------------------------------
```
When combined with the `ml_standard` tokenizer, the first line was used:
```
--------------------------------------------------------------------------------
```
This has no valid tokens for our standard tokenizer. Consequently, no tokens were found by `ml_standard` tokenizer.
The new filter, `first_line_with_letters`, returns the first line with any letter character (e.g. `Character#isLetter` returns true).
Given the previously poorly handled log, when combining with our `ml_standard` tokenizer, we get the following, more appropriate, tokens:
```
"tokens" : ["Alias", "foo", "already", "exists", "and", "this", "prevents", "setting", "up", "ILM", "for", "logs"]
```
This commit removes the ability to set the vocabulary location in the model config.
This opts instead for sane defaults to be set and used. Wrapping this up in an
API.
The index is now always the internally managed .ml-inference-native index
and the document ID is always <model_id>_vocabulary
This API only works for pytorch/nlp type models.
Previously, if a model failed to be allocated on any node, the deployment failed.
This commit allows for an allocation to be partially_started and indicates its
current state via a new state value in the deployment stats API.
Additionally, when starting a deployment, the user may specify to wait_for
starting, partially_started, started and the API will block (as long as timeout doesn't expire) until that state is reached.
This new parameter is a boolean parameter that allows
users to put in a compressed model without it having
to be inflated on the master node during the put
request
This is useful for system/module set up and then later
having the model validated and fully parsed when it
is being loaded on a node for usage
In #75617 a new setting, system_annotations_retention_days, was
added to control how long system annotations are retained for.
We now feel that this setting is redundant and that system
annotations should be retained for the same period as results.
This is intuitive and defensible, as system annotations can be
considered a type of result.
Followup to #75617
Previously attempting to delete a job that had a datafeed
would return an exception. However, this was unnecessarily
pedantic - the user would always want to delete both job
and datafeed together, and would react by deleting the
datafeed and then subsequently deleting the job again.
This change makes the delete job API automatically delete
a datafeed associated with the job. The same level of
force is used for this delete datafeed request as was used
on the delete job request. This means that it's possible
to force-delete an open job with a started datafeed (since
force-delete datafeed will automatically stop a started
datafeed). It's still not possible to delete an opened job
without using force.
Changes:
* Use "geopoint" when not referring to the literal field type
* Use "geoshape" when not referring to the literal field type or query type
* Use "GeoJSON" consistently
Add configuration for pruning dead split fields in anomaly detection
jobs via the `model_prune_window` field for both the job creation and
update APIs.
Relates to ml-cpp/#1962
This is a quality of life improvement for typical users. Almost all anomaly jobs will receive their data through a datafeed.
The datafeed config can now be supplied and is available in the datafeed field in the job config for creation and getting jobs.