Commit Graph

572 Commits

Author SHA1 Message Date
Lisa Cawley 1751ced80a
[DOCS] Fix formatting in get anomaly job API (#81682) 2021-12-13 12:56:27 -08:00
David Kyle 3c974a1e5d
[ML][DOCS] Remove orphaned GET deployment stats doc (#81505) 2021-12-09 08:32:33 +00:00
Lisa Cawley 429bdd9afc
[DOCS] Move trained model APIs out of dataframe analytics (#81315) 2021-12-03 09:21:09 -08:00
David Kyle aba14aacfa
[ML][DOCS] Add zero shot example and setting truncation at inference (#81003)
More examples for the _infer endpoint
2021-12-01 11:44:04 +00:00
Lisa Cawley e5de9d8ad7
[DOCS] Add actual and typical values in ML alerting docs (#80571) 2021-11-25 10:06:52 -08:00
Lisa Cawley 8da1236bca
[DOCS] Clarify impact of force stop trained model deployment (#81026) 2021-11-25 09:08:46 -08:00
Lisa Cawley d1af86cfdd
[DOCS] Fixes start and stop trained model deployment APIs (#80978) 2021-11-24 10:09:45 -08:00
Lisa Cawley 38cbd116c9
[DOCS] Fixes query parameters for get buckets API (#80643) 2021-11-22 11:34:43 -08:00
Lisa Cawley f3a69ae4b1
[DOCS] Adds missing query parameters to ML APIs (#80863) 2021-11-22 09:25:01 -08:00
Lisa Cawley fffac5bd08
[DOCS] Adds missing query parameters in get influencer and get snapshot APIs (#80801) 2021-11-18 08:24:24 -08:00
Lisa Cawley d6f48dc5bd
[DOCS] Add query parameters to update datafeed API (#80777) 2021-11-17 07:40:31 -08:00
Dimitris Athanasiou c7f745b40a
[ML] Force delete trained models (#80595)
Adds a `force` parameter to the delete trained models API
which when set to `true` allows deletion of a model that
is referenced by ingest pipelines or has a started deployment.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2021-11-11 10:54:01 +02:00
Benjamin Trent 5627dc66e1
[ML] deprecate estimated_heap_memory_usage_bytes and replace with model_size_bytes (#80554)
This deprecates estimated_heap_memory_usage_bytes on model put and replaces it with model_size_bytes.

On GET, only model_size_bytes is returned unless v7 rest-api compatibility is requested.

For the ml/info API, only model_size_bytes is returned

A forward-port of: #80545
2021-11-10 10:23:25 -05:00
David Roberts a61088063e
[ML] use_auto_machine_memory_percent now defaults max_model_memory_limit (#80532)
If the xpack.ml.use_auto_machine_memory_percent setting is true,
and xpack.ml.max_model_memory_limit is not set then
xpack.ml.max_model_memory_limit is now considered to be set to
the largest size that could be assigned in the cluster.

This functionality will be crucial for Cloud once the Elasticsearch
startup code is setting the Elasticsearch JVM heap size. Then the
Cloud code will no longer be able to accurately set
xpack.ml.max_model_memory_limit, so will not set it at all.
Instead the Cloud code will just set
xpack.ml.use_auto_machine_memory_percent and the ML code will
calculate the appropriate maximum model_memory_limit that should
be permitted.
2021-11-10 08:38:02 +00:00
Lisa Cawley 6ecc495d15
[DOCS] Clarify parameters in delete expired data, forecast, and flush job APIs (#80517) 2021-11-09 14:57:35 -08:00
Lisa Cawley 1c98a23ca8
[DOCS] Edits stop and start datafeed APIs (#80461) 2021-11-09 14:39:13 -08:00
Benjamin Trent cf5f521fac
[ML] add deployment_stats to trained model stats (#80531)
This commit adds a new field deployment_stats that is optionally set for models that are deployed.

If a model does not have a deployment, it will be null.

Also, removes the get deployment stats API and makes the deployment stats action internal only.
2021-11-09 16:09:47 -05:00
Benjamin Trent c3c3f88000
[ML] validate model definition on start deployment (#80439)
When a deployment is started, we do not validate that the definition
documents are all present and not truncated. This commit adds a
validation on _start that prevents a bad state from occurring where the
deployment starts, but the model is incorrectly defined, or some unknown
error occurs to late in the deployment process.
2021-11-09 10:33:55 -05:00
Dimitris Athanasiou afe58ba6d8
[ML] Force stop deployment in use (#80431)
Implements a `force` parameter to the stop deployment API.
This allows a user to forcefully stop a deployment. Currently,
this specifically allows stopping a deployment that is in use
by ingest processors.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2021-11-08 14:35:52 +02:00
Lisa Cawley 733381bed2
[DOCS] Adds missing query parameters to datafeed APIs (#80314) 2021-11-05 16:31:04 -07:00
James Rodewig f56a0f4b66
[DOCS] Remove `testenv` annotations from doc snippet tests (#80023)
Removes `testenv` annotations and related code. These annotations originally let you skip x-pack snippet tests in the docs. However, that's no longer possible.

Relates to #79309, #31619
2021-11-05 18:38:50 -04:00
István Zoltán Szabó f72e2da221
[DOCS] Adds missing query params to GET category and GET influencer APIs (#79448) 2021-11-05 10:59:57 +01:00
David Kyle 0635f2758f
[ML] Consistently apply the default truncation option for the BERT tokenizer (#80339)
The default is Truncate.First
2021-11-05 09:10:59 +00:00
Lisa Cawley 638fe2c26a
[DOCS] Fixes typo in start trained models API (#80368) 2021-11-04 14:23:03 -07:00
Dimitris Athanasiou d13baade69
[ML] Report start_time for trained model deployments and allocations (#80188)
Adds `start_time` to the get deployment stats API for the deployment
and each allocation.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2021-11-02 17:12:46 +02:00
David Kyle 58a517309a
[ML] [DOCS] Update the model part upload URL in example (#80181) 2021-11-02 11:33:04 +00:00
Benjamin Trent 8887cfa080
[ML] updating the infer trained model deployment docs (#80083)
the infer endpoint has changed its format.

Also, the results format for the various tasks have changed. This updates the docs to match what is currently in 8.0.0.
2021-10-29 13:07:23 -04:00
Benjamin Trent f9bf4e57b9
[ML] adds new params to the start trained model deployment docs (#80016) 2021-10-28 11:23:25 -04:00
Benjamin Trent 375fc779b4
[ML] update truncation default & adding field output when input is truncated (#79942)
This commit makes the two following changes (along with some
refactoring)  - Nlp results will now indicate if the input was truncated
or not  - The default truncation is now `none` instead of `first`
2021-10-28 10:40:49 -04:00
Benjamin Trent d2b638356b
[ML] Update trained model docs for truncate parameter for bert tokenization (#79652) 2021-10-28 07:19:10 -04:00
David Roberts 6b20e8e1b0
[ML] Fixing doc test substitution bug (#79943)
The substitutions should not have a space after the field
name.

Fixes #79931
2021-10-27 19:45:15 +01:00
Mark Vieira 8f79cfacab Mute documentation test 2021-10-27 09:48:20 -07:00
Lisa Cawley 610043f100
[DOCS] Edits formatting in create trained models API (#79758)
Related to #78376

This PR fixes minor formatting issues in the create trained models API documentation
2021-10-27 07:41:11 -04:00
Lisa Cawley cadc0c3800
[DOCS] Fixes typo in preview datafeed API (#79863) 2021-10-26 16:48:06 -07:00
István Zoltán Szabó c879db98b1
[DOCS] Updates get trained models API docs (#79372)
* [DOCS] Updates get trained models API docs.

* [DOCS] Reviews get trained models related definitions in ml-shared.
2021-10-25 11:47:45 +02:00
István Zoltán Szabó 94ab204a1e
[DOCS] Fixes indentation issue in GET trained models API docs. (#79347) 2021-10-18 12:27:24 +02:00
Lisa Cawley 3d6074b76e
[DOCS] Fixes typo in calendar API example (#78867) 2021-10-07 17:51:14 -07:00
Lisa Cawley df5dde5b3c
[DOCS] Fixes ML get calendars API (#78808) 2021-10-07 12:22:11 -07:00
Lisa Cawley bcd75c3203
[DOCS] Fixes ML get scheduled events API (#78809) 2021-10-07 08:34:58 -07:00
Benjamin Trent 498e6e3d0f
[ML] adding docs for estimated heap and operations (#78376)
Add docs for optionally supplying memory and operation estimates in put model
2021-09-29 09:11:42 -04:00
Benjamin Trent b96d929af3
[ML] add documentation for get deployment stats API (#78412)
* [ML] add documentation for get deployment stats API

* Apply suggestions from code review

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
2021-09-29 07:20:25 -04:00
Benjamin Trent 408489310c
[ML] add zero_shot_classification task for BERT nlp models (#77799)
Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels.

This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs  text sequences with "entailment" clauses. An example could be:

"Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...]. 

This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification.

See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works. 

The zeroshot classification task is configured as follows:
```js
{
   // <snip> model configuration </snip>
  "inference_config" : {
    "zero_shot_classification": {
      "classification_labels": ["entailment", "neutral", "contradiction"], // <1>
      "labels": ["sad", "glad", "mad", "rad"], // <2>
      "multi_label": false, // <3>
      "hypothesis_template": "This example is {}.", // <4>
      "tokenization": { /*<snip> tokenization configuration </snip>*/}
    }
  }
}
```
* <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case
* <2> This is an optional parameter for the default zero_shot labels to attempt to classify
* <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels
* <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.`

For inference in a pipeline one may provide label updates:
```js
{
  //<snip> pipeline definition </snip>
  "processors": [
    //<snip> other processors </snip>
    {
      "inference": {
        // <snip> general configuration </snip>
        "inference_config": {
          "zero_shot_classification": {
             "labels": ["humanities", "science", "mathematics", "technology"], // <1>
             "multi_label": true // <2>
          }
        }
      }
    }
    //<snip> other processors </snip>
  ]
}
```
* <1> The `labels` we care about, these replace the default ones if they exist. 
* <2> Should the results allow multiple true labels

Similarly one may provide label changes against the `_infer` endpoint
```js
{
   "docs":[{ "text_field": "This is a very happy person"}],
   "inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}}
}
```
2021-09-28 09:38:23 -04:00
Benjamin Trent 00defa38a9
[ML] adding some initial document for our pytorch NLP model support (#78270)
Adding docs for:

put vocab
put model definition part
start deployment
all the new NLP configuration objects for trained model configurations
2021-09-27 12:46:13 -04:00
Benjamin Trent 281ec58b8d
[ML] add new default char filter `first_line_with_letters` for machine learning categorization (#77457)
The char filter replaces the previous default of `first_non_blank_line`.

`first_non_blank_line` worked well to figure out what line had characters at all, but log lines 
like the following were handled poorly:
```
--------------------------------------------------------------------------------

Alias 'foo' already exists and this prevents setting up ILM for logs

--------------------------------------------------------------------------------
```
When combined with the `ml_standard` tokenizer, the first line was used:
```
--------------------------------------------------------------------------------
```
This has no valid tokens for our standard tokenizer. Consequently, no tokens were found by `ml_standard` tokenizer.


The new filter, `first_line_with_letters`, returns the first line with any letter character (e.g. `Character#isLetter` returns true).

Given the previously poorly handled log, when combining with our `ml_standard` tokenizer, we get the following, more appropriate, tokens:

```
"tokens" : ["Alias", "foo", "already", "exists", "and", "this", "prevents", "setting", "up", "ILM", "for", "logs"]
```
2021-09-09 10:09:57 -04:00
Lisa Cawley b5a32678e7
[DOCS] Fixes admonition formatting (#77393) 2021-09-08 11:20:43 -07:00
Benjamin Trent a68c6acdb3
[ML] adding new PUT trained model vocabulary endpoint (#77387)
This commit removes the ability to set the vocabulary location in the model config.
This opts instead for sane defaults to be set and used. Wrapping this up in an
API.

The index is now always the internally managed .ml-inference-native index
and the document ID is always <model_id>_vocabulary

This API only works for pytorch/nlp type models.
2021-09-08 10:21:45 -04:00
Benjamin Trent 708491d0d3
[ML] add allocation state reason and support for partial model allocations (#76925)
Previously, if a model failed to be allocated on any node, the deployment failed.

This commit allows for an allocation to be partially_started and indicates its
current state via a new state value in the deployment stats API.

Additionally, when starting a deployment, the user may specify to wait_for
starting, partially_started, started and the API will block (as long as timeout doesn't expire) until that state is reached.
2021-09-07 15:23:13 -04:00
Benjamin Trent de49ff22a4
[ML] creating new PUT model definition part API (#76987)
This commit simplifies the interactions for uploading chunked model definitions and model vocabulary.
2021-09-07 08:22:52 -04:00
Benjamin Trent 02e17c3442
[ML] adding new defer_definition_decompression parameter to put trained model API (#77189)
This new parameter is a boolean parameter that allows
users to put in a compressed model without it having
to be inflated on the master node during the put
request

This is useful for system/module set up and then later
having the model validated and fully parsed when it
is being loaded on a node for usage
2021-09-03 09:07:54 -04:00
István Zoltán Szabó cdec5228e8
[DOCS] Fixes line breaks. (#77248) 2021-09-03 14:40:43 +02:00
István Zoltán Szabó 70a012b0c7
[DOCS] Fixes section IDs in start/stop trained model deployment APIs. (#77247) 2021-09-03 14:24:37 +02:00
Lisa Cawley 007469af63
[DOCS] Replaces index pattern in ML docs (#77041) 2021-09-01 10:26:06 -07:00
Benjamin Trent 0e1efa6533
[ML] generalize pytorch sentiment analysis to text classification (#77084)
* [ML] generalize pytorch sentiment analysis to text classification

* Update x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/TextClassificationConfig.java
2021-09-01 08:45:13 -04:00
István Zoltán Szabó ea007902ef
[DOCS] Adds anomaly job health alert type docs (#76659)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-08-30 16:11:34 +02:00
Lisa Cawley d36f24fbc3
[DOCS] Update datafeed details in ML docs (#76854) 2021-08-25 11:35:21 -07:00
István Zoltán Szabó 789368b38f
[DOCS] Fixes a syntax error in datafeed runtime field example. (#76917) 2021-08-25 12:04:32 +02:00
István Zoltán Szabó 8aed99fc02
[DOCS] Adds links that point to loss function to ML API docs. (#76438) 2021-08-23 13:09:37 +02:00
István Zoltán Szabó 7faec52a1e
[DOCS] Fixes model_prune_window property description. (#76711) 2021-08-19 16:16:37 +02:00
István Zoltán Szabó b9d875bf68
[DOCS] Updates description of model_prune_window property in ML shared (#76487) 2021-08-13 12:18:38 +02:00
István Zoltán Szabó 9b0417f2df
[DOCS] Comments out links that points to regression loss functions (#76435)
* [DOCS] Comments out links that points to regression loss functions.

* Update docs/reference/ml/df-analytics/apis/get-trained-models.asciidoc
2021-08-12 18:33:42 +02:00
David Roberts 7ac5ea39df
[ML] Use results retention time for deleting system annotations (#76096)
In #75617 a new setting, system_annotations_retention_days, was
added to control how long system annotations are retained for.
We now feel that this setting is redundant and that system
annotations should be retained for the same period as results.
This is intuitive and defensible, as system annotations can be
considered a type of result.

Followup to #75617
2021-08-04 17:42:31 +01:00
David Roberts 10a1d27c7b
[ML] Deleting a job now deletes the datafeed if necessary (#76010)
Previously attempting to delete a job that had a datafeed
would return an exception. However, this was unnecessarily
pedantic - the user would always want to delete both job
and datafeed together, and would react by deleting the
datafeed and then subsequently deleting the job again.

This change makes the delete job API automatically delete
a datafeed associated with the job. The same level of
force is used for this delete datafeed request as was used
on the delete job request. This means that it's possible
to force-delete an open job with a started datafeed (since
force-delete datafeed will automatically stop a started
datafeed). It's still not possible to delete an opened job
without using force.
2021-08-03 17:22:06 +01:00
James Rodewig fc0ac1923d
[DOCS] Correct spelling for geo terms (#76028)
Changes:
* Use "geopoint" when not referring to the literal field type
* Use "geoshape" when not referring to the literal field type or query type
* Use "GeoJSON" consistently
2021-08-03 09:55:48 -04:00
Ed Savage 5651215be1
[ML] Add 'model_prune_window' field to AD job config (#75741)
Add configuration for pruning dead split fields in anomaly detection
jobs via the `model_prune_window` field for both the job creation and
update APIs.

Relates to ml-cpp/#1962
2021-08-03 09:16:43 +01:00
István Zoltán Szabó ce537a33b6
[DOCS] Adds link that points to outlier detection example to GET DFA stats API docs. (#75689) 2021-08-02 18:10:03 +02:00
István Zoltán Szabó 8d4fb3aa84
[DOCS] Changes link to outlier detection docs in PUTDFA API docs. (#75933) 2021-08-02 13:45:37 +02:00
Przemysław Witek 30d9f13436
[ML] Delete expired annotations (#75617) 2021-07-29 15:27:03 +02:00
Lisa Cawley c1ba949aee
[DOCS] Fixes bulleted list in ML aggregations (#75806) 2021-07-28 11:29:48 -07:00
Lisa Cawley 02d851e50e
[DOCS] Drafts trained model deployment APIs (#75497) 2021-07-26 09:49:37 -07:00
István Zoltán Szabó 7e7a386078
[DOCS] Comments out link that points to outlier detection example (#75687) 2021-07-26 16:36:57 +02:00
Lisa Cawley 70b870ee7f
[DOCS] Fixes nesting of datafeed config in APIs (#75502) 2021-07-20 11:27:15 -07:00
István Zoltán Szabó 9ef156df9f
[DOCS] Adds peak_model_bytes and assignment_memory_basis to GET model snapshot API docs (#75413) 2021-07-16 17:12:47 +02:00
Lisa Cawley c8c7f0ef52
[DOCS] Anomaly detection: Visualize delayed data (#75098) 2021-07-13 18:06:07 -07:00
Lisa Cawley 3c76bcb3a5
[DOCS] Fixes links to machine learning concepts (#75194) 2021-07-09 13:09:03 -07:00
István Zoltán Szabó 6a4de77e11
[DOCS] Adds classification and regression links back to DFA docs. (#74930) 2021-07-08 16:37:16 +02:00
István Zoltán Szabó 841cfb9214
[DOCS] Adds outlier detection links to DFA API docs (#74748) 2021-07-06 15:10:41 +02:00
Lisa Cawley b71b7d0866
[DOCS] Fix links to anomaly detection overview (#74943) 2021-07-05 13:19:54 -07:00
Lisa Cawley 4c85852cc7
[DOCS] Update forecasting links in ML APIs (#74942) 2021-07-05 12:34:03 -07:00
Lisa Cawley 5bcd318e29
[DOCS] Move ML functions to appendix (#74802) 2021-07-05 11:53:17 -07:00
István Zoltán Szabó 483d145f78
[DOCS] Fixes an attribute in PUT DFA API docs. (#74931) 2021-07-05 17:08:11 +02:00
István Zoltán Szabó 6c6e6874ff
[DOCS] Removes link to classification and regression. (#74926) 2021-07-05 16:28:14 +02:00
István Zoltán Szabó a4f9f4fae1
[DOCS] Comments out links to outlier detection. (#74745) 2021-06-30 14:24:34 +02:00
Lisa Cawley 64af39b759
[DOCS] Add memory limit details in update job API (#74517)
Co-authored-by: David Roberts <dave.roberts@elastic.co>
2021-06-24 08:50:19 -07:00
Benjamin Trent 0303e6d733
[ML] add datafeed field to the job config (#74265)
This is a quality of life improvement for typical users. Almost all anomaly jobs will receive their data through a datafeed.

The datafeed config can now be supplied and is available in the datafeed field in the job config for creation and getting jobs.
2021-06-23 08:06:58 -04:00
David Roberts 6e9b959450
[ML] Closing an anomaly detection job now automatically stops its datafeed if necessary (#74257)
Previously it was a requirement of the close job API that if the
job had an associated datafeed that that datafeed was stopped
before the job could be closed. Experience has shown that this
is just a pedantic nuisance. If a user closes the job without
first stopping the datafeed then it's just a mistake, and they
then have to make two further calls, to stop the datafeed and
then attempt to close the job again.

This PR changes the behaviour so that if you ask to close a job
whose datafeed is running then the datafeed gets stopped first
as part of the same call. Datafeeds are stopped with the same
level of force as the job close request specified.
2021-06-22 12:56:11 +01:00
István Zoltán Szabó 2e820fcab6
[DOCS] Clarifies terminology in Performing population analysis page. (#74237) 2021-06-18 09:03:38 +02:00
ymao1 c727b40d0b
[Docs] Update cross-document links to Kibana Alerting docs (#74034)
* Updating cross-document links

* PR fixes
2021-06-14 12:23:47 -04:00
Dimitris Athanasiou dc61a72c9e
[ML] Reset anomaly detection job API (#73908)
Adds a new API that allows a user to reset
an anomaly detection job.

To use the API do:

```
POST _ml/anomaly_detectors/<job_id>_reset
```

The API removes all data associated to the job.
In particular, it deletes model state, results and stats.

However, job notifications and user annotations are not removed.

Also, the API can be called asynchronously by setting the parameter
`wait_for_completion` to `false` (defaults to `true`). When run
that way the API returns the task id for further monitoring.

In order to prevent the job from opening while it is resetting,
a new job field has been added called `blocked`. It is an object
that contains a `reason` and the `task_id`. `reason` can take
a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The `task_id` is also
included in order to allow tracking the task if necessary.

Finally, this commit also sets the `blocked` field when
the revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.
2021-06-14 18:56:28 +03:00
Benjamin Trent 8d882863d7
[ML] adding running_state to datafeed stats object (#73926)
It is useful to know the following information when reading datafeed stats:

 - Is the datafeed a "real-time" datafeed, i.e. a datafeed without a configured `end` time
 - Has the datafeed processed all past data available at the time of starting.

This object is only available if the datafeed task has been created.

It has the form:

```
"running_state": {
  "is_real_time": <boolean>,
  "look_back_finished": <boolean>
}
```
2021-06-10 08:08:49 -04:00
István Zoltán Szabó 20d0dc300f
[DOCS] Updates datafeed related runtime field examples (#73725) 2021-06-08 11:27:55 +02:00
Lisa Cawley a6339918ac
[DOCS] Adds defaults to get ML results APIs (#73540)
Co-authored-by: David Roberts <dave.roberts@elastic.co>
2021-06-03 10:05:47 -07:00
István Zoltán Szabó 44c26c8bdc
[DOCS] Removes Kibana charts-related advise about agg interval and bucket span. (#73673) 2021-06-02 16:47:01 +02:00
David Roberts 0059c59e25
[ML] Make ml_standard tokenizer the default for new categorization jobs (#72805)
Categorization jobs created once the entire cluster is upgraded to
version 7.14 or higher will default to using the new ml_standard
tokenizer rather than the previous default of the ml_classic
tokenizer, and will incorporate the new first_non_blank_line char
filter so that categorization is based purely on the first non-blank
line of each message.

The difference between the ml_classic and ml_standard tokenizers
is that ml_classic splits on slashes and colons, so creates multiple
tokens from URLs and filesystem paths, whereas ml_standard attempts
to keep URLs, email addresses and filesystem paths as single tokens.

It is still possible to config the ml_classic tokenizer if you
prefer: just provide a categorization_analyzer within your
analysis_config and whichever tokenizer you choose (which could be
ml_classic or any other Elasticsearch tokenizer) will be used.

To opt out of using first_non_blank_line as a default char filter,
you must explicitly specify a categorization_analyzer that does not
include it.

If no categorization_analyzer is specified but categorization_filters
are specified then the categorization filters are converted to char
filters applied that are applied after first_non_blank_line.

Closes elastic/ml-cpp#1724
2021-06-01 15:11:32 +01:00
István Zoltán Szabó 1ce2308e2a
[DOCS] Adds max_trees hyperparameter to GET TM API docs (#72298) 2021-05-06 08:18:19 +02:00
István Zoltán Szabó d07c174aaf
[DOCS] Revises required privileges info in Anomaly Detection API docs (#72483) 2021-05-03 10:20:14 +02:00
Benjamin Trent 2ce4d175f0
[ML] increase the default value of xpack.ml.max_open_jobs from 20 to 512 for autoscaling improvements (#72487)
This commit increases the xpack.ml.max_open_jobs from 20 to 512. Additionally, it ignores nodes that cannot provide an accurate view into their native memory.

If a node does not have a view into its native memory, we ignore it for assignment.

This effectively fixes a bug with autoscaling. Autoscaling relies on jobs with adequate memory to assign jobs to nodes. If that is hampered by the xpack.ml.max_open_jobs scaling decisions are hampered.
2021-04-30 07:55:57 -04:00
István Zoltán Szabó ce9dd74cf5
[DOCS] Expands DFA and TM API docs with required privileges info (#71335) 2021-04-28 08:33:42 +02:00
Pierre Grimaud 3c44dfec60
[DOCS] Fix typos (#72227) 2021-04-26 12:40:38 -04:00
István Zoltán Szabó 2f122f03b2
[DOCS] Adds anomaly detection rule advanced settings to docs (#72072)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-04-26 09:55:02 +02:00
István Zoltán Szabó aca0a7ffa4
[DOCS] Alters examples in anomaly detection page to use runtime mappings (#71745) 2021-04-19 13:06:50 +02:00
Benjamin Trent 01fc8ed246
[ML] adding ability to update runtime_mappings via datafeed config update API (#71707)
Adds runtime_mappings as an updatable field via datafeed config update.

closes: #71702
2021-04-15 09:44:34 -04:00
István Zoltán Szabó ce389dff5d
[DOCS] Clarifies that custom rules are job rules in Kibana (#71678)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-04-15 09:33:03 +02:00
James Rodewig 693807a6d3
[DOCS] Fix double spaces (#71082) 2021-03-31 09:57:47 -04:00
Benjamin Trent c8415a7924
[ML] adding support for composite aggs in anomaly detection (#69970)
This commit allows for composite aggregations in datafeeds. 

Composite aggs provide a much better solution for having influencers, partitions, etc. on high volume data. Instead of worrying about long scrolls in the datafeed, the calculation is distributed across cluster via the aggregations. 

The restrictions for this support are as follows:

- The composite aggregation must have EXACTLY one `date_histogram` source
- The sub-aggs of the composite aggregation must have a `max` aggregation on the SAME timefield as the aforementioned `date_histogram` source
- The composite agg must be the ONLY top level agg and it cannot have a `composite` or `date_histogram` sub-agg
- If using a `date_histogram` to bucket time, it cannot have a `composite` sub-agg.
- The top-level `composite` agg cannot have a sibling pipeline agg. Pipeline aggregations are supported as a sub-agg (thus a pipeline agg INSIDE the bucket).

Some key user interaction differences:
- Speed + resources used by the cluster should be controlled by the `size` parameter in the `composite` aggregation. Previously, we said if you are using aggs, use a specific `chunking_config`. But, with composite, that is not necessary. 
- Users really shouldn't use nested `terms` aggs anylonger. While this is still a "valid" configuration and MAY be desirable for some users (only wanting the top 10 of certain terms), typically when users want influencers, partition fields, etc. they want the ENTIRE population. Previously, this really wasn't possible with aggs, with `composite` it is.
- I cannot really think of a typical usecase that SHOULD ever use a multi-bucket aggregation that is NOT supported by composite.
2021-03-30 08:25:40 -04:00
István Zoltán Szabó 1db2b85e45
[DOCS] Adds source index privileges required for Explain DFA API docs. (#70978) 2021-03-30 10:42:48 +02:00
Benjamin Trent b796632582
[ML] Allow datafeed and job configs for datafeed preview API (#70836)
Previously, a datafeed and job must already exist for the `_preview` API to work.

With this change, users can get an accurate preview of the data that will be sent to the anomaly detection job
without creating either of them. 

closes https://github.com/elastic/elasticsearch/issues/70264
2021-03-26 12:52:23 -04:00
István Zoltán Szabó 9a8c6fb66f
[DOCS] Removes beta labels from DFA related docs. (#70808) 2021-03-26 09:46:41 +01:00
István Zoltán Szabó 165c0ddaeb
[DOCS] Updates anomaly detection alert docs with the new alerting terminology (#70486)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-03-18 18:23:19 +01:00
Benjamin Trent 10e637d97c
[ML] allow documents to be out of order within the same time bucket (#70468)
This commit allows documents seen within the same time bucket to be out of order.

This is already supported within the native process.

Additionally, when recording the "latest" record timestamp, we were assuming that the latest seen document was truly the "latest". This is not really the case if latency is utilized or if documents come out of order within the same bucket.
2021-03-17 09:34:49 -04:00
James Rodewig 5c75d004fa
[DOCS] Replace `put` with `create or update` in API names (#70330)
Co-authored-by: debadair <debadair@elastic.co>
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
2021-03-15 14:49:44 -04:00
István Zoltán Szabó 59f6280a7b
[DOCS] Changes deprecated syntax to node.role style in datafeed docs. (#70201) 2021-03-10 15:46:01 +01:00
Lisa Cawley 2caba7b11f
[DOCS] Edits machine learning settings (#69947)
Co-authored-by: David Roberts <dave.roberts@elastic.co>
2021-03-09 10:59:12 -08:00
István Zoltán Szabó c226958947
[DOCS] Expands anomaly detection alert type docs (#70026)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
Co-authored-by: Dima Arnautov <arnautov.dima@gmail.com>
2021-03-09 12:02:16 +01:00
Lisa Cawley c537e5f38c
[DOCS] Edits delete trained model alias API (#70119) 2021-03-08 17:08:58 -08:00
István Zoltán Szabó 8a7aced8e8
[DOCS] Adds beta tag to anomaly detection alert docs. (#70013) 2021-03-08 10:46:24 +01:00
István Zoltán Szabó 2ccc81081f
[DOCS] Adds hyperparameters option to the include setting of GET trained models API. (#69959) 2021-03-04 16:43:06 +01:00
Joe Gallo 1e8b5fa7c2
Remove the _ml/find-file-structure docs (#69823) 2021-03-03 09:49:28 -05:00
Benjamin Trent 2279cafb4e
[ML] adding new _preview endpoint for data frame analytics (#69453)
This commit adds a new `_preview` endpoint for data frame analytics. 

This allows users to see the data on which their model will be trained. This is especially useful 
in the arrival of custom feature processors.

The API design is a similar to datafeed `_preview` and data frame analytics `_explain`.
2021-03-01 12:25:50 -05:00
Lisa Cawley 138224b398
[DOCS] Edits trained model alias API (#69491) 2021-02-24 08:17:49 -08:00
István Zoltán Szabó 77d0f56581
[DOCS] Adds anomaly detection alert documentation (#68923)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2021-02-23 10:29:54 +01:00
Dimitris Athanasiou 7fb98c0d3c
[ML] Add runtime mappings to data frame analytics source config (#69183)
Users can now specify runtime mappings as part of the source config
of a data frame analytics job. Those runtime mappings become part of
the mapping of the destination index. This ensures the fields are
accessible in the destination index even if the relevant data frame
analytics job gets deleted.

Closes #65056
2021-02-19 16:29:19 +02:00
Benjamin Trent 0af38bba9e
[ML] add new delete trained model aliases API (#69195)
In addition to creating and re-assigning model aliases, users should be able to delete existing and unused model aliases.
2021-02-18 13:12:07 -05:00
Lisa Cawley 55f0e32fe4
[DOCS] Clarify put data frame analytics API feature processors option (#69158) 2021-02-18 08:53:46 -08:00
Benjamin Trent 26eef892df
[ML] adds new trained model alias API to simplify trained model updates and deployments (#68922)
A `model_alias` allows trained models to be referred by a user defined moniker. 

This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models. 

Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated. 

When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically. 

An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.
2021-02-18 09:41:50 -05:00
James Rodewig 9b88ae92e6
[DOCS] Fix typos for duplicate words (#69125) 2021-02-17 10:34:20 -05:00
Lisa Cawley a1fb2c3606
[DOCS] Fixes n_gram_encoding in data frame analytics APIs (#69084) 2021-02-16 14:02:00 -08:00
Lisa Cawley 8b6ec07613
[DOCS] Edits ML hyperparameter descriptions (#68880) 2021-02-11 11:55:28 -08:00
Lisa Cawley 683368cc4d
[DOCS] Clarify soft_tree_depth_limit (#68787)
Co-authored-by: Tom Veasey <tveasey@users.noreply.github.com>
2021-02-10 12:51:01 -08:00
István Zoltán Szabó e45d7a942d
[DOCS] Expands feature processors property description and adds a link of conceptual docs (#68213) 2021-02-02 14:48:43 +01:00
Valeriy Khakhutskyy 78368428b3
[ML] Add early stopping DFA configuration parameter (#68099)
The PR adds early_stopping_enabled optional data frame analysis configuration parameter. The enhancement was already described in elastic/ml-cpp#1676 and so I mark it here as non-issue.
2021-02-01 11:41:28 +01:00
Dimitris Athanasiou 5c961c1c81
[ML] Expand regression/classification hyperparameters (#67950)
Expands data frame analytics regression and classification
analyses with the followin hyperparameters:

- alpha
- downsample_factor
- eta_growth_rate_per_tree
- max_optimization_rounds_per_hyperparameter
- soft_tree_depth_limit
- soft_tree_depth_tolerance
2021-01-26 12:56:41 +02:00
István Zoltán Szabó addb5cbd3a
[DOCS] Adds custom feature processors description to PUT DFA API (#67424)
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2021-01-19 09:47:32 +01:00
Dimitris Athanasiou 7574013604
[ML] Remove DFA job states reindexing and analyzing from docs (#67658)
These states do no longer exist as of #67423
2021-01-18 17:39:22 +02:00
Benjamin Trent 35f478b618
[ML] [DOCS] adding missing fields to the get trained models API docs (#67590)
Adds missing fields description, inference_config, and input to the GET trained models API documentation
2021-01-15 13:20:53 -05:00
Benjamin Trent 24ebcc8c24
[ML] [DOCS] update find-structure reference docs (#67586)
The text structure finder API documentation had many references to the "files". While this is one use of the API, the API now has a more generic name. This commit replaces many references to the word "file" to the more generic word "text".
2021-01-15 12:19:38 -05:00
István Zoltán Szabó 085a288af5
[DOCS] Adds hyperparameter metadata property to GET trained models API docs. (#67412) 2021-01-13 13:49:51 +01:00
Lisa Cawley 401d302c69
[DOCS] Move find file structure to a new API endpoint (#67314) 2021-01-12 11:59:45 -08:00
Benjamin Trent af179ab2f5
[ML] move find file structure to a new API endpoint (#67123)
This introduces a new `text-structure` plugin. This is the new home of the find file structure API. 

The old REST URL is still available but is deprecated.

The new URL is: `_text_structure/find_structure`. All parameters and behavior are unchanged.

Changes to the high-level REST client and docs will be in separate commit.

related to: https://github.com/elastic/elasticsearch/issues/67001
2021-01-11 08:56:02 -05:00
Lisa Cawley eff9dfc3a4
[DOCS] Clarify impact of delayed data in anomaly detection (#66816)
Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
2021-01-05 12:14:51 -08:00
István Zoltán Szabó d3ad9fe632
[DOCS] Improves inference processor linking and docs (#66119) 2021-01-05 09:42:06 +01:00
David Roberts c5bef7f9a7
[ML] Deprecate anomaly detection post data endpoint (#66347)
There is little evidence of this endpoint being used
and there is quite a lot of code complexity associated
with the various formats that can be used to upload
data and the different errors that can occur when direct
data upload is open to end users.

In a future release we can make this endpoint internal
so that only datafeeds can use it, and remove all the
options and formats that are not used by datafeeds.

End users will have to store their input data for
anomaly detection in Elasticsearch indices (which we
believe all do today) and use a datafeed to feed it
to anomaly detection jobs.
2020-12-15 18:37:20 +00:00
Dimitris Athanasiou 3bed6661de
[ML] Add log_time to AD data_counts and decide current based on it (#66343)
This commit is fixing a potential bug if we support anomaly detection
results index rollover in the future.

In particular, we determine the current `data_counts` by sorting on the
latest record time. However, this is not correct if the job reverts
to an older model snapshot. To fix this we add `log_time` to `data_counts`
(similarly to `model_size_stats`) and sort on `log_time` to figure
out the current counts for the job.
2020-12-15 19:09:13 +02:00
István Zoltán Szabó bc989e4a86
[DOCS] Adds note about data_counts values to Revert snapshot API docs. (#66085) 2020-12-09 10:47:51 +01:00
István Zoltán Szabó 3081cf4944
[DOCS] Adds empty snapshot_id description to revert snapshot API docs (#66036) 2020-12-09 10:01:26 +01:00
David Kyle 22dadfd407
[ML] Docs and HRLC for datafeed runtime mappings (#65810)
For the changes in #65606
2020-12-08 10:06:58 +00:00
David Roberts 49e492f313
[ML] Adding assignment_memory_basis to model_size_stats (#65561)
At present the Java code makes a decision on whether to
use current model memory or model memory limit to calculate
how much memory a job requires to be assigned.

The plan is to move this decision to the C++ code, which will
report it via a new field in the model size stats.  An
additional change will be that once we have made the switch
from using model memory limit to using current model memory
we will never switch back, as this causes large fluctuations
up and down in memory requirement which will be much more
noticeable when autoscaling is in use.

Although the only two options at present are model memory
limit and current model memory, the new enum includes a
third possibility, peak model memory.  To switch to this
now would be tricky, as there have been two bugs in the
implementation of peak model memory which render its value
unreliable in 7.x.  However, in 8.x it might make sense to
switch to using peak model memory instead of current model
memory and it's much easier from a BWC perspective if the
enum contains all the values from the start.

Relates #63163
2020-12-03 17:18:08 +00:00
David Roberts fc72b39a17
[ML] Adjusting soft_limit description (#65383)
This PR adds detail to the explanation of the soft_limit
memory_status in ML job stats. A consequence that was not
mentioned before is that examples are not added to category
definitions.

Relates elastic/ml-cpp#1590
2020-11-24 09:35:07 +00:00
István Zoltán Szabó a85fb5534a
[DOCS] Fixes typo in Aggregating data for faster performance. (#65354) 2020-11-23 12:44:59 +01:00
István Zoltán Szabó f1e54a63a1
[DOCS] Adds UI related limitation to configuring aggs docs (#65184)
Co-authored-by: Lisa Cawley <lcawley@elastic.co>
2020-11-20 19:03:18 +01:00
István Zoltán Szabó 1e045da339
[DOCS] Makes the screenshot larger on the custom URLs page. (#65269) 2020-11-20 09:29:39 +01:00