* [DOCS] Add Beats config example for ingest pipelines
The Elasticsearch ingest pipeline docs cover ingest pipelines for Fleet and
Elastic Agent. However, the docs don't cover Beats. This adds those docs.
Relates to https://github.com/elastic/beats/pull/28239.
* Update docs/reference/ingest.asciidoc
Co-authored-by: DeDe Morton <dede.morton@elastic.co>
Co-authored-by: DeDe Morton <dede.morton@elastic.co>
The 'verbose' option to /_segments returns memory information
for each segment. However, lucene 9 has stopped tracking this memory
information as it is largely held off-heap and so is no longer significant.
This commit deprecates the 'verbose' parameter and makes it a no-op.
Fixes#75955
This commit adds a new multi-bucket aggregation: `categorize_text`
The aggregation follows a similar design to significant text in that it reads from `_source`
and re-analyzes the the text as it is read.
Key difference is that it does not use the indexed field's analyzer, but instead relies on
the `ml_standard` tokenizer with specialized ML token filters. The tokenizer + filters are the
same that machine learning categorization anomaly jobs utilize.
The high level logical flow is as follows:
- at each shard, read in the text field with a custom analyzer using `ml_standard` tokenizer
- Read in the particular tokens from the analyzer
- Feed these tokens to a token tree algorithm (an adaptation of the drain categorization algorithm)
- Gather the individual log categories (the leaf nodes), sort them by doc_count, ship those buckets to be merged
- Merge all buckets that have the EXACT same key
- Once all buckets are merged, pass those keys + counts to a new token tree for additional merging
- That tree builds the final buckets and that is returned to the user
Algorithm explanation:
- Each log is parsed with the ml-standard tokenizer
- each token is passed into a token tree
- For `max_match_token` each token is stored in the tree and at `max_match_token+1` (or `len(tokens)`) a log group is created
- If another log group exists at that leaf, merge it if they have `similarity_threshold` percentage of tokens in common
- merging simply replaces tokens that are different in the group with `*`
- If a layer in the tree has `max_unique_tokens` we add a `*` child and any new tokens are passed through there. Catch here is that on the final merge, we first attempt to merge together subtrees with the smallest number of documents. Especially if the new sub tree has more documents counted.
## Aggregation configuration.
Here is an example on some openstack logs
```js
POST openstack/_search?size=0
{
"aggs": {
"categories": {
"categorize_text": {
"field": "message", // The field to categorize
"similarity_threshold": 20, // merge log groups if they are this similar
"max_unique_tokens": 20, // Max Number of children per token position
"max_match_token": 4, // Maximum tokens to build prefix trees
"size": 1
}
}
}
}
```
This will return buckets like
```json
"aggregations" : {
"categories" : {
"buckets" : [
{
"doc_count" : 806,
"key" : "nova-api.log.1.2017-05-16_13 INFO nova.osapi_compute.wsgi.server * HTTP/1.1 status len time"
}
]
}
}
```
The get SLM status API will only return one of three statuses: `RUNNING`, `STOPPING`, or `STOPPED`.
This corrects the docs to remove the `STARTED` status and document the `RUNNING` status.
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
In #77686 we added a service to clean up blob store
cache docs after a searchable snapshot is no more
used. We noticed some situations where some cache
docs could still remain in the system index: when the
system index is not available when the searchable
snapshot index is deleted; when the system index is
restored from a backup or when the searchable
snapshot index was deleted on a version before #77686.
This commit introduces a maintenance task that
periodically scans and cleans up unused blob cache
docs. This task is scheduled to run every hour on the
data node that contain the blob store cache primary
shard. The periodic task works by using a point in
time context with search_after.
For this grid type, the features on the aggregation layer are represented by a point that is computed from the
centroid of the data inside the cell
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>
Documents the `runs` keyword for running the same event criteria successively in a sequence query.
Relates to #75082.
# Conflicts:
# docs/reference/release-notes/highlights.asciidoc
Documents `archived.*` persistent cluster settings and index settings.
These settings are commonly produced during a major version upgrade.
Closes#28027
* Add stubs for get API
* Add stub for post API
* Register new actions in ActionModule
* HLRC stubs
* Unit tests
* Add rest api spec and tests
* Add new action to non-operator actions list
This change removes JodaCompatibleZonedDateTime and replaces it with ZonedDateTime for use in
scripting.
Breaking changes:
* JodaCompatibleDateTime no longer exists and cannot be cast to in Painless. Use ZonedDateTime
instead.
* The dayOfWeek method on ZonedDateTime returns the DayOfWeek enum instead of an int from
JodaCompatibleDateTime. dayOfWeekEnum still exists on ZonedDateTime as an augmentation to
support the transition to ZonedDateTime, but is now deprecated in favor of dayOfWeek on
ZonedDateTime.
* [DOCS] Always enable file and native realms by default
Adds an 8.0 breaking change for PR #69096.
The copy is based on the 7.13 deprecation notice added with PR #69320.
* reword
* Update docs/reference/migration/migrate_8_0/security.asciidoc
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Update docs/reference/migration/migrate_8_0/security.asciidoc
Co-authored-by: Yang Wang <ywangd@gmail.com>
Co-authored-by: Yang Wang <ywangd@gmail.com>
* [ML] add documentation for get deployment stats API
* Apply suggestions from code review
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
* Improve docs for pre-release version compatibility
Follow-up to #78317 clarifying a couple of points:
- a pre-release build can restore snapshots from released builds
- compatibility applies if at least one of the local or remote cluster
is a released build
* Remote cluster build date nit
Monitoring installs a number of ingest pipelines which have been historically used
to upgrade documents when mappings and document structures change between
versions. Since there aren't any changes to the document format, nor will there be
by the time the format is completely retired, we can comfortably remove these
pipelines.
Zero-Shot classification allows for text classification tasks without a pre-trained collection of target labels.
This is achieved through models trained on the Multi-Genre Natural Language Inference (MNLI) dataset. This dataset pairs text sequences with "entailment" clauses. An example could be:
"Throughout all of history, man kind has shown itself resourceful, yet astoundingly short-sighted" could have been paired with the entailment clauses: ["This example is history", "This example is sociology"...].
This training set combined with the attention and semantic knowledge in modern day NLP models (BERT, BART, etc.) affords a powerful tool for ad-hoc text classification.
See https://arxiv.org/abs/1909.00161 for a deeper explanation of the MNLI training and how zero-shot works.
The zeroshot classification task is configured as follows:
```js
{
// <snip> model configuration </snip>
"inference_config" : {
"zero_shot_classification": {
"classification_labels": ["entailment", "neutral", "contradiction"], // <1>
"labels": ["sad", "glad", "mad", "rad"], // <2>
"multi_label": false, // <3>
"hypothesis_template": "This example is {}.", // <4>
"tokenization": { /*<snip> tokenization configuration </snip>*/}
}
}
}
```
* <1> For all zero_shot models, there returns 3 particular labels when classification the target sequence. "entailment" is the positive case, "neutral" the case where the sequence isn't positive or negative, and "contradiction" is the negative case
* <2> This is an optional parameter for the default zero_shot labels to attempt to classify
* <3> When returning the probabilities, should the results assume there is only one true label or multiple true labels
* <4> The hypothesis template when tokenizing the labels. When combining with `sad` the sequence looks like `This example is sad.`
For inference in a pipeline one may provide label updates:
```js
{
//<snip> pipeline definition </snip>
"processors": [
//<snip> other processors </snip>
{
"inference": {
// <snip> general configuration </snip>
"inference_config": {
"zero_shot_classification": {
"labels": ["humanities", "science", "mathematics", "technology"], // <1>
"multi_label": true // <2>
}
}
}
}
//<snip> other processors </snip>
]
}
```
* <1> The `labels` we care about, these replace the default ones if they exist.
* <2> Should the results allow multiple true labels
Similarly one may provide label changes against the `_infer` endpoint
```js
{
"docs":[{ "text_field": "This is a very happy person"}],
"inference_config":{"zero_shot_classification":{"labels": ["glad", "sad", "bad", "rad"], "multi_label": false}}
}
```
We deprecated support for multiple data paths (MDP) in 7.13. However,
we won't remove support until after 8.0.
Changes:
* Reverts PR #72267, which removed MDP docs
* Removes a related item from the 8.0 breaking changes.
The reference manual includes docs on version compatibility in various
places, but it's not clear that these docs only apply to released
versions and that the rules for pre-release versions are stricter than
folks expect. This commit adds some words to the docs for unreleased
versions which explains this subtlety.
Changes:
* Documents the `time_series_metric` mapping parameter for PR #76766.
* Renames the `dimension` parameter to `time_series_dimension` for PR #78012.
* Adds support for `unsigned_long` to `time_series_dimension` for PR #78204.
This adds a new "elasticsearch-keystore show" command that displays
the value of a single secure setting from the keystore.
An optional `-o` (or `--output`) parameter can be used to direct
output to a file.
The `-o` option is required for binary keystore values
because the CLI `Terminal` class does not support writing binary data.
Hence this command:
elasticsearch-keystore show xpack.watcher.encryption_key > watcher.key
would not produce a file with the correct contents.
Co-authored-by: Ioannis Kakavas <ikakavas@protonmail.com>
* [DOCS] Update remote cluster docs
* Add files, rename files, write new stuff
* Plethora of changes
* Add test and update snippets
* Redirects, moved files, and test updates
* Moved file to x-pack for tests
* Remove older CCS page and add redirects
* Cleanup, link updates, and some rewrites
* Update image
* Incorporating user feedback and rewriting much of the remote clusters page
* More changes from review feedback
* Numerous updates, including request examples for CCS and Kibana
* More changes from review feedback
* Minor clarifications on security for remote clusters
* Incorporate review feedback
Co-authored-by: Yang Wang <ywangd@gmail.com>
* Some review feedback and some editorial changes
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Yang Wang <ywangd@gmail.com>
This allows consumers of the API to be able to know exactly if all the features in a tile has been considered
when building the hits layer of a vector tile
Co-authored-by: James Rodewig <40268737+jrodewig@users.noreply.github.com>