When Joni, the regex engine that powers grok emits a warning it
does so by default to System.err. System.err logs are all bucketed
together in the server log at WARN level. When Joni emits a warning,
it can be extremely verbose, logging a message for each execution
again that pattern. For ingest node that means for every document
that is run that through Grok. Fortunately, Joni provides a call
back hook to push these warnings to a custom location.
This commit implements Joni's callback hook to push the Joni warning
to the Elasticsearch server logger (logger.org.elasticsearch.ingest.common.GrokProcessor)
at debug level. Generally these warning indicate a possible issue with
the regular expression and upon creation of the Grok processor will
do a "test run" of the expression and log the result (if any) at WARN
level. This WARN level log should only occur on pipeline creation which
is a much lower frequency then every document.
Additionally, the documentation is updated with instructions for how
to set the logger to debug level.
* Changes for #52239.
* Incorporating review feedback from Julie T. Also single-sourcing nexted options in the Mapping page and referencing them in the Nested page.
* Moving tip after the introduction and clarifying limits.
* Update docs/reference/mapping.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Update docs/reference/mapping/types/nested.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Data frame analytics dynamically determines the classification field type. This field type then dictates the encoded JSON that is written to Elasticsearch.
Inference needs to know about this field type so that it may provide the EXACT SAME predicted values as analytics.
Here is added a new field `prediction_field_type` which indicates the desired type. Options are: `string` (DEFAULT), `number`, `boolean` (where close_to(1.0) == true, false otherwise).
Analytics provides the default `prediction_field_type` when the model is created from the process.
A new field called `inference_config` is now added to the trained model config object. This new field allows for default inference settings from analytics or some external model builder.
The inference processor can still override whatever is set as the default in the trained model config.
Restructures the 'Update an enrich policy' section to:
* Migrate the content to the section. It was previously stored in the
Put Enrich Policy API docs.
* Remove the warning tag admonition from the section content.
* Replace a reused section earlier in the "Set up an enrich processor"
page with a link.
No substantive changes were made to the content.
Adds a new `default_field_map` field to trained model config objects.
This allows the model creator to supply field map if it knows that there should be some map for inference to work directly against the training data.
The use case internally is having analytics jobs supply a field mapping for multi-field fields. This allows us to use the model "out of the box" on data where we trained on `foo.keyword` but the `_source` only references `foo`.
This adds machine learning model feature importance calculations to the inference processor.
The new flag in the configuration matches the analytics parameter name: `num_top_feature_importance_values`
Example:
```
"inference": {
"field_mappings": {},
"model_id": "my_model",
"inference_config": {
"regression": {
"num_top_feature_importance_values": 3
}
}
}
```
This will write to the document as follows:
```
"inference" : {
"feature_importance" : {
"FlightTimeMin" : -76.90955548511226,
"FlightDelayType" : 114.13514762158526,
"DistanceMiles" : 13.731580450792187
},
"predicted_value" : 108.33165831875137,
"model_id" : "my_model"
}
```
This is done through calculating the [SHAP values](https://arxiv.org/abs/1802.03888).
It requires that models have populated `number_samples` for each tree node. This is not available to models that were created before 7.7.
Additionally, if the inference config is requesting feature_importance, and not all nodes have been upgraded yet, it will not allow the pipeline to be created. This is to safe-guard in a mixed-version environment where only some ingest nodes have been upgraded.
NOTE: the algorithm is a Java port of the one laid out in ml-cpp: https://github.com/elastic/ml-cpp/blob/master/lib/maths/CTreeShapFeatureImportance.cc
usability blocked by: https://github.com/elastic/ml-cpp/pull/991
This commit updates the enrich.get_policy API to specify name
as a list, in line with other URL parts that accept a comma-separated
list of values.
In addition, update the get enrich policy API docs
to align the URL part name in the documentation with
the name used in the REST API specs.
The changes add more granularity for identiying the data ingestion user.
The ingest pipeline can now be configure to record authentication realm and
type. It can also record API key name and ID when one is in use.
This improves traceability when data are being ingested from multiple agents
and will become more relevant with the incoming support of required
pipelines (#46847)
Resolves: #49106
* Add empty_value parameter to CSV processor
This change adds `empty_value` parameter to the CSV processor.
This value is used to fill empty fields. Fields will be skipped
if this parameter is ommited. This behavior is the same for both
quoted and unquoted fields.
* docs updated
* Fix compilation problem
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
This commit adds the name of the current pipeline to ingest metadata.
This pipeline name is accessible under the following key: '_ingest.pipeline'.
Example usage in pipeline:
PUT /_ingest/pipeline/2
{
"processors": [
{
"set": {
"field": "pipeline_name",
"value": "{{_ingest.pipeline}}"
}
}
]
}
Closes#42106
* CSV Processor for Ingest
This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.
Closes#49113
* Allow list of IPs in geoip ingest processor
This change lets you use array of IPs in addition to string in geoip processor source field.
It will set array containing geoip data for each element in source, unless first_only parameter
option is enabled, then only first found will be returned.
Closes#46193
The documentation contained a small error, as bytes and duration was not
properly converted to a number and thus remained a string.
The documentation is now also properly tested by providing a full blown
simulate pipeline example.
When the enrich processor appends enrich data to an incoming document,
it adds a `target_field` to contain the enrich data.
This `target_field` contains both the `match_field` AND `enrich_fields`
specified in the enrich policy.
Previously, this was reflected in the documented example but not
explicitly stated. This adds several explicit statements to the docs.
In case an exception occurs inside a pipeline processor,
the pipeline stack is kept around as header in the exception.
Then in the on_failure processor the id of the pipeline the
exception occurred is made accessible via the `on_failure_pipeline`
ingest metadata.
Closes#44920
For the user agent ingest processor, custom regex files must end
with the `.yml` file extension.
This corrects the docs which said the `.yaml` extension was required.
Prior to this change the `target_field` would always be a json array
field in the document being ingested. This to take into account that
multiple enrich documents could be inserted into the `target_field`.
However the default `max_matches` is `1`. Meaning that by default
only a single enrich document would be added to `target_field` json
array field.
This commit changes this; if `max_matches` is set to `1` then the single
document would be added as a json object to the `target_field` and
if it is configured to a higher value then the enrich documents will be
added as a json array (even if a single enrich document happens to be
enriched).
Currently the policy config is placed directly in the json object
of the toplevel `policies` array field. For example:
```
{
"policies": [
{
"match": {
"name" : "my-policy",
"indices" : ["users"],
"match_field" : "email",
"enrich_fields" : [
"first_name",
"last_name",
"city",
"zip",
"state"
]
}
}
]
}
```
This change adds a `config` field in each policy json object:
```
{
"policies": [
{
"config": {
"match": {
"name" : "my-policy",
"indices" : ["users"],
"match_field" : "email",
"enrich_fields" : [
"first_name",
"last_name",
"city",
"zip",
"state"
]
}
}
}
]
}
```
This allows us in the future to add other information about policies
in the get policy api response.
The UI will consume this API to build an overview of all policies.
The UI may in the future include additional information about a policy
and the plan is to include that in the get policy api, so that this
information can be gathered in a single api call.
An example of the information that is likely to be added is:
* Last policy execution time
* The status of a policy (executing, executed, unexecuted)
* Information about the last failure if exists
This commit removes types from the ShardGetService, and propagates this API change
up through the Transport and Rest actions for Get and MultiGet
Relates to #41059
The enrich api returns enrich coordinator stats and
information about currently executing enrich policies.
The coordinator stats include per ingest node:
* The current number of search requests in the queue.
* The total number of outstanding remote requests that
have been executed since node startup. Each remote
request is likely to include multiple search requests.
This depends on how much search requests are in the
queue at the time when the remote request is performed.
* The number of current outstanding remote requests.
* The total number of search requests that `enrich`
processors have executed since node startup.
The current execution policies stats include:
* The name of policy that is executing
* A full blow task info object that is executing the policy.
Relates to #32789
This commit changes the GET REST api so it will accept an optional comma
separated list of enrich policy ids. This change also modifies the
behavior of the GET API in that it will not error if it is passed a bad
enrich id anymore, but will instead just return an empty list.
* Add enrich policy common parameter
* Add enrich APIs to REST APIs index
* Add put enrich policy API docs
* Add get enrich policy API docs
* Add delete enrich policy API docs
* Add execute enrich policy API docs
Besides a rename, this changes allows to processor to attach multiple
enrich docs to the document being ingested.
Also in order to control the maximum number of enrich docs to be
included in the document being ingested, the `max_matches` setting
is added to the enrich processor.
Relates #32789
A policy type controls how the enrich index is created and
the query executed against the match field. Currently there
is a single policy type (`exact_match`). In the near future
more policy types will be added and different policy may have
different configuration options.
For this reason type should be a json object instead of a string field:
```
{
"exact_match": {
...
}
}
```
instead of:
```
{
"type": "exact_match",
...
}
```
This will make streaming parsing of enrich policies easier as in the
new format, the parsing code can know ahead what configuration fields
to expect. In the latter format that is not possible if the type field
appears not as the first field.
Relates to #32789
Enrich processor configuration changes:
* Renamed `enrich_key` option to `field` option.
* Replaced `set_from` and `targets` options with `target_field`.
The `target_field` option behaves different to how `set_from` and
`targets` worked. The `target_field` is the field that will contain
the looked up document.
Relates to #32789
The get and list APIs are a single API in this commit. Whether
requesting one named policy or all policies, a list of policies is
returened. The list API code has all been removed and the GET api is
what remains, which contains much of the list response code.
The "Conditionals with the Pipeline Processor" incorrectly documents
how to create a pipeline of pipelines with a failure condition. The
example as-is will always execute the fail processor. The change here
updates the documentation to correct guard the fail processor with an
if condition.
If a pipeline that refrences the policy exists, we should not allow the
policy to be deleted. The user will need to remove the processor from
the pipeline before deleting the policy. This commit adds a check to
ensure that the policy cannot be deleted if it is referenced by any
pipeline in the system.
These docs were misleading for package installations of
Elasticsearch. Instead, we should refer to $ES_CONFIG/ingest-geoip as
the path to place the custom database files. For non-package
installations, this is the same as $ES_HOME/config, but for package
installations this is not the case as the config directory for package
installations is /etc/elasticsearch, and is not relative to
$ES_HOME. This commit corrects the docs.
Add an explanatory NOTE section to draw attention to the difference
between small and capital letters used for the index date patterns.
e.g.: HH vs hh, MM vs mm.
Closes: #22322
This processor uses the lucene HTMLStripCharFilter class to remove HTML
entities from a field. This adds to the char filter, so that there is
possibility to store the stripped version as well.
Note, that the characeter filter replaces tags with a newline, so that
the produced HTML will look slightly different than the incoming HTML
with regards to newlines.
This commit is a correction of a doc bug in the docs for the ingest
date-index-name processor. The correct pattern is
yyyy-MM-dd'T'HH:mm:ss.SSSXX. This is due to the transition from Joda
time to Java time where Z does not mean the same thing between the two.
Prior to this commit (and after 6.5.0), if an ingest node changes
the _index in a pipeline, the original target index would be created.
For daily indexes this could create an extra, empty index per day.
This commit changes the TransportBulkAction to execute the ingest node
pipeline before attempting to create the index. This ensures that the
only index created is the original or one set by the ingest node pipeline.
This was the execution order prior to 6.5.0 (#32786).
The execution order was changed in 6.5 to better support default pipelines.
Specifically the execution order was changed to be able to read the settings
from the index meta data. This commit also includes a change in logic such
that if the target index does not exist when ingest node pipeline runs, it
will now pull the default pipeline (if one exists) from the settings of the
best matched of the index template.
Relates #32786
Relates #32758Closes#36545
As mapping types are being removed throughout Elasticsearch, the use of
`_type` in pipeline simulation requests is deprecated. Additionally, the
default `_type` used if one is not supplied has been changed to `_doc` for
consistency with the rest of Elasticsearch.
When the ingest node user agent parses the device field, it
will result in a string value. To match the ecs schema
this commit moves the value of the parsed device to an
object with an inner field named 'name'. There are not
any passivity concerns since this modifies an unreleased change.
closes#38094
relates #37329
* Add ECS schema for user-agent ingest processor (#37727)
This switches the format of the user agent processor to use the schema from [ECS](https://github.com/elastic/ecs).
So rather than something like this:
```
{
"patch" : "3538",
"major" : "70",
"minor" : "0",
"os" : "Mac OS X 10.14.1",
"os_minor" : "14",
"os_major" : "10",
"name" : "Chrome",
"os_name" : "Mac OS X",
"device" : "Other"
}
```
The structure is now like this:
```
{
"name" : "Chrome",
"original" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36",
"os" : {
"name" : "Mac OS X",
"version" : "10.14.1",
"full" : "Mac OS X 10.14.1"
},
"device" : "Other",
"version" : "70.0.3538.102"
}
```
This is now the default for 7.0. The deprecated `ecs` setting in 6.x is not
supported.
Resolves#37329
* Remove `ecs` setting from docs
* Default include_type_name to false for get and put mappings.
* Default include_type_name to false for get field mappings.
* Add a constant for the default include_type_name value.
* Default include_type_name to false for get and put index templates.
* Default include_type_name to false for create index.
* Update create index calls in REST documentation to use include_type_name=true.
* Some minor clean-ups around the get index API.
* In REST tests, use include_type_name=true by default for index creation.
* Make sure to use 'expression == false'.
* Clarify the different IndexTemplateMetaData toXContent methods.
* Fix FullClusterRestartIT#testSnapshotRestore.
* Fix the ml_anomalies_default_mappings test.
* Fix GetFieldMappingsResponseTests and GetIndexTemplateResponseTests.
We make sure to specify include_type_name=true during xContent parsing,
so we continue to test the legacy typed responses. XContent generation
for the typeless responses is currently only covered by REST tests,
but we will be adding unit test coverage for these as we implement
each typeless API in the Java HLRC.
This commit also refactors GetMappingsResponse to follow the same appraoch
as the other mappings-related responses, where we read include_type_name
out of the xContent params, instead of creating a second toXContent method.
This gives better consistency in the response parsing code.
* Fix more REST tests.
* Improve some wording in the create index documentation.
* Add a note about types removal in the create index docs.
* Fix SmokeTestMonitoringWithSecurityIT#testHTTPExporterWithSSL.
* Make sure to mention include_type_name in the REST docs for affected APIs.
* Make sure to use 'expression == false' in FullClusterRestartIT.
* Mention include_type_name in the REST templates docs.
This commit fixes some cross-doc links from the old ingest plugins page
to the new ingest processor pages that arose after converting
ingest-geoip and ingest-user-agent to modules.
This commit adds a placeholder ingest-geoip plugin page as there are
other components in the Elastic Stack that still refer to these
pages. These docs would be broken without this placeholder page forcing
teams responsible for those docs to scramble to fix the build over the
weekend before a holiday period. Instead, we add a placeholder page so
the docs build continues to function, and those teams can fix their docs
without the constraint of a broken build. We also cleanup a few minor
docs issues that were missed during the initial changes to convert
ingest-geoip to a module.
This commit breaks the single ingest docs file into multiple files,
factoring out the processor docs into a documentation file per
processor. This will help make this content easier to maintain.
This commit adds the last sequence number and primary term of the last operation that have
modified a document to `GetResult` and uses it to power the Update API.
Relates #36148
Relates #10708
Sometimes users are confused about whether they can use the Convert Processor
for changing an existing fields type to other types even if the existing one is already
ingested. This confusion is from the first line of description. Changing this and also
adding a some detail to the code snippet.
* ingest: Introduce the dissect processor
The ingest node dissect processor is an alternative to Grok
to split a string based on a pattern. Dissect differs from
Grok such that regular expressions are not used to split the
string.
Dissect can be used to parse a source text field with a
simpler pattern, and is often faster the Grok for basic string
parsing. This processor uses the dissect library which
does most of the work.
* INGEST: Extend KV Processor (#31789)
Added more capabilities supported by LS to the KV processor:
* Stripping of brackets and quotes from values (`include_brackets` in corresponding LS filter)
* Adding key prefixes
* Trimming specified chars from keys and values
Refactored the way the filter is configured to avoid conditionals during execution.
Refactored Tests a little to not have to add more redundant getters for new parameters.
Relates #31786
* Add documentation
ingest: Introduction of a bytes processor
This processor allows for human readable byte values (e.g. 1kb) to be converted to value in bytes (e.g. 1024). Internally this processor re-uses "ByteSizeValue.parseBytesSizeValue" which supports conversions up to Long.MAX_VALUE and the following units: "b", "kb", "mb", "gb", "tb", pb".
This change also introduces a generic return type for the AbstractStringProcessor to allow for code reuse while supporting a String -> T conversion. (String -> Long in this case).
This adds a thread interrupter that allows us to encapsulate calls to org.joni.Matcher#search()
This method can hang forever if the regex expression is too complex.
The thread interrupter in the background checks every 3 seconds whether there are threads
execution the org.joni.Matcher#search() method for longer than 5 seconds and
if so interrupts these threads.
Joni has checks that that for every 30k iterations it checks if the current thread is interrupted and
if so returns org.joni.Matcher#INTERRUPTED
Closes#28731
Adds support for triple quoted strings to the documentation test
generator. Kibana's CONSOLE tool has supported them for a year but we
were unable to use them in Elasticsearch's docs because the process that
converts example snippets into tests couldn't handle this. This change
adds code to convert them into standard JSON so we can pass them to
Elasticsearch.