Commit Graph

6372 Commits

Author SHA1 Message Date
István Zoltán Szabó e5d512a8ed
[DOCS] Fixes classification evaluation example response. (#49905) 2019-12-06 13:24:22 +01:00
István Zoltán Szabó 37cc0b6c9e
[DOCS] Fixes attribute in transforms overview. (#49898) 2019-12-06 10:23:01 +01:00
Hendrik Muhs 25474c62f1
[Transform][DOCS]rewrite client ip example to use continuous transform (#49822)
adapt the transform example for suspicious client ips to use continuous transform
2019-12-06 08:19:21 +01:00
Orhan Toy f002cd1b6f [DOCS] Minor typo fixes in reindex.asciidoc (#49863) 2019-12-05 20:24:22 +01:00
István Zoltán Szabó f7a5b73972
[DOCS] Adds an example of preprocessing actions to the PUT DFA API docs (#49831) 2019-12-05 14:15:19 +01:00
István Zoltán Szabó c793e80d3b
[DOCS] Fixes typo in the ML anomaly detection time functions docs. (#49834) 2019-12-05 09:57:01 +01:00
James Rodewig 72dd49ddcc
[DOCS] Document `minimum_should_match` defaults for `bool` query (#48865)
Adds documentation for the `minimum_should_match` parameter to the `bool` query docs. Includes docs for the default values:

- `1` if the `bool` query includes at least one `should` clause and no `must` or `filter` clauses
- `0` otherwise
2019-12-04 12:44:13 -05:00
James Rodewig e964a97005
[DOCS] Reformat length token filter docs (#49805)
* Adds a title abbreviation
* Updates the description and adds a Lucene link
* Reformats the parameters section
* Adds analyze, custom analyzer, and custom filter snippets

Relates to #44726.
2019-12-04 09:58:19 -05:00
Alexander Reelsen 062f9f03bf
Docs: Fix & test more grok processor documentation (#49447)
The documentation contained a small error, as bytes and duration was not
properly converted to a number and thus remained a string.

The documentation is now also properly tested by providing a full blown
simulate pipeline example.
2019-12-03 11:47:27 +01:00
James Rodewig 1a574115c1
[DOCS] Document CCR compatibility requirements (#49776)
* Creates a prerequisites section in the cross-cluster replication (CCR)
  overview.
* Adds concise definitions for local and remote cluster in a CCR context.
* Documents that the ES version of the local cluster must be the same
  or a newer compatible version as the remote cluster.
2019-12-02 15:52:13 -05:00
James Rodewig 6ea54eecf0
[DOCS] Reformat keep types and keep words token filter docs (#49604)
* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds explanations of token types to keep types token filter and tokenizer docs
2019-12-02 09:22:21 -05:00
James Rodewig 37baa50815
[DOCS] Explicitly document enrich `target_field` includes `match_field` (#49407)
When the enrich processor appends enrich data to an incoming document,
it adds a `target_field` to contain the enrich data.

This `target_field` contains both the `match_field` AND `enrich_fields`
specified in the enrich policy.

Previously, this was reflected in the documented example but not
explicitly stated. This adds several explicit statements to the docs.
2019-12-02 09:12:21 -05:00
Andrei Stefan ce727615c0
SQL: handle NULL arithmetic operations with INTERVALs (#49633) 2019-12-02 16:05:05 +02:00
David Turner 69e0b1a0f4
Drop snapshot instructions for autobootstrap fix (#49755)
The "Restore any snapshots as required" step is a trap: it's somewhere between
tricky and impossible to restore multiple clusters into a single one.

Also add a note about configuring discovery during a rolling upgrade to
proscribe any rare cases where you might accidentally autobootstrap during the
upgrade.
2019-12-02 12:43:18 +00:00
Tugberk Ugurlu ac07248402 [Docs] Fix typo in templates.asciidoc (#49726) 2019-11-29 18:43:45 +01:00
Henning Andersen 5b56a990b0
Deprecate sorting in reindex (#49458)
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to #47567
2019-11-29 17:46:44 +01:00
Dimitris Athanasiou bad07b76f7
[ML] Add optional source filtering during data frame reindexing (#49690)
This adds a `_source` setting under the `source` setting of a data
frame analytics config. The new `_source` is reusing the structure
of a `FetchSourceContext` like `analyzed_fields` does. Specifying
includes and excludes for source allows selecting which fields
will get reindexed and will be available in the destination index.

Closes #49531
2019-11-29 14:20:31 +02:00
Marios Trivyzas 8ca11f54cd
[Docs] Enhance rolling upgrade guide (#49686)
Add a couple of pointers for the user to check the
overall cluster health and the version of ES running
on every node.

Fixes: #49670
2019-11-28 17:00:41 +01:00
Ignacio Vera eade4f03f4
New Histogram field mapper that supports percentiles aggregations. (#48580)
This commit adds  a new histogram field mapper that consists in a pre-aggregated format of numerical data to be used in percentiles aggregations.
2019-11-28 13:58:20 +01:00
Ryan Ernst 6c54b38a1b
Remove legacy referene to file scripts (#49339)
This commit removes outdated documentation about a path setting for file
scripts which no longer exist.

closes #45827
2019-11-27 10:42:15 -08:00
Ryan Ernst 0042500026
Add JAVA_HOME env override location to docs (#49565)
This commit clarifies how to override JAVA_HOME from the bundled jdk for
deb and rpm installs, which each have their own file that is sourced
upon service startup.

closes #49068
2019-11-27 10:39:54 -08:00
Xiang Dai 7a7d15ba0b [DOCS] Clarify how to update max memory size in bootstrap checks (#48975) 2019-11-27 09:39:34 -05:00
bellengao 00cef95a77 [DOCS] Correct the request path for flush API docs (#49615) 2019-11-27 09:26:57 -05:00
Yannick Welsch a5f23758a1 [DOCS] Correct request path for synced flush API docs (#49631)
Fixes an incorrect request path added with #46634
2019-11-27 08:43:21 -05:00
glerb 815ea928b2 [Docs] Correct typo in log file name (#49620) 2019-11-27 14:38:19 +01:00
Martijn van Groningen 88aea2107d
Add templating support to pipeline processor. (#49030)
This commit adds templating support to the pipeline processor's `name` option.

Closes #39955
2019-11-27 13:45:11 +01:00
Jim Ferenczi c2deb287f1
Add a cluster setting to disallow loading fielddata on _id field (#49166)
This change adds a dynamic cluster setting named `indices.id_field_data.enabled`.
When set to `false` any attempt to load the fielddata for the `_id` field will fail
with an exception. The default value in this change is set to `false` in order to prevent
fielddata usage on this field for future versions but it will be set to `true` when backporting
to 7x. When the setting is set to true (manually or by default in 7x) the loading will also issue
a deprecation warning since we want to disallow fielddata entirely when https://github.com/elastic/elasticsearch/issues/26472
is implemented.

Closes #43599
2019-11-27 13:38:09 +01:00
Dimitrios Liappis 1c9efba809
Clarify gid used by docker image process and bind-mount method
Fix reference about the uid:gid that Elasticsearch runs as inside
the Docker container and add a packaging test to ensure that bind
mounting a data dir with a random uid and gid:0 works as
expected.

Relates #49529
Closes #47929
2019-11-27 10:36:30 +02:00
Martijn van Groningen 4013e814e8
Add templating support to enrich processor (#49093)
Adds support for templating to `field` and `target_field` options.
2019-11-27 07:52:42 +01:00
lcawl 3b3f3ca925 [DOCS] Fixes typo in ML resources 2019-11-26 10:28:18 -08:00
lcawl 63b944c00f [DOCS] Fixes data type formatting 2019-11-26 08:21:39 -08:00
Mayya Sharipova fa8b48deef Optimize sort on numeric long and date fields.
This rewrites long sort as a `DistanceFeatureQuery`, which can
efficiently skip non-competitive blocks and segments of documents.
Depending on the dataset, the speedups can be 2 - 10 times.

The optimization can be disabled with setting the system property
`es.search.rewrite_sort` to `false`.

Optimization is skipped when an index has 50% or more data with
the same value.

Optimization is done through:
1. Rewriting sort as `DistanceFeatureQuery` which can
efficiently skip non-competitive blocks and segments of documents.

2. Sorting segments according to the primary numeric sort field(#44021)
This allows to skip non-competitive segments.

3. Using collector manager.
When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
We use collectorManager, where for every segment a dedicated collector
will be created.

4. Using Lucene's shared TopFieldCollector manager
This collector manager is able to exchange minimum competitive
score between collectors, which allows us to efficiently skip
the whole segments that don't contain competitive scores.

5. When index is force merged to a single segment, #48533 interleaving
old and new segments allows for this optimization as well,
as blocks with non-competitive docs can be skipped.

Closes #37043

Co-authored-by: Jim Ferenczi <jim.ferenczi@elastic.co>
2019-11-26 09:24:25 -05:00
Mayya Sharipova e9ba252176 Revert "Optimize sort on long field (#48804)"
This reverts commit 79d9b365c4.
2019-11-26 09:23:27 -05:00
Mayya Sharipova 79d9b365c4
Optimize sort on long field (#48804)
* Optimize sort on numeric long and date fields (#39770)

Optimize sort on numeric long and date fields, when 
the system property `es.search.long_sort_optimized` is true.

* Skip optimization if the index has duplicate data (#43121)

Skip sort optimization if the index has 50% or more data
with the same value.
When index has a lot of docs with the same value, sort
optimization doesn't make sense, as DistanceFeatureQuery
will produce same scores for these docs, and Lucene
will use the second sort to tie-break. This could be slower
than usual sorting.

* Sort leaves on search according to the primary numeric sort field (#44021)

This change pre-sort the index reader leaves (segment) prior to search
when the primary sort is a numeric field eligible to the distance feature
optimization. It also adds a tie breaker on `_doc` to the rewritten sort
in order to bypass the fact that leaves will be collected in a random order.
I ran this patch on the http_logs benchmark and the results are very promising:

```
|                                       50th percentile latency | desc_sort_timestamp |    220.706 |      136544 |   136324 |     ms |
|                                       90th percentile latency | desc_sort_timestamp |    244.847 |      162084 |   161839 |     ms |
|                                       99th percentile latency | desc_sort_timestamp |    316.627 |      172005 |   171688 |     ms |
|                                      100th percentile latency | desc_sort_timestamp |    335.306 |      173325 |   172989 |     ms |
|                                  50th percentile service time | desc_sort_timestamp |    218.369 |     1968.11 |  1749.74 |     ms |
|                                  90th percentile service time | desc_sort_timestamp |    244.182 |      2447.2 |  2203.02 |     ms |
|                                  99th percentile service time | desc_sort_timestamp |    313.176 |     2950.85 |  2637.67 |     ms |
|                                 100th percentile service time | desc_sort_timestamp |    332.924 |     2959.38 |  2626.45 |     ms |
|                                                    error rate | desc_sort_timestamp |          0 |           0 |        0 |      % |
|                                                Min Throughput |  asc_sort_timestamp |   0.801824 |    0.800855 | -0.00097 |  ops/s |
|                                             Median Throughput |  asc_sort_timestamp |   0.802595 |    0.801104 | -0.00149 |  ops/s |
|                                                Max Throughput |  asc_sort_timestamp |   0.803282 |    0.801351 | -0.00193 |  ops/s |
|                                       50th percentile latency |  asc_sort_timestamp |    220.761 |     824.098 |  603.336 |     ms |
|                                       90th percentile latency |  asc_sort_timestamp |    251.741 |     853.984 |  602.243 |     ms |
|                                       99th percentile latency |  asc_sort_timestamp |    368.761 |     893.943 |  525.182 |     ms |
|                                      100th percentile latency |  asc_sort_timestamp |    431.042 |      908.85 |  477.808 |     ms |
|                                  50th percentile service time |  asc_sort_timestamp |    218.547 |     820.757 |  602.211 |     ms |
|                                  90th percentile service time |  asc_sort_timestamp |    249.578 |     849.886 |  600.308 |     ms |
|                                  99th percentile service time |  asc_sort_timestamp |    366.317 |     888.894 |  522.577 |     ms |
|                                 100th percentile service time |  asc_sort_timestamp |    430.952 |     908.401 |   477.45 |     ms |
|                                                    error rate |  asc_sort_timestamp |          0 |           0 |        0 |      % |
```

So roughly 10x faster for the descending sort and 2-3x faster in the ascending case. Note
that I indexed the http_logs with a single client in order to simulate real time-based indices
where document are indexed in their timestamp order.

Relates #37043

* Remove nested collector in docs response

As we don't use cancellableCollector anymore, it should be removed from
the expected docs response.

* Use collector manager for search when necessary (#45829)

When we optimize sort, we sort segments by their min/max value.
As a collector expects to have segments in order,
we can not use a single collector for sorted segments.
Thus for such a case, we use collectorManager,
where for every segment a dedicated collector will be created.

* Use shared TopFieldCollector manager

Use shared TopFieldCollector manager for sort optimization.
This collector manager is able to exchange minimum competitive
score between collectors

* Correct calculation of avg value to avoid overflow

* Optimize calculating if index has duplicate data
2019-11-26 09:07:39 -05:00
Martijn van Groningen 2ba00c8149
Introduce on_failure_pipeline ingest metadata inside on_failure block (#49076)
In case an exception occurs inside a pipeline processor,
the pipeline stack is kept around as header in the exception.
Then in the on_failure processor the id of the pipeline the
exception occurred is made accessible via the `on_failure_pipeline`
ingest metadata.

Closes #44920
2019-11-26 14:49:51 +01:00
Marios Trivyzas f2aa7f0779
SQL: Add TRUNC alias for TRUNCATE (#49571)
Add TRUNC as alias to already implemented TRUNCATE
numeric function which is the flavour supported by
Oracle and PostgreSQL.

Relates to: #41195
2019-11-26 12:30:49 +01:00
Christoph Büscher 0162662eb8
[Docs] Correct `max_doc_freq` default value (#49536)
The default is set to Integer.MAX_VALUE but is reported to be `0` in the docs.
With the current implementation a value of 0 would mean all terms are filtered
out, which is the opposite of "unbounded".

Closes #49520
2019-11-26 10:46:44 +01:00
Lisa Cawley 9cc247d929
[DOCS] Fixes security links (#49563) 2019-11-25 12:59:59 -08:00
James Rodewig 1471f34c54
[DOCS] Reformat delimited payload token filter docs (#49380)
* Adds a title abbreviation
* Relocates the older name deprecation warning
* Updates the description and adds a Lucene link
* Adds a note to explain payloads and how to store them
* Adds analyze and custom analyzer snippets
* Adds a 'Return stored payloads' example
2019-11-25 15:38:52 -05:00
James Rodewig 25ffce9391
[DOCS] Remove individual task retrieval from cat/tasks API (#49550) 2019-11-25 10:31:13 -05:00
Kelly Campbell 1542a728c9 [DOCS] Correct GET path in cat tasks API docs (#49494)
Previously, the request example included `GET _cat/_tasks`. However, the resource should be `tasks`, not `_tasks`.
2019-11-25 09:37:17 -05:00
Przemko Robakowski 04f6b6fdb2
[DOCS] IDs for doc snippets (#49008)
* Ids for docs snippets

* Ids for tests

* Ids for docs snippets

* ignoring build folder from idea

* Ignoring build-eclipse
2019-11-25 15:30:00 +01:00
David Roberts 40c951d781
[ML] Add default categorization analyzer definition to ML info (#49545)
The categorization job wizard in the ML UI will use this
information when showing the effect of the chosen categorization
analyzer on a sample of input.
2019-11-25 13:20:12 +00:00
Dimitris Athanasiou 5a6967af57
[ML][DOCS] Anomaly detection job retention days settings do not require restart (#49546) 2019-11-25 15:12:41 +02:00
James Rodewig 642390c3a7 [DOCS] Fix edge n-gram tokenizer nav
Adds a missing float tag to the edge n-gram tokenizer docs. This tag
ensures the edge n-gram tokenizer docs display on the same page.
2019-11-22 15:51:52 -05:00
Jason Tedor da20957e81
Replace required pipeline with final pipeline (#49470)
This commit enhances the required pipeline functionality by changing it
so that default/request pipelines can also be executed, but the required
pipeline is always executed last. This gives users the flexibility to
execute their own indexing pipelines, but also ensure that any required
pipelines are also executed. Since such pipelines are executed last, we
change the name of required pipelines to final pipelines.
2019-11-22 14:00:38 -05:00
Dimitris Athanasiou 0390ec3627
[ML] Explain data frame analytics API (#49455)
This commit replaces the _estimate_memory_usage API with
a new API, the _explain API.

The API consolidates information that is useful before
creating a data frame analytics job.

It includes:

- memory estimation
- field selection explanation

Memory estimation is moved here from what was previously
calculated in the _estimate_memory_usage API.

Field selection is a new feature that explains to the user
whether each available field was selected to be included or
not in the analysis. In the case it was not included, it also
explains the reason why.
2019-11-22 20:08:14 +02:00
Lisa Cawley a4efab6ab4
[DOCS] Merge rollup config details into API (#49412) 2019-11-22 08:31:30 -08:00
James Rodewig ddf5c0a76a
[DOCS] Reformat n-gram token filter docs (#49438)
Reformats the edge n-gram and n-gram token filter docs. Changes include:

* Adds title abbreviations
* Updates the descriptions and adds Lucene links
* Reformats parameter definitions
* Adds analyze and custom analyzer snippets
* Adds notes explaining differences between the edge n-gram and n-gram
  filters

Additional changes:
* Switches titles to use "n-gram" throughout.
* Fixes a typo in the edge n-gram tokenizer docs
* Adds an explicit anchor for the `index.max_ngram_diff` setting
2019-11-22 10:38:01 -05:00
István Zoltán Szabó 56888ff194
[DOCS] Removes the default size definition of thread pool types (#49442)
Co-Authored-By: James Rodewig <james.rodewig@elastic.co>
2019-11-22 11:15:35 +01:00