Initial support for Apache Arrow's streaming format as a response for ES|QL. It triggers based on the Accept header or the format request parameter.
Arrow has implementations in every mainstream language and is a backend of the Python Pandas library, which is extremely popular among data scientists and data analysts. Arrow's streaming format has also become the de facto standard for dataframe interchange. It is an efficient binary format that allows zero-cost deserialization by adding data access wrappers on top of memory buffers received from the network.
This PR builds on the experiment made by @nik9000 in PR #104877
Features/limitations:
- all ES|QL data types are supported
- multi-valued fields are not supported
- fields of type _source are output as JSON text in a varchar array. In a future iteration we may want to offer the choice of the more efficient CBOR and SMILE formats.
Technical details:
Arrow comes with its own memory management to handle vectors with direct memory, reference counting, etc. We don't want to use this as it conflicts with Elasticsearch's own memory management.
We therefore use the Arrow library only for the metadata objects describing the dataframe schema and the structure of the streaming format. The Arrow vector data is produced directly from ES|QL blocks.
---------
Co-authored-by: Nik Everett <nik9000@gmail.com>
This adds a `NOTE` to each comparison saying that pushing the comparison
to the search index requires that the field have an `index` and
`doc_values`. This is unique compared to the rest of Elasticsearch which
only requires an `index` and it's caused by our insistence that
comparisons only return true for single-valued fields. We can in future
accelerate comparisons without `doc_values`, but we just haven't written
that code yet.
* ESQL: change from quoting from backtick to quote
For historical reasons, the source declaration inside FROM command is
treated as an identifier, using backticks (`) for escaping the value.
This is inconsistent since the source is not an identifier (field name)
but an index name which has different semantics.
`index` means a field name index while "index" means a literal with
said value.
In case of FROM, the index name/location is more like a literal (also in
unquoted form) than an identifier (that is a reference to a value).
This PR tweaks the grammar and plugs in the quoted string logic so that
both the single quote (") and triple quote (""") are allowed.
* Update grammar
* Add more tests
* Add a few more tests
* Add extra test
* Update docs/changelog/108395.yaml
* Adress review comments
* Add doc note
* Revert test rename
* Fix quoting with remote cluster
* Update docs/reference/esql/source-commands/from.asciidoc
Co-authored-by: marciw <333176+marciw@users.noreply.github.com>
---------
Co-authored-by: Bogdan Pintea <bogdan.pintea@elastic.co>
Co-authored-by: Bogdan Pintea <pintea@mailbox.org>
Co-authored-by: marciw <333176+marciw@users.noreply.github.com>
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
- Added a new `AbstractAggregationTestCase` base class for tests, that shares most of the code of function tests, adapted for aggregations. Including both testing and docs generation.
- Reused the `AbstractFunctionTestCase` class to also let us test evaluators if the aggregation is foldable
- Added a `TopListTests` example
- This includes the docs for Top_list _(Also added a missing include of Ip_prefix docs)_
- Adapted Kibana docs to use `type: "agg"` (@drewdaemon)
The current tests are very basic: Consume a page, generate an output,
all in Single aggregation mode (No intermediates, no grouping). More
complex testing will be added in future PRs
Initial PR of https://github.com/elastic/elasticsearch/issues/109917
* WIP Started refactoring in preparation for ST_DISTANCE
* Initial evaluators for ST_DISTANCE
* Update docs/changelog/108764.yaml
* Fix invalid changelog generated by CI
* Register function and get unit tests working
* Fixed failing meta function description tests, and refined descriptions
* Added initial CsvTests and calculate Geo differently to Cartesian
* Added more csv-spec tests and changed to arcDistance for accuracy
* Added generated docs files
* Link to generated docs
* Fix examples tag for linking from generated docs
* Skip wrapper function
And note that we might want to include instead some of the related intelligence from Circle2D::HaversineDistance class
* Added ST_DWITHIN and more tests for ST_DISTANCE and ST_DWITHIN
* Code style
* Added more tests, this time for sorting on distance
* Fixes after rebase on main
* The ST_DWITHIN cannot use BinarySpatialFunction because it is ternary
So we moved the common code to a separate SpatialTypeResolver, and made a simpler TernarySpatialFunction based on a simple TernaryScalarFunction. This had additional consequences, simplifying the points-only cases.
The main reason for this change was to support StDWithinTests which need to test a lot of things that involve varying all three input types, generating expected error strings, etc. The original hack of just adding to BinarySpatialFunction worked for the actual integration tests, but clearly did not satisfy all the use cases tested by the unit tests.
We also restricted ST_DWITHIN to take only a double as the third argument, because otherwise the number of evaluators would explode, since we need a separate evaluator for each Block type, and Integer and Double use different block types.
* Fixed function count after rebasing on main
* Update docs/changelog/108764.yaml
* Added generated docs for ST_DWITHIN
* Connect docs for ST_DWITHIN
* Add back issue link
* Remove support for ST_DWITHIN
* Update docs/changelog/108764.yaml
* Bring back link to issue in changelog
* Update x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/StDistance.java
Co-authored-by: Ignacio Vera <iverase@gmail.com>
* Revert reformatting of function descriptions
We should put this into a separate PR
* Github merged commit with incorrectly formatted whitespace
---------
Co-authored-by: Ignacio Vera <iverase@gmail.com>
This adds a test that generates
`docs/reference/esql/functions/kibana/inline_cast.json` which is a json
object who's keys are the names of valid inline casts and who's values
are the resulting data types.
I also moved one of the maps we use to make the inline casts to
`DataType`, which is a place where we want it.
When you divide two integers or two longs we round towards 0. Like
Postgres or Java or Rust or C. Other systems, like MySQL or SPL or
Javascript or Python always produce a floating point number. We should
warn folks about this. It's genuinely unexpected for some folks. OTOH,
converting into a floating point number would be unexpected for other
folks. Oh well, let's document what we've got.
Fixing MvAppendTests CB exceptions by generating smaller geometries: the
test generates a lot of documents and the CB is too small for multiple
big shapes.
Fixes https://github.com/elastic/elasticsearch/issues/109409
Add support for the string manipulation function REPEAT(string, number). This function concatenates the string argument with itself the specified number of times. If number is 0 an empty string is returned. If number is less than 0, null is returned and a warning is logged. If number is less than 0 and is a constant, the query will fail without executing.
Adding `MV_APPEND(value1, value2)` function, that appends two values
creating a single multi-value. If one or both the inputs are
multi-values, the result is the concatenation of all the values, eg.
```
MV_APPEND([a, b], [c, d]) -> [a, b, c, d]
```
~I think for this specific case it makes sense to consider `null` values
as empty arrays, so that~ ~MV_APPEND(value, null) -> value~ ~It is
pretty uncommon for ESQL (all the other functions, apart from
`COALESCE`, short-circuit to `null` when one of the values is null), so
let's discuss this behavior.~
[EDIT] considering the feedback from Andrei, I changed this logic and
made it consistent with the other functions: now if one of the
parameters is null, the function returns null
Added ESQL function to get the prefix of an IP. It works now with both
IPv4 and IPv6. For users planning to use it with mixed IPs, we may need
to add a function like "is_ipv4()" first.
**About the skipped test:** There's currently a "bug" in the
evaluators//functions that return null. Evaluators can't handle them.
We'll work on support for that in another PR. It affects other
functions, like `substring()`. In this function, however, it only
affects in "wrong" cases (Like an invalid prefix), so it has no impact.
Fixes https://github.com/elastic/elasticsearch/issues/99064
Adding more unit tests for `coalesce()` function, in particular adding
tests for `ip`, `date` and spatial data types.
This also generates the right signatures for Kibana.
Related to https://github.com/elastic/elasticsearch/issues/108982
Part of https://github.com/elastic/elasticsearch/issues/106679
* Copy the `ql` project into a different project _just for esql_, call it `esql-core`.
* Make `esql` depend only on the latter.
* Fix `EsqlNodeSubclassTests`; I'm confused why this didn't bite us earlier.
* Update the warning regexes in some csv tests as the exceptions have other package names now.
**Note to reviewers:** Exclude the first commit when viewing the diff,
as that contains only the actual copying of `ql`. The remaining commits
are the actually meaningful ones. _The `build.gradle` files probably
require the most attention._
- Added the cube root function to ESQL (`CBRT(x)`). Nearly identical to SQRT, but without the negative numbers exception
- Added docs generation support for Windows end lines (CRLF), as within the examples, it was writing the "\r" without the "\n" (Which was being converted to "\\n"), and some other inconsistencies
- Some updates to `package-info.java` documentation over how to create functions
- Fixes https://github.com/elastic/elasticsearch/issues/108675
Functions issue: https://github.com/elastic/elasticsearch/issues/98545
This moves examples from files marked to run in integration tests only
to the files where they belong and disables this pattern matching. We
now use supported features.
This adds `nanosecond`, `microsecond` and `quarter` to the set of
supported time spans. It also adds a few standard and common
abbreviations to some existing ones.
This adds some clarifications on the time unit strings the function
takes as arguments, noting the differences between these and the time
span literals, as well as the abbreviations' source.
This reworks the integration-test-only csv testing for `metadata` to use
the `required_feature:` syntax instead of the `-IT_tests_only`
extension. This is a little more flexible and way nicer on the eyes.
This takes the CIDR_MATCH out of the operators group and adds it to a
new `IP functions` group.
The change also re-aranges the groups, grouping together the
type-specific functions and ordering them alphabetically.
This fixes the generation of the signatures for variadic functions,
except for those that take a list as last argument; i.e. functions with
optional arguments (like ROUND) or functions with overloading-like
signatures (like BUCKET).
This commit adds support for numeric metrics counter fields in ES|QL.
These counter types, including counter_long, counter_integer, and
counter_double, are different from their parent types. Users will have
limited interaction with these counter types, restricted to:
- Retrieving values without any processing
- Casting to their root type (e.g., to_long(a_long_counter))
- Using them in the metrics rate aggregation
These restrictions are intentional to prevent misuse. If users want to
use them as numeric values, explicit casting to their root types is
required.
This adds the documentation for BUCKET as a grouping function and the
addition of the "direct" invocation mode providing a span (in addition
to the auto mode).
This moves the TO_BASE64 and FROM_BASE64 from the type conversion
functions under string functions (they take a string as input and output
another string).