Support for `ST_EXTENT_AGG` was added in https://github.com/elastic/elasticsearch/pull/118829, and then partially optimized in https://github.com/elastic/elasticsearch/pull/118829. This optimization worked only for cartesian_shape fields, and worked by extracting the Extent from the doc-values and re-encoding it as a WKB `BBOX` geometry. This does not work for geo_shape, where we need to retain all 6 integers stored in the doc-values, in order to perform the datelline choice only at reduce time during the final phase of the aggregation.
Since both geo_shape and cartesian_shape perform the aggregations using integers, and the original Extent values in the doc-values are integers, this PR expands the previous optimization by:
* Saving all Extent values into a multi-valued field in an IntBlock for both cartesian_shape and geo_shape
* Simplifying the logic around merging intermediate states for all cases (geo/cartesian and grouped and non-grouped aggs)
* Widening test cases for testing more combinations of aggregations and types, and fixing a few bugs found
* Enhancing cartesian extent to convert from 6 ints to 4 ints at block loading time (for efficiency)
* Fixing bugs in both cartesian and geo extents for generating intermediate state with missing groups (flaky tests in serverless)
* Moved the int order to always match Rectangle for 4-int and Extent for 6-int cases (improved internal consistency)
Since the PR already changed the meaning of the invalid/infinite values of the intermediate state integers, it was already not compatible with the previous cluster versions. We disabled mixed-cluster testing to prevent errors as a result of that. This leaves us the opportunity to make further changes that are mixed-cluster incompatible, hence the decision to perform this consistency update now.
This PR adds support for ST_EXTENT_AGG aggregation, i.e., computing a bounding box over a set of points/shapes (Cartesian or geo). Note the difference between this aggregation and the already implemented scalar function ST_EXTENT.
This isn't a very efficient implementation, and future PRs will attempt to read these extents directly from the doc values.
We currently always use longitude wrapping, i.e., we may wrap around the dateline for a smaller bounding box. Future PRs will let the user control this behavior.
Fixes#104659.
The libs projects are configured to all begin with `elasticsearch-`.
While this is desireable for the artifacts to contain this consistent
prefix, it means the project names don't match up with their
directories. Additionally, it creates complexities for subproject naming
that must be manually adjusted.
This commit adjusts the project names for those under libs to be their
directory names. The resulting artifacts for these libs are kept the
same, all beginning with `elasticsearch-`.
* Add maximum nested depth check to WKT parser
This prevents StackOverflowErrors, replacing them with ParseException errors, which is more easily managed by running servers.
* Update docs/changelog/111843.yaml
The only reason this method is throwing an exception is because the
method ByteArrayOutputStream#close() is declaring it although it is a
noop. Therefore it can be safely ignored.
Thanks @romseygeek for bringing into attention.
Another round of automated fixes to this, marking things that can be
made static as static. Saves some JIT cycles but also turns some lambdas
from capturing to non-capturing and makes the "utilityness" of some
classes visible.
* WIP Started geo_line for TSDB work
Starting with YAML tests (which currently pass) and AggregatorTests
(currently failing, likely due to mistake in the tests)
* Update docs/changelog/94954.yaml
* WIP Refactoring to prepare for TSDB geo_line
* Created TimeSeries version of GeoLineAggregator, and wired it in so that time-series aggregations use it, but current behavior is still identical to non-time-series.
* Added both yaml and unit tests for testing that geo_line works with correct results in both time-series and non-time-series cases.
* Added additional tests to verify the grouping behaviour of time-series vs. terms aggs, and the combination of the two.
* WIP Refactoring to prepare for TSDB geo_line
* Started refactoring to re-use simplifier for all buckets
* Fixed bug with leaf collector not changing per segment
* Fixed bug with leaf collector not detecting bucket changes
The bucket id can change within a segment, so we need to detect this and save the geo_line.
* Renamed class since it no longer extends BucketedSort
The original geo_line relied on the BucketedSort for all intelligence.
The time-series geo_line uses none of that, and does its own memory management.
* Fixed bug with geo_point leaking between geo_line buckets
And enhanced unit tests to cover multiple groups
* Code review updates
* Verify that the sort field is specifically the TS timestamp
Only activate the time-series optimizations if the aggregation is both:
* Within a time-series aggregation (ie. tsid and @timestamp ordered)
* The geo_line sort field is @timestamp
* Allow geo_point time-series to skip sort config
Also disables the new geo_line for time-series even if the correct
sort and point fields are used if the point field is not explicitly
configured to be a position metric.
* Support geo_centroid and geo_bounds on position metric
* Update yaml tests for multi-terms tests
* Changed to disallow alternative sort-fields in ts-geo_line
Since the primary criteria for switching to the new algorithm is that
geo_line is within a time-series aggregation, we now disallow any other sort field.
We test the negative case in the yaml tests, but changed the unit tests to
use TermsAggregation to minim the time-series aggregation to get comparable
results.
* For non-time-series check missing sort field early
The old code only threw error if there was data because the check was done
inside the leaf collector just before actually reading the sort field.
And there were no tests for missing sort field.
This commit adds the tests, and checks early so even if data is missing.
* Reviewed TODOs
* Test that behaviour is identical with or without POSITION metric
* Removed fallback code in builder (was switching to old geo_line without POSITION metric)
* Removed two TODO's that are no longer valid concerns
Support geometry and streaming simplification
There are many opportunities to enable geometry simplification in Elasticsearch, both as an explicit feature available to users, and as an internal optimization technique for reducing memory consumption for complex geometries. For the latter case, it can even be considered a bug fix. This PR provides support for constraining Line and LinearRing sizes to a fixed number of points, and thereby a fixed amount of memory usage.
Consider, for example, the geo_line aggregation. This is similar to the top-10 aggregation, but allows the top-10k (ten thousand) points to be aggregated. This is not only a lot of memory, but can still cause unwanted line truncation for very large geometries. Line simplification is a solution to this. It is likely that a much smaller limit than 10k would suffice, while at the same time not truncating the geometry at all, so we fix a bug (truncation) while improving memory usage (pull limit from 10k down to perhaps just 1k).
This PR provides two APIs:
Streaming:
* By using the simplifier.consume(x, y) method on a stream of points, the total memory used is limited to a linear function of k, the total number of points to retain. This algorithm is at its heart based on the Visvalingam–Whyatt algorithm, with concepts from https://bost.ocks.org/mike/simplify/ and in particular the detailed streaming discussions in the paper at https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.7132&rep=rep1&type=pdf
Full-geometry:
* Simplifying full geometries using the simplifier.simplify(geometry) method can work with most geometry types, even GeometryCollection, but:
- Some geometries do not get simplified because it makes no sense to: Point, Circle, Rectangle
- The maxPoints parameter is used as is to apply to the main component (shell for polygons, largest geometry for multi-polygons and geometry collections), and all other sub-components (holes in polygons, etc.) are simplified to a scaled down version of the maxPoints, scaled by the relative size of the sub-component to the main component.
* The simplification itself is done on each Line and LinearRing component using the same streaming algorithm above. Since we use the Visvalingam–Whyatt algorithm, this works is applicable to both streaming and full-geometry simplification with the same essential result, but better control over memory than normal full-geometry simplifiers.
The basic algorithm for simplification on a stream of points requires maintaining two data structures:
* an array of all currently simplified points (implicitly ordered in stream order)
* a priority queue of all but the two end points with an estimated error on each that expresses the cost of removing that point from the line
This commit adds support for storing geo_shape fields as a lucene stored field. The geometry will be stored
normalised in well-known binary format (WKB).
This PR represents the initial phase of Modularizing Elasticsearch (with
Java Modules).
This initial phase modularizes the core of the Elasticsearch server
with Java Modules, which is then used to load and configure extension
components atop the server. Only a subset of extension components are
modularized at this stage (other components come in a later phase).
Components are loaded dynamically at runtime with custom class loaders
(same as is currently done). Components with a module-info.class are
defined to a module layer.
This architecture is somewhat akin to the Modular JDK, where
applications run on the classpath. In the analogy, the Elasticsearch
server modules are the platform (thus are always resolved and present),
while components without a module-info.class are non-modular code
running atop the Elasticsearch server modules. The extension components
cannot access types from non-exported packages of the server modules, in
the same way that classpath applications cannot access types from
non-exported packages of modules from the JDK. Broadly, the core
Elasticseach java modules simply "wrap" the existing packages and export
them. There are opportunites to export less, which is best done in more
narrowly focused follow-up PRs.
The Elasticsearch distribution startup scripts are updated to put jars
on the module path (the class path is empty), so the distribution will
run the core of the server as java modules. A number of key components
have been retrofitted with module-info.java's too, and the remaining
components can follow later. Unit and functional tests run as
non-modular (since they commonly require package-private access), while
higher-level integration tests, that run the distribution, run as
modular.
Co-authored-by: Chris Hegarty <christopher.hegarty@elastic.co>
Co-authored-by: Ryan Ernst <ryan@iernst.net>
Co-authored-by: Rene Groeschke <rene@elastic.co>
JEP 361[https://openjdk.java.net/jeps/361] added support for switch expressions
which can be much more terse and less error-prone than switch statements.
Another useful feature of switch expressions is exhaustiveness: we can make
sure that an enum switch expression covers all the cases at compile time.
The Checkstyle rule that bans unary negation in favour of an explicit
`== false` has a `maximumDepth` of 2 configured, which meant that it
didn't catch all violations. The `maximumDepth` isn't required (actually
it has a really high default), so this change removes the limit and
fixes the resulting violations.
Part 8.
We have an in-house rule to compare explicitly against `false` instead
of using the logical not operator (`!`). However, this hasn't
historically been enforced, meaning that there are many violations in
the source at present.
We now have a Checkstyle rule that can detect these cases, but before we
can turn it on, we need to fix the existing violations. This is being
done over a series of PRs, since there are a lot to fix.
As per the new licensing change for Elasticsearch and Kibana this commit
moves existing Apache 2.0 licensed source code to the new dual license
SSPL+Elastic license 2.0. In addition, existing x-pack code now uses
the new version 2.0 of the Elastic license. Full changes include:
- Updating LICENSE and NOTICE files throughout the code base, as well
as those packaged in our published artifacts
- Update IDE integration to now use the new license header on newly
created source files
- Remove references to the "OSS" distribution from our documentation
- Update build time verification checks to no longer allow Apache 2.0
license header in Elasticsearch source code
- Replace all existing Apache 2.0 license headers for non-xpack code
with updated header (vendored code with Apache 2.0 headers obviously
remains the same).
- Replace all Elastic license 1.0 headers with new 2.0 header in xpack.
We have an in-house rule to compare explicitly against `false` instead
of using the logical not operator (`!`). However, this hasn't
historically been enforced, meaning that there are many violations in
the source at present.
We now have a Checkstyle rule that can detect these cases, but before we
can turn it on, we need to fix the existing violations. This is being
done over a series of PRs, since there are a lot to fix.
* Remove usage of deprecated testCompile configuration
* Replace testCompile usage by testImplementation
* Make testImplementation non transitive by default (as we did for testCompile)
* Update CONTRIBUTING about using testImplementation for test dependencies
* Fail on testCompile configuration usage
This is another part of the breakup of the massive BuildPlugin. This PR
moves the code for configuring publications to a separate plugin. Most
of the time these publications are jar files, but this also supports the
zip publication we have for integ tests.
this commit adds aggregation support for the geo_shape field
type on geo*_grid aggregations.
it introduces a Tiler for both tiles and hashes that enables a new type of
ValuesSource to replace the GeoPoint's CellIdSource. This makes it possible
for the existing Aggregator to be re-used, so no new implementations of
the grid aggregators are added.
Currently forbidden apis accounts for 800+ tasks in the build. These
tasks are aggressively created by the plugin. In forbidden apis 3.0, we
will get task avoidance
(https://github.com/policeman-tools/forbidden-apis/pull/162), but we
need to ourselves use the same task avoidance mechanisms to not trigger
these task creations. This commit does that for our foribdden apis
usages, in preparation for upgrading to 3.0 when it is released.
While we use `== false` as a more visible form of boolean negation
(instead of `!`), the true case is implied and the true value does not
need to explicitly checked. This commit converts cases that have slipped
into the code checking for `== true`.
* Adds JavaDoc to `AbstractWireTestCase` and
`AbstractWireSerializingTestCase` so it is more obvious you should prefer
the latter if you have a choice
* Moves the `instanceReader` method out of `AbstractWireTestCase` becaue
it is no longer used.
* Marks a bunch of methods final so it is more obvious which classes are
for what.
* Cleans up the side effects of the above.
Closes#48724. Update `.editorconfig` to make the Java settings the default
for all files, and then apply a 2-space indent to all `*.gradle` files.
Then reformat all the files.
* Remove eclipse conditionals
We used to have some meta projects with a `-test` prefix because
historically eclipse could not distinguish between test and main
source-sets and could only use a single classpath.
This is no longer the case for the past few Eclipse versions.
This PR adds the necessary configuration to correctly categorize source
folders and libraries.
With this change eclipse can import projects, and the visibility rules
are correct e.x. auto compete doesn't offer classes from test code or
`testCompile` dependencies when editing classes in `main`.
Unfortunately the cyclic dependency detection in Eclipse doesn't seem to
take the difference between test and non test source sets into account,
but since we are checking this in Gradle anyhow, it's safe to set to
`warning` in the settings. Unfortunately there is no setting to ignore
it.
This might cause problems when building since Eclipse will probably not
know the right order to build things in so more wirk might be necesarry.
Changes the order of parameters in Geometries from lat, lon to lon, lat
and moves all Geometry classes are moved to the
org.elasticsearch.geomtery package.
Closes#45048
Switches to more robust way of generating random test geometries by
reusing lucene's GeoTestUtil. Removes duplicate random geometry
generators by moving them to the test framework.
Closes#37278
By default, we don't check ranges while indexing geo_shapes. As a
result, it is possible to index geoshapes that contain contain
coordinates outside of -90 +90 and -180 +180 ranges. Such geoshapes
will currently break SQL and ML retrieval mechanism. This commit removes
these restriction from the validator is used in SQL and ML retrieval.
Refactors the WKT and GeoJSON parsers from an utility class into an
instantiatable objects. This is a preliminary step in
preparation for moving out coordinate validators from Geometry
constructors. This should allow us to make validators plugable.
This commit refactors GeoHashUtils class into a new Geohash utility class located in the ES geo library. The intent is to not only better control what geo methods are whitelisted for painless scripting but to clean up the geo utility API in general.