UNCLASSFIED GeoWave Geospatial Indexing https://github.com/ngageoint/geowave Eric Robertson Derek Yeager (BAH) Rich.

UNCLASSFIED GeoWave Geospatial Indexing https://github.com/ngageoint/geowave http://ngageoint.github.io/geowave/ Eric Robertson Derek Yeager (BAH) Rich Fecher (Radiant Blue) Steven McNutt (BAH) Whitney Omeara (Randiant Blue)

Intent Pluggable Backend for multidimensional indexing on any sorted key-value store. Modular Design for easy feature extension as well as integration into other platforms. Self-Describing Data configuration, format, and other information needed to manipulate data in the data store. As found on http://ngageoint.github.io/geowave

Geographic Information Systems (GIS) GeoWave Overview – Features – Components – Data Types The Fundamentals – How does GeoWave organize geospatial data? Set of problems and solutions with Big Table Architecture – Deduplication – Map Occlusion Culling – Raster Data – Statistics O UTLINE

GIS Technology Explosion – E.g. Smart Phone and GPS Applications Data Explosion – Satellite Imagery, Ground Based Imagery, Aerial Photography Problems: – Generate Maps: Create base image and add vector data (shapes): points of interest roads boundaries – Find Features “restaurants near you” – Analysis Density, Surface Analysis, Interpolation, Pattern Discovery GIS : G EOGRAPHIC I NFORMATION SYSTEM Generated by OpenStreetMap.org

Leverage Big Table distributed data store – High-performance ingest – Horizontally scalable – Per-entry access constraints Fast geospatial retrieval Geo-temporal indexing Pre-calculated statistics: – Counts per Data Type – Bounding Region – Time Range – Numeric Range – Histograms F EATURES OF G EOWAVE

Accumulo 1.5.1, 1.6.x, 1.7 Cloudera2.0.0-cdh4.7.0, 2.5.0-cdh5.2 HortonworksHDP 2.1 Apache Hadoop2.6.0.x GeoTools 11.4, 12.1, 12.2 Geoserver2.5.2,2.6.1, 2.7.5, 2.8.2 Accumulo Data Store Hadoop Map-Reduce input/output formats GeoServer integration with GeoTools Vector and Raster Data Multi-Threaded Ingest Tools Administrative RESTful Services Layers and Data Stores Analytics (on Map Reduce) Kernel Density K-means Clustering Sampling DBScan I NTEGRATED C OMPONENTS Tested Versions

Data Structures – Simple Feature (ISO 19125) via GeoTools (http://www.geotools.org/).http://www.geotools.org/ – Raster Images – Custom Provided Ingest Types – Vector Data Sources (GeoTools) Examples: Shapefiles, GeoJSON, PostGIS, etc. – Grid Formats (GeoTools) Examples: ArcGrid, GeoTIFF, etc. – GeoLife GPS Trajectories (http://research.microsoft.com/en- us/projects/GeoLife/)http://research.microsoft.com/en- us/projects/GeoLife/ – GPX (http://www.topografix.com/gpx.asp)http://www.topografix.com/gpx.asp – T-Drive (http://research.microsoft.com/en-us/projects/tdrive/)http://research.microsoft.com/en-us/projects/tdrive/ – PDAL D ATA T YPES

Basic Problem: Efficiently locate and retrieve vectors or tiles intersecting a polygon (e.g bounding box). Big Table: Each table organized into blocks of sorted row identifiers. Revised Problem: Two-way mapping between multiple dimensions and a single dimension row ID to support location efficient storage and retrieval of vectors or tiles given constraints in terms of multi-dimensional boundaries. M AIN P ROBLEM : I NDEX T WO D IMENSION I N S INGLE D IMENSION I NDEX

G ENERALIZED P ROBLEMS Solve the general problem first. Then apply to Geospatial specific problems.  Multi-Dimension Numeric Index supporting efficient data retrieval given bounded set of constraints for each dimension.  Indexed data includes scalars and intervals per dimension. For example, a range of time or a polygon.  Index over a mix of bounded and unbounded dimensions.

Curves are constructed iteratively. Each iteration produces a sequence of piecewise linear continuous curves, each one more closely approximating the space-filling limit.piecewise linear Each discrete value on the curve represents a hyper-rectangle in n- dimensional space. Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube. F UNDAMENTAL A PPROACH : S PACE F ILLING C URVES T RAVERSE N-D IMENSIONAL S PACE

Achieve optimal read performance through contiguous series of values across two or more dimensions. Reading 11 records over a contiguous range 23->33 is faster than reading non- contiguous range such as 15,18,34,56-58,83,99,101-102. Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should map to the least number of ranges on the space filling curve. Haverkort and Walderveen[1] describe 3 metrics to help quantify this. C URVE S ELECTION : S EQUENTIAL IO O PTIMIZATION Worst Case Dilation Average Bounding Box Worst Case Bounding Box

Worst Case Dilation Average Box Area Worst Case Area L∞L∞ L2L2 L1L1 ∞6485.405.00 ∞6486.045.00 ∞9810.6612.009.00 ∞2.403.002.003.052.22 2.861.411.691.421.471.40 [1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two- Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2 C URVE S ELECTION : L OCALITY

Place a grid on the globe (dotted lines) Connect all the points on the grid with a Hilbert SFC. Curve provides linear ordering over two dimensional space. Bounding box is defined by the set of ranges covered by the Hilbert SFC. H ILBERT C URVE M APPING IN 2D: THE G LOBAL

Precision determined by the ‘depth’ of the curve. In this example globe is defined by a 16X16 grid. Resolution is 22.5 degrees latitude and 11.25 degrees longitude per cell. Each elbow (discrete point) in the Hilbert SFC maps to a grid cell. The precision, defined in terms of the number of bits, of the Hilbert SFC determines the grid. Thus, more bits equates to finer grained cell. H ILBERT C URVE P RECISION

Recursively decompose the Hilbert region to find only those covered regions that overlap the query box. The figure depicts a third order (2 3 “buckets” per dimension) Hilbert curve in 2D. Forms a quad-tree view over the data. Each two bits, from most significant to least represents a “quadrant.” 0001 1011 10 1100 01 11 10 00 01 R ECURSIVE D ECOMPOSITION : T WO D IMENSION E XAMPLE

Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right) Decompose cells intersecting bounding box as shown in the blue. Range decomposes to three (color coded) ranges – 70 -> 75 92 -> 99 116 -> 121 Note: Bounding box from a geospatial query window does not necessarily “snap” perfectly to the grid cells. (e.g. 6.2, 8.8 instead of 6, 9). The bounding box is expanded to encompass all intersecting cells. D ECODE THE BOUNDING B OX : R ANGE D ECOMPOSITION

Here we see the query range fully decomposed into the underlying “quadrants.” Decomposition stops when the query window fully contains the quad. (See segment 3 and segment 8) Here we see the query range fully decomposed into the underlying “quadrants.” Decomposition stops when the query window fully contains the quad. (See segment 3 and segment 8) R ANGE D ECOMPOSITION O PTIMIZATION

I NTERVALS : P OLYGONS AND M ULTI -P OLYGON Duplicate entry for each intersecting hyper-rectangle over the interval. Polygon covers 66 cells in the example Remove duplicate data for each cell – 66 duplicates. De-Duplication is applied in Accumulo Iterator as well as client-side. Query is defined by a range per dimension (a bounding rectangle in 2D)

I NTERVALS : P OLYGONS AND M ULTI -P OLYGONS High resolution curves force excessive number of duplicates for large intervals. A high resolution 2D curve – 2 31 x 2 31 and a large polygon such as the pacific ocean. The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion duplicate entries. Solution: Tiered Indexing [8] Each tier has a resolution of 2 n x2 n, where n is the tier number. Thus, each lower tier has a two order increase in resolution. Polygons are stored in the lowest tier possible that minimizes the number of duplicates. Example: Blue polygon indexed in tier 2; Red polygon indexed in tier 3.

T IERS : Q UERY R EGIONS W ITH F ALSE P OSITIVES Balance between an acceptable amount of duplicates and false positives due to lower granularity of higher tiers. Consider a query region in orange. It does not intersect either polygons. However, it does intersect shared quadrants at the respective tiers for both shapes. Thus, more rows are filtered during range scan. Without tiers, using a higher resolution, this false positive does not occur. However, consider that, for a resolution of 10 (e.g. 2 10 ), hundreds of duplicates occur.

T IERS : W ORST C ASE Cap the amount of duplicates by choosing an appropriate tier. Our analysis indicates that an optimal number of duplicates is represented by 2 d where d is the number of dimensions (ie. in 2 dimensions, cap at 4) Consider the worst case, a small square polygon centered on the inner intersecting boundary (example polygon in red). Regardless of size of the box, there is always four duplicates at all tiers except at a 2 0 tier—the orange box, representing the entire world

U NBOUNDED D IMENSION : T IME To normalize real-world values to fit on a space filling curve, the sample space must be bound. Solution: Binning A bin represents a period for EACH dimension. For example, a periodicity of a year can be used for time. Each bin covers its own Hilbert space. Entries that contain ranges may span multiple bins resulting in duplicates. The Bin ID is part of row identifier. 199719981999 A single bin for an unbounded dimension : [min + (period * period duration), min + ((period+1) * period duration))

B IN : V ARIABILITY OVER D IMENSIONS Time Elevation Velocity Each Bin is a hyper-rectangle representing ranges of data labeled by points on a Hilbert curve. Bounded dimensions assume a single Bin. For example, Latitude and Longitude.

T HAT ’ S E NOUGH T HEORY, L ET ’ S A PPLY I T A CCUMULO TECHNIQUES YOU MIGHT FIND INTERESTING

SFC Curve Hierarchy Feature Type Feature ID Hint to Dedupe Filter From Field Visibility Handlers V ECTOR D ATA P ERSISTENCE M ODEL Column per feature identifier. Column per each feature attribute. Types include: Geometry Integer Double BigDecimal Date Time String Boolean etc. Feature Attribute Name

M AP O CCLUSION C ULLING A specific determined zoom level, each pixel signifies a range in degrees. Scanning the data, only one entry is needed within each pixel range. The rest of the entries can be skipped. The block identified in red represents many data points, but is rendered by the 9 pixels.

1 23 4 Database Data The accumulo iterator starts at the first pixel, scans until it hits a geometry, then skips to the next pixel. Scan to the first pixel Seek to the beginning of the next pixel The rendering engine received only these points Points that were all skipped. M AP O CCLUSION C ULLING : I TERATORS Displayed Pixels

GeoServer (GeoWave Plugin) D ISTRIBUTED R ENDERING Map Request Map Response Layer Style Accumulo (GeoWave Iterators) Rendered Map Each scan result is an image with the data in the range All resultant images are composited together

D ISTRIBUTED R ENDERING WITH O CCLUSION C ULLING

SFC Curve Hierarchy SFC Value is Effectively a Tile ID Coverage Name R ASTER D ATA P ERSISTENCE M ODEL Image Data Buffer + Image Metadata Image Metadata is customizable. Default is to store “no data” values, but can be customized Tiles are unique, ignore duplication Unique name for global coverage

R ASTER D ATA : G RID C OVERAGE Tiled, each “cell” fit to boundary. “No Data” values must be maintained. Multi-band, more than just RGB.

Histogram Equalization [10] Image Pyramid [11] Tile Merge Strategy t1 t2 t3 f ( f(, ), ) = t1t2 t3 final tntn Value Custom data per tile, in scope for f(x) R ASTER D ATA : A DVANCED O PTIONS

S TATISTICS : STRUCTURE Statistics infrastructure supports summary data. Currently, each row ID includes adapter ID and a statistics ID. Current statistics types include population bounding boxes, counts and ranges. Key Statistic ID Row ID Column Value Adapter ID Family Qualifier Visibility “STATS” Matches represented data Attribute Name & Statistic Type. Time

S TATISTICS : C OMBINER Statistic ID Value Adapter ID FamilyQualifierVisibility “STATS” “Count”300xA43E“STATS”A&B “Count”600xA43E“STATS”A&C “Count”200xA43E“STATS”A&B “Count”500xA43E“STATS”A&B MERGE Time 2 4 7 9 BBOX: Grow Envelope to Minimum and Maximum corners. RANGE: Minimum and Maximum HISTOGRAM: Update bins from coverage over raster image

S TATISTICS : T RANSFORMATION I TERATOR Statistic ID Value Adapter ID FamilyQualifierVisibility “STATS” “Count” 50 0xA43E “STATS”A&B “Count” 60 0xA43E “STATS”A&C “Count”1100xA43E“STATS”A&B&C MERGE Time 9 4 9 Query authorization may authorize multiple rows. Query with authorization A,B & C

Key Concepts Adapter Secondary Index Primary Index Statistics Data Store Selects for each attribute Provides attributes to fulfill model Data Encode & Decodes Creates Index Writer Writes to Index Meta Data Primary Index contains the ‘data’. Secondary Index has a pointer to a primary index row ID.

Primary Index Index Strategy Tiered SFC CompoundHash Round Robin 2 Index Model Dimension Time Latitude Longitude * Data from Multiple Adapters stored in a single Index. All adapters used in a single index provide attributes to fulfill the Index Model.

Flows Index Writer 1. Inject (data) Adapter 3. Get Row IDs With encoding 2. Encode (data), Statistics, Secondary Index requirements Index Strategy Statistics Manager Secondary Index Manager 4. Calculate and Write Statistics 5.Add secondary index entries

Ingest and Query try (final CloseableIterator iterator = dataStore.query(new QueryOptions(ADAPTER,index), new CQLQuery("BBOX(geometry,-77.6167,38.6833,-76.6,38.9200) and locationName like 'W%'",ADAPTER))) { while (iterator.hasNext()) { System.out.println("Query match: " + iterator.next().getID()); } try (IndexWriter indexWriter = dataStore.createIndexWriter( index, DataStoreUtils.DEFAULT_VISIBILITY)) { for (final SimpleFeature point : points) { indexWriter.write(ADAPTER,point); }

Kernel Density Estimate (Gaussian Kernel)

[1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008 arXiv:0806.4787v2 [2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008 Information Processing Letters 105 (155-163) [3] Hayes Crinkly Curves 2013 American Scientist 100-3 (178). DOI: 10.1511/2013.102.1 [4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 23 rd Workshop Proceedings. 2004. American Institude of Physics 0-7354-0182-9/04 [5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary 2013http://en.Wikipedia.org/wiki/Well-known_binary [6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve 2013http://en.wikipedia.org/wiki/Hilbert_curve [7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java http://code.google.com/p/uzaygezen/ 2008 Google Inc.http://code.google.com/p/uzaygezen/ [8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation. [9] Open Geospatial Consortium Standard List http://www.opengeospatial.org/standards/ishttp://www.opengeospatial.org/standards/is [10] Remote Sensed Image Processing on Grids for Training in Earth Observation http://www.intechopen.com/source/html/6674/media/image3.jpeg [11] OSGeo Wiki http://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpghttp://wiki.osgeo.org/images/thumb/d/d0/Pyramid.jpg/286px-Pyramid.jpg [12] WFS-T (http://www.opengeospatial.org/standards/wfs ) B IBLIOGRAPHY

UNCLASSFIED GeoWave Geospatial Indexing https://github.com/ngageoint/geowave Eric Robertson Derek Yeager (BAH) Rich.

Similar presentations

Presentation on theme: "UNCLASSFIED GeoWave Geospatial Indexing https://github.com/ngageoint/geowave Eric Robertson Derek Yeager (BAH) Rich."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNCLASSFIED GeoWave Geospatial Indexing https://github.com/ngageoint/geowave Eric Robertson Derek Yeager (BAH) Rich.

Similar presentations

Presentation on theme: "UNCLASSFIED GeoWave Geospatial Indexing https://github.com/ngageoint/geowave Eric Robertson Derek Yeager (BAH) Rich."— Presentation transcript:

Similar presentations

About project

Feedback