Presentation is loading. Please wait.

Presentation is loading. Please wait.

GeoWave Geospatial Indexing Eric Robertson Derek Yeager.

Similar presentations


Presentation on theme: "GeoWave Geospatial Indexing Eric Robertson Derek Yeager."— Presentation transcript:

1 GeoWave Geospatial Indexing Eric Robertson Derek Yeager

2 Outline Geographic Information Systems (GIS) GeoWave Overview Features
Components Data Types The Fundamentals How does GeoWave organize geospatial data? Set of problems and solutions with Accumulo Deduplication WFS-T Transaction Isolation Map Occlusion Culling Raster Data Statistics Expect velocity in the beginning. Slow down during fundamentals and even slower during the accumulo specific. Thinking no more than 2 minutes a slide up to fundamentals.

3 GIS: Geographic Information system
GIS Technology Explosion E.g. Smart Phone and GPS Applications Data Explosion Satellite Imagery, Ground Based Imagery, Aerial Photography Problems: Generate Maps: Create base image and add vector data (shapes): points of interest roads boundaries Find Features “restaurants near you” Analysis Density, Surface Analysis, Interpolation, Pattern Discovery Generated by OpenStreetMap.org

4 Features of Geowave Leverage Accumulo offerings as distributed data store High-performance ingest Horizontally scalable Per-entry access constraints Fast geospatial retrieval Geo-temporal indexing Pre-calculated statistics: Counts per Data Type Bounding Region Time Range Numeric Range Histograms

5 Integrated Components
Tested Versions Accumulo 1.5.1, 1.6.x Cloudera cdh4.7.0, cdh5.2 Hortonworks HDP 2.1 Apache 2.6 GeoTools 11.4, 12.1, 12.2 Geoserver ,2.6.1 Accumulo Data Store Hadoop Map-Reduce input/output formats GeoServer integration with GeoTools Vector and Raster Data Multi-Threaded Ingest Tools Administrative RESTful Services Layers and Data Stores Analytics Kernel Density K-means Clustering Sampling

6 Data Types Data Structures
Simple Feature (ISO 19125) via GeoTools ( Raster Images Custom Provided Ingest Types Vector Data Sources (GeoTools) Examples: Shapefiles, GeoJSON, PostGIS, etc. Grid Formats (GeoTools) Examples: ArcGrid, GeoTIFF, etc. GeoLife GPS Trajectories ( GPX ( T-Drive ( PDAL

7 Index Two Dimension In Single Dimension Index
Main Problem: Index Two Dimension In Single Dimension Index Basic Problem: Efficiently locate and retrieve vectors or tiles intersecting a polygon (e.g bounding box). Accumulo: Each table organized into blocks of sorted row identifiers. Revised Problem: Two-way mapping between multiple dimensions and a single dimension row ID to support location efficient storage and retrieval of vectors or tiles given constraints in terms of multi-dimensional boundaries.

8 Generalized Problems Solve the general problem first. Then apply to Geospatial specific problems. Multi-Dimension Index supporting efficient data retrieval given bounded set of constraints for each dimension. Indexed data includes scalars and intervals per dimension. For example, a range of time or a polygon. Index over a mix of bounded and unbounded dimensions.

9 Fundamental Approach:
Space Filling Curves Traverse N-Dimensional Space Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube. Curves are constructed iteratively. Each iteration produces a sequence of piecewise linear continuous curves, each one more closely approximating the space-filling limit. Each discrete value on the curve represents a hyper-rectangle in n-dimensional space.

10 Worst Case Bounding Box
Curve Selection : Sequential IO Optimization Achieve optimal read performance through contiguous series of values across two or more dimensions. Reading 11 records over a contiguous range 23->33 is faster than reading non-contiguous range such as 15,18,34,56-58,83,99, Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should map to the least number of ranges on the space filling curve. Haverkort and Walderveen[1] describe 3 metrics to help quantify this. Worst Case Dilation Worst Case Bounding Box Average Bounding Box 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑝 𝑎𝑛𝑑 𝑞 𝑎𝑟𝑒𝑎 𝑜𝑓 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑏𝑜𝑢𝑛𝑑𝑖𝑛𝑔 𝑟𝑒𝑐𝑡𝑎𝑛𝑔𝑙𝑒 (𝑏𝑙𝑢𝑒) 𝑎𝑟𝑒𝑎 𝑓𝑖𝑙𝑙𝑒𝑑 𝑏𝑦 𝑐𝑢𝑟𝑣𝑒 (𝑔𝑟𝑒𝑒𝑛)

11 Curve Selection : Locality
Z-Order Hilbert H-order Peano AR2W2 L∞ 6 4 8 5.40 5.00 6.04 9 10.66 12.00 9.00 2.40 3.00 2.00 3.05 2.22 2.86 1.41 1.69 1.42 1.47 1.40 L2 Worst Case Dilation L1 Worst Case Area Average Box Area [1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two-Dimensional Space-Filling Curves 2008 arXiv: v2

12 Hilbert Curve Mapping in 2D: the Global
Place a grid on the globe (dotted lines) Connect all the points on the grid with a Hilbert SFC. Curve provides linear ordering over two dimensional space. Bounding box is defined by the set of ranges covered by the Hilbert SFC. Shift Focus to two dimensions

13 Hilbert Curve Precision
Precision determined by the ‘depth’ of the curve. In this example globe is defined by a 16X16 grid. Resolution is 22.5 degrees latitude and degrees longitude per cell. Each elbow (discrete point) in the Hilbert SFC maps to a grid cell. The precision, defined in terms of the number of bits, of the Hilbert SFC determines the grid. Thus, more bits equates to finer grained cell. TBD: Change examples from red to other color and only show four, not six

14 Recursive Decomposition : Two Dimension Example
Hilbert Index (52) = 00 01 10 11 Recursively decompose the Hilbert region to find only those covered regions that overlap the query box. The figure depicts a third order (23 “buckets” per dimension) Hilbert curve in 2D. Forms a quad-tree view over the data. Each two bits, from most significant to least represents a “quadrant.” Three dimensions a box, etc. Per side ~ side of box. In this example, lowest tier is an 8X8 grid

15 Decode the bounding Box: Range Decomposition
Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right) Decompose cells intersecting bounding box as shown in the blue. Range decomposes to three (color coded) ranges – 70 -> 75 92 -> 99 116 -> 121 Note: Bounding box from a geospatial query window does not necessarily “snap” perfectly to the grid cells. (e.g. 6.2, 8.8 instead of 6, 9). The bounding box is expanded to encompass all intersecting cells.

16 Range Decomposition Optimization
Here we see the query range fully decomposed into the underlying “quadrants.” Decomposition stops when the query window fully contains the quad. (See segment 3 and segment 8)

17 Intervals: Polygons and Multi-Polygon
Duplicate entry for each intersecting hyper-rectangle over the interval. Polygon covers 66 cells in the example Remove duplicate data for each cell – 66 duplicates. De-Duplication is applied in Accumulo Iterator as well as client-side. Query is defined by a range per dimension (a bounding rectangle in 2D)

18 Intervals: Polygons and Multi-Polygons
High resolution curves force excessive number of duplicates for large intervals. A high resolution 2D curve – 231 x 231 and a large polygon such as the pacific ocean. The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion duplicate entries. Solution: Tiered Indexing[8] Each tier has a resolution of 2nx2n, where n is the tier number. Thus, each lower tier has a two order increase in resolution. Polygons are stored in the lowest tier possible that minimizes the number of duplicates. Example: Blue polygon indexed in tier 2; Red polygon indexed in tier 3. Use Polygons and Multi-Polygons as an example of intervals

19 Tiers: Query Regions With False Positives
Balance between an acceptable amount of duplicates and false positives due to lower granularity of higher tiers. Consider a query region in orange. It does not intersect either polygons. However, it does intersect shared quadrants at the respective tiers for both shapes. Thus, more rows are filtered during range scan. Without tiers, using a higher resolution, this false positive does not occur. However, consider that, for a resolution of 10 (e.g. 210), hundreds of duplicates occur.

20 Tiers: Worst Case Cap the amount of duplicates by choosing an appropriate tier. Our analysis indicates that an optimal number of duplicates is represented by 2d where d is the number of dimensions (ie. in 2 dimensions, cap at 4) Consider the worst case, a small square polygon centered on the inner intersecting boundary (example polygon in red). Regardless of size, there is always four duplicates at all tiers except at a 20 tier—the orange box, representing the entire world

21 Unbounded Dimension: Time
To normalize real-world values to fit on a space filling curve, the sample space must be bound. Solution: Binning A bin represents a period for EACH dimension. For example, a periodicity of a year can be used for time. Each bin covers its own Hilbert space. Entries that contain ranges may span multiple bins resulting in duplicates. The Bin ID is part of row identifier. 1997 1998 1999 Special focus on generality…without specifics A single bin for an unbounded dimension : [min + (period * period duration), min + ((period+1) * period duration))

22 Bin: Variability over Dimensions
Time Elevation Each Bin is a hyper-rectangle representing ranges of data labeled by points on a Hilbert curve. Velocity Bounded dimensions assume a single Bin. For example, Latitude and Longitude.

23 That’s Enough Theory, Let’s Apply It Accumulo techniques you might find interesting

24 Vector Data Persistence Model
Types include: Geometry Integer Double BigDecimal Date Time String Boolean etc. Hint to Dedupe Filter Feature Attribute Name From Field Visibility Handlers SFC Curve Hierarchy Feature Type Feature ID Column per feature identifier. Column per each feature attribute.

25 Map Occlusion Culling A specific determined zoom level, each pixel signifies a range in degrees. Scanning the data, only one entry is needed within each pixel range. The rest of the entries can be skipped. The block identified in red represents many data points, but is rendered by the 9 pixels.

26 Map Occlusion Culling: Iterators
1 2 3 4 2 3 Database Data Displayed Pixels The accumulo iterator starts at the first pixel, scans until it hits a geometry, then skips to the next pixel. 1 4 TO DO: Fix graphic The rendering engine received only these points Scan to the first pixel Seek to the beginning of the next pixel Points that were all skipped.

27 Distributed Rendering
GeoServer (GeoWave Plugin) Map Request Layer Style Accumulo (GeoWave Iterators) Each scan result is an image with the data in the range Rendered Map Map Response All resultant images are composited together

28 Distributed Rendering with Occlusion Culling

29 Raster Data Persistence Model
Image Data Buffer + Image Metadata SFC Curve Hierarchy SFC Value is Effectively a Tile ID Coverage Name Tiles are unique, ignore duplication Unique name for global coverage Image Metadata is customizable. Default is to store “no data” values, but can be customized

30 Raster Data: Grid Coverage
Tiled, each “cell” fit to boundary. “No Data” values must be maintained. Multi-band, more than just RGB.

31 Raster Data: Advanced Options
Histogram Equalization [10] Tile Merge Strategy tn t3 t2 t1 Image Pyramid [11] f ( f( , ), ) = t1 t2 t3 final Value Custom data per tile, in scope for f(x)

32 Statistics: structure
Statistics infrastructure supports summary data. Currently, each row ID includes adapter ID and a statistics ID. Current statistics types include population bounding boxes, counts and ranges. Key Statistic ID Row ID Column Value Adapter ID Family Qualifier Visibility “STATS” Matches represented data Attribute Name & Statistic Type. Time Index ID…if added and slide not updated, Reason for Index ID: Keep index/adapter statistics…all measures missing components…a like a partialed file indx Eventually to support cost based optimizer, picking the best index None idempotent operations such as count

33 Statistics: Combiner Statistic ID Value Adapter ID Family Qualifier
Visibility “STATS” Time “Count” “STATS” 0xA43E A&B 2 30 “Count” “STATS” 0xA43E A&C 4 60 “Count” “STATS” 0xA43E A&B 7 20 MERGE “Count” “STATS” 0xA43E A&B 9 50 Statictics: Communitive Monoids Recall: the Combiner occurs before the Versioner. Semigroup M is a nonempty set equipped with a binary operation, which is required (only!) to be associative. An element I of a semigroup M is said to be an identity if for all x ∈ M, Ix=xI=x. A semigroup can have at most one identity. Definition: A Monoid is a semigroup with an identity element. Semigroup M is commutative if x · y = y · x for all x; y ∈ M BBOX: Grow Envelope to Minimum and Maximum corners. RANGE: Minimum and Maximum HISTOGRAM: Update bins from coverage over raster image

34 Statistics: Transformation Iterator
Query authorization may authorize multiple rows. Query with authorization A,B & C Statistic ID Value Adapter ID Family Qualifier Visibility “STATS” Time “Count” “STATS” 0xA43E A&B 9 50 “Count” “STATS” 0xA43E A&C 4 60 Dropped slide on statistics future so mention Bloom filters, hyper log log, bin stats, etc. MERGE “Count” “STATS” 0xA43E A&B&C 9 110

35 WFS-T[12] Transactions: Isolation
Problem: Isolation of updates and new records until commit. Solution: Use a managed set of transaction identifiers as authorization tags. A single transaction places an authorization tag in all new entries. Upon commit, the authorization tag is removed using a transforming iterator. Note about WFS-T specific meaning (Sessions) Role1, role2, tx123 Commit Role1, role2

36 So What? Eye-Candy You’ve Been Waiting For

37 Microsoft GeoLife Microsoft research has made available a trajectory data set that contains the GPS coordinates of 182 users over a three year period (April 2007 to August 2012). There are 17,621 trajectories in this data set with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours recorded by GPS loggers and GPS phones often sampling every 1-5 seconds or every 5-10 meters.

38 GeoLife – Just the tracks

39 Let’s bring out some detail – Kernel Density Estimate (Guassian Kernel)

40 Let’s zoom in a bit

41 Density estimate again

42 OSM – Planet GPX dump Every track ever uploaded to Open Street Map
Complete data attribution 2.9 Billion spatial entities (points)

43 Level 0 Overview (all the points!)

44 Let’s go deeper..

45

46 Let’s bring out some detail again – Kernel Density Estimate (Guassian Kernel)

47 Let’s zoom a bit – and try some different styling options

48 Bibliography [1] Haverkort, Walderveen Locality and Bounding-Box Qualifty of Two-Dimensional Space-Filling Curves 2008 arXiv: v2 [2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008 Information Processing Letters 105 ( ) [3] Hayes Crinkly Curves 2013 American Scientist (178). DOI: / [4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 23rd Workshop Proceedings American Institude of Physics /04 [5] Wikipedia Well-known_binary [6] Wikipedia Hilbert curve [7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java Google Inc. [8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation. [9] Open Geospatial Consortium Standard List [10] Remote Sensed Image Processing on Grids for Training in Earth Observation [11] OSGeo Wiki [12] WFS-T ( )


Download ppt "GeoWave Geospatial Indexing Eric Robertson Derek Yeager."

Similar presentations


Ads by Google