UNCLASSFIED Geospatial Indexing https://github.com/ngageoint/geowave Eric Robertson Derek Yeager (BAH) Rich Fecher.

Slides:



Advertisements
Similar presentations
1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,
Advertisements

Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Searching on Multi-Dimensional Data
WFM 6202: Remote Sensing and GIS in Water Management © Dr. Akm Saiful IslamDr. Akm Saiful Islam WFM 6202: Remote Sensing and GIS in Water Management Akm.
CHAPTER 12 Height Maps, Hidden Surface Removal, Clipping and Level of Detail Algorithms © 2008 Cengage Learning EMEA.
Concepts of Database Management Seventh Edition
Concepts of Database Management Sixth Edition
Spatial Mining.
TERMS, CONCEPTS and DATA TYPES IN GIS Orhan Gündüz.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Cartographic and GIS Data Structures
Raster Data in ArcSDE 8.2 Why Put Images in a Database? What are Basic Raster Concepts? How Raster data stored in Database?
CS 128/ES Lecture 5b1 Vector Based Data. CS 128/ES Lecture 5b2 Spatial data models 1.Raster 2.Vector 3.Object-oriented Spatial data formats:
Geographic Information Systems
Spatial Information Systems (SIS) COMP Raster-based structures (2) Data conversion.
CS 128/ES Lecture 5a1 Raster Formats (II). CS 128/ES Lecture 5a2 Spatial modeling in raster format  Basic entity is the cell  Region represented.
Geographic Information Systems. What is a Geographic Information System (GIS)? A GIS is a particular form of Information System applied to geographical.
Spatial Indexing I Point Access Methods.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
PROCESS IN DATA SYSTEMS PLANNING DATA INPUT DATA STORAGE DATA ANALYSIS DATA OUTPUT ACTIVITIES USER NEEDS.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Dr. David Liu Objectives  Understand what a GIS is  Understand how a GIS functions  Spatial data representation  GIS application.
GI Systems and Science January 23, Points to Cover  What is spatial data modeling?  Entity definition  Topology  Spatial data models Raster.
Rebecca Boger Earth and Environmental Sciences Brooklyn College.
Spatial data models (types)
SPATIAL DATA STRUCTURES
Applied Cartography and Introduction to GIS GEOG 2017 EL Lecture-3 Chapters 5 and 6.
ESRM 250 & CFR 520: Introduction to GIS © Phil Hurvitz, KEEP THIS TEXT BOX this slide includes some ESRI fonts. when you save this presentation,
Applied Cartography and Introduction to GIS GEOG 2017 EL
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
© Manfred Huber Autonomous Robots Robot Path Planning.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Faculty of Applied Engineering and Urban Planning Civil Engineering Department Geographic Information Systems Vector and Raster Data Models Lecture 3 Week.
Applied Cartography and Introduction to GIS GEOG 2017 EL Lecture-2 Chapters 3 and 4.
GIS Data Structure: an Introduction
8. Geographic Data Modeling. Outline Definitions Data models / modeling GIS data models – Topology.
How do we represent the world in a GIS database?
Concepts of Database Management Seventh Edition
Tables tables are rows (across) and columns (down) common format in spreadsheets multiple tables linked together create a relational database entity equals.
Lecture 3 The Digital Image – Part I - Single Channel Data 12 September
CSC 211 Data Structures Lecture 13
Raster data models Rasters can be different types of tesselations SquaresTrianglesHexagons Regular tesselations.
1 Spatial Data Models and Structure. 2 Part 1: Basic Geographic Concepts Real world -> Digital Environment –GIS data represent a simplified view of physical.
GIS Data Structures How do we represent the world in a GIS database?
1 Research Question  Can a vision-based mobile robot  with limited computation and memory,  and rapidly varying camera positions,  operate autonomously.
NR 143 Study Overview: part 1 By Austin Troy University of Vermont Using GIS-- Introduction to GIS.
Duy & Piotr. How to reconstruct a high quality image with the least amount of samples per pixel the least amount of resources And preserving the image.
1 Overview Importing data from generic raster files Creating surfaces from point samples Mapping contours Calculating summary attributes for polygon features.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
GeoWave Geospatial Indexing Eric Robertson Derek Yeager.
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
9-1 © Prentice Hall, 2007 Topic 9: Physical Database Design Object-Oriented Systems Analysis and Design Joey F. George, Dinesh Batra, Joseph S. Valacich,
Introduction to Geographic Information Systems
U.S. Census Data & TIGER/Line Files
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
UNCLASSFIED GeoWave How Space Filling Curves accelerate ingest and query of Geospatial data Eric RobertsonDerek Yeager.
Data Models, Pixels, and Satellite Bands. Understand the differences between raster and vector data. What are digital numbers (DNs) and what do they.
What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”
Spatial Data Models Geography is concerned with many aspects of our environment. From a GIS perspective, we can identify two aspects which are of particular.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
UNCLASSFIED GeoWave Geospatial Indexing Eric Robertson Derek Yeager (BAH) Rich.
Distributed Geospatial Indexing
Rayat Shikshan Sanstha’s Chhatrapati Shivaji College Satara
Spatial Data Management
GEOGRAPHICAL INFORMATION SYSTEM
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
GeoMesa, GeoBench & SFCurve: Measuring & Improving BigGeo performance
Mean Shift Segmentation
Spatial Data Models Raster uses individual cells in a matrix, or grid, format to represent real world entities Vector uses coordinates to store the shape.
Presentation transcript:

UNCLASSFIED Geospatial Indexing Eric Robertson Derek Yeager (BAH) Rich Fecher (Radiant Blue) Steven McNutt (BAH) Whitney Omeara (Radiant Blue) Andrew Manning (Radiant Blue)

GIS Data Volume Explosion – Satellite Imagery, Ground Based Imagery, Aerial Photography GIS Technology Growth – Smart Phone and GPS Applications Data Problems: “On click” retrieval of data. – Generate Maps: Create base image and add vector data (shapes): points of interest roads boundaries – Find Features “restaurants near you” – Analysis Density, Surface Analysis, Interpolation, Pattern Discovery GIS: G EOGRAPHIC I NFORMATION SYSTEM Generated by OpenStreetMap.org

Table Server B IG T ABLE A RCHITECTURE : S CALE GIS D ATA S TORAGE, Q UERY AND P ROCESSING Data Nodes RowRegionColumnTimestampValue A1R1C120 A1R1C220 A2R1C121 A2R1C221 B1R2C120 B1R2C220 B2R2C121 Distributed Sorted Key Value Store Value Keys: Row Region Column Timestamp Visibility (Accumulo) Geospatial Value Keys: Row = Geo Locality Region = Data Group Locality Column = Attribute Set Distribution: Geo Locality Preserved Group together proximal data points Horizontally scalable Per-Attribute access constraints

Problem: Efficiently locate and retrieve vectors or tiles intersecting a polygon (e.g bounding box). Big Table: Each table organized into blocks of sorted row identifiers. Revised Problem: Two-way mapping between multiple dimensions and a single dimension row ID to support location efficient storage and retrieval of vectors or tiles given constraints in terms of multi-dimensional boundaries. B ASIC P ROBLEM : I NDEX T WO D IMENSION I N S INGLE D IMENSION I NDEX

Curves are constructed iteratively. Each iteration produces a sequence of piecewise linear continuous curves, each one more closely approximating the space-filling limit.piecewise linear Each discrete value on the curve represents a hyper-rectangle in n- dimensional space. Space Filling Curve: A curve whose range contains the entire n-dimensional hypercube. F UNDAMENTAL A PPROACH : S PACE F ILLING C URVES T RAVERSE N-D IMENSIONAL S PACE

Place a grid on the globe (dotted lines) Connect all the points on the grid with a Hilbert SFC. Curve provides linear ordering over two dimensional space. Each discrete point along the curve represents a grid cell; it is assigned a sequential identifier. Points close together share the same Hilbert Identifier. Bounding box is defined by the set of ranges covered by the Hilbert SFC. H ILBERT C URVE M APPING IN 2D: T HE G LOBE

C URVE S ELECTION : M INIMIZE W ORST C ASE

Precision determined by the ‘depth’ of the curve. In this example globe is defined by a 16X16 grid. Resolution is 22.5 degrees latitude and degrees longitude per cell. Each elbow (discrete point) in the Hilbert SFC maps to a grid cell. The precision, defined in terms of the number of bits, of the Hilbert SFC determines the grid. Thus, more bits equates to finer grained cell. H ILBERT C URVE P RECISION

Recursively decompose the Hilbert region to find only those covered regions that overlap the query box. The figure depicts a third order (2 3 “buckets” per dimension) Hilbert curve in 2D. Forms a quad-tree view over the data. Each two bits, from most significant to least represents a “quadrant.” R ECURSIVE D ECOMPOSITION : T WO D IMENSION E XAMPLE

Bounding Box over grid cells (2,9) to (5,13) (lower left) to (upper right) Decompose cells intersecting bounding box as shown in the blue. Range decomposes to three (color coded) ranges – 70 -> > > 121 Note: Bounding box from a geospatial query window does not necessarily “snap” perfectly to the grid cells. (e.g. 6.2, 8.8 instead of 6, 9). The bounding box is expanded to encompass all intersecting cells. BOUNDING B OX Q UERY : R ANGE D ECOMPOSITION

Here we see the query range fully decomposed into the underlying “quadrants.” Decomposition stops when the query window fully contains the quad. (See segment 3 and segment 8) Here we see the query range fully decomposed into the underlying “quadrants.” Decomposition stops when the query window fully contains the quad. (See segment 3 and segment 8) R ANGE D ECOMPOSITION O PTIMIZATION

I NTERVALS : P OLYGONS AND M ULTI -P OLYGON Duplicate entry for each intersecting hyper-rectangle over the interval. Polygon covers 66 cells in the example Remove duplicate data for each cell – 66 duplicates. De-Duplication is applied in Accumulo Iterator as well as client-side. Query is defined by a range per dimension (a bounding rectangle in 2D)

I NTERVALS : P OLYGONS AND M ULTI -P OLYGONS High resolution curves force excessive number of duplicates for large intervals. A high resolution 2D curve – 2 31 x 2 31 and a large polygon such as the pacific ocean. The pacific ocean covers ~33% of the earths surface, amplifies to ~1.5 quintillion duplicate entries. Solution: Tiered Indexing [8] Each tier has a resolution of 2 n x2 n, where n is the tier number. Thus, each lower tier has a two order increase in resolution. Polygons are stored in the lowest tier possible that minimizes the number of duplicates. Example: Blue polygon indexed in tier 2; Red polygon indexed in tier 3.

T IERS : Q UERY R EGIONS W ITH F ALSE P OSITIVES Balance between an acceptable amount of duplicates and false positives due to lower granularity of higher tiers. Consider a query region in orange. It does not intersect either polygons. However, it does intersect shared quadrants at the respective tiers for both shapes. Thus, more rows are filtered during range scan. Without tiers, using a higher resolution, this false positive does not occur. However, consider that, for a resolution of 10 (e.g ), hundreds of duplicates occur.

T IERS : W ORST C ASE Cap the amount of duplicates by choosing an appropriate tier. Our analysis indicates that an optimal number of duplicates is represented by 2 d where d is the number of dimensions (ie. in 2 dimensions, cap at 4) Consider the worst case, a small square polygon centered on the inner intersecting boundary (example polygon in red). Regardless of size of the box, there is always four duplicates at all tiers except at a 2 0 tier—the orange box, representing the entire world

F OCUSED ON T HE G ENERALIZED P ROBLEMS Solve the general problem first.  Multi-Dimension Numeric Index supporting efficient data retrieval given bounded set of constraints for each dimension.  Indexed data includes scalars and intervals per dimension. For example, a range of time or a polygon.  Index over a mix of bounded and unbounded dimensions. For example, time Latitude Longitude Elevation Latitude Longitude Elevation Latitude Time Elevation

U NBOUNDED D IMENSION : T IME To normalize real-world values to fit on a space filling curve, the sample space must be bound. Solution: Binning A bin represents a period for EACH dimension. For example, a periodicity of a year can be used for time. Each bin covers its own Hilbert space. Entries that contain ranges may span multiple bins resulting in duplicates. The Bin ID is part of row identifier A single bin for an unbounded dimension : [min + (period * period duration), min + ((period+1) * period duration))

SFC Curve Hierarchy Group Name B IG T ABLE K EY V ALUE R EPRESENTATION Tiles are unique, ignore duplication Specific Data ID

Table Server Table Server Green circle in query range, within distance, but not correct color. Second (lower right) red circle in decomposed query range not within distance. DWITHIN(POINT((35,35)),10,KM) and color = ‘Red’ Decompose To Ranges Scan all rows within those ranges. Check each row prior to returning to client ? ? CQL Q UERY P ROCESSING

WMS O PERATION : M AP O CCLUSION C ULLING A specific determined zoom level, each pixel signifies a range in degrees. Scanning the data, only one entry is needed within each pixel range. The rest of the entries can be skipped. The block identified in red represents many data points, but is rendered by the 9 pixels.

Database Data The Accumulo iterator starts at the first pixel, scans until it hits a geometry, then skips to the next pixel. Scan to the first pixel Seek to the beginning of the next pixel The rendering engine received only these points Points that were all skipped. I MPROVING WMS P ERFORMANCE : S KIPPING I N F ILTER Displayed Pixels

GeoServer (GeoWave Plugin) M ORE ON D ISTRIBUTED R ENDERING Map Request Map Response Layer Style Accumulo (GeoWave Iterators) Rendered Map Each scan result is an image with the data in the range All resultant images are composited together

W HAT IS G EO W AVE ?

I NTENT Pluggable Backend for multidimensional indexing on any sorted key-value store. Modular Design for easy feature extension as well as integration into other platforms. Self-Describing Data configuration, format, and other information needed to manipulate data in the data store. GeoWave is a set services, plugins and analytics to ingest, index, retrieve and analyze data. As found on

Indices Local Files Distributed Files Kafka Topics Kafka Consumers Local Threads Map Reduce Statistics Secondary Indices Vector Data (GeoTools) Shapefiles, GeoJSON, PostGIS, etc. Grid Formats (GeoTools) ArcGrid, GeoTIFF, etc. GeoLife GPS Trajectories GPX T-Drive PDAL (Reader/Writer) GDAL (Reader/Writer) MapNik Simple Feature (ISO 19125) via GeoTools ( Raster Images Custom I NGEST

S ERVICES Near Neighbors Heat Maps Clustering & Density Scan OGC Services Statistics Counts per Data Type Bounding Region Time Range Numeric Range Histograms Count Min Sketch Hyper Log Log Accumulo 1.5.1, 1.6.x, 1.7 Hbase1.2 Cloudera2.0.0-cdh4.7.0, cdh5.2 HortonworksHDP 2.1 Apache Hadoop2.6.0.x GeoTools 11.4, 12.1, 12.2 Geoserver2.5.2,2.6.1, 2.7.5, Tested Versions Input/Output Formats

R OAD M AP New Command Line API More Documentation Simplified Java API Increase Performance Field Subsetting Hbase Data Store Adapter Open Street Map Distributed Aggregation Quickstart Guide More Documentation Another Data Store (Cassandra?) OSM Improvements Initial Cost Based Optimizer to leverage Secondary Indexing Hbase Performance Enhancements

D EVELOPING WITH G EO W AVE

K EY C ONCEPTS Adapter Secondary Index Primary Index Statistics Data Store Selects for each attribute Provides attribute values to fulfill model Data Encode & Decodes Creates Index Writer Writes to Index Meta Data Primary Index contains the ‘data’. Secondary Index has a pointer to a primary index row ID.

P RIMARY I NDEX Primary Index Index Strategy Tiered SFC CompoundHash Round Robin 2 Index Model Dimension Time Latitude Longitude * Data from Multiple Adapters stored in a single Index. All adapters used in a single index provide attributes to fulfill the Index Model.

I NGEST AND Q UERY try (final CloseableIterator iterator = dataStore.query(new QueryOptions(ADAPTER,index), new CQLQuery("BBOX(geometry, , ,-76.6, ) and locationName like 'W%'",ADAPTER))) { while (iterator.hasNext()) { System.out.println("Query match: " + iterator.next().getID()); } try (IndexWriter indexWriter = dataStore.createIndexWriter( index, DataStoreUtils.DEFAULT_VISIBILITY)) { for (final SimpleFeature point : points) { indexWriter.write(ADAPTER,point); } Write a set of data to a specific Primary Index:

Q UERY W ITH A GGREGATION QueryOptions options = new QueryOptions(ADAPTER,index); options.setAggregation(new CountAggregation(), ADAPTER); try (final CloseableIterator iterator = dataStore.query(options, new CQLQuery("BBOX(geometry, , ,-76.6, ) and locationName like 'W%'",ADAPTER))) { if (iterator.hasNext()) { System.out.println(”Found : " + iterator.next().getCount()); } Statistics can be aggregated permitting the computation of statistics, such as histograms, over a selected population. Pluggable Aggregation—define custom aggregators.

N OT G EOSPATIAL : B UILD T HE A DAPTER public class CelestailDataAdapter implements DataAdapter { public ByteArrayId getAdapterId() { return new ByteArrayId(“Celestial”); } public ByteArrayId getDataId(CelestialBody data) { return new ByteArrayId(data.getId()); } public CelestialBody decode( IndexedAdapterPersistenceEncoding data,PrimaryIndex index) {…} public AdapterPersistenceEncoding encode( CelestialBody entry, CommonIndexModel indexModel ) {…} } Decode -> convert data store image to user object Encode -> convert user object to a data store image.

public class RightAscensionDefinition extends BasicDimensionDefinition { public RightAscensionDefinition(){ super(-50,150); } public class DeclinationDefinition extends BasicDimensionDefinition { public DeclinationDefinition() { super(0,86400); } } public class Declination implements NumericDimensionField { public ByteArrayId getFieldId() { new ByteArrayId(“Declination”); } … } public class RightAscension implements NumericDimensionField { public ByteArrayId getFieldId() { new ByteArrayId(“RightAscension”); } … } N OT G EOSPATIAL : B UILD T HE D IMENSIONS Define the domain of the curve space function Define each field to store in the index

new CustomIdIndex( TieredSFCIndexFactory.createFullIncrementalTieredStrategy( new NumericDimensionDefinition[] { new RightAscensionDefinition (), new DeclinationDefinition() }, new int[] {30, 30 }, SFCType.HILBERT), new BasicIndexModel( new NumericDimensionField[] { new Declination(), new RightAscension()}, new ByteArrayId(“CELESTIAL_INDEX")); N OT G EOSPATIAL : B UILD T HE I NDEX D EFINITION SFC Curve Strategy Domain definition Fields from the data to index

[1] Haverkort, Walderveen Locality and Bounding-Box Qualify of Two-Dimensional Space-Filling Curves 2008 arXiv: v2 [2] Hamilton, Rau-Chaplin Compact Hilbert indices: Space-filling curves for domains with unequal side lengths 2008 Information Processing Letters 105 ( ) [3] Hayes Crinkly Curves 2013 American Scientist (178). DOI: / [4] Skilling Programming the Hilbert Curve Bayesian Inference and Maximum Entropy Methods in Science and Engineering: 23 rd Workshop Proceedings American Institude of Physics /04 [5] Wikipedia Well-known_binary http://en.Wikipedia.org/wiki/Well-known_binary [6] Wikipedia Hilbert curve http://en.wikipedia.org/wiki/Hilbert_curve [7] Aioanei Uzaygezen–Compact Hilbert Index implementation in Java Google Inc. [8] Surratt, Boyd, Russelavage Z-Value Curve Index Evaluation 2012 Internal Presentation. [9] Open Geospatial Consortium Standard List [10] Remote Sensed Image Processing on Grids for Training in Earth Observation [11] OSGeo Wiki [12] WFS-T ( ) B IBLIOGRAPHY

W RITE F LOW Index Writer 1. Inject (data) Data Adapter 3. Get Row IDs With encoding 2. Encode (data), Statistics, Secondary Index Requirements Index Strategy Statistics Manager Secondary Index Manager 5. Calculate and Write Statistics 6. Add secondary index entries “Callbacks” 4.Forward IDs and encoding

S TATISTICS : STRUCTURE Statistics infrastructure supports summary data. Currently, each row ID includes adapter ID and a statistics ID. Current statistics types include population bounding boxes, counts and ranges. Key Statistic ID Row ID Column Value Adapter ID Family Qualifier Visibility “STATS” Matches represented data Attribute Name & Statistic Type. Time

S TATISTICS : C OMBINER Statistic ID Value Adapter ID FamilyQualifierVisibility “STATS” “Count”300xA43E“STATS”A&B “Count”600xA43E“STATS”A&C “Count”200xA43E“STATS”A&B “Count”500xA43E“STATS”A&B MERGE Time BBOX: Grow Envelope to Minimum and Maximum corners. RANGE: Minimum and Maximum HISTOGRAM: Update bins from coverage over raster image

S TATISTICS : T RANSFORMATION I TERATOR Statistic ID Value Adapter ID FamilyQualifierVisibility “STATS” “Count” 50 0xA43E “STATS”A&B “Count” 60 0xA43E “STATS”A&C “Count”1100xA43E“STATS”A&B&C MERGE Time Query authorization may authorize multiple rows. Query with authorization A,B & C

R ASTER D ATA : G RID C OVERAGE Tiled, each “cell” fit to boundary. “No Data” values must be maintained. Multi-band, more than just RGB.

Histogram Equalization [10] Image Pyramid [11] Tile Merge Strategy t1 t2 t3 f ( f(, ), ) = t1t2 t3 final tntn Value Custom data per tile, in scope for f(x) R ASTER D ATA : A DVANCED O PTIONS

Worst Case Dilation Average Box Area Worst Case Area L∞L∞ L2L2 L1L1 ∞ ∞ ∞ ∞ [1] Haverkort, Walderveen Locality and Bounding-Box Quality of Two- Dimensional Space-Filling Curves 2008 arXiv: v2 C URVE S ELECTION : L OCALITY

Achieve optimal read performance through contiguous series of values across two or more dimensions. Reading 11 records over a contiguous range 23->33 is faster than reading non- contiguous range such as 15,18,34,56-58,83,99, Consider: Latitude and Longitude defined by a range (latA, lonA) -> (latB, lonB) should map to the least number of ranges on the space filling curve. Haverkort and Walderveen[1] describe 3 metrics to help quantify this. C URVE S ELECTION : S EQUENTIAL IO O PTIMIZATION Worst Case Dilation Average Bounding Box Worst Case Bounding Box