SpatialHadoop: A MapReduce Framework for Spatial Data

Name: SpatialHadoop: A MapReduce Framework for Spatial Data
Uploaded: 2017-08-16T21:36:49+00:00
Duration: PTM25S34
Channel: Randall Bruce
Description: SpatialHadoop: A MapReduce Framework for Spatial Data

SpatialHadoop: A MapReduce Framework for Spatial Data
Author: Ahmed Eldawy, Mohamed F. Mokbel Publication: ICDE 15’

Context 0. Abstract 1. Background 2. Related Work 3. Architecture
4~7. Four layers 8. Experiments

Abstract SpatialHadoop: a full-fledged MapReduce framework with native support for spatial data It is a comprehensive extension to Hadoop that injects spatial data awareness in each Hadoop layer: language: Pigeon storage: two level spatial index MapReduce: SpatialFileSplitter, SpatialRecordReader operations: range query, kNN, spatial join

Background Motivations:
Hadoop: solution for scalable processing of huge datasets Recent explosions of spatial data Present: Researchers and practitioners worldwide have started to take advantage of the MapReduce environment in supporting large-scale spatial data: * Industry: GIS tools on Hadoop * academic: 1. Parallel-Secondo 2. MD-HBase 3. Hadoop-GIS

Background Drawback: deals with Hadoop as a black box and limited by limitations of existing Hadoop system. Take Hadoop-GIS as an example: Hadoop treat spatial data as non-spatial ones, without additional support Support only uniform grid index, only applicable in uniform data distribution MapReduce programs cannot access the constructed spatial index Parallel-Secondo, MD-Hbase and ESRI tools on Hadoop suffer from similar drawbacks.

Background SpatialHadoop: built-in Hadoop base code
able to support a set of spatial index structures users can develop a myriad of spatial functions including range queries, kNN and spatial join Difference:

Background SpatialHadoop-four main layers: language layer: Pigeon
storage layer: two-level index structure MapReduce layer: SpatialFileSplitter, SpatialRecordReader operations layer: encapsulates a dozen of spatial operations

Related Work Existing work can be classified into 2 categories:
Specific spatial operations System

Related Work Specific: R-tree construction Range query kNN query
All NN query Reverse NN query Spatial join kNN join

Related Work System: Hadoop-GIS MD-HBase Parallel-Secondo

Architecture Architecture: 3 types of users -Casual user -Developer
-System Admin 4 layers -language -operations -MapReduce -storage

Architecture The language layer
Pigeon, a high-level SQL-like language that supports OGC-compliant spatial data types(Point and Polygon) and operations(Overlap and Touches) The storage layer two-level index structure of global and local indexing, implement 3 standard indexes: Grid file, R-tree and R+-tree

Architecture The MapReduce layer
SpatialFileSplitter: use global index to prune file blocks that do not contribute to the answer SpatialRecordReader: use local index to retrieve partial answer from each block The operation layer Encapsulates the implementation of various spatial operations that take advantage of the spatial indexes and the new components in the MapReduce layer

Language Layer Background: a set of declarative SQL-like languages have been proposed: HiveQL, Pig Latin, SCOPE and YSmart Pigeon: an extension to Pig Latin language, adding spatial data types, functions and operations that conform to OGC standard.

Language Layer Data types: overrides the bytearray to support spatial data types such as Point, LineString, Polygon lakes = LOAD ‘lakes’ AS (id:int, area:polygon); Spatial functions: provide spatial functions including aggregate functions (e.g., Union), predicates (e.g., Overlaps), and others (e.g., Buffer) houses_with_distance = FOREACH houses GENERATE id, Distance(house_loc, sc_loc); kNN query: new KNN statement nearest_houses = KNN houses WITH_K=100 USING Distance(house_loc, query_loc);

Language Layer Override following 2 Pig Latin statements
FILTER: to accept a spatial predicate and call the corresponding procedure for range queries houses_in_range = FILTER houses BY Overlaps(house_loc, query_range); JOIN: to accept spatial files and forward to the corresponding spatial join produre lakes_states = JOIN lakes BY lakes_boundary states BY states_boundary PREDICATE = Overlaps

Storage Layer Background:
Input files in Hadoop: non-indexed heap files SpatialHadoop: Index structure in HDFS Indexing in SpatialHadoop is the key point in superior performance over Hadoop Challenges: Index structures are optimized for procedural program A file in HDFS can be only written sequentially while traditional indexes are constructed incrementally

Storage Layer Existing techniques for spatial indexing in Hadoop:
build only construct a R-tree using MapReduce approach but queried outside MapReduce using other techniques custom on-the-fly indexing non-standard index is created and discarded with each query execution indexing in HDFS only support range queries on trajectory data, quite limited

Storage Layer Overview:

Storage Layer How to overcome challenges:
local indexes can be processed in parallel The small size of local indexes allows each one to be bulk loaded in memory and written to a file in an append-only manner Generic way of building index: Partitioning Local indexing Global indexing

Storage Layer Partitioning
Main goals: block fit, spatial locality, load balancing Three steps: Calculate numbers of n Decide partition boundaries Physical partitioning

Storage Layer 1. calculate numbers of partitions n 𝑛= 𝑆(1+𝛼) 𝐵
𝑛= 𝑆(1+𝛼) 𝐵 S: input file size B: HDFS block capacity(64MB) 𝛼: overhead ratio, set to 0.2 by default

Storage Layer 2. Partitions boundaries
we decide on the spatial area covered by each single partition defined by a rectangle boundaries are calculated differently according to the underlying index being constructed to accommodate data distribution The output of this step is a set of n rectangles representing boundaries of the n partitions

Storage Layer 3. Physical partitioning
Initiate a MapReduce job that physically partitions the input file The challenge here is to decide what to do with objects with spatial extents (e.g., polygons) that may overlap more than one partition At the end, for each record r assigned to a partition p, the map function writes an intermediate pair <p, r>. Such pairs are then grouped by p and sent to the reduce function for the next phase

Storage Layer Local indexing
Purpose: build the requested index structure (e.g., Grid or R-tree) as a local index on the data contents of each physical partition Building the requested index structure is realized as a reduce function that takes the records assigned to each partition and stores them in a spatial index, written in a local index file local index has to fit in one HDFS block for two reasons (1) This allows spatial operations written as MapReduce programs to access local indexes where each local index is processed in one map task (2) It ensures that the local index is treated by Hadoop load balancer as one unit when it relocates blocks across machines

Storage Layer Global indexing
Build the requested structure as a global index that indexes all partitions. process: 1. initiate an HDFS concat command to concatenate all local indexes into one file 2. master node builds all in memory global index which indexes all file blocks using their rectangular boundaries as the index key

Storage Layer Global indexing(ctd.) global index is:
1. using bulk loading 2. kept in main memory all the time 3. lazily constructed in case the master node fails and restarts

Storage Layer - Grid file
Definition: a simple flat index that partitions the data according to a grid such that records overlapping each grid cell are stored in one file block as a single partition, assuming data is uniformly distributed Partitioning: 1. calculate number of partitions n 2. creating a uniform grid of size 𝑛 ∗ 𝑛 in the space domain and take boundaries of grid cells as partition boundaries 3. a record r with a spatial extent, is replicated to every grid cell it overlaps

Storage Layer - Grid file
Local indexing: the records of each grid cell are just written to a heap file without building any local indexes Global indexing: concatenates all these files and builds the global index, which is a two dimensional directory table pointing to the corresponding blocks in the concatenated file

R-tree An R-tree is a height-balanced similar to a B-tree with index records in its leaf nodes containing pointers to data objects Spatial databases: tuples(representing spatial objects) + identifiers In a R tree: Leaf node: <I, identifier> Non-leaf node: <I, child-pointer> I – n-dimentional rectangle

R-tree Properties: (M: the maximum number of entn3 that snll At m one node) (m: parameter speclfymg the minimum number of entries in a node) Every leaf node contains between m and M index records unless it is the root For each index record (I, identifier) in a leaf node, I is the smallest rectangle that spatially contains the n-dnnenslonal data object represented by the indicated tuple Every non-leaf node has between m and M children unless it is the root

R-tree Properties(ctd.):
For each entry (I, child-pointer) in a non-leaf node, I is the smallest rectangle that spatially contains the rectangles m the child node The root node has at least two children unless it is a leaf All leaves appear on the same level

Storage Layer - (R-tree)
Partitioning To compute partition boundaries, we bulk load a random sample from the input file to an in-memory R-tree using the Sort-Tile- Recursive (STR) algorithm (details)

Storage Layer - (R-tree)
local indexing: records of each partition are bulk loaded into an R-tree using the STR algorithm, then dumped into a file The block in a local index file is annotated with its minimum bounding rectangle (MBR) of its contents the partitions might end up being overlapped, similar to traditional R- tree nodes global indexing: - concatenates all local index files and creates the global index by bulk loading all blocks into an R-tree using their MBRs as the index key

R+-tree Differences from R-tree:
Nodes are not guaranteed to be at least half filled The entries of any internal node do not overlap An object ID may be stored in more than one leaf node Adv: Point query performance improves A single path is followed and fewer nodes are visited than with the R- tree

Storage Layer - (R+-tree)
Definition: R+-tree is a variation of the R-tree where nodes at each level are kept disjoint while records overlapping multiple nodes are replicated to each node to ensure efficient query answering Similar to R-tree except 3 changes: 1. In the R+-tree physical partitioning step, each record is replicated to each partition it overlaps with 2. In the local indexing phase, the records of each partition are inserted into an R+-tree which is then dumped to a local index file 3. the global index is constructed based on the partition boundaries computed in the partitioning phase rather than the MBR of its contents as boundaries should remain disjoint

MapReduce Layer Comparison: Hadoop:
1. the input file goes through a FileSplitter that divides it into n splits, where n is set by the the MapReduce program, based on the number of available slave nodes. 2. Then, each split goes through a RecordReader that extracts records as key- value pairs which are passed to the map function SpatialHadoop 1. SpatialFileSplitter, an extended splitter that exploits the global index(es) on input file(s) to early prune file blocks not contributing to answer 2. SpatialRecordReader, which reads a split originating from spatially indexed input file(s) and exploits the local indexes to efficiently process it

MapReduce Layer Comparison(ctd.)

MapReduce Layer SpatialFileSplitter Takes: 1. one or two input files
2. filter function One input file the SpatialFileSplitter applies the filter function on the global index of the input file to select file blocks, based on their MBRs, that should be processed by the job For example, a range query job provides a filter function that prunes file blocks with MBRs completely outside the query range. For each selected file block in the query range, the SpatialFileSplitter creates a file split, to be processed later by the SpatialRecordReader

MapReduce Layer SpatialFileSplitter(ctd.)
Two input files, similar to one input file with two subtle differences: 1. The filter function is applied to two global indexes; each corresponds to one input file 2. The output of the SpatialFileSplitter is a combined split that contains a pair of file ranges (i.e., file offsets and lengths) corresponding to the two selected blocks from the filter function

MapReduce Layer SpatialRecordReader
The SpatialRecordReader takes either a split or combined split and parses it to generate key-value pairs to be passed to the map function. It parses the block to extract the local index that acts as an access method to all records in the block.

MapReduce Layer SpatialRecordReader
The record reader sends all the records to the map function indexed by the local index with two main benefits: 1. it allows the map function to process all records together, which is shown to make it more powerful and flexible 2. the local index is harnessed when processing the block, making it more efficient than scanning over all records

Operations Layer Spatial indexing(S Layer) + Spatial functionality(MR Layer) = possibility of efficient realizations of a myriad of spatial operations 3 basic spatial operations: range query k nearest neighbor(kNN) spatial Join

Operations Layer – Range Query
Definition: A range query takes a set of spatial records R and a query area A as input, and returns the set of records in R that overlaps with A 2 range query techniques depending on whether there is replication No replication(R-tree) Relication(Grid or R+-tree)

No replication: each record is stored in exactly one partition Range query algorithm: Step1 - global filter step range filter->SpatialFileSplitter blocks that is completely inside query area->output blocks that are partially overlapping->are sent for further processing in the second step

Step 2 – local filter The SpatialRecordReader reads a block that needs to be processed, extracts its local index sends it to the map function, which exploits the local index with a traditional range query algorithm to return matching records

Replication: some records are replicated across partitions Range query algorithm, similar to the no replication one except: (1) In the global filter step, blocks that are completely contained in the query area A have to be further processed (2) The output of the local filter goes through an additional duplicate avoidance step to ensure that duplicates are removed from the final answer

Duplicate avoidance step For each candidate record produced by the local filter step, we compute its intersection with the query area. A record is added to the final result only if the top-left corner of the intersection is inside the partition boundaries. Since partitions are disjoint, it is guaranteed that only one partition contains that point. The output of the duplicate avoidance step gives the final answer of the range query, hence, no reduce function is needed

Operations Layer - kNN Definition: A kNN query takes a set of spatial points P , a query point Q, and an integer k as input, and returns the k closest points in P to Q kNN query algorithm in SpatialHadoop: (1) Initial answer (2) Correctness check (3) Answer refinement

Operations Layer - kNN Initial answer
First locate the partition that includes Q by feeding the SpatialFileSplitter with a filter function that selects only the overlapping partition The selected partition goes through the SpatialRecordReader to exploit its local index with a traditional kNN algorithm to produce the initial k answers

Operations Layer - kNN Correctness check
We draw a test circle C centered at Q with a radius equal to the distance from Q to its kth furthest neighbor If C does not overlap any partition other than Q, the initial answer is considered final, otherwise to Answer refinement step.

Operations Layer - kNN Answer refinement
run a range query to get all points inside the MBR of the test circle C a scan over the range query result is executed to produce the closest k points as the final answer

Operations Layer – Spatial join
Definition: A spatial join takes two sets of spatial records R and S and a spatial join predicate θ (e.g., overlaps) as input, and returns the set of all pairs <r, s> where r ∈ R, s ∈ S, and θ is true for <r, s> SJMR algorithm, MapReduce version of partition-based spatial-merge join(PBSM) Employs a map function that partitions input records according to a uniform grid A reduce function that joins records in each partition

Distributed join (Preprocessing if needed) Global join Local join Duplicate avoidance

Global join: this step produces all pairs of file blocks with overlapping MBRs the SpatialFileSplitter module is fed with the overlapping filter function to exploit two spatially indexed input files. Then, a traditional spatial join algorithm is applied over the two global indexes to produce the overlapping pairs of partitions. The SpatialFileSplitter will finally create a combined split for each pair of overlapping blocks

Local join: this step joins the records in the two blocks in this split to produce pairs of overlapping records the SpatialRecordReader reads the combined split, extracts the records and local indexes from its two blocks, and sends all of them to the map function for processing. The map function exploits the two local indexes to speed up the process of joining the two sets of records in the combined split. The result of the local join may contain duplicate results due to having records overlapping with multiple blocks

Duplicate avoidance: employs the reference-point duplicate avoidance technique For each detected overlapping pair of records, the intersection of their MBRs is first computed. Then, the overlapping pair is reported as a final answer only if the top-left corner (i.e., reference point) of the intersection falls in the overlap of the MBRs with the two processed blocks

Experiments Compared to the standard Hadoop
All experiments are conducted on an Amazon EC2 cluster of up to 100 nodes. The default cluster size is 20 nodes of ‘small’ instances Datasets: TIGER OSM NASA SYNTH

Experiments – Range Query
SYNTH

Experiments – Range Query
TIGER

Experiments - kNN SYNTH

Experiments - kNN TIGER

Experiments – Spatial join

Experiments – Index creation

谢谢大家！

SpatialHadoop: A MapReduce Framework for Spatial Data

Similar presentations

Presentation on theme: "SpatialHadoop: A MapReduce Framework for Spatial Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SpatialHadoop: A MapReduce Framework for Spatial Data

Similar presentations

Presentation on theme: "SpatialHadoop: A MapReduce Framework for Spatial Data"— Presentation transcript:

Similar presentations

About project

Feedback