University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer.

Slides:



Advertisements
Similar presentations
Voronoi-based Geospatial Query Processing with MapReduce
Advertisements

Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Speeding Up Batch Alignment of Large Ontologies Using MapReduce Uthayasanker Thayasivam and Prashant Doshi Dept. of Computer Science University of Georgia.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
SpatialHadoop: A MapReduce Framework for Spatial Data
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
SpatialHadoop:A MapReduce Framework
Mining High Utility Itemset in Big Data
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Efficient Processing of Top-k Spatial Preference Queries
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Record Linkage in a Distributed Environment
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp
Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Lecture 15 Computational Geometry Geometry sweeping Geometric preliminaries Some basics geometry algorithms.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
MapReduce. Google and MapReduce Google searches billions of web pages very, very quickly How? It uses a technique called “MapReduce” to distribute the.
Matrix Multiplication in Hadoop
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Institute for Cyber Security
SpatialHadoop: A MapReduce Framework for Spatial Data
Dynamic Indexing in SpatialHadoop
Sameh Shohdy, Yu Su, and Gagan Agrawal
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Cloud Distributed Computing Environment Hadoop
MapReduce: Data Distribution for Reduce
On Spatial Joins in MapReduce
Voronoi-based Geospatial Query Processing with MapReduce
Range-Aggregate Query on Distributed Uncertain Database
CS110: Discussion about Spark
MapReduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Relaxing Join and Selection Queries
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Efficient Processing of Top-k Spatial Preference Queries
Dong Deng+, Yu Jiang+, Guoliang Li+, Jian Li+, Cong Yu^
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Presentation transcript:

University of Minnesota CG_Hadoop: Computational Geometry in MapReduce Ahmed Eldawy* Yuan Li* Mohamed F. Mokbel*$ Ravi Janardan* * Department of Computer Science and Engineering, University of Minnesota $ KACST GIS Technology Innovation Center, Umm Al-Qura University

2 Big Spatial Data More than ~500TB Satellite ImageryGeotagged Tweets Billions of tweets Billions of check ins Millions more every day Check ins

3 SpatialHadoop ■ A MapReduce framework tailored for spatial data ■ Free open source [ ■ More than 40,000 downloads since the initial release ■ Supports spatial data types  E.g., Point, Rectangle and Polygon ■ Provides spatial partitioning and indexing for spatial data  Grid file  R-tree  R+-tree ■ Efficient MapReduce operations for spatial queries

4 CG_Hadoop ■ Make use of SpatialHadoop to speedup computational geometry algorithms  Polygon union, Skyline, Convex Hull, Farthest/Closest Pair ■ Single machine implementation  E.g., Skyline of 4 billion points takes three hours ■ Straight forward implementation in Hadoop  Hadoop parallel execution ■ More efficient implementation in SpatialHadoop  Spatial indexing  Early pruning ■ Free open source [ Single Machine Hadoop Spatial Hadoop 29x 260x 1x

5 Agenda ■ Motivation ■ Overview ■ CG_Hadoop  General Methodology  Polygon union  Skyline  Convex Hull  Farthest Pair  Closest Pair ■ Experiments ■ Conclusion

6 General Methodology All algorithms in CG_Hadoop employ a divide and conquer approach 1. Partition the input using Hadoop/SpatialHadoop partitioner 2. (Optional) prune partitions that do not contribute to answer 3. Apply the algorithm locally in each partition 4. Combine the partial answers to compute the final result

7 Example: Partitioning of 400 GB OSM Data

8 Polygon Union Compute the union of a set of polygons Input Output

9 Polygon Union in CG_Hadoop HadoopSpatialHadoop  Partition  Local union  Global union

10 Skyline (Maximal Vectors) Select all non-dominated points in a set of points Input Output

11 Skyline in CG_Hadoop HadoopSpatialHadoop  Global skyline  Local skyline  Pruning  Partition

12 Convex Hull Find the minimal convex polygon that contains all points Input Output

13 Convex Hull in CG_Hadoop HadoopSpatialHadoop  Partition  Local hull  Global hull  Pruning

14 Farthest Pair Find the pair of points that have the largest Euclidean distance Input Output

15 Farthest Pair in CG_Hadoop HadoopSpatialHadoop  Partition  Local farthest pair  Global farthest pair  Pruning

16 Closest Pair Find the pair of points that have the shortest Euclidean distance Input Output

17 Closest Pair in CG_Hadoop HadoopSpatialHadoop  Global closest pair  Local closest pair  Partition

18 Experiments ■ Apache Hadoop/SpatialHadoop ■ Java 1.6 ■ A cluster of 25 nodes  Dual core  4 GB RAM ■ Single machine  Eight cores  16 GB RAM ■ Real datasets from OpenStreetMap  OSM1: 164 Million polygons, total size of 80GB  OSM2: 1.7 Billion points, total size of 52GB ■ Synthetic dataset of up to 3.8 Billion points, total size of 128GB

19 Real datasets OSM1 (Polygon Union)OSM2 (Others)

20 Skyline Negatively correlated dataPositively correlated data

21 Convex Hull Uniform dataGaussian data

22 Farthest/Closest Pair Farthest PairClosest Pair Uniform dataCircular data

23 Conclusion ■ CG_Hadoop is a suite of scalable MapReduce algorithms for computational geometry problems ■ Implemented in both Hadoop and SpatialHadoop ■ Distributed processing in Hadoop speeds up query answering ■ SpatialHadoop spatial partitioning makes it more efficient ■ Extensive experiments on both real and synthetic data show the scalability and efficiency of CG_Hadoop

University of Minnesota Thank you! Questions?