Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Spatial Mining.
Indexing Network Voronoi Diagrams*
2-dimensional indexing structure
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Accessing Spatial Data
Spatial Indexing SAMs.
1 Partition Filter Set for Power- Efficient Packet Classification Authors: Haibin Lu, MianPan Publisher: IEEE GLOBECOM 2006 Present: Chen-Yu Lin Date:
Spatial Information Systems (SIS) COMP Spatial access methods: Indexing.
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
1 R-Trees for Spatial Indexing Yanlei Diao UMass Amherst Feb 27, 2007 Some Slide Content Courtesy of J.M. Hellerstein.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 3: Data Storage and Access Methods
Spatial Indexing I Point Access Methods.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatial Indexing SAMs. Spatial Access Methods PAMs Grid File kd-tree based (LSD-, hB- trees) Z-ordering + B+-tree R-tree Variations: R*-tree, Hilbert.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
R-Trees Extension of B+-trees.  Collection of d-dimensional rectangles.  A point in d-dimensions is a trivial rectangle.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
R-Tree. 2 Spatial Database (Ia) Consider: Given a city map, ‘index’ all university buildings in an efficient structure for quick topological search.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Sunpyo Hong, Hyesoon Kim
R* Tree By Rohan Sadale Akshay Kulkarni.  Motivation  Optimization criteria for R* Tree  High level Algorithm  Example  Performance Agenda.
Martin Kruliš by Martin Kruliš (v1.0)1.
Rethinking Choices for Multi-dimensional Point Indexing You Jung Kim and Jignesh M. Patel University of Michigan.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Multidimensional Access Methods Ho Hoang Nguyen Nguyen Thanh Trong Dao Vu Quoc Trung Ngo Phuoc Huong Thien DATABASE.
Multidimensional Access Structures COMP3017 Advanced Databases Dr Nicholas Gibbins –
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Engg, IIT(BHU)
Mehdi Kargar Department of Computer Science and Engineering
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Massive Spatial Query on the Kepler Architecture
Real-Time Ray Tracing Stefan Popov.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
6- General Purpose GPU Programming
Presentation transcript:

Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National Institution of Science and Technology, Korea

Multi-dimensional Indexing One of the core technology in GIS, scientific databases, computer graphics, etc. Access pattern into Scientific Datasets – Multidimensional Range Query Retrieves data that overlaps given range of values Ex) SELECT temperature FROM dataset WHERE latitude BETWEEN 20 AND 30 AND longitude BETWEEN 50 AND 60 – Multidimensional indexing trees KD-Trees, KDB-Trees, R-Trees, R*-Trees Bitmap index – Multi-dimensional indexing is one of the things that do not work well in parallel.

Multi-dimensional Indexing Trees: R-Tree Proposed by Antonin Guttman (1984) Stored and indexed via nested MBRs (Minimum Bounding Rectangles) Resembles height-balanced B+-tree

Multi-dimensional Indexing Trees: R-Tree An Example Structure of an R-Tree Source: Proposed by A. Guttman Stored and indexed via nested MBRs (Minimum Bounding Rectangles) Resembles height-balanced B+-tree

Motivation GPGPU has emerged as new HPC parallel computing paradigm. Scientific data analysis applications are major applications in HPC market. A common access pattern into scientific datasets is multi-dimensional range query. Q: How to parallelize multi-dimensional range query on the GPU?

MPES (Massively Parallel Exhaustive Scan) This is how GPGPU is currently utilized Achieve the maximum utilization of GPU. Simple, BUT we should access ALL the datasets. … Divide the Total datasets by the number of threads thread[0]thread[1]thread[2]thread[3]thread[K-1]

Basic idea – Compare a given query range with multiple MBRs of child nodes in parallel Parallel R-Tree Search Each SP compares an MBB with a Query Global Memory Node A Node B Node C Node D Node ENode FNode G SMP Node E SPs Q : i th Query

Recursive Search on GPU simply does not work Inherently spatial indexing structures such as R-Trees or KDB-Trees are not well suited for CUDA environment. irregular search path and recursion make it hard to maximize the utilization of GPU – 48K shared memory will overflow when tree height is > 5

Leftmost search – Choose the leftmost child node no matter how many child nodes overlap Rightmost search – Choose the rightmost child node no matter how many child nodes overlap Parallel Scanning – In between two leaf nodes, perform massively parallel scanning to filter out non-overlapping data elements. MPTS (Massively Parallel 3 Phase Scan) pruned out

MPTS improvement using Hilbert Curve Hilbert Curve: Continuous fractal space-filling curve – Map multi-dimensional points onto 1D curve Recursively defined curve – Hilbert curve of order n is constructed from four copies of the Hilbert curve of order n-1, properly oriented and connected. Spatial Locality Preserving Method – Nearby points in 2D are also close in the 1D Image source: Wikipedia first order2 nd order3 rd order

MPTS improvement using Hilbert Curve Hilbert curve is well known for it spatial clustering property. – Sort the data along with Hilbert curve – Cluster similar data nearby – The gap between leftmost leaf node and the rightmost leaf node would be reduced. – The number of visited nodes would decrease pruned out

MPTS improvement using Hilbert Curve Hilbert curve is well known for it spatial clustering property. – Sort the data along with Hilbert curve – Cluster similar data nearby – The gap between leftmost leaf node and the rightmost leaf node would be reduced. – The number of visited nodes would decrease

Drawback of MPTS MPTS reduces the number of leaf nodes to be accessed, but still it accesses a large number of leaf nodes that do not have requested data. Hence we designed a variant of R-trees that work on the GPU without stack problem and does not access leaf nodes that do not have requested data. – MPHR-Trees (Massively Parallel Hilbert R-Trees)

MPHR-tree (Massively Parallel Hilbert R-Tree) Bottom-up construction on the GPU 1. Sort data using Hilbert curve index

MPHR-tree (Massively Parallel Hilbert R-tree) Bottom-up construction on the GPU 2. Build R-trees in a bottom-up fashion Store maximum Hilbert value max along with MBR

MPHR-tree ( Massively Parallel Hilbert R-tree ) Bottom-up construction on the GPU 2. Build R-trees in a bottom-up fashion Store maximum Hilbert value max along with MBR

MPHR-tree ( Massively Parallel Hilbert R-tree ) Bottom-up construction on the GPU Basic idea – Parallel reduction to generate an MBR of a parent node and to get a maximum Hilbert value. R4R5 626 R6 44 R7R R9 96 R10R R SMP0SMP1SMP2 thread[0] … thread[K-1] thread[0] … thread[K-1] thread[0] … thread[K-1] R1R level n level n+1 build the tree bottom-up in parallel R3 159

MPHR-tree ( Massively Parallel Hilbert R-tree ) Searching on the GPU Iterate leftmost search and parallel scan using Hilbert curve index – leftmostSearch() visits leftmost search path whose Hilbert index is greater than the given Hilbert index R1R R6R R3R R5 159 D1D2 626 D3 44 D4D D6 96 D7D D9 159 D10D D D13D keep parallel scanning if there exist overlapping leaf nodes Left-most Search /Find leaf node Left-most Search level 0 level 1 lastHilbertIndex = 0; while(1){ leftmostLeaf=leftmostSearch(lastHilbertIndex, QueryMBR); if(leftmostLeaf < 0) break; lastHilbertIndex = parallelScan(leftmostLeaf); }

MPTS vs MPHR-Tree Search complexity of MPHR-Tree k is the number of leaf nodes that have requested data pruned out pruned out pruned out pruned out pruned out MPTSMPHR-Trees

Braided Parallelism vs Data Parallelism Braided Parallel Indexing – Multiple queries can be processed in parallel. Data Parallel Indexing (Partitioned Indexing) – Single query is processed by all the CUDA SMPs – partitioned R-trees Braided Parallel IndexingData Parallel Indexing

Performance Evaluation Experimental Setup (MPTS vs MPHR-tree) CUDA Toolkit 5.0 Tesla Fermi M2090 GPU card – 16 SMPs – Each SMP has 32 CUDA cores, which enables 512 (16x32) threads to run concurrently. Datasets – 40 millions of 4D point data sets in uniform, normal, and Zipf's distribution

Performance Evaluation MPHR-tree Construction 12 K page (fanouts=256), 128 CUDA blocks X64 threads per block It takes only 4 seconds to build R-trees with 40 millions of data while CPU takes more than 40 seconds. ( 10x speed up ) – Without including memory transfer time, it takes only 50 msec. (800x speed up)

Performance Evaluation MPTS Search vs MPES Search 12K page (fanouts=256), 128 CUDA blocks X64 threads per block, selection ratio = 1% MPTS outperforms MPES and R-trees on Xeon E5506 (8cores) – In high dimensions, MPTS accesses more memory blocks but the number of instructions executed by a warp is smaller than MPES

Performance Evaluation MPHR-tree Search 12 K page (fanouts=256), 128 CUDA blocks X64 threads per block MPHR-tree consistently outperforms other indexing methods – In terms of throughput, braided MPHR-Trees shows an order of magnitude higher performance than multi-core R-trees and MPES. – In terms of query response time, partitioned MPHR-trees shows an order of magnitude faster performance than multi-core R-trees and MPES.

Performance Evaluation MPHR-tree Search In cluster environment, MPHR-Trees show an order of magnitude higher throughput than LBNL FastQuery library. – LBNL FastQuery is a parallel bitmap indexing library for multi-core architectures.

Summary Brute-force parallel methods can be refined with more sophisticated parallel algorithms. We proposed new parallel tree traversal algorithms and showed they significantly outperform the traditional recursive access to hierarchical tree structures.

Q&A Thank You

MPTS improvement using Sibling Check When a current node doesn’t have any overlapping children, check sibling nodes! – It’s always better to prune out tree nodes in upper level.

CUDA GPGPU (General Purpose Graphics Processing Unit) – CUDA is a set of developing tools to create applications that will perform execution on GPU – GPUs allow creation of very large number of concurrently executed threads at very low system resource cost. – CUDA also exposes fast shared memory (48KB) that can be shared between threads. Image source: Wikipedia Tesla M2090 : 16 X 32 = 512 cores

Grids and Blocks of CUDA Threads A kernel is executed as a grid of thread blocks – All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: – Synchronizing their execution For hazard-free shared memory accesses – Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Courtesy: NVIDIA