Database Operations on GPU Changchang Wu 4/18/2007.

Slides:



Advertisements
Similar presentations
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware.
Advertisements

Sven Woop Computer Graphics Lab Saarland University
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
N-Buffers for efficient depth map query Xavier Décoret Artis GRAVIR/IMAG INRIA.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
CAP4730: Computational Structures in Computer Graphics Visible Surface Determination.
Computer Graphics Visible Surface Determination. Goal of Visible Surface Determination To draw only the surfaces (triangles) that are visible, given a.
Fast GPU Histogram Analysis for Scene Post- Processing Andy Luedke Halo Development Team Microsoft Game Studios.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Computer Graphics Hardware Acceleration for Embedded Level Systems Brian Murray
Adapted from: CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware Naga K. Govindaraju, Stephane.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
The FFT on a GPU Graphics Hardware 2003 July 27, 2003 Kenneth MorelandEdward Angel Sandia National LabsU. of New Mexico Sandia is a multiprogram laboratory.
Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
© 2004 Tomas Akenine-Möller1 Shadow Generation Hardware Vision day at DTU 2004 Tomas Akenine-Möller Lund University.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
1 A Hierarchical Shadow Volume Algorithm Timo Aila 1,2 Tomas Akenine-Möller 3 1 Helsinki University of Technology 2 Hybrid Graphics 3 Lund University.
Hidden Surface Removal
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Image Based Positioning System Ankit Gupta Rahul Garg Ryan Kaminsky.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
The programmable pipeline Lecture 3.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Real-time Graphics for VR Chapter 23. What is it about? In this part of the course we will look at how to render images given the constrains of VR: –we.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.
Java Methods Big-O Analysis of Algorithms Object-Oriented Programming
CSCI 440.  So far we have learned how to  build shapes  create movement  change views  add simple lights  But, our objects still look very cartoonish.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
CULLIDE: Interactive Collision Detection Between Complex Models in Large Environments using Graphics Hardware Presented by Marcus Parker By Naga K. Govindaraju,
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
Dense-Region Based Compact Data Cube
GPU Architecture and Its Application
Computer Graphics Implementation II
Real-Time Soft Shadows with Adaptive Light Source Sampling
Efficient Image Classification on Vertically Decomposed Data
Graphics Processing Unit
Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.
Real-Time Ray Tracing Stefan Popov.
Timothy J. Purcell Stanford / NVIDIA
Efficient Image Classification on Vertically Decomposed Data
Sorting and Searching Tim Purcell NVIDIA.
GPGPU: Parallel Reduction and Scan
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
Ray Tracing on Programmable Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Presentation transcript:

Database Operations on GPU Changchang Wu 4/18/2007

Outline Database Operations on GPU Point List Generation on GPU Nearest Neighbor Searching on GPU

Database Operations on GPU

Design Issues Low bandwidth between GPU and CPU Avoid frame buffer readbacks No arbitrary writes Avoid data rearrangements Programmable pipeline has poor branching Evaluate branches using fixed function tests

Design Overview Use depth test functionality of GPUs for performing comparisons Implements all possible comparisons =, >, ==, !=, ALWAYS, NEVER Use stencil test for data validation and storing results of comparison operations Use occlusion query to count number of elements that satisfy some condition

Basic Operations Basic SQL query Select A From T Where C A= attributes or aggregations (SUM, COUNT, MAX etc) T=relational table C= Boolean Combination of Predicates (using operators AND, OR, NOT)

Basic Operations Predicates – a i op constant or a i op a j Op is one of, =,!=, =, TRUE, FALSE Boolean combinations – Conjunctive Normal Form (CNF) expression evaluation Aggregations – COUNT, SUM, MAX, MEDIAN, AVG

Predicate Evaluation a i op constant (d) Copy the attribute values a i into depth buffer Define the comparison operation using depth test Draw a screen filling quad at depth d glDepthFunc(…) glStencilOp( fail, zfail, zpass );

Predicate Evaluation Comparing two attributes: a i op a j is treated as (a i – a j ) op 0 Semi-linear queries Easy to compute with fragment shader

Boolean Combinations Expression provided as a CNF CNF is of form (A 1 AND A 2 AND … AND A k ) where A i = (B i 1 OR B i 2 OR … OR B i mi ) CNF does not have NOT operator If CNF has a NOT operator, invert comparison operation to eliminate NOT Eg. NOT (a i (a i >= d) For example, c ompute a i within [low, high] Evaluated as ( a i >= low ) AND ( a i <= high )

Algorithm

Range Query Compute a i within [low, high] Evaluated as ( a i >= low ) AND ( a i <= high )

Aggregations COUNT, MAX, MIN, SUM, AVG No data rearrangements

COUNT Use occlusion queries to get pixel pass count Syntax: Begin occlusion query Perform database operation End occlusion query Get count of number of attributes that passed database operation Involves no additional overhead!

MAX, MIN, MEDIAN We compute Kth-largest number Traditional algorithms require data rearrangements We perform no data rearrangements, no frame buffer readbacks

K-th Largest Number By comparing and counting, determinate every bit in order of MSB to LSB

Example: Parallel Max S={10,24,37,99,192,200,200,232} Step 1: Draw Quad at 128( ) S = {10,24,37,99,192,200,200,232} Step 2: Draw Quad at 192( ) S = {10,24,37,192,200,200,232} Step 3: Draw Quad at 224( ) S = {10,24,37,192,200,200,232} Step 4: Draw Quad at 240( ) – No values pass Step 5: Draw Quad at 232( ) S = {10,24,37,192,200,200,232} Step 6,7,8: Draw Quads at 236,234,233 – No values pass, Max is 232

Accumulator, Mean Accumulator - Use sorting algorithm and add all the values Mean – Use accumulator and divide by n Interval range arithmetic Alternative algorithm Use fragment programs – requires very few renderings Use mipmaps [Harris et al. 02], fragment programs [Coombe et al. 03]

Accumulator Data representation is of form a k 2 k + a k-1 2 k-1 + … + a 0 Sum = sum(a k ) 2 k + sum(a k-1 ) 2 k-1 +…+sum(a 0 ) Current GPUs support no bit-masking operations

The Algorithm >=0.5 means i-th bit is 1

Implementation Algorithm CPU – Intel compiler 7.1 with hyper-threading, multi-threading, SIMD optimizations GPU – NVIDIA Cg Compiler Hardware Dell Precision Workstation with Dual 2.8GHz Xeon Processor NVIDIA GeForce FX 5900 Ultra GPU 2GB RAM

Benchmarks TCP/IP database with 1 million records and four attributes Census database with 360K records

Copy Time

Predicate Evaluation

Range Query

Multi-Attribute Query

Semi-linear Query

Kth-Largest

Kth-Largest conditional

Accumulator

Analysis: Issues Precision Copy time Integer arithmetic Depth compare masking Memory management No Branching No random writes

Analysis: Performance Relative Performance Gain High Performance – Predicate evaluation, multi-attribute queries, semi-linear queries, count Medium Performance – Kth-largest number Low Performance - Accumulator

High Performance Parallel pixel processing engines Pipelining Early Z-cull Eliminate branch mispredictions

Medium Performance Parallelism FX 5900 has clock speed 450MHz, 8 pixel processing engines Rendering single 1000x1000 quad takes 0.278ms Rendering 19 such quads take 5.28ms. Observed time is 6.6ms 80% efficiency in parallelism!!

Low Performance No gain over SIMD based CPU implementation Two main reasons: Lack of integer-arithmetic Clock rate

Advantages Algorithms progress at GPU growth rate Offload CPU work Fast due to massive parallelism on GPUs Algorithms could be generalized to any geometric shape Eg. Max value within a triangular region Commodity hardware!

GPU Point List Generation Data compaction

Overall task

3D to 2D mapping

Current Problem

The solution

Overview, Data Compaction

Algorithm: Discriminator

Algorithm: Histogram Builder

Histogram Output

Algorithm: PointList Builder

PointList Output

Timing Reduces a highly sparse matrix with N elements to a list of its M active entries in O(N) + M (log N) steps,

Applications Image Analysis Feature Detection Volume Analysis Sparse Matrix Generation

Searching 1D Binary Search Nearest Neighbor Search for High dimension space K-NN Search

Binary Search Find a specific element in an ordered list Implement just like CPU algorithm Assuming hardware supports long enough shaders Finds the first element of a given value v If v does not exist, find next smallest element > v Search algorithm is sequential, but many searches can be executed in parallel Number of pixels drawn determines number of searches executed in parallel 1 pixel == 1 search

Binary Search Search for v0 v0v0v2v2v5v0v5 Sorted List Initialize Search starts at center of sorted array v2 >= v0 so search left half of sub-array v2

Binary Search Search for v0 v0v0v2v2v2v5v0v5 Sorted List Initialize 2 Step 1 v0 >= v0 so search left half of sub-array

Binary Search Search for v0 v0v2v2v2v5v0v5 Sorted List Initialize 2 1 Step 1 Step 2 v0 >= v0 so search left half of sub-array v0

Binary Search Search for v0 v0v2v2v2v5v0v5 Sorted List Initialize Step 1 Step 2 Step 3 At this point, we either have found v0 or are 1 element too far left One last step to resolve v0

Binary Search Search for v0 v0v2v2v2v5v0v5 Sorted List Initialize Step 1 Step 2 Step 3 Step 4 Done! v0

Binary Search Search for v0 and v2 v0v0v2v2v5v0v5 Sorted List Initialize 4 Search starts at center of sorted array Both searches proceed to the left half of the array v2

Binary Search Search for v0 and v2 v0v0v2v2v2v5v0v5 Sorted List Initialize 2 Step The search for v0 continues as before The search for v2 overshot, so go back to the right

Binary Search Search for v0 and v2 v0v2v2v5v0v5 Sorted List Initialize 2 1 Step 1 Step v0v2 We’ve found the proper v2, but are still looking for v0 Both searches continue

Binary Search Search for v0 and v2 v0v2v2v2v5v0v5 Sorted List Initialize Step 1 Step 2 Step v0 Now, we’ve found the proper v0, but overshot v2 The cleanup step takes care of this

Binary Search Search for v0 and v2 v0v2v2v5v0v5 Sorted List Initialize Step 1 Step 2 Step 3 Step v0v2 Done! Both v0 and v2 are located properly

Binary Search Summary Single rendering pass Each pixel drawn performs independent search O(log n) steps

Nearest Neighbor Search Very fundamental step in similarity search of data mining, retrieval… Curse of dimensionality, When dimensionality is very high, structures like k-d tree does not help Use GPU to improve linear scan

Distances N-norm distance Cosine distance acos(dot(x,y))

Data Representation Use separate textures to store different dimensions.

Distance Computation Accumulating distance component of different dimensions

Reduction in RGBA

Reduction to find NN

Results

K-Nearest Neighbor Search Given a sample point p, find the k points nearest p within a data set On the CPU, this is easily done with a heap or priority queue Can add or reject neighbors as search progresses Don’t know how to build one efficiently on GPU kNN-grid Can only add neighbors…

kNN-grid Algorithm sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm Candidate neighbors must be within max search radius Visit voxels in order of distance to sample point sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm If current number of neighbors found is less than the number requested, grow search radius 1 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm 2 sample point neighbors found candidate neighbor Want 4 neighbors If current number of neighbors found is less than the number requested, grow search radius

kNN-grid Algorithm Don’t add neighbors outside maximum search radius Don’t grow search radius when neighbor is outside maximum radius 2 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm Add neighbors within search radius 3 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm Add neighbors within search radius 4 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm Don’t expand search radius if enough neighbors already found 4 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm Add neighbors within search radius 5 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Algorithm Visit all other voxels accessible within determined search radius Add neighbors within search radius 6 sample point neighbors found candidate neighbor Want 4 neighbors

kNN-grid Summary Finds all neighbors within a sphere centered about sample point May locate more than requested k- nearest neighbors 6 sample point neighbors found candidate neighbor Want 4 neighbors

References Naga Govindaraju, Brandon Lloyd, Wei Wang, Ming Lin and Dinesh Manocha, Fast Computation of Database Operations using Graphics Processors Benjamin Bustos, Oliver Deussen, Stefan Hiller, and Daniel Keim, A Graphic Hardware Accelerated Algorithm for Nearest Neighbor Search Gernot Ziegler, Art Tevs, Christian Theobalt, Hans-Peter Seidel, GPU Point List Generation through Histogram Pyramids Tim Purcell, Sorting and Searching ppt