Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
CUDA - 2.
From lecture slides for Computer Organization and Architecture: Designing for Performance, Eighth Edition, Prentice Hall, 2010 CS 211: Computer Architecture.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
Sina Meraji Towards A Combined Grouping and Aggregation Algorithm for Fast Query Processing in Columnar Databases with GPUs A Technology.
OpenCL Programming James Perry EPCC The University of Edinburgh.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Sunpyo Hong, Hyesoon Kim
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Computer Engg, IIT(BHU)
NFV Compute Acceleration APIs and Evaluation
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010

Overview Implementation of certain SQL SELECT queries to run on an NVIDIA GPU with CUDA Built on top of the SQLite database Average query speedup of 35X Two goals: – Accelerate existing database operations – Give programmers easier access to GPU hardware 2

Outline SQL Review SQLite Implementation – Scope – Limitations Testing, Results Multicore Hardware Limitations Future Work Conclusions 3

SQL Review Widely-used database query language Declarative, thus easy to parallelize Data mining operations have been accelerated on GPUs SELECT name FROM students WHERE gpa > 3 AND age < 25 4

SQLite Most widely installed SQL database Simple design Easy to understand code Code placed in the public domain Performance comparable to other open-source databases 5

SQLite Architecture Runs as part of the client process SQL queries are parsed and transformed to a program in an intermediate language of opcodes – Process resembles compilation, output resembles assembly Intermediate language program, or query plan, executed by the SQLite virtual machine 6

Implementation A re-implementation of the virtual machine as a CUDA kernel Uses SQLite query processor Many query plans execute opcodes over all rows in a table, inherently parallelizable Data selected from SQLite and transferred to GPU in initialization step Queries execute in a kernel launch Results transferred from the GPU 7

Implementation Each thread assigned a table row Threads execute opcodes Thread divergence occurs as threads jump to different opcodes based on the data Reductions must be performed to coordinate output of results and aggregations of values 8

Example Opcode Program 1:Integer310 2:Integer2520 … 6:Column013 7:Lt1123 8:Column023 9:Gt :Column :ResultRow510 12:Next060 Alice Bob :Integer310 2:Integer2520 … 6:Column013 7:Lt1123 8:Column023 9:Gt :Column :ResultRow510 12:Next060

GPU Limitations GPU memory size – 4 GB on Tesla C1060 GPUs Data must be transferred to GPU Results must be transferred back CUDA makes certain optimizations more difficult 10

Scope Subset of possible SELECT queries Numerical data types – 32, 64 bit ints, 32, 64 bit IEEE 754 floats Mid-size data sets Applications that run multiple queries over the same data – Ignore the data transfer time to the GPU 11

Testing 13 queries benchmarked on CPU and GPU Random data set of 5 million rows, varying distribution, data type Running times include the query execution time and the results transfer time Tesla C1060 GPU, CUDA 2.2 – 4 GB memory, 240 streaming multiprocessors, 102 GB/s bandwidth Intel Xeon X5550 with Linux CPU code compiled and optimized with the Intel C Compiler

Running Time Comparison 13

Speedup Comparison 14

Results The average query took 2.27 seconds for CPU execution,.063 seconds for GPU execution – 36X speedup Speedup varied by query, range of 20-70X The size of the result set significantly affected relative query speed – Synchronization among threads – Time to transfer results back from the GPU 15

Multicore SQLite does not take advantage of multiple cores, the results shown are for a single core The maximum possible speedup is n times on an n-core machine, less because of overhead The GPU is faster because of the number of SMs, memory throughput, and very efficient thread synchronization 16

Hardware Limitations No support for indirect jumps Dynamically accessed arrays are stored in local (global memory latency) memory Certain 32 and 64 bit math operations are emulated on current hardware These limitations are expected to be removed in Fermi, the next generation of NVIDIA hardware 17

Future Work This is preliminary work, there are a number of topics available for future research: – More complete implementation of SQL, including JOINs, INSERTs, indices etc. – Multi-GPU, Distributed GPU implementation – Direct comparison to multi-core – Direct comparison to other databases – GPU-targeted SQL query processor 18

Conclusions SQL is a good programming paradigm for accessing GPU hardware Many database operations can be significantly accelerated on GPUs Next-generation GPU hardware will likely improve these results This is an open area for future research 19

Questions 20

SQL Queries Tested 1.SELECT id, uniformi, normali5 FROM test WHERE uniformi > 60 AND normali5 < 0 2.SELECT id, uniformf, normalf5 FROM test WHERE uniformf > 60 AND normalf5 < 0 3.SELECT id, uniformi, normali5 FROM test WHERE uniformi > -60 AND normali5 < 5 4.SELECT id, uniformf, normalf5 FROM test WHERE uni-formf > -60 AND normalf5 < 5 5.SELECT id, normali5, normali20 FROM test WHERE (normali ) > (uniformi - 10) 6.SELECT id, normalf5, normalf20 FROM test WHERE (normalf ) > (uniformf - 10) 7.SELECT id, normali5, normali20 FROM test WHERE normali5 * normali20 BETWEEN -5 AND 5 8.SELECT id, normalf5, normalf20 FROM test WHERE normalf5 * normalf20 BETWEEN -5 AND 5 9.SELECT id, uniformi, normali5, normali20 FROM test WHERE NOT uniformi OR NOT normali5 OR NOT normali20 10.SELECT id, uniformf, normalf5, normalf20 FROM test WHERE NOT uniformf OR NOT normalf5 OR NOT normalf20 11.SELECT SUM(normalf20) FROM test 12.SELECT AVG(uniformi) FROM test WHERE uniformi > 0 13.SELECT MAX(normali5), MIN(normali5) FROM test 21

Rows Returned vs. Speedup 22