SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

OpenFOAM on a GPU-based Heterogeneous Cluster

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Red Fox:

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Supporting GPU Sharing in Cloud Environments with a Transparent

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

NFV Compute Acceleration APIs and Evaluation

Gwangsun Kim, Jiyun Jeong, John Kim

CS427 Multicore Architecture and Parallel Computing

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

The Yin and Yang of Processing Data Warehousing Queries on GPUs

General Purpose Graphics Processing Units (GPGPUs)

6- General Purpose GPU Programming

Presentation transcript:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission Haicheng Wu*, Gregory Diamos #, Jin Wang*, Srihari Cadambi^, Sudhakar Yalamanchili*, Srimat Chakradhar^ *Georgia Institute of Technology # NVIDIA Research ^ NEC Laboratories America Sponsors: National Science Foundation, LogicBlox Inc., IBM, and NVIDIA

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY The General Purpose GPU 2 ② Launch Kernel ① Input Data ④ Result ③ Execute CPU (Multi Core) 2-10 Cores MAIN MEM ~128GB GPU ~1500 Cores GPU MEM ~6GB PCI-E  GPU is a many core co-processor  10s to 100s of cores  1000s to 10,000s of concurrent threads  CUDA and OpenCL are the dominant programming models  Well suited for data parallel apps  Molecular Dynamics, Options Pricing, Ray Tracing, etc.  Commodity: led by NVIDIA, AMD, and Intel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Enterprise: Amazon EC2 GPU Instance Amazon EC2 GPU Instances ElementsCharacteristics OSCentOS 5.5 CPU2 x Intel Xeon X5570 (quad-core "Nehalem" arch, 2.93GHz) GPU2 x NVIDIA Tesla "Fermi" M2050 GPU Nvidia GPU driver and CUDA toolkit 3.1 Memory22 GB Storage1690 GB I/O10 GigE Price$2.10/hour NVIDIA Tesla 3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Data Warehousing Applications on GPUs 4 The good Lots of potential data parallelism If data fits in GPU mem, 2x—27x speedup has been shown The bad Very large data set (will not even fit in host memory) I/O bound (GPU has no disk) PCI data transfer takes 15–90% of the total time * OrderPriceDiscount 01010% 12020% 21015% 35114% 43313% 52210% …… B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, 2009.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY This Work 5 Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs Assumptions In-memory system Host memory, not GPU memory Not OLTP (Online Transaction Processing) type simple queries Focus on data analysis instead of data entry/retrieval

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Two Optimizations for Data Movement 6 Our solutions are: Kernel Fusion – Aggregate computation to reuse data Kernel Fission – Overlap computation with PCI transfer This is the problem!!! CPU (Multi Core) 2-10 Cores MAIN MEM ~128GB GPU ~1500 Cores GPU MEM ~6GB PCI-E ~16GB/ s

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Relational Algebra (RA) Operators 7 RA are building blocks of DB APPs UNIONx = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)} union x y -> {(3,a), (4,a), (2,b), (0,a)} INTERSECTIONx = {(3,a), (4,a), (2,b)}, y = {(0,a), (2,b)} intersection x y -> {(2,b)} PRODUCTx = {(3,a), (4,a)}, y = {(True, 2)} product x y -> {(3,a,True,2), (4,a,True,2)} DIFFERENCEx = {(3,a), (4,a), (2,b)}, y = {(4,a), (3,a)} difference x y -> {(2,b)} JOINx = {(2,b), (3,a), (4,a)}, y = {(2,f), (3,c)} join x y -> {(3,a,c), (2,b,f)} PROJECTIONx = {(3,True,a), (4,True,a), (2,False,b)} project [0,2] x -> {(3,a), (4,a), (2,b)} SELECTx = {(3,True,a), (4,True,a), (2,False,b)} select [field.0==2] x -> (2,False,b)

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Common RA Combinations of TPC-H 8

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Experimental Environment 9 Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission CPU2 quad-core Xeon 2.27GHz Memory48 GB GPU1 Tesla C2070 (6GB GDDR5 memory) OSUbuntu Server GCC4.4.3 NVCC4.0

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY PCI Bandwidth vs. GPU Computation Capacity 10 PCI Bandwidth GPU Computation Capacity (1 SELECT) <

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY +/- Kernel Fusion Kernel A Kernel B Fused Kernel A1 : A2: Kernel A A1A2 A3 Kernel B Result A1A2A3 Fused Kernel A, B Result A3: A1 : A2: 456 A3:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Benefits of Kernel Fusion-Reduce Data Footprint (1) 12 Spatial Locality Traverse the data only ONCE GPU temp GPU tempResult GPU Result temp A1A2A3 A1A2A3 Temporal Locality

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Benefits of Kernel Fusion-Reduce Data Footprint (2) 13 Reduce Data Transfer input1 result1 input2 result2 CPU MEM GPU MEM Memory Efficiency A1A2 Temp A3 A1 A2 A3 GPU MEM Kernel A A1A2 A3 Kernel B Result A1A2A3 Fused Kernel A, B Result Temp

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Benefits of Kernel Fusion-Enlarge Optimization Scope 14 Eliminate Common Stages Enable More Opt Fused Kernel A, B Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation …… Kernel A Kernel B

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Examples of Kernel Fusion 15 Original 1 SELECTFused 2 SELECTs

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion-Overall Performance 16 Including PCI Excluding PCI 1.80x speedup PCI-e noise

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion-Breakdown Execution Time 17 Not needed Faster filter and gather

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion-Sensitivity 18 Fusing more kernels is better Lower selected rate is better

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fission-CUDA Stream 19 Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel Commands in the same CUDA STREAM have to run in sequential

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fission-Stream Pool Stream Pool is a library that abstracts away the details of CUDA STREAM APIComment getAvailableStream()Get an available stream setStreamCommand()Assign a command to a specific stream startStreams()Start the execution selectWait()Assign point-to-point synchronization between two specific streams terminate()End the execution immediately 20

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fission-Different Ways to Use CUDA Stream Concurrently running two kernels is not always beneficial small uses half resource as big 21

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example of Kernel Fission 1.37x speedup 22

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion + Kernel Fission 1.41x serial 1.31x fusion only 1.10x fission only 23

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Real Queries-Q1 24 Query Plan Totally 1.26x speedup

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Real Queries-Q21 25 Query Plan Totally 1.13x speedup

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Conclusions Two Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps. Kernel Fusion Does not need to dump intermediate temporary data Enlarge the optimization scope Kernel Fission works like double buffer that can overlap data transfer with GPU Computation 26

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Thank You Questions? 27