SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

Slides:



Advertisements
Similar presentations
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
Multi-GPU System Design with Memory Networks
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
OpenFOAM on a GPU-based Heterogeneous Cluster
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Presented by Rengan Xu LCPC /16/2014
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Red Fox:
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SAGE: Self-Tuning Approximation for Graphics Engines
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Supporting GPU Sharing in Cloud Environments with a Transparent
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
NFV Compute Acceleration APIs and Evaluation
Gwangsun Kim, Jiyun Jeong, John Kim
Employing compression solutions under openacc
Linchuan Chen, Xin Huo and Gagan Agrawal
Short Circuiting Memory Traffic in Handheld Platforms
Department of Computer Science University of California, Santa Barbara
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Department of Computer Science University of California, Santa Barbara
6- General Purpose GPU Programming
Presentation transcript:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation Haicheng Wu 1, Gregory Diamos 2, Srihari Cadambi 3, Sudhakar Yalamanchili 1 1 Georgia Institute of Technology 2 NVIDIA Research 3 NEC Laboratories America

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Data Warehousing Applications on GPUs 2 The Opportunity Significant potential data parallelism If data fits in GPU memory, 2x—27x speedup has been shown 1 The Challenge Need to process 1-50 TBs of data 2 15–90% of the total time * spent in moving data between CPU and GPU * Fine grained computation 1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Relational Algebra (RA) Operators——Select RA operators are the building blocks of DB applications Set Intersection Set Union Set Difference Cross Product Join Select Project KeyValue 3True, a 3False, b 4True, a Example: Select [Key == 3] KeyValue 3True, a 3False, b 4True, a 3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Relational Algebra (RA) Operators ——Join RA are building blocks of DB applications Set Intersection Set Union Set Difference Cross Product Join Select Project KeyValue 3a 3b 4a KeyValue 3c 4d 5e Example: Join KeyValue 3a,c 3b,c 4a,d New Key = Key(A) ∩ Key(B) New Vallue = Value(A) U Value(B) A B JOIN (A, B) 4

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Data Movement in Kernel Execution 5 ~250GB/s ① Input ② Execute ③ Result M N T

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Memory Hierarchy Bottleneck

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion GPU MEM GPU Core A1 A2 A3 Temp A1 A2 A3 Temp Result Before Fusion GPU MEM GPU Core A1 A2 A3 A1 A2 A3 Result After Fusion Temp Kernel AKernel B Fused Kernel A&B Kernel A A1A2 A3 Kernel B Result Temp A1A2A3 Fused Kernel A, B Result 7

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Major Benefits Reduce Data Footprint Reduction in accesses to global memory Access to common data across kernels improves temporal locality Reduction in PCIe transfers Expand optimization scope of the compiler Data re-use Increase textual scope of optimizers 8 Kernel A A1A2 A3 Kernel B Result Temp A1A2A3 Fused Kernel A, B Result

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Outline Introduction Current RA implementation Kernel Fusion Kernel fusion criteria Greedy heuristic to choose candidates Weaving and fusion Experiment Results Conclusion 9

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Red Fox Compilation Flow 10 RA-to-PTX (nvcc + RA-Lib) Runtime LogicBlox Front-End Language Front-End Translation Layer Back-End Datalog Queries Query PlanPTX/Binary Kernel Kernel Weaver Kernel Weaver – CUDA source to source transformation to apply kernel fusion PTX – Parallel Thread Execution RA Primitives Library

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example of SELECT * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, RA Implementation —— Multi-Stage Algorithms 11 All primitives have the same three stages * Each stage normally maps to 1 CUDA kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion – Three Steps 1.Opportunity: Find candidates meeting fusion criteria. 2.Feasibility: Choose kernels to fuse according to available resources. 3.Fusion: Kernel fusion. 12

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion Criteria (1) 13 Compatible kernel configurations (CTA & thread dimensions) Implementations of RA primitives are parametric Empirically choose configurations after fusion M1 N1 M2 N2 T1 T2 M N T Kernel A Kernel B Fused Kernel A & B 13

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dependence Restriction Thread dependence Kernel Fusion Criteria (2) Kernel A Kernel B Input data have 2 attributes Operations of each thread are independent Use registers to communicate 14 Kernel A Kernel B

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dependence Restriction Thread dependence CTA (Thread Block) dependence Kernel Fusion Criteria (2) 15 Kernel A Kernel B 15

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion Criteria-CTA Dependence 16 Threads in the same CTA have dependence No dependence between CTAs Can be fused After fusion Use Shared MEM to communicate Synchronization is needed Example of 2 back-to-back JOINs

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dependence Restriction Thread dependence CTA (Thread Block) dependence Kernel dependence Kernel Fusion Criteria (2) 17 Can be fused Kernel A Kernel B 17

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion Criteria - Candidates for Fusion Only exhibit thread or CTA dependence Bounded by operators with kernel dependence 18

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Choosing Operators to Fuse 19 Dependence Graph 1. Topo Sort 2. Incrementally add operators 3. Stop When the Estimated Usage is Larger than Budget Kernel fusion will increase resource usage, e.g., registers Greedy heuristic to choose

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Weaving and Fusion 20 Interweaving and Fusing individual stages (CUDA kernels) Use registers or shared memory to store temporary result

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Fusing Thread Dependent Only Operators 21 Example of fusing 2 SELECTs  Unary operators only  No Synchronization required  Register-based communication Select

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Gather Partition Compute Fusing CTA and Thread Dependent Operators 22  Partition multiple inputs  Synchronization necessary  Communication via shared memory Example Pattern

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Experimental Environment CPU2 quad-core Xeon 2.27GHz Memory48 GB GPU1 Tesla C2070 (6GB GDDR5 memory) OSUbuntu Server GCC4.4.3 NVCC4.0 Use micro-benchmarks derived from TPC-H Measure memory allocation, memory access demand, effect of optimization scope, and PCIe traffic Full queries from TPC-H 23

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY TPC-H Benchmark Suites 24 A popular decision making benchmark suite Micro-benchmarks are common patterns from TPC-H Baseline: directly using primitive implementation without fusion Optimized: fusing all primitives of each pattern

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Fused vs. Not Fused Small Inputs-PCIe excluded 25 Average 2.89x speedup Small inputs (64MB-1GB) fitting the GPU memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Small Inputs-Analysis 26 Memory Allocation Compiler Optimization (Speedup of O3) Memory Access Reduction

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Conclusions Kernel Fusion can reduce data transfer and speeds up the computation for Data Warehousing Apps. Definition of basic dependences and general criteria for kernel fusion applicable across multiple application domains Quantification of the impact of kernel fusion on different levels of the CPU-GPU memory hierarchy for a range of RA operators. Proposes and demonstrates the utility of compile-time data movement optimizations based on kernel fusion 27

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Thank You Questions? 28

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Backup Slides 29

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion- A Data Movement Optimization 30 Increase the granularity of kernel computation Reduce data movement throughout the hierarchy Inspired by loop fusion Compile-time automation Input is an optimized query plan

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Large Inputs-PCIe included 31 Average 2.22x speedup overall and 2.35x speedup in PCIe Large inputs (1GB-1.6GB) fitting the GPU memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Resource Usage & Occupancy 32 PTX Reg #Shared MEM (Byte) Occupancy (%) PROJECT SELECT JOIN / Multiply PTX Reg #Shared MEM (Byte) Occupancy (%) (a) (b) (c) (d) (e)27075 Kernel fusion may increase resource usage and thus decrease occupancy These two do not negate the other benefits Individual primitive After kernel fusion

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Real Queries (scale factor = 1) 33 TPC-H Q1 1.25x speedup TPC-H Q x speedup

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Extensions Different Domains Require multi-stage algorithm Dependence classification still applies Different Representation PTX, OpenCL, LLVM Different Platform CPU, GPU/CPU hybrid 34