SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.

Slides:



Advertisements
Similar presentations
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
OpenFOAM on a GPU-based Heterogeneous Cluster
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Presented by Rengan Xu LCPC /16/2014
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Red Fox:
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SAGE: Self-Tuning Approximation for Graphics Engines
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Supporting GPU Sharing in Cloud Environments with a Transparent
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Accelerating Simulation of Agent-Based Models on Heterogeneous Architectures.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.
Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY HPCDB Satisfying Data-Intensive Queries Using GPU Clusters November.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
NFV Compute Acceleration APIs and Evaluation
Gwangsun Kim, Jiyun Jeong, John Kim
Employing compression solutions under openacc
CS427 Multicore Architecture and Parallel Computing
Ph.D. in Computer Science
Parallel Algorithm Design
GPU Programming using OpenCL
The Yin and Yang of Processing Data Warehousing Queries on GPUs
Department of Computer Science University of California, Santa Barbara
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Department of Computer Science University of California, Santa Barbara
6- General Purpose GPU Programming
Presentation transcript:

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation Haicheng Wu 1, Gregory Diamos 2, Srihari Cadambi 3, Sudhakar Yalamanchili 1 1 Georgia Institute of Technology 2 NVIDIA Research 3 NEC Laboratories America

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Data Warehousing Applications on GPUs 2 The Opportunity Significant potential data parallelism If data fits in GPU memory, 2x—27x speedup has been shown 1 The Challenge Need to process 1-50 TBs of data 2 15–90% of the total time * spent in moving data between CPU and GPU * Fine grained computation 1 B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. In TODS, Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Relational Algebra (RA) Operators RA operators are the building blocks of DB applications Set Intersection Set Union Set Difference Cross Product Join Select Project KeyValue 3True, a 3False, b 4True, a Example: Select [Key == 3] KeyValue 3True, a 3False, b 4True, a 3

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Relational Algebra (RA) Operators RA are building blocks of DB applications Set Intersection Set Union Set Difference Cross Product Join Select Project KeyValue 3a 3b 4a KeyValue 3c 4d 5e Example: Join KeyValue 3a,c 3b,c 4a,d New Key = Key(A) ∩ Key(B) New Vallue = Value(A) U Value(B) A B JOIN (A, B) 4

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Data Movement in Kernel Execution 5 ~250GB/s ① Input ② Execute ③ Result M N T

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion- A Data Movement Optimization 6 Increase the granularity of kernel computation Reduce data movement throughout the hierarchy Inspired by loop fusion Compile-time automation Input is an optimized query plan

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion GPU MEM GPU Core A1 A2 A3 Temp A1 A2 A3 Temp Result Before Fusion GPU MEM GPU Core A1 A2 A3 A1 A2 A3 Result After Fusion Temp Kernel AKernel B Fused Kernel A&B Kernel A A1A2 A3 Kernel B Result Temp A1A2A3 Fused Kernel A, B Result 7

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Major Benefits Reduce Data Footprint Reduction in accesses to global memory Access to common data across kernels improves temporal locality Reduction in PCIe transfers Expand optimization scope of the compiler Data re-use Increase textual scope of optimizers 8 Kernel A A1A2 A3 Kernel B Result Temp A1A2A3 Fused Kernel A, B Result

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Red Fox Compilation Flow 9 RA-to-PTX (nvcc + RA-Lib) Runtime LogicBlox Front-End Language Front-End Translation Layer Back-End Datalog Queries Query PlanPTX/Binary Kernel Kernel Weaver Kernel Weaver – CUDA source to source transformation to apply kernel fusion PTX – Parallel Thread Execution RA Primitives Library

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Example of SELECT * G. Diamos, H. Wu, J. Wang, A. Lele, and S. Yalamanchili. Relational Algorithms for Multi-Bulk-Synchronous Processors. In PPoPP, RA Implementation-Multi-Stage Algorithms 10 All primitives have the same three stages * Each stage normally maps to 1 CUDA kernel

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion – Three Steps 1.Opportunity: Find candidates meeting fusion criteria. 2.Feasibility: Choose kernels to fuse according to available resources. 3.Fusion: Kernel fusion. 11

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion Criteria (1) 12 Compatible kernel configurations (CTA & thread dimensions) Implementations of RA primitives are parametric Empirically choose configurations after fusion M1 N1 M2 N2 T1 T2 M N T Kernel A Kernel B Fused Kernel A & B 12

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dependence Restriction Thread dependence Kernel Fusion Criteria (2) Kernel A Kernel B Input data have 2 attributes Operations of each thread are independent Use registers to communicate 13 Kernel A Kernel B

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dependence Restriction Thread dependence CTA (Thread Block) dependence Kernel Fusion Criteria (2) 14 Kernel A Kernel B 14

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion Criteria-CTA Dependence 15 Threads in the same CTA have dependence No dependence between CTAs Can be fused After fusion Use Shared MEM to communicate Synchronization is needed Example of 2 back-to-back JOINs

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Dependence Restriction Thread dependence CTA (Thread Block) dependence Kernel dependence Kernel Fusion Criteria (2) 16 Can be fused Kernel A Kernel B 16

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Fusion Criteria - Candidates for Fusion Only exhibit thread or CTA dependence Bounded by operators with kernel dependence 17

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Choosing Operators to Fuse 18 Dependence Graph 1. Topo Sort 2. Incrementally add operators 3. Stop When the Estimated Usage is Larger than Budget Kernel fusion will increase resource usage, e.g., registers Greedy heuristic to choose

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Kernel Weaving and Fusion 19 Interweaving and Fusing individual stages (CUDA kernels) Use registers or shared memory to store temporary result

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Fusing Thread Dependent Only Operators 20 Example of fusing 2 SELECTs  Unary operators only  No Synchronization required  Register-based communication Select

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Gather Partition Compute Fusing CTA and Thread Dependent Operators 21  Partition multiple inputs  Synchronization necessary  Communication via shared memory Example Pattern

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Experimental Environment CPU2 quad-core Xeon 2.27GHz Memory48 GB GPU1 Tesla C2070 (6GB GDDR5 memory) OSUbuntu Server GCC4.4.3 NVCC4.0 Use micro-benchmarks derived from TPC-H Measure memory allocation, memory access demand, effect of optimization scope, and PCIe traffic Full queries from TPC-H 22

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY TPC-H Benchmark Suites 23 A popular decision making benchmark suite Micro-benchmarks are common patterns from TPC-H Baseline: directly using primitive implementation without fusion Optimized: fusing all primitives of each pattern

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Fused vs. Not Fused Small Inputs-PCIe excluded 24 Average 2.89x speedup Small inputs (64MB-1GB) fitting the GPU memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Small Inputs-Analysis 25 Memory Allocation Compiler Optimization (Speedup of O3) Memory Access Reduction

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Large Inputs-PCIe included 26 Average 2.22x speedup overall and 2.35x speedup in PCIe Large inputs (1GB-1.6GB) fitting the GPU memory

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Resource Usage & Occupancy 27 PTX Reg #Shared MEM (Byte) Occupancy (%) PROJECT SELECT JOIN / Multiply PTX Reg #Shared MEM (Byte) Occupancy (%) (a) (b) (c) (d) (e)27075 Kernel fusion may increase resource usage and thus decrease occupancy These two do not negate the other benefits Individual primitive After kernel fusion

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Real Queries (scale factor = 1) 28 TPC-H Q1 1.25x speedup TPC-H Q x speedup

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Extensions Different Domains Require multi-stage algorithm Dependence classification still applies Different Representation PTX, OpenCL, LLVM Different Platform CPU, GPU/CPU hybrid 29

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Conclusions Kernel Fusion can reduce data transfer and speeds up the computation for Data Warehousing Apps. Definition of basic dependences and general criteria for kernel fusion applicable across multiple application domains Quantification of the impact of kernel fusion on different levels of the CPU-GPU memory hierarchy for a range of RA operators. Proposes and demonstrates the utility of compile-time data movement optimizations based on kernel fusion 30

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Thank You Questions? 31