University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
SAGE: Self-Tuning Approximation for Graphics Engines
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, GPU Compilation University of Michigan December 3, 2012 Guest Speakers Today: Daya Khudia and.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
EECS 583 – Class 21 Research Topic 3: Compilation for GPUs University of Michigan December 12, 2011 – Last Class!!
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CPU-GPU Collaboration for Output Quality Monitoring Mehrzad Samadi and Scott Mahlke University of Michigan March 2014 Compilers creating custom processors.
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
ECE 587 Hardware/Software Co- Design Lecture 26/27 CUDA to FPGA Flow Professor Jia Wang Department of Electrical and Computer Engineering Illinois Institute.
Employing compression solutions under openacc
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Linchuan Chen, Xin Huo and Gagan Agrawal
6- General Purpose GPU Programming
Presentation transcript:

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable Stream Programming on Graphics Engines

University of Michigan Electrical Engineering and Computer Science 2 Why GPUs? Every mobile and desktop system will have one Affordable and high performance Over-provisioned Programmable Sony PlayStation Phone

University of Michigan Electrical Engineering and Computer Science 3 GPU Architecture Shared Regs Interconnection Network CPU SM 0SM 1SM 29 Kernel 1 Kernel 2 Time Shared Regs Shared Regs Registers Global Memory (Device Memory) Shared Memory

University of Michigan Electrical Engineering and Computer Science 4 GPU Programming Model Threads  Blocks  Grid All the threads run one kernel Registers private to each thread Registers spill to local memory Shared memory shared between threads of a block Global memory shared between all blocks

University of Michigan Electrical Engineering and Computer Science 5 Grid 1 GPU Execution Model SM 1 Shared Regs SM 0 Shared Regs SM 2 Shared Regs SM 3 Shared Regs SM 30 Shared Regs

University of Michigan Electrical Engineering and Computer Science 6 GPU Execution Model Block 0 Block 1 Block 3 Shared Registers SM0 Block 2 Warp 0Warp 1 ThreadId

University of Michigan Electrical Engineering and Computer Science 7 GPU Programming Challenges Optimized for GeForce GTX 285 Optimized for GeForce 8400 GS Data restructuring for complex memory hierarchy efficiently –Global memory, Shared memory, Registers Partitioning work between CPU and GPU Lack of portability between different generations of GPU –Registers, active warps, size of global memory, size of shared memory Will vary even more –Newer high performance cards e.g. NVIDA’s Fermi –Mobile GPUs with less resources

University of Michigan Electrical Engineering and Computer Science 8 Nonlinear Optimization Space [Ryoo, CGO ’08] SAD Optimization Space 908 Configurations We need higher level of abstraction!

University of Michigan Electrical Engineering and Computer Science 9 Goals Write-once parallel software Free the programmer from low-level details (C + Pthreads) Shared Memory Processors (C +Intrinsics) SIMD Engines (Verilog/VHDL) FPGAs (CUDA/OpenCL) GPUs Parallel Specification

University of Michigan Electrical Engineering and Computer Science 10 Streaming Higher-level of abstraction Decoupling computation and memory accesses Coarse grain exposed parallelism, exposed communication Programmers can focus on the algorithms instead of low-level details Streaming actors use buffers to communicate A lot of recent works on extending portability of streaming applications

University of Michigan Electrical Engineering and Computer Science 11 Sponge –Generating optimized CUDA for a wide variety of GPU targets –Perform an array of optimizations on stream graphs –Optimizing and porting to different generations –Utilize memory hierarchy (registers, shared memory, coallescing) –Efficiently utilize streaming cores Reorganization and Classification Memory Layout Memory Layout Graph Restructuring Graph Restructuring Register Optimization Register Optimization Shared/Global Memory Helper Threads Bank Conflict Resolution Loop Unrolling Software Prefetching

University of Michigan Electrical Engineering and Computer Science 12 GPU Performance Model - Memory bound Kernels M 0M 1M 2M 3M 4M 5M 6M 7 C 0C 1C 2C 3C 4C 5C 6 C 7 ≈ Memory Time - Computation bound Kernels M 0M 1M 4M 5M 2M 3M 6M 7 C 0C1C 2C 3C 4C 5C 6C 7 ≈ Computation Time M C Memory InstructionsComputation Instructions

University of Michigan Electrical Engineering and Computer Science 13 Actor Classification High Traffic Actors(HiT) –Large number of memory accesses per actor –Less number of threads with shared memory –Using shared memory underutilizes the processors Low Traffic Actors(LoT) –Less number of memory accesses per actor –More number of threads –Using shared memory increases the performance

University of Michigan Electrical Engineering and Computer Science 14 Thread 1 Thread 2 Thread 3 Thread Global Memory Accesses A[4,4] Global Memory Large access latency Not access the words in sequence No coalescing A[4,4] A[i, j]  Actor A has i pops and j pushes

University of Michigan Electrical Engineering and Computer Science 15 Thread 3 Thread 2 Thread 1 Thread 0 Shared Memory A[4,4] Shared Memory Global To Shared Global Memory First bring the data into shared memory with coalescing –Each filter brings data for other filters –Satisfies coalescing constraints After data is in the shared memory, then each filter accesses its own memory. Improve bandwidth and performance Shared to Global

University of Michigan Electrical Engineering and Computer Science 16 Using Shared Memory Shared memory is 100x faster than global memory Coalesce all global memory accesses Number of threads is limited by size of the shared memory.

University of Michigan Electrical Engineering and Computer Science 17 Helper Threads Shared memory limits the number of threads. Underutilized processors can fetch data. All the helper threads are in one warp. (no control flow divergence)

University of Michigan Electrical Engineering and Computer Science 18 Data Prefetch Better register utilization Data for iteration i+1 is moved to registers Data for iteration i is moved from register to shared memory Allows the GPU to overlap instructions

University of Michigan Electrical Engineering and Computer Science 19 Loop unrolling Similar to traditional unrolling Allows the GPU to overlap instructions Better register utilization Less loop control overhead Can also be applied to memory transfer loops

University of Michigan Electrical Engineering and Computer Science 20 Methodology Set of benchmarks from the StreamIt Suite 3GHz Intel Core 2 Duo CPU with 6GB RAM Nvidia Geforce GTX 285

University of Michigan Electrical Engineering and Computer Science 21 Result (Baseline CPU) 10 24

University of Michigan Electrical Engineering and Computer Science 22 Result (Baseline GPU) 64% 3% 16%

University of Michigan Electrical Engineering and Computer Science 23 Conclusion Future systems will be heterogeneous GPUs are important part of such systems Programming complexity is a significant challenge Sponge automatically creates optimized CUDA code for a wide variety of GPU targets Provide portability by performing an array of optimizations on stream graphs

University of Michigan Electrical Engineering and Computer Science 24 Questions

University of Michigan Electrical Engineering and Computer Science 25 Spatial Intermediate Representation StreamIt Main Constructs: –Filter  Encapsulate computation. –Pipeline  Expressing pipeline parallelism. –Splitjoin  Expressing task-level parallelism. –Other constructs not relevant here Exposes different types of parallelism –Composable, hierarchical Stateful and stateless filters pipeline filter splitjoin

University of Michigan Electrical Engineering and Computer Science 26 Nonlinear Optimization Space [Ryoo, CGO ’08] SAD Optimization Space 908 Configurations

University of Michigan Electrical Engineering and Computer Science 27 Thread 1 Thread 2 Thread 0 Bank Conflict A[8,8] Shared Memory Conflict data = buffer[BaseAddress + s * ThreadId]

University of Michigan Electrical Engineering and Computer Science 28 Thread 2 Thread 1 Thread 0 Removing Bank Conflict A[8,8] Shared Memory data = buffer[BaseAddress + s * ThreadId] if GCD( # of bank, s) is 1 there will be no bank conflict  s must be odd