Introduction to Heterogeneous Parallel Computing

Slides:



Advertisements
Similar presentations
Please do not distribute
Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Instruction Level Parallelism (ILP) Colin Stevens.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Computer performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Rethinking Computer Architecture Wen-mei Hwu University of Illinois, Urbana-Champaign Celebrating September 19, 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
GPU Programming Shirley Moore CPS 5401 Fall 2013
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, Urbana-Champaign 1 CS/EE 217 GPU Architecture and Parallel Programming Project.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Advanced Rendering Technology The AR250 A New Architecture for Ray Traced Rendering.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Lynn Choi School of Electrical Engineering
Chapter 1: Introduction
ECE354 Embedded Systems Introduction C Andras Moritz.
Microarchitecture.
Ph.D. in Computer Science
Visit for more Learning Resources
Simultaneous Multithreading
Multi-core processors
Independence Instruction Set Architectures
Architecture & Organization 1
ECE408 / CS483 Applied Parallel Programming Lecture 1: Introduction
CS775: Computer Architecture
Super Quick Architecture Review
Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform
Mattan Erez The University of Texas at Austin
Architecture & Organization 1
NVIDIA Fermi Architecture
Coe818 Advanced Computer Architecture
Chapter 1 Introduction.
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
Computer Architecture: A Science of Tradeoffs
The Vector-Thread Architecture
Mattan Erez The University of Texas at Austin
Overview Prof. Eric Rotenberg
The University of Adelaide, School of Computer Science
CSE 502: Computer Architecture
Spring’19 Prof. Eric Rotenberg
Presentation transcript:

Introduction to Heterogeneous Parallel Computing Slide credit: Slides adapted from © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2016

Heterogeneous Parallel Computing Use the best match for the job (heterogeneity in mobile SOC) Latency Cores Throughput Cores HW IPs DSP Cores Configurable Logic/Cores On-chip Memories

CPU and GPU are designed very differently Latency Oriented Cores GPU Throughput Oriented Cores Chip Chip Core Local Cache Registers SIMD Unit Control Compute Unit Cache/Local Mem Registers Threading SIMD Unit

CPUs: Latency Oriented Design Cache ALU Control DRAM Powerful ALU Reduced operation latency Large caches Convert long latency memory accesses to short latency cache accesses Sophisticated control Branch prediction for reduced branch latency Data forwarding for reduced data latency CPU

GPUs: Throughput Oriented Design Small caches To boost memory throughput Simple control No branch prediction No data forwarding Energy efficient ALUs Many, long latency but heavily pipelined for high throughput Require massive number of threads to tolerate latencies Threading logic Thread state DRAM GPU

Applications Use Both CPU and GPU CPUs for sequential parts where latency matters CPUs can be 10X+ faster than GPUs for sequential code GPUs for parallel parts where throughput wins GPUs can be 10X+ faster than CPUs for parallel code

Heterogeneous Parallel Computing in Many Disciplines Financial Analysis Scientific Simulation Engineering Simulation Data Intensive Analytics Medical Imaging Digital Audio Processing Digital Video Processing Computer Vision Biomedical Informatics Electronic Design Automation Statistical Modeling Numerical Methods Ray Tracing Rendering Interactive Physics

Portability and Scalability in Heterogeneous Parallel Computing

Software Dominates System Cost SW lines per chip increases at 2x/10 months HW gates per chip increases at 2x/18 months Future systems must minimize software redevelopment

Keys to Software Cost Control Scalability App Core A

Keys to Software Cost Control Scalability The same application runs efficiently on new generations of cores App Core A 2.0

Keys to Software Cost Control Scalability The same application runs efficiently on new generations of cores The same application runs efficiently on more of the same cores App Core A Core A Core A

More on Scalability Performance growth with HW generations Increasing number of compute units (cores) Increasing number of threads Increasing vector length Increasing pipeline depth Increasing DRAM burst size Increasing number of DRAM channels Increasing data movement latency The programming style we use in this course supports scalability through fine-grained problem decomposition and dynamic thread scheduling

Keys to Software Cost Control Scalability Portability The same application runs efficiently on different types of cores App App App Core B Core A Core C

Keys to Software Cost Control Scalability Portability The same application runs efficiently on different types of cores The same application runs efficiently on systems with different organizations and interfaces App App App

More on Portability Portability across many different HW types Across ISAs (Instruction Set Architectures) - X86 vs. ARM, etc. Latency oriented CPUs vs. throughput oriented GPUs Across parallelism models - VLIW vs. SIMD vs. threading Across memory models - Shared memory vs. distributed memory