GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: 100-150 GB/s vs. 32-64 GB/s –GPU in every PC and.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Lecture 4: CPU Performance
Lecture 6: Multicore Systems
CMPT 334 Computer Organization
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Intro to GPU’s for Parallel Computing. Goals for Rest of Course Learn how to program massively parallel processors and achieve – high performance – functionality.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
©Wen-mei W. Hwu and David Kirk/NVIDIA, SSL(2014), ECE408/CS483/ECE498AL, University of Illinois, ECE408/CS483 Applied Parallel Programming Lecture.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Extracted directly from:
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1 ECE 8823A GPU Architectures Module 5: Execution and Resources - I.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CMPS 5433 Dr. Ranette Halverson Programming Massively.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
1 Ceng 545 GPU Computing. Grading 2 Midterm Exam: 20% Homeworks: 40% Demo/knowledge: 25% Functionality: 40% Report: 35% Project: 40% Design Document:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Classic Model of Parallel Processing
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPU Programming Shirley Moore CPS 5401 Fall 2013
© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core Processors for Science and Engineering.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
1 Concurrent and Distributed Programming Lecture 2 Parallel architectures Performance of parallel programs References: Based on: Mark Silberstein, ,
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Parallel Computing Lecture
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
ECE408 / CS483 Applied Parallel Programming Lecture 1: Introduction
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Introduction to Heterogeneous Parallel Computing
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

GPUs

An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and workstation – massive volume and potential impact Performance Advantage of GPUs Courtesy: John Owens

Heart of Blue Waters: Two Chips AMD Interlagos 157 GF peak performance Features: GHz 8 core modules, 16 threads On-chip Caches L1 (I:8x64KB; D:16x16KB) L2 (8x2MB) Memory Subsystem Four memory channels 51.2 GB/s bandwidth NVIDIA Kepler 1,300 GF peak performance Features: 15 Streaming multiprocessors (SMX) SMX: 192 CUDA SPs, 64 dp units, 32 special function units L1 caches/shared mem (64KB, 48KB) L2 cache (1,536KB) Memory subsystem Six memory channels 250 GB/s bandwidth

DRAM Cache ALU Control ALU DRAM CPU GPU CPUs and GPUs have fundamentally different design philosophies.

CPUs: Latency Oriented Design Large caches –Convert long latency memory accesses to short latency cache accesses Sophisticated control –Branch prediction for reduced branch latency –Data forwarding for reduced data latency Powerful ALU –Reduced operation latency 5 Cache ALU Control ALU DRAM CPU

GPUs: Throughput Oriented Design Small caches –To boost memory throughput Simple control –No branch prediction –No data forwarding Energy efficient ALUs –Many, long latency but heavily pipelined for high throughput Require massive number of threads to tolerate latencies 6 DRAM GPU

Winning Applications Use Both CPU and GPU CPUs for sequential parts where latency matters –CPUs can be 10+X faster than GPUs for sequential code GPUs for parallel parts where throughput wins –GPUs can be 10+X faster than CPUs for parallel code 7

GPU Architecture FETCH DECODE ISSUE RF FU RF FU RF FU N Good for data parallel code. Bad when threads diverge.

Control Divergence A CB D A CB D Thread 0Thread 1 A A C B D D A A C B D D Thread 0 Thread 1 Threads must execute in lockstep, but different threads execute different instruction sequences

Memory Divergence MEM CTRL DRAM L2 INTERCONNECT... MEM CTRL DRAM L2... L1 SM L1 SM L1 SM Load 0Load 1 Dependent instructions must stall until longest memory request finishes

Massive Parallelism Requires Regularity

Undesirable Patterns Serialization due to conflicting use of critical resources Over subscription of Global Memory bandwidth Load imbalance among parallel threads

Conflicting Data Accesses Cause Serialization and Delays Massively parallel execution cannot afford serialization Contentions in accessing critical data causes serialization

A Simple Example A naïve inner product algorithm of two vectors of one million elements each –All multiplications can be done in one time unit (parallel) –Additions to a single accumulator in one million time units (serial) * * * * * +*+* +++ …… Time

How much can conflicts hurt? Amdahl’s Law –If fraction X of a computation is serialized, the speedup cannot be more than 1/X In the previous example, X = 50% –Half the calculations are serialized –No more than 2X speedup, no matter how many computing cores are used

Load Balance The total amount of time to complete a parallel job is limited by the thread that takes the longest to finish goodbad

How bad can it be? Assume that a job takes 100 units of time for one person to finish –If we break up the job into 10 parts of 10 units each and have 10 people to do it in parallel, we can get a 10X speedup –If we break up the job into 50, 10, 5, 5, 5, 5, 5, 5, 5, 5 units, the same 10 people will take 50 units to finish, with 9 of them idling for most of the time. We will get no more than 2X speedup.

How does imbalance come about? ©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 Non-uniform data distributions –Highly concentrated spatial data areas –Astronomy, medical imaging, computer vision, rendering, … If each thread processes the input data of a given spatial volume unit, some will do a lot more work than others

Global Memory Bandwidth Ideal Reality

Global Memory Bandwidth Many-core processors have limited off-chip memory access bandwidth compared to peak compute throughput Fermi –1.5 TFLOPS SPFP peak throughput –0.75 TFLOPS DPFP peak throughput –144 GB/s peak off-chip memory access bandwidth 36 G SPFP operands per second 18 G DPFP operands per second –To achieve peak throughput, how many operations must a program perform for each operand? 1,500/36 = ~42 SPFP arithmetic operations for each operand value fetched from off-chip memory How about DPFP?