GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS 565 - Fall 2012.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CPU Architecture Overview Varun Sampath CIS 565 Spring 2012.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Multithreading Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.
GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Multi-core architectures. Single-core computer Single-core CPU chip.
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
Multi-Core Architectures
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? Multithreaded and multicore processors Marco D. Santambrogio:
GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture Overview
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
GPU Programming and Architecture: Course Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012.
SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2013.
Computer Organization Yasser F. O. Mohammad 1. 2 Lecture 1: Introduction Today’s topics:  Why computer organization is important  Logistics  Modern.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
My Coordinates Office EM G.27 contact time:
Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture 5a: CPU architecture 101 boris.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Computer Structure Multi-Threading
/ Computer Architecture and Design
Instructional Parallelism
Hyperthreading Technology
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
Mattan Erez The University of Texas at Austin
Overview Prof. Eric Rotenberg
Mattan Erez The University of Texas at Austin
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
The University of Adelaide, School of Computer Science
Presentation transcript:

GPU Architecture Overview Patrick Cozzi University of Pennsylvania CIS Fall 2012

Announcements Project 0 due Monday 09/17 Karl’s office hours  Tuesday, 4:30-6pm  Friday, 2-5pm

Course Overview Follow-Up Read before or after lecture? Project vs. final project Closed vs. open source Feedback vs. grades Guest lectures

Acknowledgements CPU slides – Varun Sampath, NVIDIA GPU slides  Kayvon Fatahalian, CMU  Mike Houston, AMD

CPU and GPU Trends FLOPS – FLoating-point OPerations per Second GFLOPS - One billion (10 9 ) FLOPS TFLOPS – 1,000 GFLOPS

CPU and GPU Trends Chart from:

CPU and GPU Trends Compute  Intel Core i7 – 4 cores – 100 GFLOP  NVIDIA GTX280 – 240 cores – 1 TFLOP Memory Bandwidth  System Memory – 60 GB/s  NVIDIA GT200 – 150 GB/s Install Base  Over 200 million NVIDIA G80s shipped

CPU Review What are the major components in a CPU die?

CPU Review Desktop Applications –Lightly threaded –Lots of branches –Lots of memory accesses Profiled with psrun on ENIAC Slide from Varun SampathVarun Sampath

A Simple CPU Core Image: Penn CIS501Penn CIS501 Fetch  Decode  Execute  Memory  Writeback PC I$ Register File s1s2d D$ +4+4 control

Pipelining Image: Penn CIS501Penn CIS501 PC Insn Mem Register File s1s2d Data Mem +4+4 T insn-mem T regfile T ALU T data-mem T regfile T singlecycle

Branches Image: Penn CIS501Penn CIS501 jeq loop ??? Register File SXSX s1s2d Data Mem a d IR A B IR O B IR O D IR F/DD/XX/MM/W nop

Branch Prediction + Modern predictors > 90% accuracy o Raise performance and energy efficiency (why?) – Area increase – Potential fetch stage latency increase

Memory Hierarchy Memory: the larger it gets, the slower it gets Rough numbers: LatencyBandwidthSize SRAM (L1, L2, L3) 1-2ns200GBps1-20MB DRAM (memory) 70ns20GBps1-20GB Flash (disk)70-90µs200MBps GB HDD (disk)10ms1-150MBps GB

Caching Keep data you need close Exploit:  Temporal locality Chunk just used likely to be used again soon  Spatial locality Next chunk to use is likely close to previous

Cache Hierarchy Hardware-managed  L1 Instruction/Data caches  L2 unified cache  L3 unified cache Software-managed  Main memory  Disk I$ D$ L2 L3 Main Memory Disk (not to scale) LargerLarger FasterFaster

Intel Core i7 3960X – 15MB L3 (25% of die). 4-channel Memory Controller, 51.2GB/s total Source:

Improving IPC IPC (instructions/cycle) bottlenecked at 1 instruction / clock Superscalar – increase pipeline width Image: Penn CIS371

Scheduling Consider instructions: xor r1,r2 -> r3 add r3,r4 -> r4 sub r5,r2 -> r3 addi r3,1 -> r1 xor and add are dependent (Read-After- Write, RAW) sub and addi are dependent (RAW) xor and sub are not (Write-After-Write, WAW)

Register Renaming How about this instead: xor p1,p2 -> p6 add p6,p4 -> p7 sub p5,p2 -> p8 addi p8,1 -> p9 xor and sub can now execute in parallel

Out-of-Order Execution Reordering instructions to maximize throughput Fetch  Decode  Rename  Dispatch  Issue  Register-Read  Execute  Memory  Writeback  Commit Reorder Buffer (ROB)  Keeps track of status for in-flight instructions Physical Register File (PRF) Issue Queue/Scheduler  Chooses next instruction(s) to execute

Parallelism in the CPU Covered Instruction-Level (ILP) extraction  Superscalar  Out-of-order Data-Level Parallelism (DLP)  Vectors Thread-Level Parallelism (TLP)  Simultaneous Multithreading (SMT)  Multicore

Vectors Motivation for (int i = 0; i < N; i++) A[i] = B[i] + C[i];

CPU Data-level Parallelism Single Instruction Multiple Data (SIMD)  Let’s make the execution unit (ALU) really wide  Let’s make the registers really wide too for (int i = 0; i < N; i+= 4) { // in parallel A[i] = B[i] + C[i]; A[i+1] = B[i+1] + C[i+1]; A[i+2] = B[i+2] + C[i+2]; A[i+3] = B[i+3] + C[i+3]; }

Vector Operations in x86 SSE2  4-wide packed float and packed integer instructions  Intel Pentium 4 onwards  AMD Athlon 64 onwards AVX  8-wide packed float and packed integer instructions  Intel Sandy Bridge  AMD Bulldozer

Thread-Level Parallelism Thread Composition  Instruction streams  Private PC, registers, stack  Shared globals, heap Created and destroyed by programmer Scheduled by programmer or by OS

Simultaneous Multithreading Instructions can be issued from multiple threads Requires partitioning of ROB, other buffers + Minimal hardware duplication + More scheduling freedom for OoO – Cache and execution resource contention can reduce single-threaded performance

Multicore Replicate full pipeline Sandy Bridge-E: 6 cores + Full cores, no resource sharing other than last-level cache + Easier way to take advantage of Moore’s Law – Utilization

CPU Conclusions CPU optimized for sequential programming  Pipelines, branch prediction, superscalar, OoO  Reduce execution time with high clock speeds and high utilization Slow memory is a constant problem Parallelism  Sandy Bridge-E great for 6-12 active threads  How about 12,000?

How did this happen?

Graphics Workloads Triangles/vertices and pixels/fragments Right image from

Early 90s – Pre GPU Slide from

Why GPUs? Graphics workloads are embarrassingly parallel  Data-parallel  Pipeline-parallel CPU and GPU execute in parallel Hardware: texture filtering, rasterization, etc.

Data Parallel Beyond Graphics  Cloth simulation  Particle system  Matrix multiply Image from:

Reminders Piazza  Signup: GitHub  Create an account:  Change it to an edu account:  Join our organization: No class Wednesday, 09/12