Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Panda: MapReduce Framework on GPU’s and CPU’s
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Lecture 2: Intro to the simd lifestyle and GPU internals
Presented by: Isaac Martin
NVIDIA Fermi Architecture
Graphics Processing Unit
General Purpose Graphics Processing Units (GPGPUs)
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska

2 Introduction and Motivation Graphics Processing Units (GPUs) have been evolving at a rapid rate in recent years In terms of raw processing power gains, they greatly outpace CPUs

3 Introduction and Motivation

4 Disparity is largely due to the specific nature of problems historically solved by the GPU – Same operations on many primitives (SIMD) – Focus on throughput over Latency – Lots of special purpose hardware CPUs On the the other hand: – Focus on reducing Latency – Designed to handle a wider range of problems

5 Introduction and Motivation Despite differences, we've found that GPUs and CPUs are converging in many ways: – CPUs are adding more cores – GPUs becoming more programmable, general purpose Examples – NVIDIA Fermi – Intel Larrabee

6 Overview Introduction History of GPU Chip Layouts Data-flow Memory Hierarchy Instruction Set Applications Conclusion

7 History of the GPU GPUs have mostly developed in the last 15 years Before that, graphics handled by Video Graphics Array (VGA) Controller – Memory controller, DRAM, display generator – Takes image data, and arranges it for output device

8 History of the GPU Graphics Acceleration hardware components were gradually added to VGA controllers – Triangle rasterization – Texture mapping – Simple shading Examples of early “graphics accelerators” – 3dfx Voodoo – ATI Rage – NIVDIA RIVA TNT2

9 History of the GPU NVIDIA GeForce 256 “first” GPU (1999) – Non-programmable (fixed-function) – Transforming and Lighting – Texture/Environment Mapping

10 History of the GPU Fairly early on in the GPU market, there was a severe narrowing of competition Early companies: – Silicon Graphics International – 3dfx – NVIDIA – ATI – Matrox Now only AMD and NVIDIA

11 History of the GPU Since their inception, GPUs have gradually become more powerful, programmable, and general purpose – Programmable geometry, vertex and pixel processors – Unified Shader Model – Expanding instruction set – CUDA, OpenCL

12 History of the GPU The latest NVIDIA Architecture, Fermi offers many more general purpose features – Real floating point quality and performance – Error Correcting Codes – Fast context switching – Unified address space

13 GPU Chip Layouts GPU Chip layouts have been moving in the direction of general purpose computing for several years Some High-level trends – Unification of hardware components – Large increases in functional unit counts

14 GPU Chip Layouts NVIDIA GeForce 7800

15 GPU Chip Layouts NVIDIA GeForce 8800

16 GPU Chip Layouts NVIDIA GeForce 400 (Fermi architecture) 3 billion transisors

17 GPU Chip Layouts AMD Radeon 6800 (Cayman architecture) 2.64 billion transisors

18 CPU Chip Layouts CPUs have also been increasing functional unit counts However, these units are always added with all of the hardware fanfare that would come with a single core processor – Reorder buffers/reservations stations – Complex branch prediction This means that CPUs add raw compute power at a much slower rate.

19 CPU Chip Layouts Intel Core i7 (Nehalem architecture) 125 million transistors

20 CPU Chip Layouts Intel Core i7 (Nehalem architecture) 731 million transistors

21 CPU Chip Layouts Nehalem “core” 731 million transistors

22 CPU Chip Layouts Intel Westmere (Nehalem)

23 CPU Chip Layouts Intel 8-Core Nehalem EX 2.3 Billion transistors

24 “Hybrid” Chip Layouts Intel Larrabee project Vaporware

25 “Hybrid” Chip Layouts NVIDIA Tegra

26 Chip Layouts Summary The take-home message is that the real-estate allocation of GPUs and CPUs evolve based on very different fundamental priorities – GPUs Increase raw compute power Increase throughput Still fairly special purpos e – CPUs Reduce Latency Epitome of general purpose Backwards compatibility

27 The (traditional) graphics pipeline Programmable Since 2000 Programmable elements of the graphics pipeline were historically fixed-function units, until the year 2000

28 The unified shader With the introduction of the unified shader model, the GPU becomes essentially a many- core, streaming multiprocessor Nvidia 6800 tech brief

Emphasis on throughput If your frame rate is 50 Hz, your latency can be approximately 2 ms However, you need to do 100 million operations for that one frame  Result: very deep pipelines and high FLOPS GeForce 7 had >200 stages for the pixel shader Fermi: 1.5 TFLOPS, AMD 5870: 2.7 TFLOPS Unified shader has cut down on the number of stages by allowing breaks from linear execution 29

Memory hierarchy 30 Cache size hierarchy caches is backwards from that of CPUs Caches serve to conserve precious memory bandwidth by intelligently prefetching

Memory prefetching Graphics pipelines are inherently high-latency Cache misses simply push another thread into the core Hit rates of ~90%, as opposed to ~100% 31 Prefetching algorithm

Memory access GPUs are all about 2D spatial locality, not linear locality GPU caches read- only (uses registers) Growing body of research optimizing algorithms for 2D cache model 32

Instruction set differences Until very recently, scattered address space 2009 saw the introduction of modern CPU-style 64-bit addressing Block operations versus sequential 33 for i = 1 to 4 for j = 1 to 4 y[i][j] = y[i][j] + 1 block = 1:4 by 1:4 if y[i][j] = within block y[i][j] = y[i][j] + 1 Bam! SIMD: single instruction, multiple data

SIMD vs. SISD 34 versus Programmable GPU shaders Pentium 4

35 Single Instruction, Multiple Thread (SIMT) Newer GPUs are using a new kind of scheduling model called SIMT ~32 threads are bundled together in a “warp” and executed together Warps are then executed 1 instruction at a time, round robin Weaving cotton threads

Instruction set differences Branch granularity If one thread within a processor cluster branches without the rest, you have a branch divergence Threads become serial until branches converge Warp scheduling improves, not eliminates, hazards from branch divergence if/else may stall threads 36

Instruction set differences Unified shader All shaders (since 2006) have the same basic instruction set layered on a (still) specialized core Cores are very simple: hardware support for things like recursion may not be available Until very recently, dealing with speed hacks Floating-point accuracy truncated to save cycles IEEE FP specs are appearing on some GPUs Primitives limited to GPU data structures GPUs operate on textures, etc Computational variables must be mapped 37

38 GPU Limitations Relatively small amount of memory, < 4GB in current GPUs I/O directly to GPU memory has complications – Must transfer to host memory, and then back – If 10% of instructions are LD/ST and other instructions are times faster 1/(.1 +.9/10) ≈ speedup of times faster 1/(.1 +.9/100) ≈ speedup of 9

39 Applications – real-time physics

Applications – protein folding 40

Applications – fluid dynamics 41

Applications – bitonic sorting 42

Applications – n-body problems 43

44 Conclusion GPUs and CPUs fill different niches in the market for high performance architecture. – GPUs: Large throughput; latency hidden; fairly simple, but costly programs; special purpose – CPUs: Low latency; complex programs; general purpose Both will likely always be needed; combinations of CPUs and GPUs can be much faster than either alone CPUs are becoming multi-core and parallel GPUs are adding general-purpose cores