CS427 Multicore Architecture and Parallel Computing

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS SOFTWARE.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Pipeline.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
The Graphics Pipeline CS2150 Anthony Jones. Introduction What is this lecture about? – The graphics pipeline as a whole – With examples from the video.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Review on Graphics Basics. Outline Polygon rendering pipeline Affine transformations Projective transformations Lighting and shading From vertices to.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
COMPUTER GRAPHICS CS 482 – FALL 2015 SEPTEMBER 29, 2015 RENDERING RASTERIZATION RAY CASTING PROGRAMMABLE SHADERS.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Graphics Pipeline Bringing it all together. Implementation The goal of computer graphics is to take the data out of computer memory and put it up on the.
Computer Engg, IIT(BHU)
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
CS427 Multicore Architecture and Parallel Computing
ICG Syllabus 1. Introduction 2. Viewing in 3D and Graphics Programming
Week 2 - Friday CS361.
A Crash Course on Programmable Graphics Hardware
Graphics Processing Unit
Real-Time Ray Tracing Stefan Popov.
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Real-time Computer Graphics Overview
Models and Architectures
Models and Architectures
Presented by: Isaac Martin
Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.
NVIDIA Fermi Architecture
Graphics Processing Unit
Models and Architectures
Mattan Erez The University of Texas at Austin
RADEON™ 9700 Architecture and 3D Performance
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CIS 441/541: Introduction to Computer Graphics Lecture 15: shaders
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Prof. Xiaoyao Liang 2016/10/13

GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Every PC, phone, pad has GPU now

GPU Speedup GeForce 8800 GTX vs. 2.2GHz Opteron 248 • 10´ speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads • 25´ to 400´ speedup if the function’s data requirements and control flow suit the GPU and the application is optimized

GPU Speedup

Early Graphic Hardware

Early Electronic Machine

Early Graphic Chip

Graphic Pipeline • Sequence of operations to generate an image using object-order processing Primitives processed one-at-a-time Software pipeline: e.g. Renderman • High-quality and efficiency for large scenes Hardware pipeline: e.g. graphics accelerators • Will cover algorithms of modern hardware pipeline But evolve drastically every few years We will only look at triangles

Graphic Pipeline • Handles only simple primitives by design Points, lines, triangles, quads (as two triangles) Efficient algorithm • Complex primitives by tessellation Complex curves: tessellate into line strips Curves surfaces: tessellate into triangle meshes • “pipeline” name derives from architecture design Sequences of stages with defined input/output Easy-to-optimize, modular design

Graphic Pipeline

Pipeline Stages • Vertex processing Input: vertex data (position, normal, color, etc.) Output: transformed vertices in homogeneous canonical view-volume, colors, etc. Applies transformation from object-space to clip-space Passes along material and shading data • Clipping and rasterization Turns sets of vertices into primitives and fills them in Output: set of fragments with interpolated data

Pipeline Stages • Fragment processing Output: final color and depth Traditionally mostly for texture lookups • Lighting was computed for each vertex Today, computes lighting per-pixel • Frame buffer processing Output: final picture Hidden surface elimination Compositing via alpha-blending

Vertex Processing

Clipping

Rasterization

Anti-Aliasing

Texture

Gouraud Shading

Phong Shading

Alpha Blending

Wireframe

SGI Reality Engine (1997)

Graphic Pipeline Characteristic • Simple algorithms can be mapped to hardware • High performance using on-chip parallel execution highly parallel algorithms memory access tends to be coherent

Graphic Pipeline Characteristic • Multiple arithmetic units NVidia Geforce 7800: 8 vertex units, 24 pixel units • Very small caches not needed since memory access are very coherent • Fast memory architecture needed for color/z-buffer traffic • Restricted memory access patterns read-modify-write • Easy to make fast: this is what Intel would love!

Programmable Shader

Programmable Shader

Unified Shader

GeForce 8

GT200

GPU Evolution

Moore’s Law Computers no longer get faster, just wider You must re-think your algorithms to be parallel ! Data-parallel computing is most scalable solution

GPGPU 1.0 GPU Computing 1.0: compute pretending to be graphics Disguise data as textures or geometry Disguise algorithm as render passes Trick graphics pipeline into doing your computation! Term GPGPU coined by Mark Harris

GPU Grows Fast GPUs get progressively more capable Fixed-function ! register combiners ! shaders fp32 pixel hardware greatly extends reach Algorithms get more sophisticated Cellular automata ! PDE solvers ! ray tracing Clever graphics tricks High-level shading languages emerge

GPGPU 2.0 GPU Computing 2.0: direct compute Program GPU directly, no graphics-based restrictions GPU Computing supplants graphics-based GPGPU November 2006: NVIDIA introduces CUDA

GPGPU 3.0 GPU Computing 3.0: an emerging ecosystem Hardware & product lines Algorithmic sophistication Cross-platform standards Education & research Consumer applications High-level languages

GPGPU Platforms

Fermi

Fermi Architecture

SM Architecture

SM Architecture • Each Thread Blocks is divided in 32- thread Warps This is an implementation decision, not part of the CUDA programming model • Warps are scheduling units in SM • If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution.

SM Architecture • SM hardware implements zero overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected • 4 clock cycles needed to dispatch the same instruction for all threads in a Warp If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency

SM Architecture • All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited Prevents hazards Cleared instructions are eligible for issue • Decoupled Memory/Processor pipelines Any thread can continue to issue instructions until scoreboarding prevents issue Allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops

SM Architecture • Register File (RF) 32 KB (8K entries) for each SM Single read/write port, heavily banked • TEX pipe can also read/write RF • Load/Store pipe can also read/write RF

SM Architecture This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks/warps assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other warps Each thread in the same block only access registers assigned to itself

SM Architecture • Each SM has 16 KB of Shared Memory 16 banks of 32bit words • CUDA uses Shared Memory as shared storage visible to all threads in a thread block read and write access • Not used explicitly for pixel shader programs we dislike pixels talking to each other

SM Architecture • Immediate address constants/cache • Indexed address constants/cache • Constants stored in DRAM, and cached on chip 1 L1 per SM • A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a block!

Bank Conflict • Shared memory is as fast as registers if there are no bank conflicts • The fast case: If all threads access different banks, there is no bank conflict If all threads access the identical address, there is no bank conflict (broadcast) • The slow case: Bank Conflict: multiple threads access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

Bank Conflict

Final Thought