Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS SOFTWARE.
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Pipeline.
Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
GPGPU Ing. Martino Ruggiero Ing. Andrea Marongiu
Programmable Pipelines. Objectives Introduce programmable pipelines ­Vertex shaders ­Fragment shaders Introduce shading languages ­Needed to describe.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Over View of the GPU Architecture CS7080 Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad &
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
CS 450: COMPUTER GRAPHICS REVIEW: INTRODUCTION TO COMPUTER GRAPHICS – PART 2 SPRING 2015 DR. MICHAEL J. REALE.
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CS 480/680 Intro Dr. Frederick C Harris, Jr. Fall 2014.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
GRAPHICS PIPELINE & SHADERS SET09115 Intro to Graphics Programming.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing using Programmable Graphics Hardware
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
COMP 175 | COMPUTER GRAPHICS Remco Chang1/XX13 – GLSL Lecture 13: OpenGL Shading Language (GLSL) COMP 175: Computer Graphics April 12, 2016.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Computer Engg, IIT(BHU)
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS427 Multicore Architecture and Parallel Computing
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Introduction to OpenGL
Chapter 6 GPU, Shaders, and Shading Languages
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
GRAPHICS PROCESSING UNIT
NVIDIA Fermi Architecture
Graphics Processing Unit
Computer Graphics Graphics Hardware
Ray Tracing on Programmable Graphics Hardware
Graphics Processing Unit
Introduction to OpenGL
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1

Graphics System GPU Architecture Memory Model – Vertex Buffer, Texture buffer GPU Programming Model – DirectX, OpenGL, OpenCL GP GPU Program – Introduction to Nvidia Cuda Programming 2

3D application 3D API: OpenGL DirectX/3D 3D API: OpenGL DirectX/3D 3D API Commands CPU-GPU Boundary GPU Command & Data Stream GPU Command GPU Command Primitive Assembly Primitive Assembly Rastereisation Interpolation Rastereisation Interpolation Raster Operation Raster Operation Frame Buffer Programmable Fragment Processors Programmable Fragment Processors Programmable Vertex Processor Programmable Vertex Processor Vertex Index Stream Assembled polygon, line & points Pixel Location Stream Pixel Updates Transformed Fragments Rastorized Pretransformed Fragments transformed Vertices Pretransformed Vertices 3

Memory System Memory System Texture Memory Texture Memory Frame Buffer Frame Buffer Vertex Processing Vertex Processing Pixel Processing Pixel Processing Vertices (x,y,z) Pixel R, G,B Vertex Shadder Vertex Shadder Pixel Shadder Pixel Shadder 4

Primitives are processed in a series of stages Each stage forwards its result on to the next stage The pipeline can be drawn and implemented in different ways Some stages may be in hardware, others in software Optimizations & additional programmability are available at some stages Modeling Transformations Illumination (Shading) Illumination (Shading) Viewing Transformation (Perspective / Orthographic) Viewing Transformation (Perspective / Orthographic) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display

Modeling Transformations Illumination (Shading) Illumination (Shading) Viewing Transformation (Perspective / Orthographic) Viewing Transformation (Perspective / Orthographic) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display

Graphics pipeline (simplified) Vertex Shader Pixel Shader Object spaceWindow spaceFramebuffer IN OUT Textures

 The computing capacities of graphics processing units (GPUs) have improved exponentially in the recent decade.  NVIDIA released a CUDA programming model for GPUs.  The CUDA programming environment applies the parallel processing capabilities of the GPUs to medical image processing research.

CUDA Cores 480 – (Compute Unified Dev Arch) Microsoft® DirectX® 11 Support 3D Vision™ Surround Ready Interactive Ray Tracing 3-way SLI® Technology PhysX® Technology CUDA™ Technology 32x Anti-aliasing Technology PureVideo® HD Technology PCI Express 2.0 Support. Dual-link DVI Support, HDMI 1.4

Vertex Transforms Vertex Transforms This generation is the first generation of fully-programmable graphics cards Different versions have different resource limits on fragment/vertex programs Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Raster Operations Rasterization and Interpolation Rasterization and Interpolation AGP Programmable Vertex shader Programmable Vertex shader Programmable Fragment Processor Programmable Fragment Processor

Writing assembly is – Painful – Not portable – Not optimize-able High level shading language solves these – Cg, HLSL

CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Main Memory GPU Video Memory GPU Video Memory CPU Caches CPU Registers GPU Caches GPU Temporary Registers GPU Temporary Registers GPU Constant Registers GPU Constant Registers

Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) Registers – Read/write Local memory – Does not exist Global memory – Read-only during computation – Write-only at end of computation (pre-computed address) Disk access – Does not exist

At any program point – Allocate/free local or global memory – Random memory access Registers – Read/write Local memory – Read/write to stack Global memory – Read/write to heap Disk – Read/write to disk

Where is GPU Data Stored? – Vertex buffer – Frame buffer – Texture Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

Each GPU memory type supports subset of the following operations – CPU interface – GPU interface

17 CPU interface – Allocate – Free – Copy CPU  GPU – Copy GPU  CPU – Copy GPU  GPU – Bind for read-only vertex stream access – Bind for read-only random access – Bind for write-only framebuffer access

GPU (shader/kernel) interface – Random-access read – Stream read

Vertex Buffers GPU memory for vertex data Vertex data required to initiate render pass Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

Supported Operations – CPU interface Allocate Free Copy CPU  GPU Copy GPU  GPU(Render-to-vertex-array) Bind for read-only vertex stream access – GPU interface Stream read (vertex program only)

Limitations – CPU No copy GPU  CPU No bind for read-only random access No bind for write-only framebuffer access – GPU No random-access reads No access from fragment programs

Random-access GPU memory Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

Supported Operations – CPU interface Allocate Free Copy CPU  GPU Copy GPU  CPU Copy GPU  GPU(Render-to-texture) Bind for read-only random access (vertex or fragment) Bind for write-only framebuffer access – GPU interface Random read

Memory written by fragment processor Write-only GPU memory Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

Fixed function pipeline – Made early games look fairly similar – Little freedom in rendering – “One way to do things” glShadeModel(GL_SMOOTH); Different render methods – Triangle rasterization, proved to be very efficiently implemented in hardware. – Raytracing, voxels, produce nice results, very slow and require large amounts of memory

DirectX before version 8 entirely fixed function OpenGL before version 2.0 entirely fixed function – Extensions were often added for different effects, but no real programmability on the GPU. OpenGL is just a specification – Vendors must implement the specification, but on whatever platform they wish DirectX is a library, Windows only – Direct3D is the graphics component

Direct3D 8.0 (2000), OpenGL 2.0 (2004) added support for assembly language programming of vertex and fragment shaders. – NVIDIA GeForce 3, ATI Radeon 8000 Direct3D 9.0 (2002) added HLSL (High Level Shader Language) for much easier programming of GPUs. – NVIDIA GeForce FX 5000, ATI Radeon 9000 Minor increments on this for a long time, with more capabilities being added to shaders.

Vertex data sent in by graphics API – Mostly OpenGL or DirectX Processed in vertex program – “vertex shader” Rasterized into pixels Processed in “fragment shader” Vertex Shader Vertex Shader Fragment Shader Fragment Shader Vertex Data Vertex Data Rasterize To Pixels Rasterize To Pixels Output

No longer need to write shaders in assembly GLSL, HLSL, Cg, offer C style programming languages Write two main() functions, which are executed on each vertex/pixel Declare auxiliary functions, local variables Output by setting position and color

Prior to Direct3D 10/GeForce 8000/Radeon 2000, vertex and fragment shaders were executed in separate hardware. Direct3D 10 (with Vista) brought shader unification, and added Geometry Shaders. – GPUs now used the same ‘cores’ to geometry/vertex/fragment shader code. CUDA comes out alongside GeForce 8000 line, allowing ‘cores’ to run general C code, rather than being restricted to graphics APIs.

Vertex Programs Geometry Programs Pixel Programs Compute Programs Rasterization Hidden Surface Removal GPU Programmable Unified Processors GPU memory (DRAM) Final Image 3D Geometric Primitives

CUDA the first to drop graphics API, and allows the GPU to be treated as a coprocessor to the CPU. – Linear memory accesses (no more buffer objects) – Run thousands of threads on separate scalar cores (with limitations) – High theoretical/achieved performance for data parallel applications ATI has Stream SDK – Closer to assembly language programming for Stream

Apple announces OpenCL initiative in 2008 – Officially owned by Khronos Group, the same that controls OpenGL – Released in 2009, with support from NVIDIA/ATI. – Another specification for parallel programming, not entirely specific to GPUs (support for CPU SSE instructions, etc.). DirectX11 (and Direct3D10 extension) add in DirectComputeshaders – Similar idea to OpenCL, just tied in with Direct3D 33 CS101 GPU Programming

DirectX11 also adds multithreaded rendering, and tessellation stages to the pipeline – Two new shader stages in the unified pipeline; Hull and Domain shaders – Allow high detail geometry to be created on the GPU, rather than flooding the PCI-E bus with geometry data. – More programmable geometry OpenGL 4 (specification just released) is close to feature parity with Direct3D11 – Namely also adds tessellation

Newest GPUs have incredible compute power – 1-3 TFlops, 100+ GB/s memory access bandwidth More parallel constructs – High speed atomic operations, more control over thread interaction/synchronization. Becoming easier to program – NVIDIA’s ‘Fermi’ architecture has support for C++ code, 64bit pointers, etc. GPU computing starting to go mainstream – Photoshop5, Video encode/decode, physics/fluid simulation, etc.

GPUs are fast… – 3.0 GHz dual-core Pentium4: 24.6 GFLOPS – NVIDIA GeForceFX 7800: 165 GFLOPs – 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s – ATI Radeon X850 XT Platinum Edition: 37.8 GB/s GPUs are getting faster, faster – CPUs: 1.4 × annual growth – GPUs: 1.7 × (pixels) to 2.3 × (vertices) annual growth

Modern GPUs are deeply programmable – Programmable pixel, vertex, video engines – Solidifying high-level language support Modern GPUs support high precision – 32 bit floating point throughout the pipeline – High enough for many (not all) applications

GPUs designed for & driven by video games – Programming model unusual – Programming idioms tied to computer graphics – Programming environment tightly constrained Underlying architectures are: – Inherently parallel – Rapidly evolving (even in basic feature set!) – Largely secret Can’t simply “port” CPU code!

Each fragment is shaded w/ SIMD program Each fragment is shaded w/ SIMD program Shading can use values from texture memory Shading can use values from texture memory Image can be used as texture on future passes Image can be used as texture on future passes Application specifies geometry  rasterized Application specifies geometry  rasterized

Run a SIMD kernel over each fragment Run a SIMD kernel over each fragment “Gather” is permitted from texture memory “Gather” is permitted from texture memory Resulting buffer can be treated as texture on next pass Resulting buffer can be treated as texture on next pass Draw a screen-sized quad  stream Draw a screen-sized quad  stream

Introduced November of 2006 Converts GPU to general purpose CPU Required hardware changes – Only available on N70 or later GPU GeForce 8000 series or newer Implemented as extension to C/C++ – Results in lower learning curve

16 Streaming Multiprocessors (SM) – Each one has 8 Streaming Processors (SP) – Each SM can execute 32 threads simultaneously – 512 threads execute per cycle – SPs hide instruction latencies 768 MB DRAM – 86.4 Gbps memory bandwidth to GPU cores – 4 Gbps memory bandwidth with system memory

Load/store Global Memory Thread Execution Manager Thread Execution Manager Input Assembler Host Texture Parallel Data Cache Load/store

CUDA Execution Model Starts with Kernel Kernel is function called on host that executes on GPU Thread resources are abstracted into 3 levels – Grid – highest level – Block – Collection of Threads – Thread – Execution unit

CUDA Execution Model

768 GB global memory – Accessible to all threads globally – 86.4 Gbps throughput 16 KB shared memory per SP – Accessible to all threads within a block – 384 Gbps throughput 32 KB register file per SM – Allocated to threads at runtime (local variables) – 384 Gbps throughput – Threads can only see their own registers

Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

(From C/C++ function) Allocate memory on CUDA device Copy data to CUDA device Configure thread resources – Grid Layout (max 65536x65536) – Block Layout (3 dimensional, max of 512 threads) Execute kernel with thread resources Copy data out of CUDA device Free memory on CUDA device

Multiply matrices M and N to form result R General algorithm – For each row i in matrix R For each column j in matrix R – Cell (i, j) = dot product of row i of M and column j of N Algorithm runs in O(length 3 )

Each thread represents cell (i, j) Calculate value for cell (i, j) Use single block Should run in O(length) – Much better than O(length 3 )

M N P WIDTH

Max threads allowed per block is 512. Only supports max matrix size of 22x22 – 484 threads needed

Split result matrix into smaller blocks Utilizes more SM’s rather than the single block approach Better speed-up

Md Nd Pd Pd sub TILE_WIDTH WIDTH bx tx 01 TILE_WIDTH by ty TILE_WIDTH TILE_WIDTHE WIDTH

Runs 10 times as fast as serial approach Solution runs 21.4 GFLOPS – GPU is capable of 384 GFLOPS – What gives?

Each block assigned to SP – 8 SPs to 1 SM SM executes single SP SM switches SPs when long-latency is found – Works similar to Intel’s Hyperthreading SM executes batch of 32 threads at a time – Batch of 32 threads called warp.

Global Memory bandwidth is 86.4 Gbps Shared Memory bandwidth is 384 Gbps Register File bandwidth is 384 Gbps Key is to use shared memory and registers when possible

Each SP has 16 KB shared memory Each SM has 32 KB register file Local variables in function take up registers Register file must support all threads in SM – If not enough registers, then less blocks are scheduled – Program still executes, but less parallelism occurs.

SM can only handle 768 threads SM can handle 8 blocks, 1 block for each SP Each block can have up to 96 threads – Max out SM resources

Intel’s new approach to a GPU Considered to be a hybrid between a multi- core CPU and a GPU Combines functions of a multi-core CPU with the functions of a GPU

63