Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1.

Similar presentations


Presentation on theme: "Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1."— Presentation transcript:

1 Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1

2 Graphics System GPU Architecture Memory Model – Vertex Buffer, Texture buffer GPU Programming Model – DirectX, OpenGL, OpenCL GP GPU Program – Introduction to Nvidia Cuda Programming 2

3 3D application 3D API: OpenGL DirectX/3D 3D API: OpenGL DirectX/3D 3D API Commands CPU-GPU Boundary GPU Command & Data Stream GPU Command GPU Command Primitive Assembly Primitive Assembly Rastereisation Interpolation Rastereisation Interpolation Raster Operation Raster Operation Frame Buffer Programmable Fragment Processors Programmable Fragment Processors Programmable Vertex Processor Programmable Vertex Processor Vertex Index Stream Assembled polygon, line & points Pixel Location Stream Pixel Updates Transformed Fragments Rastorized Pretransformed Fragments transformed Vertices Pretransformed Vertices 3

4 Memory System Memory System Texture Memory Texture Memory Frame Buffer Frame Buffer Vertex Processing Vertex Processing Pixel Processing Pixel Processing Vertices (x,y,z) Pixel R, G,B Vertex Shadder Vertex Shadder Pixel Shadder Pixel Shadder 4

5 Primitives are processed in a series of stages Each stage forwards its result on to the next stage The pipeline can be drawn and implemented in different ways Some stages may be in hardware, others in software Optimizations & additional programmability are available at some stages Modeling Transformations Illumination (Shading) Illumination (Shading) Viewing Transformation (Perspective / Orthographic) Viewing Transformation (Perspective / Orthographic) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display

6 Modeling Transformations Illumination (Shading) Illumination (Shading) Viewing Transformation (Perspective / Orthographic) Viewing Transformation (Perspective / Orthographic) Clipping Projection (to Screen Space) Scan Conversion (Rasterization) Visibility / Display

7 Graphics pipeline (simplified) Vertex Shader Pixel Shader Object spaceWindow spaceFramebuffer IN OUT Textures

8  The computing capacities of graphics processing units (GPUs) have improved exponentially in the recent decade.  NVIDIA released a CUDA programming model for GPUs.  The CUDA programming environment applies the parallel processing capabilities of the GPUs to medical image processing research.

9 CUDA Cores 480 – (Compute Unified Dev Arch) Microsoft® DirectX® 11 Support 3D Vision™ Surround Ready Interactive Ray Tracing 3-way SLI® Technology PhysX® Technology CUDA™ Technology 32x Anti-aliasing Technology PureVideo® HD Technology PCI Express 2.0 Support. Dual-link DVI Support, HDMI 1.4

10 Vertex Transforms Vertex Transforms This generation is the first generation of fully-programmable graphics cards Different versions have different resource limits on fragment/vertex programs Primitive Assembly Primitive Assembly Frame Buffer Frame Buffer Raster Operations Raster Operations Rasterization and Interpolation Rasterization and Interpolation AGP Programmable Vertex shader Programmable Vertex shader Programmable Fragment Processor Programmable Fragment Processor

11 Writing assembly is – Painful – Not portable – Not optimize-able High level shading language solves these – Cg, HLSL

12 CPU and GPU Memory Hierarchy Disk CPU Main Memory CPU Main Memory GPU Video Memory GPU Video Memory CPU Caches CPU Registers GPU Caches GPU Temporary Registers GPU Temporary Registers GPU Constant Registers GPU Constant Registers

13 Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) Registers – Read/write Local memory – Does not exist Global memory – Read-only during computation – Write-only at end of computation (pre-computed address) Disk access – Does not exist

14 At any program point – Allocate/free local or global memory – Random memory access Registers – Read/write Local memory – Read/write to stack Global memory – Read/write to heap Disk – Read/write to disk

15 Where is GPU Data Stored? – Vertex buffer – Frame buffer – Texture Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

16 Each GPU memory type supports subset of the following operations – CPU interface – GPU interface

17 17 CPU interface – Allocate – Free – Copy CPU  GPU – Copy GPU  CPU – Copy GPU  GPU – Bind for read-only vertex stream access – Bind for read-only random access – Bind for write-only framebuffer access

18 GPU (shader/kernel) interface – Random-access read – Stream read

19 Vertex Buffers GPU memory for vertex data Vertex data required to initiate render pass Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

20 Supported Operations – CPU interface Allocate Free Copy CPU  GPU Copy GPU  GPU(Render-to-vertex-array) Bind for read-only vertex stream access – GPU interface Stream read (vertex program only)

21 Limitations – CPU No copy GPU  CPU No bind for read-only random access No bind for write-only framebuffer access – GPU No random-access reads No access from fragment programs

22 Random-access GPU memory Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

23 Supported Operations – CPU interface Allocate Free Copy CPU  GPU Copy GPU  CPU Copy GPU  GPU(Render-to-texture) Bind for read-only random access (vertex or fragment) Bind for write-only framebuffer access – GPU interface Random read

24 Memory written by fragment processor Write-only GPU memory Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

25 Fixed function pipeline – Made early games look fairly similar – Little freedom in rendering – “One way to do things” glShadeModel(GL_SMOOTH); Different render methods – Triangle rasterization, proved to be very efficiently implemented in hardware. – Raytracing, voxels, produce nice results, very slow and require large amounts of memory

26 DirectX before version 8 entirely fixed function OpenGL before version 2.0 entirely fixed function – Extensions were often added for different effects, but no real programmability on the GPU. OpenGL is just a specification – Vendors must implement the specification, but on whatever platform they wish DirectX is a library, Windows only – Direct3D is the graphics component

27 Direct3D 8.0 (2000), OpenGL 2.0 (2004) added support for assembly language programming of vertex and fragment shaders. – NVIDIA GeForce 3, ATI Radeon 8000 Direct3D 9.0 (2002) added HLSL (High Level Shader Language) for much easier programming of GPUs. – NVIDIA GeForce FX 5000, ATI Radeon 9000 Minor increments on this for a long time, with more capabilities being added to shaders.

28 Vertex data sent in by graphics API – Mostly OpenGL or DirectX Processed in vertex program – “vertex shader” Rasterized into pixels Processed in “fragment shader” Vertex Shader Vertex Shader Fragment Shader Fragment Shader Vertex Data Vertex Data Rasterize To Pixels Rasterize To Pixels Output

29 No longer need to write shaders in assembly GLSL, HLSL, Cg, offer C style programming languages Write two main() functions, which are executed on each vertex/pixel Declare auxiliary functions, local variables Output by setting position and color

30 Prior to Direct3D 10/GeForce 8000/Radeon 2000, vertex and fragment shaders were executed in separate hardware. Direct3D 10 (with Vista) brought shader unification, and added Geometry Shaders. – GPUs now used the same ‘cores’ to geometry/vertex/fragment shader code. CUDA comes out alongside GeForce 8000 line, allowing ‘cores’ to run general C code, rather than being restricted to graphics APIs.

31 Vertex Programs Geometry Programs Pixel Programs Compute Programs Rasterization Hidden Surface Removal GPU Programmable Unified Processors GPU memory (DRAM) Final Image 3D Geometric Primitives

32 CUDA the first to drop graphics API, and allows the GPU to be treated as a coprocessor to the CPU. – Linear memory accesses (no more buffer objects) – Run thousands of threads on separate scalar cores (with limitations) – High theoretical/achieved performance for data parallel applications ATI has Stream SDK – Closer to assembly language programming for Stream

33 Apple announces OpenCL initiative in 2008 – Officially owned by Khronos Group, the same that controls OpenGL – Released in 2009, with support from NVIDIA/ATI. – Another specification for parallel programming, not entirely specific to GPUs (support for CPU SSE instructions, etc.). DirectX11 (and Direct3D10 extension) add in DirectComputeshaders – Similar idea to OpenCL, just tied in with Direct3D 33 CS101 GPU Programming

34 DirectX11 also adds multithreaded rendering, and tessellation stages to the pipeline – Two new shader stages in the unified pipeline; Hull and Domain shaders – Allow high detail geometry to be created on the GPU, rather than flooding the PCI-E bus with geometry data. – More programmable geometry OpenGL 4 (specification just released) is close to feature parity with Direct3D11 – Namely also adds tessellation

35 Newest GPUs have incredible compute power – 1-3 TFlops, 100+ GB/s memory access bandwidth More parallel constructs – High speed atomic operations, more control over thread interaction/synchronization. Becoming easier to program – NVIDIA’s ‘Fermi’ architecture has support for C++ code, 64bit pointers, etc. GPU computing starting to go mainstream – Photoshop5, Video encode/decode, physics/fluid simulation, etc.

36 GPUs are fast… – 3.0 GHz dual-core Pentium4: 24.6 GFLOPS – NVIDIA GeForceFX 7800: 165 GFLOPs – 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s – ATI Radeon X850 XT Platinum Edition: 37.8 GB/s GPUs are getting faster, faster – CPUs: 1.4 × annual growth – GPUs: 1.7 × (pixels) to 2.3 × (vertices) annual growth

37 Modern GPUs are deeply programmable – Programmable pixel, vertex, video engines – Solidifying high-level language support Modern GPUs support high precision – 32 bit floating point throughout the pipeline – High enough for many (not all) applications

38 GPUs designed for & driven by video games – Programming model unusual – Programming idioms tied to computer graphics – Programming environment tightly constrained Underlying architectures are: – Inherently parallel – Rapidly evolving (even in basic feature set!) – Largely secret Can’t simply “port” CPU code!

39 Each fragment is shaded w/ SIMD program Each fragment is shaded w/ SIMD program Shading can use values from texture memory Shading can use values from texture memory Image can be used as texture on future passes Image can be used as texture on future passes Application specifies geometry  rasterized Application specifies geometry  rasterized

40 Run a SIMD kernel over each fragment Run a SIMD kernel over each fragment “Gather” is permitted from texture memory “Gather” is permitted from texture memory Resulting buffer can be treated as texture on next pass Resulting buffer can be treated as texture on next pass Draw a screen-sized quad  stream Draw a screen-sized quad  stream

41 Introduced November of 2006 Converts GPU to general purpose CPU Required hardware changes – Only available on N70 or later GPU GeForce 8000 series or newer Implemented as extension to C/C++ – Results in lower learning curve

42 16 Streaming Multiprocessors (SM) – Each one has 8 Streaming Processors (SP) – Each SM can execute 32 threads simultaneously – 512 threads execute per cycle – SPs hide instruction latencies 768 MB DRAM – 86.4 Gbps memory bandwidth to GPU cores – 4 Gbps memory bandwidth with system memory

43 Load/store Global Memory Thread Execution Manager Thread Execution Manager Input Assembler Host Texture Parallel Data Cache Load/store

44 CUDA Execution Model Starts with Kernel Kernel is function called on host that executes on GPU Thread resources are abstracted into 3 levels – Grid – highest level – Block – Collection of Threads – Thread – Execution unit

45 CUDA Execution Model

46 768 GB global memory – Accessible to all threads globally – 86.4 Gbps throughput 16 KB shared memory per SP – Accessible to all threads within a block – 384 Gbps throughput 32 KB register file per SM – Allocated to threads at runtime (local variables) – 384 Gbps throughput – Threads can only see their own registers

47 Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

48 (From C/C++ function) Allocate memory on CUDA device Copy data to CUDA device Configure thread resources – Grid Layout (max 65536x65536) – Block Layout (3 dimensional, max of 512 threads) Execute kernel with thread resources Copy data out of CUDA device Free memory on CUDA device

49 Multiply matrices M and N to form result R General algorithm – For each row i in matrix R For each column j in matrix R – Cell (i, j) = dot product of row i of M and column j of N Algorithm runs in O(length 3 )

50 Each thread represents cell (i, j) Calculate value for cell (i, j) Use single block Should run in O(length) – Much better than O(length 3 )

51 M N P WIDTH

52

53 Max threads allowed per block is 512. Only supports max matrix size of 22x22 – 484 threads needed

54 Split result matrix into smaller blocks Utilizes more SM’s rather than the single block approach Better speed-up

55 Md Nd Pd Pd sub TILE_WIDTH WIDTH bx tx 01 TILE_WIDTH-1 2 012 by ty 2 1 0 TILE_WIDTH-1 2 1 0 TILE_WIDTHE WIDTH

56

57 Runs 10 times as fast as serial approach Solution runs 21.4 GFLOPS – GPU is capable of 384 GFLOPS – What gives?

58 Each block assigned to SP – 8 SPs to 1 SM SM executes single SP SM switches SPs when long-latency is found – Works similar to Intel’s Hyperthreading SM executes batch of 32 threads at a time – Batch of 32 threads called warp.

59 Global Memory bandwidth is 86.4 Gbps Shared Memory bandwidth is 384 Gbps Register File bandwidth is 384 Gbps Key is to use shared memory and registers when possible

60 Each SP has 16 KB shared memory Each SM has 32 KB register file Local variables in function take up registers Register file must support all threads in SM – If not enough registers, then less blocks are scheduled – Program still executes, but less parallelism occurs.

61 SM can only handle 768 threads SM can handle 8 blocks, 1 block for each SP Each block can have up to 96 threads – Max out SM resources

62 Intel’s new approach to a GPU Considered to be a hybrid between a multi- core CPU and a GPU Combines functions of a multi-core CPU with the functions of a GPU

63 63


Download ppt "Dr A Sahu Dept of Comp Sc & Engg. IIT Guwahati 1."

Similar presentations


Ads by Google