Joshua Barczak CMSC435 UMBC

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS SOFTWARE.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Optimization on Kepler Zehuan Wang
CS 352: Computer Graphics Chapter 7: The Rendering Pipeline.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Cg Programming Mapping Computational Concepts to GPUs.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Computer Graphics The Rendering Pipeline - Review CO2409 Computer Graphics Week 15.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Graphics Pipeline Bringing it all together. Implementation The goal of computer graphics is to take the data out of computer memory and put it up on the.
Chapter Overview General Concepts IA-32 Processor Architecture
The Present and Future of Parallelism on GPUs
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Programmable Shaders Dr. Scott Schaefer.
CMSC 611: Advanced Computer Architecture
Week 2 - Friday CS361.
A Crash Course on Programmable Graphics Hardware
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
From Turing Machine to Global Illumination
The Graphics Rendering Pipeline
CS451Real-time Rendering Pipeline
GRAPHICS PROCESSING UNIT
Graphics Hardware CMSC 491/691.
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Chapter VI OpenGL ES and Shader
Graphics Processing Unit
* From AMD 1996 Publication #18522 Revision E
Mattan Erez The University of Texas at Austin
UMBC Graphics for Games
CSE 502: Computer Architecture
Balancing the Graphics Pipeline for Optimal Performance
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Joshua Barczak CMSC435 UMBC Graphics Hardware Joshua Barczak CMSC435 UMBC

Object-Order Rendering Object-Space Primitives World-space Primitives Camera-Space Primitives Clip-Space Primitives Clip Modeling XForm Viewing XForm Projection XForm Clip-Space Primitives Rasterize Window XForm Viewport XForm Displayed Pixels Pixels Raster-Space Primitives

The Code DrawTriangles( Vertex* vb, int* ib, int n_primitives ) { for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices TransformedVertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer }

The Machine Host CPU Memory Application Device Driver PCIE bus Memory The DrawTriangles function, in ASIC form GPU

The Cheap Machine Shared Die CPU Application Device Driver Memory

Our View of The Machine API calls to Manage memory Configure the pipeline Shader code Resource bindings Fixed-function states Z/Stencil Alpha Draw sets of triangles Code VS Resource Resource Resource Rasterizer State Code PS Resource Resource Resource Output Merger State

The Rules Rule #1: Don’t be silly If you have 1000 triangles, do not make 1000 API calls glBegin/glEnd are evil

The Rules Rule #1: Don’t be silly Compute at the correct rate Per-vertex work is cheaper than per-pixel work In general Simplify uniform expressions X = CONST * CONST  x = const X < SQRT(CONST)  x*x < const X = y/CONST;  x = y*(1/const) = y*const X = pow( CONST,y) = exp( log(x)*y) = exp(const*y)

The unstoppable march of time GPU Operation Hardware Queue Command Buffers Driver Thread (on other core) Frame N Frame N+1 Frame N+2 Draw Calls Our Thread Our Scene The unstoppable march of time

The Worst Thing Imaginable Do nothing while we wait for more draws CPU/GPU dependencies GPU Draw() Draw() Draw() Draw() ReadPixels() …….. Draw() Draw() Draw() ReadPixels() ……. CPU Do whatever it is we’re doing with those pixels Wait again Process Draw calls Do nothing while we wait for the pixels Draw calls

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Don’t read the frame buffer If you use occlusion queries, wait a few frames before reading them

OpenGL Drawing Convenient, but wrong: void DrawIndexedMesh( float* pPositions, float* pNormals, GLuint* pIndices, GLuint nTriangles ) { glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_NORMAL_ARRAY); glVertexPointer(3,GL_FLOAT,3*sizeof(GLfloat),pPositions); glNormalPointer(GL_FLOAT,3*sizeof(GLfloat),pNormals); glDrawElements( GL_TRIANGLES, 3*nTriangles, GL_UNSIGNED_INT, pIndices ); } Convenient, but wrong: Driver must copy the data every draw

OpenGL Drawing the Right Way void DrawIndexedMesh( GLuint vbo, GLuint ibo, GLuint nTriangles, GLuint index_offset ) { glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_NORMAL_ARRAY); glVertexPointer(3,GL_FLOAT,3*sizeof(GLfloat),0); glNormalPointer(GL_FLOAT,3*sizeof(GLfloat), (GLvoid*) 3*sizeof(float) ); glBindBuffer( GL_ARRAY_BUFFER, vbo ); glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); glDrawElements( GL_TRIANGLES, 3*nTriangles, GL_UNSIGNED_INT, index_offset ); } At startup…. void CreateBufferObjects( Vertex* pVB, int nVertices, int* pIB, int nTris ) { glGenBuffers( 1, &vbo ); glGenBuffers( 1, &ibo ); glBindBuffer( GL_ARRAY_BUFFER, vbo ); glBindBuffer( GL_ELEMENT_ARRAY_BUFFER, ibo ); glBufferData( GL_ARRAY_BUFFER, nVertices*sizeof(Vertex), pVB, GL_STATIC_DRAW ); glBufferData( GL_ELEMENT_ARRAY_BUFFER, 3*nTris*sizeof(int), pIB, GL_STATIC_DRAW); } These are no longer pointers

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data

Dynamic Buffers But… I NEED to move data Sometimes you do: Do you really? Sometimes you do: Particles CPU animation You need to double-buffer to avoid stalls

Dynamic Buffers Buffer1 Buffer0 GPU Frame 0 Frame 1 Frame 2 Frame 3 CPU If you give them the right flags, drivers will manage this for you (read the docs)

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: When breaking rule #3, do so correctly Use dynamic, write-only buffers with discard

Pipelining Instructions takes several cycles Fetch Decode Execute Nonpipelined: N instructions in 4N clocks 4 C.P.I Decode Execute Write Regs Pipelined: 9 intructions in 12 clocks 1.3 C.P.I (1 CPI in the limit) F F F F F F F F F D D D D D D D D D E E E E E E E E E W W W W W W W W W

Graphics Pipeline Hardware Stages Index Fetch Vertex Fetch Vertex DrawTriangles( Vertex* vb, int* ib, int n_primitives ) { for( each primitive i ){ // fetch the indices int indices[3] = { ib[3*i], ib[3*i+1], ib[3*i+2] }; // fetch the vertices Vertex v[3] = { vb[indices[0]], vb[indices[1]], vb[indices[0]] }; // Transform the vertices TransformedVertex v[3] = { Process(v[0]), Process(v[1]), Process(v[2]) }; // Clip/cull (may create more triangles) for( each triangle ) { // Backface cull if( !Backfacing() ) { // rasterization setup for( each rasterized pixel ) { // interpolate vertex attributes // z test // calculate color // blend result into frame buffer } Vertex Fetch Vertex Shade Clip Cull Raster Z Shade Blend

Graphics Pipelining Index Fetch Vertex Fetch Vertex Shade Clip Cull Raster Z Shade Blend Clock cycles

The Physical Machine Control Registers Stuff like: - VB/IB address - Primitive type - Cull mode - Viewport size - ZBuffer - pointer - format -Color buffer - pointer - format - Textures - pointer - format - Blend mode - Z/Stencil modes Vertex Functional Units Triangle Rasterizer Pixel Blend

State Change Index Fetch Wait for VS Vertex Fetch Vertex Shade Clip Cull Raster White space indicates wasted electricity Z Shade Blend Clock Cycles (tick tock)

State Change (Software) Your Code Driver Workload SetState() Draw triangles SetState() Draw triangles SetState() Draw triangles Figure out what regs to change - Turn texture/VBO handles to addresses/formats - Convert GL state to register bits - Change shader code addresses Put register writes into command buffer Put draw commands into command buffer Repeat… This part is much more severe than the hardware bubbles….

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Sort objects by state Pack meshes together

Computation & Bandwidth Based on: • 100 Mtri/sec (1.6M/frame@60Hz) • 256 B vertex data • 128 B interpolated • 68 B fragment output • 5x depth complexity • 16 4-byte textures • 223 ops/vert • 1664 ops/frag • No caching • No compression Vertex 75 GB/s 67 GFLOPS Triangle It is physically impossible to run a serial datapath at these rates 13 GB/s 335 GB/s Texture 45 GB/s Fragment Fragment 1.1 TFLOPS Slide: Olano

Data Parallel Distribute Task Task Task Task Merge Slide: Olano

Parallel Graphics Vertex Geometry Pixel

Barycentric Rasterization SIMD Parallelism (NxN Stamp)

Barycentric Rasterization SIMD Parallelism (NxN Stamp)

Barycentric Rasterization SIMD Parallelism (NxN Stamp)

Barycentric Rasterization SIMD Parallelism (NxN Stamp)

Barycentric Rasterization SIMD Parallelism (NxN Stamp)

Barycentric Rasterization SIMD Parallelism (NxN Stamp)

Ordering Independent pixels Strict primitive order within a pixel Or transparency doesn’t work

Parallel Raster Architectures “Sort” Resolving pixel ordering Where the “sort” happens First Middle Last Good read: “A Sorting Classification of Parallel Rendering” Molnar et al.

Distribute objects by screen tile Slide: Olano Sort First Objects Distribute objects by screen tile Vertex Vertex Vertex Some pixels Some objects Triangle Triangle Triangle Fragment Fragment Fragment Screen

Sort Middle Objects Distribute objects or vertices Vertex Vertex Slide: Olano Sort Middle Objects Distribute objects or vertices Vertex Vertex Vertex Some objects Merge & Redistribute by screen location Triangle Triangle Triangle Triangle Some pixels Some objects Fragment Fragment Fragment Fragment Screen

Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column Slide: Olano Screen Subdivision Tiled Interleaved Scan-Line Interleaved Column Interleaved

Sort Last Objects Distribute by object Vertex Vertex Vertex Slide: Olano Sort Last Objects Distribute by object Vertex Vertex Vertex Full Screen Some objects Triangle Triangle Triangle Fragment Fragment Fragment Screen Partitioned Frag-merge Screen

Architecture An execution core Not to scale Control Logic ALU Regs

16 Instruction Streams 16 Data Streams Architecture Multiple parallel cores Multiple-instruction multiple-data (MIMD) 16 Instruction Streams 16 Data Streams

1 Instruction Stream 60 Data Streams Architecture SIMD Machine Single Instruction Multiple Data Shared control logic Pro More throughput Con Coherent execution 1 Instruction Stream 60 Data Streams

SIMD Branching if( x ) // mask threads { // issue instructions } else // invert mask } // unmask Threads agree, take if Threads agree, take else Threads disagree, take if AND else Useful Useless

SIMD Looping while(x) // update mask { // do stuff } They all run ‘till the last one’s done….

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Keep control flow simple Flatten branches Avoid ‘else’ branches

DX9-Era GPU Primitive Assembly Rasterizer Z Post-TnL Cache PS PS PS PS Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixel blocks Vertices Post-TnL Cache PS PS PS PS Index stream PS PS PS PS Blend Pixels VS VS PS PS PS PS VS VS PS PS PS PS Vertex Cache Texture Cache Cache Cache Memory

Memory Bandwidth Lots of things need data all at once Index Stream Vertex Demand Texture Demand Alpha/Z operation Vertex $ Tex $ Color $ Depth $ Memory

Texture Tiling Images tiled in memory In Memory:

Texture Tiling Texture cache is for reuse across pixel blocks Bandwidth savings Not latency reduction Like in a CPU In Memory:

Block Compression DXT1 (BC1): 4x4 pixel block packed into 8 bytes 8:1 over standard 32-bit color Endpoint colors Color Index (2 bits/pixel) Four possible colors 2 endpoints and 2 interior points

Block Compression DXT5(BC3): BC4 (ATI1N) BC5 (ATI2N) DXT1 plus alpha Endpoint alphas (2 bytes) DXT5(BC3): DXT1 plus alpha 4:1 BC4 (ATI1N) Just the alpha 2:1 for greyscale BC5 (ATI2N) 2 alphas slapped together 2:1 for 2 channels 4:1 for TS normal map Alpha index (3 bits/pixel) 8 possible alphas

Block Compression BC6/BC7 16 byte block 7 different formats

Z-Ordered Rasterization Take 2D integer coordinates Interleave bits Get a 1D index Consecutive 1D indices are spatially coherent Deinterleave a counter to walk through space Wikipedia

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Small vertex formats 16-bit float,8-bit fixed Small texture formats Compress Don’t use 4 8-bit channels for a greyscale image! See Rules 1 and 6

DX9-Era GPU Primitive Assembly Rasterizer Z Post-TnL Cache PS PS PS PS Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixel blocks Vertices Post-TnL Cache PS PS PS PS Index stream PS PS PS PS Blend Pixels VS VS PS PS PS PS VS VS PS PS PS PS Vertex Cache Texture Cache Cache Cache Memory

Nvidia “Technical Brief” (Read: Marketing) Unified Shaders Nvidia “Technical Brief” (Read: Marketing)

Nvidia “Technical Brief” (Read: Marketing) Unified Shaders Nvidia “Technical Brief” (Read: Marketing)

DX10-Era GPU Primitive Assembly Rasterizer Z GS Blend Post-TL$ L1$ L1$ Clipspace Primitives Early-Z Primitive Assembly Rasterizer Z Pixels GS US US US US GS Threads Blend Pixels Post-TL$ VS Threads L1$ L1$ L1$ L1$ Index Data L2 $ Cache Cache Memory

The Geometry Shader One Primitive In Point/Line/Triangle Up to N primitives out Unpredictable data amplification Order MUST be preserved

The Geometry Shader One geometry shader Parallel geometry shaders Spits out primitives one by one GS GS GS GS GS GS Buffering Rasterizer Lots of buffering Results must be consumed in order…. Rasterizer

The Geometry Shader Nvidia AMD Buffers in on chip memory Parallelism limited by buffer space Faster for small amplification AMD Buffers in DRAM Lots of latency Faster for large amplification Performance

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow

Memory Latency Do a little math… Miss the cache… Wait a few hundred cycles for memory Keep going

Memory Latency CPU Strategy Bend over backwards to avoid stalls Sequencer ALU Regs Gigantic Cache Branch Predictor Out-of-Order Exec. Memory Prefetch More Regs

Regs (THOUSANDS of them) Memory Latency GPU Strategy Run lots of threads “Hide” the stalls with useful work ALU Regs (THOUSANDS of them) Tiny Cache Sequencer Scheduler

Latency Do a little math Hardware swaps in other threads Miss the cache… Memory access overlapped by useful work Keep going

Terminology Thread Warp/Wavefront One instance of a shader program One pixel/vertex Warp/Wavefront SIMD-sized collection of threads, in lockstep What H/W people call a thread Many warps in flight for latency hiding

Occupancy Register file: “Registers” are SIMD-sized Evenly divided among warps SIMD Lane Register Number

Occupancy 4 registers per thread 8 warps SIMD Lane Register Number

Occupancy 16 Registers per thread 2 warps SIMD Lane Register Number

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full

Modern GPU Rast Rast Rast Rast Z Setup GS Tess PA Blend Post-TL$ L1$ Clipspace Primitives Rast Rast Rast Rast Z Setup GS US US US US Tess GS/HS/DS Threads PA Blend Pixels Post-TL$ Vertex Threads L1$ L1$ L1$ L1$ Index Data L2 $ Memory

Tessellation Unigine.com

DX11 Tessellation Pipeline Patches Control Points Hull Shader (Selects Tess Factors) Detail Levels Tessellation Hardware U,V Coordinates Domain Shader (Evaluation) Vertices Moreton 2001 Geometry Shader

Tessellation Tessellation Pitfalls: Backface cull happens post-tess LOTS of wasted DS work 2x2 Quad Utilization problem

Derivatives for MipMapping 2x2 Quads + Differencing Missing pixels are extrapolated… Each 2x2 quad is self-contained

Big Triangle

Rasterized Quads

Wasted Pixels 27 of 76 (35%) - Drops off very fast for big triangles - At this scale, this triangle is “small”

In the limit… In this scenario, we shade 4 times as many pixels as we need This is essentially what happens when we over-tessellate

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow

The Rules Rule #1: Don’t be silly Rule #2: One way traffic Rule #3: Do not move data Rule #4: Move data correctly Rule #5: Avoid changing state Rule #6: Think about coherence Rule #7: Save bandwidth Rule #8: Geometry shaders are slooooow Rule #9: Keep the machine full Rule #10: Small triangles are slow Rule #11: The rules are subject to change at any time and without notice…

NVIDIA GeForce 6 [Kilgaraff and Fernando, GPU Gems 2]

Vertex Processing 4-wide FP vector + special functions Vertex texture fetch Unfiltered Very slow MIMD

Fragment Processing Pixel pipe 2 4-wide vector pipes Dual issue Vector co-issue 3x1 or 2x2 FP16 arithmetic Poor flow control granularity All in-flight threads take same path

AMD/ATI R600 [Tom’s Hardware]

SIMD Units VLIW 4 texture engines 5 ALUs 16-wide SIMD 1 with transcendentals 16-wide SIMD In groups of 4 64 thread “wavefront” 2 waves issue over 8 clocks 4 texture engines Texture ops ¼ ALU rate

Dispatch

Demo

NVIDIA G80 [NVIDIA 8800 Architectural Overview, NVIDIA TB-02787-001_v01, November 2006]

Streaming Processors Scalar architecture Instruction issue 32-wide “warp” issued over 4 clocks Special functions take 16 clocks (2 SFUs) Instruction issue Interleaved warp instrucitons

NVIDIA Fermi [Beyond3D NVIDIA Fermi GPU and Architecture Analysis, 2010]

Fermi Rasterization Round robin vertex processing 4 rasterizers (1 per GPC) Screen partitioned Purcell 2010

NVIDA Fermi SM 2 concurrent warps 32 ALUs 16 Load/Store units 4 SFUs [NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 2009]

AMD GCN Vector pipes Scalar processor 4 SIMDs per CU Issued round-robin 64-wide waves Scalar processor Integer ops and branching Separate register set Different instruction types can co-issue From different waves