NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall.

NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

What Does a GPU Actually Do? Historically, from: Historically, from: –Acting simply as a frame buffer –Doing vertex transformations and pixel color calculations –Now even programmable In the simplest sense, a modern GPU implements a 3D rendering pipeline In the simplest sense, a modern GPU implements a 3D rendering pipeline

3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting Image Clipping Scan Conversion Scan Conversion This is a pipelined sequence of operations to draw a 3D primitive into a 2D image

3D Rendering Pipeline (direct illumination) Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting 3D Geometric Primitives Image Clipping Scan Conversion Scan Conversion Transform into 3D world coordinate system

3D Rendering Pipeline (direct illumination) Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting 3D Geometric Primitives Image Clipping Scan Conversion Scan Conversion Transform into 3D world coordinate system Illuminate according to lighting and reflectance

3D Rendering Pipeline (direct illumination) Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting 3D Geometric Primitives Image Clipping Scan Conversion Scan Conversion Transform into 3D world coordinate system Illuminate according to lighting and reflectance Transform into 3D camera coordinate system

3D Rendering Pipeline (direct illumination) Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting 3D Geometric Primitives Image Clipping Scan Conversion Scan Conversion Transform into 3D world coordinate system Transform into 3D camera coordinate system Transform into 2D screen coordinate system Illuminate according to lighting and reflectance

3D Rendering Pipeline (direct illumination) Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting 3D Geometric Primitives Image Clipping Scan Conversion Scan Conversion Transform into 3D world coordinate system Transform into 3D camera coordinate system Clip primitives outside camera’s view Transform into 2D screen coordinate system Illuminate according to lighting and reflectance

3D Rendering Pipeline (direct illumination) Modeling Transformation Modeling Transformation Viewing Transformation Viewing Transformation Projection Transformation Projection Transformation Lighting 3D Geometric Primitives Image Clipping Scan Conversion Scan Conversion Transform into 3D world coordinate system Transform into 3D camera coordinate system Draw pixels Clip primitives outside camera’s view Transform into 2D screen coordinate system Illuminate according to lighting and reflectance

GPUCPU Modern OpenGL Pipeline Programmable Vertex Processor Programmable Vertex Processor Programmable Fragment (Pixel) Processor Programmable Fragment (Pixel) Processor Application Vertex Processor Assembly & Rasterization Pixel Processor Video Memory (Textures) Vertices (3D) Xformed, Lit Vertices (2D) Fragments (pre-pixels) Final pixels (Color, Depth) Graphics State Render-to-texture Vertex Processor Pixel Processor

OpenGL vs. DirectX Just graphics Just graphics Standard C interfaces Standard C interfaces State machine State machine Multiple platforms Multiple platforms Academic use Academic use Graphics, multimedia, etc. Graphics, multimedia, etc. C++ interfaces C++ interfaces Object oriented Object oriented Windows Windows PC games PC games

Possible GPU Performance Bottlenecks CPU/Bus Bound CPU/Bus Bound –Simply not able to send enough vertices to the card to keep it busy Vertex Bound Vertex Bound –Vertex processing engine is fully loaded, while the fragment engine is just waiting and grabbing data as soon as it’s ready Pixel Bound Pixel Bound –The fragment engine is fully loaded, causing the vertex engine to have to wait before sending more data

Early History NVIDIA founded in 1993 NVIDIA founded in 1993 1997: RIVA 1997: RIVA 1998: RIVA TNT 1998: RIVA TNT 1999: GeForce 256 (NV10) 1999: GeForce 256 (NV10)

GeForce 256 (NV10) Lighting and transformation Lighting and transformation DDR and SDR DDR and SDR HDTV compliant HDTV compliant Hardware alpha-blending Hardware alpha-blending 4 pixel pipelines at 120 MHz 4 pixel pipelines at 120 MHz Fill Rate: 480 Megapixels/second Fill Rate: 480 Megapixels/second

GeForce2 2000: GeForce 2 GTS: 2000: GeForce 2 GTS: –Doubled the pixel fill rate –Quadrupled the texel fill rate –Increased clock speed –Multi-texturing –S3TC, MPEG-2, FSAA

Anti-Aliasing Without Anti-Aliasing With Anti-Aliasing

GeForce2 2000: GeForce 2 MX 2000: GeForce 2 MX –Cut pixel pipeline by 2, making it cost effective –Twinview –Compatible with MACs

GeForce2 Jan 2001: Apple selected GeForce2 MX as default high-end graphics solution for Power Mac G4 Jan 2001: Apple selected GeForce2 MX as default high-end graphics solution for Power Mac G4 August 2000: GeForce2 Ultra August 2000: GeForce2 Ultra November 2000: GeForce2 Go November 2000: GeForce2 Go December 2000: NVIDIA buys 3DFX December 2000: NVIDIA buys 3DFX

GeForce3 2001: GeForce3 (NV20) 2001: GeForce3 (NV20) –240 MHz Core/500 MHz Memory –57 million transistors –46-76 Gigaflops –Vertex shader technology –Pixel shader technology –LightSpeed Memory architecture

LightSpeed Memory Architecture

GeForce4 2002: GeForce4 Ti (NV25) and MX (NV17) 2002: GeForce4 Ti (NV25) and MX (NV17) –Ti:  4200, 4400, 4600, and 4800 versions  63 million transistors  Chip clock 225-300 MHz  Memory Clock 500-650 MHz  75-100 million vertices/second

GeForce FX November 2002: Geforce FX (NV30) November 2002: Geforce FX (NV30) –16 variations for different price ranges –125 million transistors –8 pixels/clock –1 tmu/pipe (16 textures/unit) –128 bit memory interface –128 MB/256 MB Memory size support

GeForce 6 series GeForce 6 series (NV40 ) GeForce 6 series (NV40 ) –6200; 6600 GT and Ultra; 6800 GT, Ultra, and Ultra Extreme –Core clock speed 450 MHz –Memory clock speed 600 MHz –6 4-wide fp32 vector MADDs/ clock cycle vertex shader units –16 4-wide fp32 vector MADDs/ clock cycle pixel shader units

GeForce 6 series Super scalar 16 pipe architecture Super scalar 16 pipe architecture CineFX3.0 engine CineFX3.0 engine All operations done in FP32 precision per component All operations done in FP32 precision per component 200 Gigaflops (Compare this to the Itanium’s 6.4 Gigaflops) 200 Gigaflops (Compare this to the Itanium’s 6.4 Gigaflops)

General Diagram (6800/NV40)

TurboCache Uses PCI-Express bandwidth to render directly to system memory Uses PCI-Express bandwidth to render directly to system memory Card needs less memory Card needs less memory Performance boost while lowering cost Performance boost while lowering cost TurboCache Manager dynamically allocates from main memory TurboCache Manager dynamically allocates from main memory Local memory used to cache data and to deliver peak performance when needed Local memory used to cache data and to deliver peak performance when needed

TurboCache

NV40 Vertex Processor An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

NV40 Fragment Processors Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units

Programmable 2D and Video Processor Can be used for video decoding and coding (IDCT, deinterlacing, color model transformations, etc.) Can be used for video decoding and coding (IDCT, deinterlacing, color model transformations, etc.)

Why NV40 series was better Massive parallelism Massive parallelism Scalability Scalability –Lower end products have fewer pixel pipes and fewer vertex shader units Computation Power Computation Power –222 million transistors –First to comply with Microsoft’s DirectX 9 spec Dynamic Branching in pixel shaders Dynamic Branching in pixel shaders

Dynamic Branching Helps detect if pixel needs shading Helps detect if pixel needs shading Instruction flow handled in groups of pixels Instruction flow handled in groups of pixels Specify branch granularity (the number of consecutive pixels that take the same branch) Specify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines Better distribution of blocks of pixels between the different quad engines

Dynamic Branching

GeForce 7 series 7800 GT $449 $449 7 vertex units 7 vertex units 20 pixel pipelines 20 pixel pipelines Clock speed 400 MHz Clock speed 400 MHz Memory clock speed 500 MHz Memory clock speed 500 MHz 7800 GTX $600 $600 8 vertex units 8 vertex units 24 pixel pipelines 24 pixel pipelines Clock speed 430 MHz Clock speed 430 MHz Memory clock speed 600 MHz Memory clock speed 600 MHz

GeForce 7800 302 million transistors 200 Gigaflops of multiply/add calculations per second 200 Gigaflops of multiply/add calculations per second 128-bit floating point precision through the entire rendering pipeline 128-bit floating point precision through the entire rendering pipeline Fill Rate: 10.3 Gigatexels Fill Rate: 10.3 Gigatexels 860 million vertices/sec 860 million vertices/sec

GeForce 7800

ALU Units in Pixel Processor Sub-unit 1: Sub-unit 1: –NV40: textures data and can issue a MUL vector instruction or use its mini-ALU to issue a non-vector instruction –G70: same but also can issue a multiply/add Sub-unit 2: Sub-unit 2: –NV40: can issue a multiply/add vector instruction or use its own mini-ALU to issue a non-vector instruction –G70: same

GeForce 6 vs. GeForce 7 ALU Units ALU Units –G70: 24 ALU Units –NV40: 16 ALU Units Register file: same size Register file: same size Texture samplers the same but when fetching large textures in preparation for filtering, G70's samplers have less latency pulling those textures out of memory Texture samplers the same but when fetching large textures in preparation for filtering, G70's samplers have less latency pulling those textures out of memory

GeForce 6 vs. GeForce 7 (speculative) Increased L2 texture cache (to around 12KB) Increased L2 texture cache (to around 12KB) Better cache re-use with larger textures, decompressing those larger textures into L1 faster Better cache re-use with larger textures, decompressing those larger textures into L1 faster Possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering. Possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering.

GeForce 6 vs. GeForce 7 33 % more vertex units, each with more performance 33 % more vertex units, each with more performance Improved vertex fetch unit (unconfirmed by Nvidia) Improved vertex fetch unit (unconfirmed by Nvidia) Triangle setup and rasteriser optimized via the use of a new raster pattern (again unconfirmed by Nvidia) Triangle setup and rasteriser optimized via the use of a new raster pattern (again unconfirmed by Nvidia)

General Diagram (7800/G70)

Framebuffer Framebuffer Textures Textures Fragment processor Fragment processor Vertex processor Vertex processor Interpolants Interpolants GeForce 7800 (G70) supports 128 bit through entire pipeline! GeForce 7800 (G70) supports 128 bit through entire pipeline! 32-bit IEEE floating-point throughout pipeline (NV40)

Hardware supports several other data types Fragment processor also supports: Fragment processor also supports: –16-bit “half” floating point –12-bit fixed point –These may be faster than 32-bit on some HW Framebuffer/textures also support: Framebuffer/textures also support: –Large variety of fixed-point formats –E.g., classical 8-bit per component –These formats use less memory bandwidth than FP32

How are current GPU’s different from CPU? GPU is a stream processor Multiple programmable processing units Connected by data flows Application Vertex Processor Fragment Processor Assembly & Rasterization Framebuffer Operations Framebuffer Textures

How are current GPU’s different from CPU? Optimized for 4-vector arithmetic –Useful for graphics – colors, vectors, texcoords –Easy way to get high performance/cost –SIMD/MIMD

GPU Memory Model vs CPU’s Much more restricted memory access Much more restricted memory access –Allocate/free memory only before computation –Limited memory access during computation (kernel)  Registers –Read/write  Local memory –Does not exist  Global memory –Read-only during computation –Write-only at end of computation (pre-computed address)  Disk access –Does not exist

GPU Memory Model Where is GPU Data Stored? Where is GPU Data Stored? –Vertex buffer –Frame buffer –Texture Vertex Buffer Vertex Processor Rasterizer Fragment Processor Texture Frame Buffer(s) VS 3.0 GPUs

GPGPU and Motivation GPUs are fast… GPUs are fast… –Itanium: 6.4 GFLOPS –GeForceFX 7800: 200 GFLOPs –GPUs are getting faster, faster –CPUs: annual growth  1.5×  decade growth  60× –GPUs: annual growth > 2.0×  decade growth > 1000

Motivation: Computational Power Courtesy Naga Govindaraju GPU CPU GPU

GPGPU Good for inherently parallel applications Good for inherently parallel applications Rapidly evolving ISA and HW architecture Rapidly evolving ISA and HW architecture –Largely secret Can’t simply “port” code written for the CPU! Can’t simply “port” code written for the CPU!

Programs are Shaders Bound by the specific hardware profile: Bound by the specific hardware profile: –E.g. different cards have different supported hardware, OpenGL has different restrictions than DirectX, etc Hardware profiles change relatively drastically as new GPUs are developed Hardware profiles change relatively drastically as new GPUs are developed –But typically new profiles only add features, so there is generally still backwards compatibility (but not always)

Vertex processor 256 instructions per program originally (effectively higher with branching) 256 instructions per program originally (effectively higher with branching) –Now up to 65535 instructions Executes on all vertices Executes on all vertices Outputs new vertices or texture coordinates, etc Outputs new vertices or texture coordinates, etc

Fragment Processor Flow Chart

Fragment processor has flexible texture mapping Memory is accessible through texture reads Memory is accessible through texture reads Texture reads are just another instruction Texture reads are just another instruction Allows computed texture coordinates, nested to arbitrary depth Allows computed texture coordinates, nested to arbitrary depth Allows multiple uses of a single texture unit Allows multiple uses of a single texture unit

Additional fragment processor capabilities Read access to window-space position Read access to window-space position Read/write access to fragment Z Read/write access to fragment Z Built-in derivative instructions Built-in derivative instructions –Partial derivatives w.r.t. screen-space x or y –Useful for anti-aliasing Conditional fragment-kill instruction Conditional fragment-kill instruction Multiple FP formats supported Multiple FP formats supported

Fragment processor limitations Originally No branching Originally No branching –Now support dynamic branching (but it’s still costly) No indexed reads from registers No indexed reads from registers –Use texture reads instead No memory writes No memory writes

Branching Instruction Costs (GeForce 6800)

Fragment shaders Originally very limited in size (only 96 instructions), now expanded to 65535+ instructions Originally very limited in size (only 96 instructions), now expanded to 65535+ instructions New cards support dynamic branching (but it still incurs some performance penalty) New cards support dynamic branching (but it still incurs some performance penalty) Now have the ability to output to multiple render targets Now have the ability to output to multiple render targets

CineFX 4.0 Engine A redesigned vertex shader unit reduces the time to set up and perform geometry processing. A redesigned vertex shader unit reduces the time to set up and perform geometry processing. A new pixel shader unit design can carry out twice as many floating-point operations and greatly accelerates other mathematical operations to increase throughput. A new pixel shader unit design can carry out twice as many floating-point operations and greatly accelerates other mathematical operations to increase throughput. An advanced texture unit incorporates new hardware algorithms and better caching to speed filtering and blending operations. An advanced texture unit incorporates new hardware algorithms and better caching to speed filtering and blending operations.

Vertex Shaders The 7800 has 8 vertex shaders The 7800 has 8 vertex shaders The Triangle Setup stage turns the vertex points into a triangle The Triangle Setup stage turns the vertex points into a triangle It also determines mathmatically the rasterization for each triangle It also determines mathmatically the rasterization for each triangle Accelerating triangle setup increases the total throughput of the 3D pipeline Accelerating triangle setup increases the total throughput of the 3D pipeline

Theoretical Rasterization Pattern of a Triangle

New Pixel Shader – MADD Multiply and Accumulate are commonly used math functions in 3D graphics Multiply and Accumulate are commonly used math functions in 3D graphics MADD stands for Multiply-ADD operations MADD stands for Multiply-ADD operations The 7800 can do twice the amount of MADD operations than previous GPUs could The 7800 can do twice the amount of MADD operations than previous GPUs could This allows developers to create much more complex visual effects This allows developers to create much more complex visual effects

Transparency Adaptive Supersampling Takes extra passes of thin-lined objects such as chain linked fences or trees to enhance quality Takes extra passes of thin-lined objects such as chain linked fences or trees to enhance quality Pixels inside of a polygon are usually not touched by anti-aliasing methods Pixels inside of a polygon are usually not touched by anti-aliasing methods With this, a key set is devised, and those pixels are anti-aliased, creating a smoother image. With this, a key set is devised, and those pixels are anti-aliased, creating a smoother image.

Transparency Adaptive Supersampling

Transparency Adaptive Multisampling Higher levels of performance, because it uses one texel to determine other subpixel values Higher levels of performance, because it uses one texel to determine other subpixel values Not as high quality Not as high quality

Supporting the Future The 7800 is already set up to support the new Microsoft Longhorn OS with some of the following advancements The 7800 is already set up to support the new Microsoft Longhorn OS with some of the following advancements –Video post-processing –Real-time desktop compositing –Seamless multiple 3D applications –Accelerated antialiased text rendering –Special effects and animation

Accelerated Graphics Port (AGP) The AGP is superior to the PCI because it provides a dedicated pathways between the slot and the processor The AGP is superior to the PCI because it provides a dedicated pathways between the slot and the processor Uses sideband addressing Uses sideband addressing PCI must load a texture from the hard drive into the systems RAM, then from the RAM into the GNU framebuffer PCI must load a texture from the hard drive into the systems RAM, then from the RAM into the GNU framebuffer AGP can read textures directly from system RAM by “tricking” the CPU into believing the textures are in the framebuffer, when they are really in memory AGP can read textures directly from system RAM by “tricking” the CPU into believing the textures are in the framebuffer, when they are really in memory

PCI Express Based on the PCI system, allowing for backwards compatibility Based on the PCI system, allowing for backwards compatibility Uses 1 bit, bi-directional lanes (PCI used a bus) Uses 1 bit, bi-directional lanes (PCI used a bus) Each lane can support 250 MB/s in each lane (4GB/s total) Each lane can support 250 MB/s in each lane (4GB/s total) –AGP is only 2 GB/s

Scalable Link Interface (SLI) Takes advantage of the PCI express bus, which will allow more than one discrete graphics device on the same PCI host Takes advantage of the PCI express bus, which will allow more than one discrete graphics device on the same PCI host Allows two of the same GeForce GPUs to run on one machine, thus “sharing” load. Allows two of the same GeForce GPUs to run on one machine, thus “sharing” load. There are two modes for this There are two modes for this –Split-frame Rendering (SFR) –Alternate-frame Rendering (AFR)

Split-frame Rendering Has each GPU render a portion of the screen, split horizontally Has each GPU render a portion of the screen, split horizontally No extra latency No extra latency Not necessarily evenly split Not necessarily evenly split –SFR is load shared, so it splits up the frame by the amount of work, not the size A large amount of overhead is involved, causing a max speed up of around 1.8 times A large amount of overhead is involved, causing a max speed up of around 1.8 times

Alternate-frame Rendering Avoids all the overhead problems of SFR Avoids all the overhead problems of SFR Many buffer swaps Many buffer swaps Reliant on the speed of the processor Reliant on the speed of the processor Can cause latency issues Can cause latency issues Recommended mode by NVIDIA Recommended mode by NVIDIA

GeForce Go 7800 GTX The mobile version of the 7800 GTX The mobile version of the 7800 GTX Everything from the desktop release has been carried over to this Everything from the desktop release has been carried over to this Can switch between x1 and x16 lanes of PCI Express Can switch between x1 and x16 lanes of PCI Express Uses PowerMizer 6.0, which allows this chip to operate in the same envelope as it’s predecessor, the 6800 Uses PowerMizer 6.0, which allows this chip to operate in the same envelope as it’s predecessor, the 6800

GeForce Go 7800 – Power Issues Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz Dynamic clock scaling can run as slow as 16 MHz –This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game Regardless, you won’t get much battery-based runtime for a 3D game

Questions?

NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall.

Similar presentations

Presentation on theme: "NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall.

Similar presentations

Presentation on theme: "NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall."— Presentation transcript:

Similar presentations

About project

Feedback