Graphic Architecture introduction and analysis

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Evolution of the Programmable Graphics Pipeline Patrick Cozzi University of Pennsylvania CIS Spring 2011.
3D computer graphic Basis for real-time rendering and GPU architecture 劉哲宇,Liou Jhe-Yu.
Status – Week 260 Victor Moya. Summary shSim. shSim. GPU design. GPU design. Future Work. Future Work. Rumors and News. Rumors and News. Imagine. Imagine.
Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
COOL Chips IV A High Performance 3D Graphics Rasterizer with Effective Memory Structure Woo-Chan Park, Kil-Whan Lee*, Seung-Gi Lee, Moon-Hee Choi, Won-Jong.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Under the Hood: 3D Pipeline. Motherboard & Chipset PCI Express x16.
Background image by chromosphere.deviantart.com Fella in following slides by devart.deviantart.com DM2336 Programming hardware shaders Dioselin Gonzalez.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
REAL-TIME VOLUME GRAPHICS Christof Rezk Salama Computer Graphics and Multimedia Group, University of Siegen, Germany Eurographics 2006 Real-Time Volume.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Parallel Rendering 1. 2 Introduction In many situations, standard rendering pipeline not sufficient ­Need higher resolution display ­More primitives than.
Graphics Hardware and Graphics in Video Games COMP136: Introduction to Computer Graphics.
Computer Graphics Graphics Hardware
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Matrices from HELL Paul Taylor Basic Required Matrices PROJECTION WORLD VIEW.
The Graphics Rendering Pipeline 3D SCENE Collection of 3D primitives IMAGE Array of pixels Primitives: Basic geometric structures (points, lines, triangles,
1 Dr. Scott Schaefer Programmable Shaders. 2/30 Graphics Cards Performance Nvidia Geforce 6800 GTX 1  6.4 billion pixels/sec Nvidia Geforce 7900 GTX.
Introduction to Parallel Rendering Jian Huang, CS 594, Spring 2002.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Parallel Rendering. 2 Introduction In many situations, a standard rendering pipeline might not be sufficient ­Need higher resolution display ­More primitives.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Accelerated Stereoscopic Rendering using GPU François de Sorbier - Université Paris-Est France February 2008 WSCG'2008.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Copyright © Curt Hill Video Hardware Evolution.
From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.
Ray Tracing using Programmable Graphics Hardware
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 GPU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CMSC 611: Advanced Computer Architecture
Graphics on GPU © David Kirk/NVIDIA and Wen-mei W. Hwu,
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
Chapter 6 GPU, Shaders, and Shading Languages
GPGPU Applications Introduction
The Graphics Rendering Pipeline
GRAPHICS PROCESSING UNIT
CSC 2231: Parallel Computer Architecture and Programming GPUs
Graphics Hardware CMSC 491/691.
Graphics Processing Unit
RADEON™ 9700 Architecture and 3D Performance
© 2012 Elsevier, Inc. All rights reserved.
Graphics Processing Unit
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

Graphic Architecture introduction and analysis 劉哲宇,Liou Jhe-Yu

Outline GPU Architecture history Taxonomies for Parallel Rendering Tile-based rendering Fixed function pipeline Separated shader architecture Unified shader architecture Conclusion

Architecture history Silicon Graphics, Inc. 1981-2009 3D graphics accelerator in early age. Focus on high-performance graphics server. Release its own IRSI-GL API in 1992, The first open and industrial standard IRIS-GL -> OpenGL

Architecture history From ashes 1st generation – wireframe transform, clip, and project color interpolation Model SGI Iris 2000, 1984

Architecture history From ashes 2nd generation – shaded solid Lighting calculation Depth buffer, alpha blending. Model SGI GTX, 1988

Architecture history From ashes 3rd generation Texture mapping Model SGI RealityEngine, 1992

Architecture history in the later time(after 2000) Fixed function pipeline (- 2000) With or without T&L engine Vertex and Pixel shader (2001 - 2006) 4th generation Unified shader (2007 - ) 5th generation 3DMark 2000 3DMark 03 3DMark 11

Architecture history Fixed pipeline production Desktop NVIDIA GeForce 2 - NV15 (2000) ATI Radeon 7000 - R100 (2000) 3dfx Voodoo3 (1999) Acquired by NVIDIA in 2002 S3 savage 2000 (1999) Acquired by VIA in 2001 Imagination STG Kyro - PowerVR3 (2001) Mobile Imagination MBX - PowerVR3(2001) ATI Imageon 130 (2002) Sell to Qualcomm in 2009 and rename Imageon as Adreno which be integrated in their Snapdragon SoC. Falnax Mali 55 (2005) Acquired by ARM in 2006 NV GeForce 2

Architecture history Vertex shader & Fragment shader Desktop NVidia GeForce 7900 - G71 (2006) ATI X1900 – R520 (2006) Mobile NVidia GeForce ULP (2011) In NVidia’s Tegra 3. Imagination SGX545 (2010) ARM Mali 400 (2009) Qualcomm Adreno 220 (2010) NV GeForce 7900

Architecture history Unified shader Desktop NVidia GeForce 680 – GK104 (2012) AMD HD7970 (2012) Mobile Imagination G6400 (2012) ARM Mali T678 (2012) Qualcomm Adreno 320 (2012) NV GeForce 680

Outline GPU Architecture history Taxonomies for Parallel Rendering Tile-based rendering Fixed function pipeline Separated shader architecture Unified shader architecture Conclusion

Taxonomies for Parallel Rendering Ideal parallel rendering

Taxonomies for Parallel Rendering Sort-first Work in multi GPU (NVIDIA’s SLI, AMD crossfire). Sort-middle Tiled-based rendering Intel Larrabee, and most mobile GPU. Sort-last Immediate mode rendering Most Desktop GPUs belong to this area.

Sort-first On SLI application

Sort-first Advantage Disadvantage Graphic Pipeline will receive no interrupt. Easy to re-distribute job on existed hardware. Disadvantage Pre-transform are required Harm the system performance obviously. Enormous bandwidth requirement on frame buffer access

Sort-middle – Tiled-based rendering Sorting triangles into tiles after geometry stage. Raster engine wait until all triangles have been sorted. Raster engine process tile by tile sequentially.

Tiled-based rendering

Tiled-based rendering Case study – ARM MALI 400

Tile-based Memory access model Geometry Tile divider Raster On-chip tile buffer Primitive data Tile list Texture data Frame buffer Memory

Tiled-based rendering Advantage Natural way to re-distribute jobs. Save a lot of bandwidth for communication with frame buffer. Disadvantage One triangle may go into multiple tiles. Need a sorting buffer after triangle sorting More complex scene, more memory access. Completely divide graphic pipeline into two partition. This may harm the performance.

Sort-last Delay sorting until decomposing primitive into fragment. Pixel vertex Raster Pixel vertex Raster Pixel vertex Raster Pixel

Some resorting FIFO for order maintain Sort-last Graphic API has the strict limit on order rendering EX. The pre-sort operation for Alpha blending race condition. Some resorting FIFO for order maintain Pixel vertex Raster Pixel Pixel blend vertex Raster Pixel vertex Raster Pixel

Sort last Advantage Disadvantage Full rendering pipeline without interrupt until per-fragment operation (compare to sort-middle) Disadvantage Huge bandwidth requirement on frame buffer access, particularly in high resolution and anti-aliasing enable.

Memory access model Tile-based rendering Geometry Tile divider Raster On-chip tile buffer Primitive data Tile list Texture data Frame buffer Geometry Raster Frame buffer cache Immediate more rendering

Why does mobile GPU like sort-middle Minimize bandwidth requirement. Need system memory support. High memory bandwidth usage will hammer the whole system. Save more penguins (power). Reducing memory access means less power consumption.

Why does desktop GPU use sort-last Minimize performance issue Sorting cost is low Desktop GPU has its own dedicated memory. Graphics DRAM usually has high bandwidth with high latency.

Outline GPU Architecture history Taxonomies for Parallel Rendering Tile-based rendering Fixed function pipeline Separated shader architecture Unified shader architecture Conclusion: GPU architecture issue

Fixed function pipeline NVIDIA GeForce 256 (1999) First transformation and lighting(T&L) hardware Become the market leader due to this production. NVIDIA mark it as the world’s first GPU. 1 vertex pipeline, 4 pixel pipeline

Memory access latency? GPUs seldom care the memory latency because Usually hundreds of fragments fly in the raster pipeline(queue). Switch to the next fragment with tiny cost if texture cache/frame cache miss. No pipeline stall is needed.

Separated shader architecture NVIDIA Geforce 3 (2001) First vertex shader and pixel shader architecture on desktop GPU. 1 vertex shader, 4 pixel shader Using Assembly code to program shader by microsoft shader model 1.0 Kill other competitors except ATI. 3dfx, S3, sis, Maxtor, PowerVR

Separated shader architecture Case Study: GeForce 6 series NVIDIA GeForce 6800 (2004) 6 vertex shader 16 pixel shader, 16 fragment operation pipeline

Separated shader architecture Case Study: GeForce 6 series Pixel X-bar Interconnect Multisample AA Z comp C comp Z ROP C ROP Frame buffer Partition Memory Input shaded fragment data

Early Z-test

Early Z-Test concept Put depth test before texture mapping to avoid unnecessary texel fetching. Reduce the memory traffic (reduce texture cache miss)

Outline GPU Architecture history Taxonomies for Parallel Rendering Tile-based rendering Fixed function pipeline Separated shader architecture Unified shader architecture Conclusion

Unified shader Architecture Separated shader architecture limits the graphic application. The input data rate is obviously slow than vertex shader process speed. CPU’s processing speed is slow than GPU. Force programmer to use more texture operations and less polygon/simple T&L operation.

Unified shader Architecture

Unified shader Architecture ATI Xenos on Xbox 360 (2005) The world first unified shader architecture. 48 unified shaders

Unified shader Architecture Case study: NVIDIA GeForce 8800 128 CUDA core in 8 stream processors (shader cluster) 24 fragment pipeline(for z-test and color blend)

Unified shader Architecture NVIDIA GeForce 8800

Unified shader Architecture Case study: AMD ATI HD 2900 AMD ATI HD 2900XT (2006) 64 unified shader (320 stream processor) VLIW architecture, 5 operation one cycle. 4 Render Back-End (For Z-test and color blend)

Unified shader Architecture AMD ATI HD 2900XT

AMD ATI HD 2900XT Placement & Layout Fixed hardware : shader = 4 : 6 (maybe) Not 0 : 1

A unified shader comparison in 2010 NVIDIA GeForce GTX 480 ATI Radeon HD 5870 480 cores (128 in 8800) 177.4 GB/s memory bandwidth 1.34 TFLOPS single precision 3 billion transistors 1600 cores (320 in 2900) 153.6 GB/s memory bandwidth 2.72 TFLOPS single precision 2.15 billion transistors Over double the FLOPS for less transistors! What is going on here?

Compared stream processor usage AMD vs NVDIA(Fermi vs rv770)

Unified shader Architecture AMD 7970 (2012) VLIW architecture is good for existed application, but bad for the unknown/future application. VLIW’s compiler ability is limited. From VLIW to SIMD

Unified shader Architecture Case study: Intel Larrabee 32x simplified Pentium CPU. No out-of-order execution. Compatible with X86-based program. Sort-middle architecture

Unified shader Architecture Intel Larrabee Announce in 2008, shutdown in 2010…… Due to the performance issue. The research result become part of Intel MIC ?

Outline GPU Architecture history Taxonomies for Parallel Rendering Tile-based rendering Fixed function pipeline Separated shader architecture Unified shader architecture Conclusion

GPU architecture issue Where should sort happen? What is the purpose for Job re-distribution? Hide memory latency, get more memory bandwidth. Cull the hidden element as early as possible Object, triangle, pixel Programmable vs fixed ? Reality vs ideal.

Trend GPU Parallel CPU programmable

Programmable vs fixed ? Because of the Performance issue, tessellation become fixed hardware GeForce 580 -> 680 DirectX 10 -> 11

Future lead way Application lead hardware Hardware limit application Ray-tracing Hardware limit application For the money issue, more and more 3D game companies prefer to stay in Xbox360/PS3. Since 2007, the increasing rate of Image quality in 3D game has been slow down.

Any Question?

You can get slides in 140.116.164.239/~caslab/GPU_Present_NSYSU/

Example: use transparency texture to model a tree with some leaves Step 1, draw the trunk

Example: use transparency texture to model a tree with some leaves Step 2, draw leaves(use lots of triangles) Too slow ( )n

Example: use transparency texture to model a tree with some leaves Step 2, draw leaves(use transparency texture) Alpha = 0 Be dropped alpha test

Early depth test Because of early depth test, the fragments which shall be dropped by alpha test update the depth buffer now. So we separate Z-write and Z-test ,and put Z-write behind the alpha test. 由於depth test在alpha test前先做了,所以那些應該被alpha test踢掉的fragment在depth test的時候已經先更新了depth buffer,如果在這之後我們還要畫一個背景在這棵樹之後,在樹葉空白的地方由於先占據了depth buffer,背景的地方在進行depth test就會失敗而不會寫上去 這樣的話,即使fragment通過了深度測試,也會在寫回depth buffer之前就被透明度測試給踢掉。

Early depth test But separating z-test and z-write will cause data hazard problem. Using multi-Z test to perform depth test twice and avoid data hazard. 10 5 15 15 15 10 5