Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

Similar presentations


Presentation on theme: "1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer."— Presentation transcript:

1 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona

2 2 Introduction Graphics and specifically 3D graphics have become an important element in current PDA, mobile phone and other handheld systems OpenGL ES: A simplified OpenGL specification for embedded systems OpenGL ES: A simplified OpenGL specification for embedded systems The classic GPU architecture for the PC is not suited for embedded systems Low power Low power Low area budget Low area budget We propose a single unified shader GPU architecture for embedded systems

3 3 Outline ATTILA PC ATTILA Embedded Triangle Setup in the Shader Unit ATTILA Simulation Framework Results

4 4 Outline ATTILA PC ATTILA Embedded Triangle Setup in the Shader Unit ATTILA Simulation Framework Results

5 5 Attila Classic for PCs Optimized for large resolutions Above 1024x768 Above 1024x768 Optimized for high performance High power requirements No power optimizations No power optimizations 100+ watts on current high-end GPUs 100+ watts on current high-end GPUs Large area budget 300+ million transistors on current high-end GPUs 300+ million transistors on current high-end GPUs Large dedicated of memory bandwidth 40+ GB/s on current high-end GPUs 40+ GB/s on current high-end GPUs Specialized Shader Units 2 to 8 vertex shader units 2 to 8 vertex shader units 1 to 6 fragment shader units 1 to 6 fragment shader units

6 6 Vertex Shader Vertex Shader Primitive Assembly Clipping Triangle Setup Rasterization Fragment Shader ROP HierarchicalZ Vertex Fetch Memory Controller Memory Controller Attila PC Specialized Shaders Four fragments processed in parallel Fragment Shader ROP

7 7 Outline ATTILA PC ATTILA Embedded Triangle Setup in the Shader Unit ATTILA Simulation Framework Results

8 8 Embedded Requirements Optimized for small resolutions 320x240 to 640x480 320x240 to 640x480 Optimized for low power Reduce frequency Reduce frequency Power optimizations Power optimizations Improve efficiency Improve efficiency Small area budget Remove non crucial hardware Remove non crucial hardware Low available bandwidth Reduced shading power Reduce design complexity

9 9 Attila Embedded No Hierarchical Z No Z compression Single unified shader 1 SIMD ALU 1 SIMD ALU Multithreaded Multithreaded 16 threads of four vertex/triangle/fragment elements 16 128-bit registers for temporal storage available per thread Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles 4 KB Texture Cache 4 KB Texture CacheROP One z and one color values updated per cycle in the framebuffer (a fragment quad each 4 cycles). One z and one color values updated per cycle in the framebuffer (a fragment quad each 4 cycles). Single 64-bit DDR channel Limited by current simulator implementation Limited by current simulator implementation Assimilated to small (1 MB) embedded DRAM Assimilated to small (1 MB) embedded DRAM 32-bit high latency bus to large system memory for textures

10 10 Memory Controller ROP Shader Vertex Fetch Primitive Assembly Rasterization Scheduler Distributor VerticesTriangles Fragments Attila Embedded Single Unified Shader Single fragment per cycle pipeline Clipping

11 11 Outline ATTILA PC ATTILA Embedded Triangle Setup in the Shader Unit ATTILA Simulation Framework Results

12 12 Triangle Setup in the Shader 2D Homogeneous Rasterization Olano & Greer Olano & Greer Triangle setup algorithm: Calculate setup matrix from triangle vertex matrix Calculate setup matrix from triangle vertex matrix Calculate interpolation equation for fragment Z Calculate interpolation equation for fragment Z Cull triangles based on their facing direction (area sign) Cull triangles based on their facing direction (area sign) Algorithm suited for a SIMD implementation in the Unified Shader Inputs: Four 3 component vectors as input for the triangle vertex positions Four 3 component vectors as input for the triangle vertex positionsOutputs: Three 4 component vectors as output for the triangle edge and z interpolation equation coefficients. Three 4 component vectors as output for the triangle edge and z interpolation equation coefficients. One signed triangle area register as output for face culling stage One signed triangle area register as output for face culling stage 26 Instruction Triangle Shader program

13 13 Triangle Setup in the Shader Benefits Reduce area Reduce area No specialized hardware required for Triangle setup Reduce design complexity Reduce design complexity Improve efficiency Improve efficiency Graphic workload in embedded applications may not fully utilize the triangle setup specialized hardware in most cases Higher utilization of the shader Costs Shader workload increases Shader workload increases Rerouting of the rasterization pipeline required Rerouting of the rasterization pipeline required

14 14 Outline ATTILA PC ATTILA Embedded Triangle Setup in the Shader Unit ATTILA Simulation Framework Results

15 15 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK!

16 16 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLInterceptor Capture a trace of OpenGL API calls from a real game

17 17 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLPlayer Reproduce the captured trace

18 18 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! OpenGL Library - Transform Fixed Function API into Shader code - 200 API calls supported - ARB Vertex and Fragment extensions - Alpha and Fog emulated via Shader code Driver - Low level interface to GPU hardware - Attila memory management

19 19 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! ATTILA Simulator - Detailed cycle-by-cycle simulation of all pipeline stages - 20 boxes, modeling a 100-deep pipeline - Execute@Execute: functionality embedded at each pipeline stage

20 20 Spot the differences Attila NVidia GeForce FX 5900XT

21 21 Outline ATTILA PC ATTILA Embedded Triangle Setup in the Shader Unit ATTILA Simulation Framework Results

22 22 Benchmark Unreal Tournament 2004 NOT AN EMBEDDED BENCHMARK NOT AN EMBEDDED BENCHMARK Up to 300K vertices per frame! Fixed function OpenGL API Fixed function OpenGL API Vertex and fragments shaders generated by our library 320x240 resolution 320x240 resolution 140 of 450 frames simulated 140 of 450 frames simulated 100+ frames ~ 1 day simulation 100+ frames ~ 1 day simulation On a Xeon P4 @ 2.0Ghz

23 23 Configurations We have evaluated 3 middle-end to low-end PC GPU configurations 3 middle-end to low-end PC GPU configurations 2 integrated on chipset GPUs and high-end PDA GPUs configurations 2 integrated on chipset GPUs and high-end PDA GPUs configurations 4 embedded low-end GPUs configurations 4 embedded low-end GPUs configurations We tried to keep a balance between memory bandwidth and shading computing power From 4 to no vertex shader units From 4 to no vertex shader units From 2 quad fragment shader units to a single unified shader unit From 2 quad fragment shader units to a single unified shader unit From four to one 64-bit DDR memory channels From four to one 64-bit DDR memory channels Store framebuffer in small (1 MB) GPU memory and textures in system memory Store framebuffer in small (1 MB) GPU memory and textures in system memory Halved the frequency for embedded systems Restricted design rules Restricted design rules Reduce power consumption Reduce power consumption Removed all optional features at the low end Hierarchical Z Hierarchical Z Z compression Z compression Specialized Triangle Setup hardware Specialized Triangle Setup hardware

24 24 Evaluated Configurations ConfResMHzVSh(F)Sh Fetch Way Regs Thread SetupBusesCacheeDRAMHZ Z Compr A102440042x4216x32Fixed4 16 KB -YY B32040042x4216x32Fixed4 -YY C32040021x4216x32Fixed2 -YY D32040021x4216x32Fixed2 8 KB -NY E320200-1x228x32Fixed1 -NY F320200-1x228x32Fixed1 4 KB -NN G320200-1x1216x16Fixed1 -NN H320200-1x1116x16Fixed1 -NN I320200-1x1116x16Shader1 -NN J320200-1x1116x16Shader1 1 MB NN K320200-1x1116x16Shader1 4 KB 1 MB YY

25 25 Configuration Comparison

26 26 Performance Average of 20 frames per second at 320x240 for the lower end single shader configurations

27 27 Efficiency The limiting factor for PC and high embedded configurations is memory bandwidth Shaders underutilized for the evaluated benchmark Shaders underutilized for the evaluated benchmark The limiting factor for low end configurations is shading processing Memory bandwidth could be further reduced Memory bandwidth could be further reduced Caches seem over dimensioned for the low-end embedded configurations

28 28 Shaded Triangle Setup Performance No overhead on fragment limited benchmarks 16% less performance in vertex and triangle limited traces

29 29 Conclusion The Attila Embedded achieves 20 frames per second on a single unified shader architecture at a 320x240 resolution when using a year old PC benchmark 1 MB of fast embedded DRAM provides more than enough bandwidth for framebuffer accesses 1 MB of fast embedded DRAM provides more than enough bandwidth for framebuffer accesses Texture data stored in system memory 16% performance reduction when removing the specialized Triangle Setup unit in the worst tested case 16% performance reduction when removing the specialized Triangle Setup unit in the worst tested case

30 30 Questions?

31 31 Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Attila PC Unified Shader Pool

32 32

33 33 PowerVR SGX


Download ppt "1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer."

Similar presentations


Ads by Google