Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.

Similar presentations


Presentation on theme: "1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department."— Presentation transcript:

1 1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department of Computer Architecture UPC Roger Espasa Intel DEG Barcelona

2 2 Introduction Shaders in GPUs evolving towards general programming Branches, generic loads, scatter Branches, generic loads, scatter New types of shaders: geometry in DX10 Current specialized shaders Area hungry Area hungry Unbalancing leads to inefficiencies Unbalancing leads to inefficiencies This paper: unify all shaders ~8% higher performance with less area & resources ~8% higher performance with less area & resources

3 3 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

4 4 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

5 5 ATTILA Our implementation of current GPUs Inspired in both NVIDIA and ATI Inspired in both NVIDIA and ATI Not exact to either pipeline Not exact to either pipeline Lack of detailed micro architecture information Educated guessing on our side Implemented Features 2D Homogeneous Recursive Rasterization 2D Homogeneous Recursive Rasterization Tiled Rasterization Tiled Rasterization Hierarchical Z Hierarchical Z Texture compression Texture compression Anisotropic filtering Anisotropic filtering Depth compression, fast z/stencil and color clear Depth compression, fast z/stencil and color clear

6 6 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

7 7 Vertex Shader Vertex Shader Vertex Shader Vertex Shader Primitive Assembly Clipping Triangle Setup Rasterization Fragment Shader Fragment Shader Fragment Shader Fragment Shader ROP HierarchicalZ Vertex Fetch Memory Controller Memory Controller Memory Controller Memory Controller Attila Classic Specialized Shaders

8 8 Specialized Shader Issues Unbalancing In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) In fragment shading limited scenarios (typical) up to 30% of the processing power remains idle (for a GPU with 8 vertex and 4 fragment shaders) In vertex shading limited scenarios up to 70% of the processing power remains idle. In vertex shading limited scenarios up to 70% of the processing power remains idle. Dedicated Area 4 unused vertex shaders have the same processing power than one 1 fragment shader 4 unused vertex shaders have the same processing power than one 1 fragment shader 4 vertex shaders require 66% the area of a fragment shader 4 vertex shaders require 66% the area of a fragment shader Different Designs Increases the complexity of the micro architecture Increases the complexity of the micro architecture Increases development and verification time Increases development and verification time

9 9 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

10 10 Memory Controller Memory Controller Memory Controller Memory Controller ROP Shader Vertex Fetch Primitive Assembly Clipping Triangle Setup Rasterization HierarchicalZ Scheduler Distributor Attila Unified Unified Shader Pool

11 11 Unified Shader Architecture Benefits Unified programming model Unified programming model DX10/SM4 and OpenGL/GLSlang are already pushing for it The same features for all the program targets The same features for all the program targets Texturing, branching, outputs Not just vertex and fragment programs Not just vertex and fragment programs DX10 => geometry shader General Purpose GPU or Stream Processor Workload balance Workload balance Shading resources allocated as required at any point of the rendering

12 12 Unified Shader Architecture Costs Scheduler Scheduler Select which kind of workload must be processed next Partly implemented with multithreading in the fragment shader to hide texture access latency Larger instruction memory and constant bank Larger instruction memory and constant bank Rerouting required Rerouting required All the paths cross the shader pool

13 13 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

14 14 ATTILA Framework OpenGL Interceptor tool OpenGL library for Attila GPU Driver for our Attila GPU Attila GPU simulator Signal Visualizer Tool

15 15 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK!

16 16 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLInterceptor Capture a trace of OpenGL API alls from a real game

17 17 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! GLPlayer Reproduce the captured trace

18 18 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! OpenGL Library - Transforms Fixed Function into Shader code - 200 API Calls supported - ARB Vertex and Fragment extensions - Alpha and Fog emulated via Shader code Driver - Low level access - Attila memory management

19 19 CollectVerifySimulateAnalyze OpenGL Application GLInterceptor Vendor OpenGL Driver Trace ATI R520/NVidia G70 Framebuffer Vendor OpenGL Driver ATI R520/NVidia G70 Framebuffer ATTILA OpenGL Driver ATTILA Simulator Framebuffer GLPlayer Signal Visualizer Statistics Signal Traffic CHECK! ATTILA Simulator - Detailed cycle-by-cycle simulation of all pipeline stages - 20 boxes, modeling a 100-deep pipeline - Execute@Execute: functionality embedded at each pipeline stage

20 20 Find the differences Find the differences NVIDIA GeForce FX 5900XT Attila

21 21 Outline Attila – our GPU architecture Attila-Classic: Non-unified shaders Attila-Unified: Unified Shaders Simulation Framework Results

22 22 Benchmark Unreal Tournament 2004 Fixed function OpenGL API Fixed function OpenGL API Vertex and fragments shaders generated by our library 1024x768 resolution 1024x768 resolution 8x Anisotropic Filtering 8x Anisotropic Filtering 160 of 450 frames simulated 160 of 450 frames simulated 40 frames ~ 1 day simulation 40 frames ~ 1 day simulation On a Xeon P4 @ 2.0Ghz On a Xeon P4 @ 2.0Ghz

23 23 Baseline Configuration Four Vertex Shaders (only for Attila- Classic) Fragment and Unified shader configuration: 32 threads 32 threads 4 fragments/vertices per thread 16 128-bit FP registers available for temporal storage per thread n SIMD ALUs n SIMD ALUs 1 scalar ALU (optional) 1 scalar ALU (optional) 1 Texture Unit per Shader Unit 1 Texture Unit per Shader Unit 16 KB texture cache Single cycle bilinear and two cycle trilinear AF up to 16x Geometry and Rasterization pipelines limited to 1 vertex and 1 triangle per cycle Two ROPs: 8 z and 8 color values written per cycle Four 64-bit DDR buses: peak bandwidth 64 bytes/cycle

24 24 “Classic” Performance 8% improvement for 2-way Near linear improvement for 4 shaders Sublinear improvement for 6 and 8 shaders Limited by memory bandwidth and latency Limited by memory bandwidth and latency 8sh 6sh 4sh 2sh ~75% ~45% ~40% 7% 8%

25 25 Vertex shader and fragment shader workload for 4 vertex shader units and 2 fragment shader units Frame 330 – Detailed Zoom Vertex shading limited

26 26 Unified Shader Performance Unified improvement ranges from 1% (2 shaders) to 8% (eight 1-way shaders) Fragment shading limited Vertex fetch limited Geometry pipeline limited 8sh 6sh 4sh 2sh

27 27 Area Estimation ATI R400 ATI RV400 Transistors (millions) 160120 Vertex Shaders 64 Fragment Shaders 42 Hardware Element Estimated Area Millions of Transistors Vertex Shader 2.5 Fragment Shader 15 Additional SIMD ALU +15% Additional scalar ALU +5% 160 – 120 = 40 = 2 vertex shader * 2.5 + 2 fragments shader * 15 + 5 (other)

28 28 Shader Scaling vs Transistors 8sh 6sh 4sh 2sh Linear for 4 shader units, sublinear for more than 4 shader units Up to 30% more efficient per area for the unified architecture (two 1- way shaders)

29 29 Conclusion Attila Unified architecture has better performance than Attila Classic with less hardware Up to 8% better performance Up to 8% better performance 8% to 25% less area required 8% to 25% less area required 10% to 30% better performance per area 10% to 30% better performance per area Up to 8% better performance for 2-way shader units 160% better performance from 2 to 8 fragment or unified shader units Memory bandwidth limited beyond 4 shaders Memory bandwidth limited beyond 4 shaders

30 30 Questions

31 31 Performance of Attila Unified vs Classic Attila


Download ppt "1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department."

Similar presentations


Ads by Google