Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

Similar presentations


Presentation on theme: "Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev."— Presentation transcript:

1 Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev Kumar Microprocessor Research Labs, Intel Corporation

2 Parallel Architecture and Compilation Techniques, 20032 Graphics Applications  Computational intensive graphics applications are becoming increasingly popular  Computer-Aided Design ─ From Airplanes to Cars  Visualization of massive quantities of Data  Visual Simulators e.g. Training Pilots  Fancier Graphical User Interfaces  And, of course, Games  And this trend is continuing  As high-end applications become more mainstream

3 Parallel Architecture and Compilation Techniques, 20033 OpenGL Or DirectX Compositing TransformLightingClippingRasterizationTexture Mapping Display Graphics Pipeline Vertex Shaders Operate on every vertex in the scene Effects like Blur Diffuse and specular reflection Pixel Shaders Operate on every pixel Effects like Texturing Fog blending Scene 3D Application

4 Parallel Architecture and Compilation Techniques, 20034 Vertex and Pixel Shaders  Need to operate millions of times a second  Small programs  Typically run on the graphics cards  However most desktops do not have graphics cards that support programmable shaders  This work focuses on running Vertex Shaders on the main CPU  Pixel shaders have very high computational and bandwidth requirements  Graphics applications are designed to adapt to the available features and performance

5 Parallel Architecture and Compilation Techniques, 20035 Goals  Improving the performance of Vertex Shaders on the main CPU  Analyze the performance on today’s CPU  Better Compiler Optimizations  Additional Architectural Support  Identify three architectural and compiler enhancements  Significant impact on the performance ─ Roughly by a factor of 2

6 Parallel Architecture and Compilation Techniques, 20036 Outline  Motivation  Baseline Compiler  Three Enhancements  Performance Evaluation  Conclusions

7 Parallel Architecture and Compilation Techniques, 20037 Vertex Shader Programs  Small Programs (at most 256 instructions)  SIMD instructions with xyzw components  Mask and Swizzle on each instruction  No state saved between vertices  Read-only memory & Temporary Registers  Program cannot change control flow Vertex Input 16 x 4 Registers Vertex Output 15 x 4 Registers SIMD ALU Constant Memory 256 x 4 Temporary Registers 12 x 4 Integer Registers 84 x 1 dp4 oPos.x, v0, c[0] dp4 oPos.y, v0, c[1] dp4 oPos.z, v0, c[2] dp4 oPos.w, v0, c[3] mov oD0, c[4].wzyx Virtual Machine

8 Parallel Architecture and Compilation Techniques, 20038 Baseline Optimizing Compiler  Implemented a Compiler for Vertex Shaders Input: Vertex Shader Assembly Output: Optimized x86 (with SSE2)  Started with DirectX reference rasterizer: Interpreter ─ Used it as the front end  Use Olive pattern-matching code-generator generator  Graph-coloring based register allocator  Loop unrolling  List-scheduler  About 70% faster than a naïve translator  Translate into C and feed it to a C compiler

9 Parallel Architecture and Compilation Techniques, 20039 Characteristics of Generated Code  Mostly SIMD instructions (x86 with SSE2)  83-99 % instructions  Large basic blocks  Use of control-flow is limited  Makes it easier to compile efficiently  Vertex Shared Assembly to x86 Assembly  10-20 times increase in number of instructions mul r0.x_z_, v0.xyzz, v1.wwww

10 Parallel Architecture and Compilation Techniques, 200310 Outline  Motivation  Baseline Compiler  Three Enhancements  Performance Evaluation  Conclusions

11 Parallel Architecture and Compilation Techniques, 200311 1. New Instructions  Dot products are very common in Shaders  A dot product translates is expensive on x86  A sequence of 7 instructions  1 multiply, 2 add, 4 shuffle instructions ─ In the simple case  New dot product instructions  Compute dot product of two source operands and store it in each of the word of the destination operand

12 Parallel Architecture and Compilation Techniques, 200312 2. Mask Analysis Optimization  Traditional optimizers keep track of the liveness information on a per-register basis  Shaders: often only part of the SIMD register is live  Modify to do this for each word of the SIMD register  Analysis Phase  Annotate the IR with additional information  During live variable analysis, propagate the liveness mask depending on the instructions  Optimization Phase  Identify dead code  Replace some shuffle/mask instructions with move ─ Might get eliminated entirely during register allocation

13 Parallel Architecture and Compilation Techniques, 200313 3. Number of Registers  Spilling registers to memory can degrade performance  Investigate the impact of increasing the number of registers from 8 to 16  Why not more?  Trickier to encode it in the ISA

14 Parallel Architecture and Compilation Techniques, 200314 Outline  Motivation  Baseline Compiler  Three Enhancements  Performance Evaluation  Conclusions

15 Parallel Architecture and Compilation Techniques, 200315 Experimental Setup  10 Vertex Shaders  8-84 instructions  Only 3 of them have loops (Control)  2.2 GHz Pentium IV processor  Instruction counts otherwise  Breakdown the instructions into categories  Measure performance by using the generated code to process an array of vertices  Compute average

16 Parallel Architecture and Compilation Techniques, 200316 Evaluation  New dot-product Instructions: 27.4% Average (Estimate)  Reduces the number of instructions by 24 %  Mask optimization: 19.5% on Average  Both: 42% on Average Vertex Shaders Normalized Execution Time

17 Parallel Architecture and Compilation Techniques, 200317 Evaluation Cont’d  Reduce the number of instructions by 8 % on average  35-100% of the spill instructions  This understates the potential benefit  More registers allow more aggressive optimizations like instruction scheduling Vertex Shaders Normalized Instruction Count

18 Parallel Architecture and Compilation Techniques, 200318 Outline  Motivation  Baseline Compiler  Three Enhancement  Performance Evaluation  Conclusions

19 Parallel Architecture and Compilation Techniques, 200319 Conclusions & Future Work  Implemented an Optimizing Compiler for Vertex Shaders  Propose and Evaluate Three Enhancements  Compiler: Mask Optimization  Architectural: New Instructions & More registers Improve the performance by a factor of 2 (Roughly)  Shaders are evolving rapidly  More like general purpose processors  More complex model


Download ppt "Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev."

Similar presentations


Ads by Google