Generic Software Pipelining at the Assembly Level Markus Pister

Slides:



Advertisements
Similar presentations
Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
Advertisements

CPU Structure and Function
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Multiscalar processors
Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Computer Organization and Architecture Reduced Instruction Set Computers (RISC) Chapter 13.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Reduced Instruction Set Computers. Major Advances in Computers(1) The family concept —IBM System/ —DEC PDP-8 —Separates architecture from implementation.
Basic Block Scheduling  Utilize parallelism at the instruction level (ILP)  Time spent in loop execution dominates total execution time  It is a technique.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Principles and Modulo Scheduling Algorithm
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Henk Corporaal TUEindhoven 2011
Instruction Level Parallelism (ILP)
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
How to improve (decrease) CPI
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Generic Software Pipelining at the Assembly Level Markus Pister

2/23 Embedded Systems (ES)  Embedded Systems (ES) are widely used  Many systems of daily use: handy, handheld, …  Safety critical systems: airbag control, flight control system,…  Rapidly growing complexity of software in ES

3/23 Embedded Systems (2)  Hard real time scenarios:  Short response time  Flight control systems, airbag control systems  Low power consumption and weight  Handy, handheld, …  Urgent need for fast program execution under the constraint of very limited code size

4/23 Code Generation for ES  Program execution times mostly spent in loops  Modern processors offer massive instruction level parallelism (ILP)  VLIW architecture: e.g. Philips TriMedia TM1000  EPIC architecture: e.g. Intel Itanium

5/23 Code Generation for ES (2)  Many existing compilers cannot generate satisfactory code (cannot exploit ILP)  High effort enhancing them to cope with advanced ILP  Improving the quality of legacy compilers by  Starting at the assembly level  Building flexible postpass optimizers -Can be quickly retargeted -Improve generated code quality significantly

6/23 PROPAN-Overview  Postpass-oriented Retargetable Optimizer and Analyzer

7/23 In this talk  Software Pipelining as a post pass optimization  Important technique to exploit ILP while trying to keep code size low  Static cyclic and global instruction scheduling method  Idea: overlap the execution of consecutive iterations of a loop a b c a b c a b c a b c a b c cb a DDG4x unrolled loop Kernel

8/23 Software Pipelining  Computes new (shorter) loop body  Overlapping loop iterations  Exploits ILP  Modulo Scheduling  Initiation interval (II)  divides loop into Stages  Schedule operations modulo II  Iterative Modulo Scheduling

9/23 Minimum Initiation Interval  Resource based: MII res  Determined by the resource requirements  Approximation for optimal bin packing  Data dependence based: MII dep  Delays imposed by cycles in DDG  MII = Max (MII res, MII dep )  Basis for Kernel (modulo) computation

10/23 Scheduling Phase  Flat Schedule  Maintain partial feasible schedule  Algorithm:  Pick next operation  Compute slot window [EStart,LStart]  Search feasible slot within [EStart,LStart]  Conflict: unschedule some operations and force current operation into partial schedule  Kernel  Schedule operations from the Flat Schedule modulo II

11/23 Prologue / Epilogue  „fills up“ or „drains down“ the pipeline respectively a b c d e II=1 Kernel Prologue Epilogue a b c d e a b c d e a b c d e a b c d e

12/23 Characteristics of the Post pass approach  Integration of the pipelined loop into the surrounding control flow  Modification of branch targets needed  Reconstruction of the CFG is complex and difficult  Resolving targets of computed branches/calls and switch tables ld32d(20) r4 → r34 ijmpt r1 r34

13/23 Characteristics of the Post pass approach (2)  Register allocation is already done  Assignment can be changed with Modulo Variable Expansion  Liveliness properties must be checked before register renaming  Applicable for inline assembly and library code  Data dependencies at the assembly level are more general  More generality leads to a more complex DDG  One single array access  multiple assembler operations

14/23 Data dependences at the assembly level i=0; … i=i+1; … j=array[i]; … ld32d(8) r6 → r7 … iadd(1) r7 → r8 … ld32d(20) r4 → r10 iadd r10 r8 → r9 ld32d r9 → r11

15/23 DDG at the assembly level

16/23 TriMedia TM Overview  Digital Signal Processor for Multimedia Applications designed by Philips  100 MHz VLIW-CPU (32 Bit)  128 General Purpose Registers (32 Bit)  27 parallel functional units

17/23 TM1000 ― VLIW-Core

18/23 TriMedia TM Properties  Instruction set  Register-based addressing modes  Predicative execution: register-based  load/store architecture  Special multimedia operations  5 Issue Slots, 5 Write-Back Busses  Irregular execution times for operations  Write-Back Bus has to be modeled independently

19/23 Experimental Results  Files from DSPSTONE- and Mibench-Benchmark  Best performance gains for chain like DDG‘s (up to 3,1)

20/23 Experimental Results (2)  Moderate code size increase (average: 1,42)

21/23 Experimental Results (3)  Computed MII mostly is already feasible (73%)

22/23 Future Work  Nested loops:  Process loops from innermost to outermost one  Treat an inner loop as one instruction (“meta-instruction”)  Parallelize Prologue and Epilogue code with surrounding code  Can be done by existing acyclic scheduling techniques like list scheduling  Delay Slot filling

23/23 Conclusion  Embedded Systems creates need for fast program execution under constraint of very limited code size  Overcome limitation of existing compilers by retargetable postpass optimizer  Fast program execution by exploiting ILP with Software Pipelining  Iterative Modulo Scheduling at the Assembly level  Characteristics of the Postpass approach  Experimental results show  a speedup of up to 3,1 with  an average code size increase of 1,42

24/23 Hardware-Support  Predicated execution  possible to omit prologue and epilogue  Rotating Register Files  No Modulo Register Expansion needed  Speculative Execution  Arbitrary number of iterations possible without a copy of the original loop at the expense of code size

25/23 Slot Window  Early Start  Earliest possible schedule time w.r.t. to the data dependencies  Late Start  Analog to Early Start the latest possible schedule time

26/23 Highest-level-first Priority  Larger priority  smaller slack available for the operation w.r.t. to the critical path

27/23 Modulo Register Expansion a b c d d cba Flat ScheduleModulo Schedule [0] [1] [2] [3] [0] II=1 Data dependence violation true Life span=2 d cba [0] d´ c´b´a´ [1] Expanded Modulo Schedule Unroll and rename