Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

Slides:



Advertisements
Similar presentations
ILP: Software Approaches
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 4 Predication CSE 820. Michigan State University Computer Science and Engineering Go over midterm exam.
ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
COMP4611 Tutorial 6 Instruction Level Parallelism
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
Tomasulo’s Approach and Hardware Based Speculation
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
Computer Architecture Principles Dr. Mike Frank
CS203 – Advanced Computer Architecture
Henk Corporaal TUEindhoven 2009
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 6: Static ILP, Branch prediction
CS 704 Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Yingmin Li Ting Yan Qi Zhao
Computer Architecture
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
Static Scheduling Techniques
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Loop Unrolling & Predication CSE 820

Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized loop contains instructions from different iterations of the original loop. Sometimes called symbolic loop unrolling.

Michigan State University Computer Science and Engineering Software Pipelined Loop

Michigan State University Computer Science and Engineering Unrolled Loop select subset of each iteration (bold) Iteration 1: L.D F0,0 (R1) ADD.D F4, F0, F2 S.D F4, 0 (R1) Iteration 2: L.D F0,0 (R1) ADD.D F4, F0, F2 S.D F4, 0 (R1) Iteration 3: L.D F0,0 (R1) ADD.D F4, F0, F2 S.D F4, 0 (R1)

Michigan State University Computer Science and Engineering Software Pipelining Loop: S.D F4, 16 (R1); stores into M[i] ADD.D F4, F0, F2 ; adds to M[i-1] L.D F0,0 (R1) ; loads M[i-2] DADDUI R1, R1, # -8 BNE R1, R2, Loop Requires start-up and clean-up.

Michigan State University Computer Science and Engineering Symbolic Loop Unrolling Software pipelining can be thought of as symbolic loop unrolling, but has the advantage of generating less code.

Michigan State University Computer Science and Engineering Software Pipelining has less overhead

Michigan State University Computer Science and Engineering Global Code Scheduling allows moving instructions across branches Most techniques concentrate on determining a Straight-line code segment representing the most frequently executed code

Michigan State University Computer Science and Engineering Trace Scheduling Concept 1.Guess the likely path through branches (called the trace) 2.Trace now contains long stretches of code without taken branches (predicted) 3.Schedule the trace allowing movement across branches Add code to off-the-trace to undo the effects of movement The increased ability to move across branches should improve scheduling

Michigan State University Computer Science and Engineering Movement + Undo Consider if (cond) then { x=x + 5; // likely } else // unlikely After Movement x = x + 5; if (cond) then { // likely} else { x = x – 5; // unlikely} // undo

Michigan State University Computer Science and Engineering Select a trace

Michigan State University Computer Science and Engineering Trace showing jumps off the trace

Michigan State University Computer Science and Engineering Superblocks Avoid the multiple entry and exits of traces. Superblock has one entry and multiple exits which makes scheduling easier. The one-entry-multiple-exit is achieved by duplicating code where the unlikely path exits the trace so that no reentry is needed.

Michigan State University Computer Science and Engineering Superblock: one entry and multiple exits

Michigan State University Computer Science and Engineering Predicated Instructions Requires –Hardware –ISA modification Predicated instructions eliminate branches, converting a control dependence into a data dependence. IA-64 has predicated instructions, but many existing ISA contain at least one (the conditional move).

Michigan State University Computer Science and Engineering Conditional Move if (R1 == 0) R2 = R3; Branch: BNEZ R1,L ADDU R2, R3, R0 L: Conditional Move: CMOVZ R2, R3, R1 In a pipeline, the control dependence at the beginning of the pipeline is transformed into a data dependence at the end of the pipeline.

Michigan State University Computer Science and Engineering Full Predication Every instruction has a predicate: if the predicate is false, it becomes a NOP. It is particularly useful for global scheduling since non-loop branches can be eliminated: the harder ones to schedule.

Michigan State University Computer Science and Engineering Exceptions & Predication A predicated instruction must not be allowed to generate an exception, if the predicate is false.

Michigan State University Computer Science and Engineering Implementation Although predicated instructions can be annulled early in the pipeline, annulling during commit delays annulment until later so data hazards have an opportunity to be resolved. The disadvantage is that resources such as functional units and registers (rename or other) are used.

Michigan State University Computer Science and Engineering Predication is good for… Short alternative control flow Eliminating some unpredictable branches Reducing the overhead of global scheduling But the precise rules for compilation are still being determined.

Michigan State University Computer Science and Engineering Limitations Annulled instructions waste resources: registers, functional units, cache & memory bandwidth If predicate condition cannot be separated from the instruction, a branch might have had better performance, if it could have been accurately predicted.

Michigan State University Computer Science and Engineering Limitations (con’t) Predication across multiple branches can complicate control and is undesirable unless hardware supports it (as in IA-64). Predicated instructions may have a speed penalty—not the case when all instructions are predicated.

Michigan State University Computer Science and Engineering Example if (A==0) A=B; else A= A+4; LDR1,0(R3);load A BNEZR1,L1;test A LDR1,0(R2);then clause JL2;skip else L1:DADDIR1,R1,#4 ;else clause L2:SDR1,0(R3);store A

Michigan State University Computer Science and Engineering Hoist Load if (A==0) A=B; else A= A+4; LDR1,0(R3) ;load A LDR14,0(R2) ;speculative load B BEQZR1,L3 ;other branch of if DADDIR14,R1,#4 ;else clause L3:SDR14,0(R3) ;store A What if speculative load raises an exception?

Michigan State University Computer Science and Engineering Guard if (A==0) A=B; else A= A+4; LDR1,0(R3);load A sLDR14,0(R2);speculative load BNEZR1,L1;test A SPECCK0(R2);speculative check JL2;skip else L1:DADDIR14,R1,#4 ;else clause L2:SDR14,0(R3);store A sLD does not raise certain exceptions; leaves them for SPECCK (IA-64).

Michigan State University Computer Science and Engineering Other exception techniques Poison bit: –applied to destination register. –set upon exception –raise exception upon access to poisoned register.

Michigan State University Computer Science and Engineering Hoist Load above Store If memory addresses are known, a load can be hoisted above a store. If not, … add a special instruction to check addresses before the loaded value is used. (It is similar to SPECCK shown earlier: IA-64)

Michigan State University Computer Science and Engineering Speculation: soft vs. hard must be able to disambiguate memory (to hoist loads past stores), but at compile time information is insufficient hardware works best when control flow is unpredictable and when hardware branch prediction is superior exception handling is easier in hardware trace techniques require compensation code compilers see further for better scheduling

Michigan State University Computer Science and Engineering IA-64