Lec. 27-1 ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Compiler techniques for exposing ILP

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

Pipeline Optimization

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki

Compiler Techniques for ILP

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CSCE430/830 Computer Architecture

Lecture 11: Advanced Static ILP

CSL718 : VLIW - Software Driven ILP

CS 5513 Computer Architecture Pipelining Examples

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Computer Architecture

Compiler techniques for exposing ILP (cont)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Siddhartha Chatterjee Spring 2008

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

CS 704 Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Midterm 2 review Chapter

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Version-6: Software Pipelining

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

High-level view Out-of-order pipeline

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Static Scheduling Techniques

Lecture 5: Pipeline Wrap-up, Static ILP

CS 3853 Computer Architecture Pipelining Examples

Predication ECE 721 Prof. Rotenberg.

Presentation transcript:

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within a basic block) m Loop unrolling m Software pipelining (modulo scheduling) m Trace scheduling m Predication

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Loop Unrolling (1 of 2) m Unroll the loop (e.g., four times) Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) ADDI R1, R1, #8 BNE R1, xxx, Loop Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, 8(R1) ADD.D F8, F6, F2 S.D F8, 8(R1) L.D F10, 16(R1) ADD.D F12, F10, F2 S.D F12, 16(R1) L.D F14, 24(R1) ADD.D F16, F14, F2 S.D F16, 24(R1) ADDI R1, R1, #32 BNE R1, xxx, Loop More registers needed! Benefits: Less dynamic instruction overhead (ADD/BNE) Fewer branches  larger basic block: can reschedule operations more easily Loop: L.D F0, 0(R1) L.D F6, 8(R1) L.D F10, 16(R1) L.D F14, 24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, 8(R1) S.D F12, 16(R1) S.D F16, 24(R1) ADDI R1, R1, #32 BNE R1, xxx, Loop 14 cycles/4 iterations (3.5 cycles/iteration) 5 cycles/iteration

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Loop Unrolling (2 of 2) m Positives/negatives  (+) Larger block for scheduling  (+) Reduces branch frequency  (+) Reduces dynamic instruction count (loop overhead)  (–) Expands code size  (–) Have to handle excess iterations (“strip-mining”)

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Strip-mining m Suppose that our loop requires 10 iterations, and we can unroll only 4 of the iterations. m If we go through our unrolled loop 3 times, how many iterations (of the original loop) would we be performing? m If we go through our unrolled loop twice, how many iterations (of the original loop) would we be performing? Iterations: Prel: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6, 8(R1) ADD.D F8, F6, F2 S.D F8, 8(R1) ADD R1, R1, 16 BNE R1, xxx, Prel m So, we need to add some prelude code, e.g.,

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within a basic block) m Loop unrolling m Software pipelining (modulo scheduling) m Trace scheduling m Predication

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Software Pipelining (1 of 5) m Symbolic loop unrolling  The instructions in a loop are taken from different iterations in the original loop. Iteration 0 Iteration 1 Iteration 2 Iteration 3 Iteration 4 Software Pipelined Iteration This slide borrowed from Per Stenström, Chalmers University, Stockholm, Sweden, Computer Architecture, Lecture 6.

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Software Pipelining (2 of 5) m Treat dependent operations in a loop as a pipeline  LD[i]  ADDD[i]  SD[i] for (i=1; i<=N; i++) { // a[i] = a[i] + k; load a[i] add a[i] store a[i] } START-UP-BLOCK for (i=3; i<=N; i++) { load a[i] add a[i-1] store a[i-2] } FINISH-UP-BLOCK START-UP (prologue) LOOP FINISH-UP (epilogue) load a[1]a[2]a[3]a[i]a[N] add a[1]a[2]a[i-1]a[N-1]a[N] store a[1]a[i-2]a[N-2]a[N-1]a[N]  Hide latencies by placing different stages in successive iterations  LD[i] goes in first iteration  ADDD[i] goes in second iteration  SD[i] goes in third iteration  Stated another way, LD[i], ADDD[i–1], SD[i–2] go into the same iteration

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Software Pipelining (3 of 5) m Assembly-code version Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) ADDI R1, R1, #8 BNE R1, xxx, Loop START: L.D F0, 0(R1) ADDI R1, R1, #8 START: L.D F0, 0(R1) ADD.D F4, F0, F2 L.D F0, 8(R1) ADDI R1, R1, #16 FINISH: S.D F4, -16(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) ADDI R1, R1, #8 Loop: S.D F4, -16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) ADDI R1, R1, #8 BNE R1, xxx, Loop FINISH: S.D F4, -16(R1) ADD.D F4, F0, F2 ADDI R1, R1, #8 S.D F4, -16(R1)

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Instructions from 3 consecutive iterations form the loop body: Software pipelining (4 of 5) lNo data dependences within a loop iteration lThe dependence distance is 2 iterations lWAR hazard elimination is needed lRequires startup and finish code Loop:S.DF4,–16(R1) ; from iteration i ADD.DF4,F0,F2 ; from iteration i+1 L.DF0,0(R1) ; from iteration i+2 ADDIR1,R1,#8 BNER1, xxx, loop This slide borrowed from Per Stenström, Chalmers University, Stockholm, Sweden, Computer Architecture, Lecture 6.

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Software Pipelining (5 of 5) m Positives/negatives  (+) No dependences in loop body  (+) Same effect as loop unrolling (hide latencies), but don’t need to replicate iterations (code size)  (–) Still have extra code for prologue/epilogue (pipeline fill/drain)  (–) Does not reduce branch frequency

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within a basic block) m Loop unrolling m Software pipelining (modulo scheduling) m Trace scheduling m Predication

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Trace Scheduling (1 of 3) Creates a sequence of instructions that are likely to be executed—a trace. Two steps: lTrace selection: Find a likely sequence of basic blocks (trace) across statically predicted branches (e.g. if-then-else). lTrace compaction:Schedule the trace to be as efficient as possible while preserving correctness in the case the prediction is wrong. This slide and the next one have been borrowed from Per Stenström, Chalmers University, Stockholm, Sweden,Computer Architecture, Lecture 6.

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Trace Scheduling (2 of 3) A[i]:=B[i]+C[i] A[i] =0? X B[i]:= C[i]:= TF lThe leftmost sequence is chosen as the most likely trace. The assignment to B is control-dependent on the if statement. lTrace compaction has to respect data dependencies. lThe rightmost (less likely) trace has to be augmented with fixup code.

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Trace Scheduling (3 of 3) m Select most common path – a trace  Use profiling to select a trace  Allows global scheduling, i.e., scheduling across branches  This is speculation because schedule assumes certain path through region  If trace is wrong (other paths taken), execute repair code  Efficient static branch prediction key to success  Yields more instruction-level parallelism Trace to be scheduled b[i] =  old value  a[i] = b[i] + c[i] b[i] =  new value  c[i] = if (a[i] != 0) goto A B: A: restore old b[i] X maybe recalculate c[i] goto B Repair code b[i] =  old value  a[i] = b[i] + c[i] if (a[i] = 0) then b[i] =  new value  // common case else X c[i] = Original code

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Static Scheduling Techniques m Local scheduling (within a basic block) m Loop unrolling m Software pipelining (modulo scheduling) m Trace scheduling m Predication

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Example: Normal codeConditional BNEZR1, LCMOVZR2, R3, R1 MOVR2, R3 L: lUseful for short if-then statements lMore complex; might slow down cycle time Predication (1 of 3) m Conditional or predicated instructions lExecuted only if a condition is satisfied Control converted into data dependences This slide borrowed from Per Stenström, Chalmers University, Stockholm, Sweden, Computer Architecture, Lecture 6.

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Predication (2 of 3)  ISA role  Provide predicate registers  Provide predicate-setting instructions (e.g., compare)  Subset of opcodes can be guarded with predicates  Compiler role  Replace branch with predicate computation  Guard alternate paths with  predicate  and  !predicate   Hardware role  Execute all predicated code, i.e., both paths after a branch  Do not commit either path until predicate is known  Conditionally commit or quash, depending on predicate b[i] =  old value  a[i] = pred1 = (a[i] = 0)  pred1  : b[i] =  new value   !pred1  : X c[i] = Predicated code b[i] =  old value  a[i] = b[i] + c[i] if (a[i] = 0) then b[i] =  new value  // common case else X c[i] = Original code

Lec ECE 463/521, Profs. Conte, Rotenberg and Gehringer, Dept. of ECE, NC State University Predication (3 of 3) m Positives/negatives  (+) Larger scheduling scope without speculation  (–) ISA extensions; opcode pressure, extra register specifier  (–) May degrade performance if over-commit fetch/execute resources  (–) Convert control dependence to data dependence  Does this really fix the branch problem?  Not predicting control flow delays resolving register dependences  Can lengthen schedule w.r.t. trace scheduling m Above discussion is simplified  Predication issues are much, much more complex …  Complicating matters:  Software: Trace scheduling, predication, superblocks, hyperblocks, treegions, …  Hardware: Selective multi-path execution, control independence, …