Pipeline ComplicationsCS510 Computer ArchitecturesLecture 8 - 1 Lecture 8 Advanced Pipeline.

Slides:

Advertisements

Similar presentations

1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth.

Advertisements

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

COMP 4211 Seminar Presentation Based On: Computer Architecture A Quantitative Approach by Hennessey and Patterson Presenter : Feri Danes.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

EENG449b/Savvides Lec /24/04 March 24, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.

CMSC 611: Advanced Computer Architecture Instruction Level Parallelism Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides.

Lecture 6: Pipelining MIPS R4000 and More Kai Bu

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Lecture 8 Advanced Pipeline.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

CS252/Kubiatowicz Lec 5.1 9/10/99 CS252 Graduate Computer Architecture Lecture 5 Introduction to Advanced Pipelining September 10, 1999 Prof. John Kubiatowicz.

COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

COMP381 by M. Hamdi 1 Pipelining Control Hazards and Deeper pipelines.

EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

ENGS 116 Lecture 61 Pipelining Difficulties and MIPS R4000 Vincent H. Berk October 6, 2008 Reading for today: A.3 – A.4, article: Yeager Reading for Wednesday:

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Appendix A Pipelining: Basic and Intermediate Concepts

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

CS252 Graduate Computer Architecture Lecture 5 Software Scheduling around Hazards Out-of-order Scheduling February 2 nd, 2011 John Kubiatowicz Electrical.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.

Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

CMPE 421 Parallel Computer Architecture

EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

Review Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2006.

Instruction-Level Parallelism and Its Dynamic Exploitation

Instruction-Level Parallelism

Lecture 15: Pipelining: Branching & Complications

Lecture 07: Pipelining Multicycle, MIPS R4000, and More

Appendix C Pipeline implementation

CSCE430/830 Computer Architecture

Lecture 3: Introduction to Advanced Pipelining

Exceptions & Multi-cycle Operations

Pipelining: Advanced ILP

Pipelining Multicycle, MIPS R4000, and More

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Project Instruction Scheduler Assembler for DLX

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Key to pipelining: smooth flow Hazards limit performance

CS252 Graduate Computer Architecture Lecture 5 Software Scheduling around Hazards Out-of-order Scheduling John Kubiatowicz Electrical Engineering and.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Lecture 8 Advanced Pipeline

Pipeline ComplicationsCS510 Computer ArchitecturesLecture InterruptsInterrupts Interrupts: 5 instructions executing in 5-stage pipeline –How to stop the pipeline? –How to restart the pipeline? –Who caused the interrupt? StageExceptional Conditions IF Page fault on instruction fetch; Unaligned memory access; Memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; Unaligned memory access; Memory-protection violation

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Simultaneous Exceptions in More Than One Pipe Stages Simultaneous exceptions in more than one pipeline stage, e.g. LD followed by ADD –LD with data page(DM) fault in MEM stage –ADD with instruction page(IM) fault in IF stage –ADD fault will happen BEFORE Load fault Solution #1 –Interrupt status vector per instruction –Defer check till the last stage, and kill machine state update if exception Delays updating the machine state until late in pipeline, possibly at the completion of an instruction! Solution #2 – Interrupt ASAP – Restart everything that is incomplete

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Simultaneous Exceptions Complex Addressing Modes and Instructions Address modes: Auto-increment causes register change during the instruction execution - Register write in EX stage instead of WB stage –Interrupts? Need to restore register state –Adds WAR and WAW hazards since writes in a register in EX, no longer in WB stage Memory-Memory Move Instructions –Must be able to handle multiple page faults –Long-lived instructions: partial state save on interrupt Condition Codes

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Extending the DLX to Handle Multi-cycle Operations IF IDMEM WB EX IF ID MEM WB EX int unit EX FP/int Multiply EX FP/Int divider EX FP adder

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Multicycle Operations IFID MEM WB EX integer unit M1 M2M3M4 M5 M6M7 FP/integer multiply A1A2 A3A4 FP adder DIV FP/integer divider

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Latency and Initiation Interval Latency: Number of intervening cycles between instructions that produces a result and uses the result Initiation Interval: number of cycles that must elapse between issuing of two operations of a given type Integer ALU 0 1 Load 1 1 FP add 3 1 FP mul 6 1 FP div Data needed Result available * FP LD and ST are same as integer by having 64-bit path to memory. MULTD IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB ADDD IF ID AI A2 A3 A4 MEM WB LD* IF ID EX MEM WB SD* IF ID EX MEM WB Example latency initiation interval

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Floating Point Operations FP Instruction Latency Initiation Interval (MIPS R4000) Add, Subtract43 Multiply84 Divide3635 Square root Negate21 Absolute value21 FP compare32 Cycles before using result Cycles before issuing instr of the same type Floating Point: long execution time Also, pipeline FP execution unit may initiate new instructions without waiting full latency Reality: MIPS R4000

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Complications Due to FP Operations Divide, Square Root take  10X to  30X longer than Add –exceptions? –Adds WAR and WAW hazards since pipelines are no longer same length

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Summary of Pipelining Basics Hazards limit performance –Structural: need more HW resources –Data: need forwarding, compiler scheduling –Control: early evaluation of PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, FP Instruction Set makes pipelining harder Compilers reduce cost of data and control hazards –Load delay slots –Branch delay slots –Branch prediction

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Case Study: MIPS R4000 and Introduction to Advanced Pipelining

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Case Study: MIPS R4000 (100 MHz to 200 MHz) 8 Stage Pipeline: IF First half of fetching of instruction PC selection Initiation of instruction cache access IS - Second half of fetching of instruction Access to instruction cache RFInstruction decode, register fetch, hazard checking also instruction cache hit detection(tag check) EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation DF - First half of access to data cache DS - Second half of access to data cache TC - Tag check for data cache hit WB -Write back for loads and register-register operations Cache miss exception – 10s of cycles delay What is impact on – Load Delay? – Why?

Pipeline ComplicationsCS510 Computer ArchitecturesLecture The Pipeline Structure of the R4000 Instruction Memory REG ALU Data Memory REG Instruction is available Tag check load data available IF IS RF EX DF DS TC WB

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Case Study: MIPS R4000 LOAD Latency 2 Cycle Load Latency Load data available with forwarding LD R1, X IF IS RF EX DF DS TC WB IF IS RF EX DF DS IF IS RF EX DF DS... ADD R3, R1, R2 IF IS RF EX DF DS TC WB IF IS RF EX DF DS TC... IF IS RF EX DF... EX Load data needed EX 2 Stall Cycles

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Case Study: MIPS R4000 LOAD Followed by ALU Instructions 2 cycle Load Latency with Forwarding Circuit IFIS IF RF IS IF EX RF IS IF DF stall DS stall TC EX RF IS WB DF... EX... RF... LW R1 ADD R2, R1 SUB R3, R1 OR R4, R1 Forwarding

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Case Study: MIPS R4000 LOAD Followed by ALU Instructions 2 cycle Load Latency with Forwarding Circuit IFIS IF RF IS IF EX RF IS IF DF stall DS stall TC EX RF IS WB DF... EX... RF... LW R1 ADD R2, R1 SUB R3, R1 OR R4, R1 Forwarding

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Case Study: MIPS R4000 Branch Latency Predict NOT TAKEN strategy NOT TAKEN: one-cycle delayed slot TAKEN: one-cycle delayed slot followed by two stalls - 3 cycle latency (conditions evaluated during EX phase) NOT TAKEN R4000 uses Predict NOT TAKEN Delay Slot plus 2 stall cycles IFIS IF RF IS IF RF IS IF DF EX RF IS DS DF EX RF IS TC DS DF EX RF WB TC... DS... DF... EX... NOT TAKEN NOT TAKEN Br Delay Slot Br instr +2 Br instr +3 Br instr +4 EX IF DS DF IS TC DS RF IFIS IF RF ISRF DF EX WB TC... EX... EX TAKEN TAKEN Br Delay Slot StallStall Br Target instr IF Branch target address available after EX stage

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Extending DLX to Handle Floating Point Operations IFID MEM WB Integer Unit(EX) FP/integer multiply FP Multiplier FP Adder FP Divider

Pipeline ComplicationsCS510 Computer ArchitecturesLecture MIPS R4000 FP Unit FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: StageFunctional unitDescription AFP adderMantissa ADD stage DFP dividerDivide pipeline stage EFP multiplierException test stage MFP multiplierFirst stage of multiplier NFP multiplierSecond stage of multiplier RFP adderRounding stage SFP adderOperand shift stage UUnpack FP numbers

Pipeline ComplicationsCS510 Computer ArchitecturesLecture MIPS R4000 FP Pipe Stages FP Instr  Add, SubtractU S+AA+RR+S MultiplyU E+MMMMNN+AR DivideU AD 28  D+AD+R, D+A, D+R, A, R Square rootU E(A+R) 108  AR NegateU S Absolute valueU S FP compareU AR Stages: MFirst stage of multiplierN Second stage of multiplier RRounding stageA Mantissa ADD stage SOperand shift stageD Divide pipeline stage UUnpack FP numbersE Exception test stage

Pipeline ComplicationsCS510 Computer ArchitecturesLecture MIPS R4000 FP Pipe Stages Add Issue U S+A A+R R+S Add Stall U S+A A + R R +S Add Issue U S+A A+R R+S Multiply Issue U M M M M N N+ A R clock cycle OperationIssue/stall A A A A ADD issued at 5 cycles after Multiply will stall 1 cycle. Stall ADD issued at 4 cycles after Multiply will stall 2 cycles.

Pipeline ComplicationsCS510 Computer ArchitecturesLecture R4000 Performance Not an ideal pipeline CPI of 1: –Load stalls: (1 or 2 clock cycles) –Branch stalls: (2 cycles for taken br. + unfilled branch slots) –FP result stalls: RAW data hazard (latency) –FP structural stalls: Not enough FP hardware (parallelism) eqntott espresso gcc li doduc nasa7 ora spice2g6 su2cor tomcatv BaseLoad stallsBranch stallsFP result stallsFP structural stalls Integer programsFloating Point programs Pipeline CPI

Pipeline ComplicationsCS510 Computer ArchitecturesLecture

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Advanced Pipeline And Instruction Level Parallelism

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Advanced Pipelining and Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Beyond single block to get more instruction level parallelism Loop level parallelism is one opportunity, SW and HW... Branch Target... Branch instruction... Any instruction... Branch instruction... Block of Code

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Advanced Pipelining and Instruction Level Parallelism Loop unrollingControl stalls Basic pipeline schedulingRAW stalls Dynamic scheduling with scoreboardingRAW stalls Dynamic scheduling with register renamingWAR and WAW stalls Dynamic branch predictionControl stalls Issuing multiple instructions per cycleIdeal CPI Compiler dependence analysisIdeal CPI and data stalls Software pipelining and trace schedulingIdeal CPI and data stalls SpeculationAll data and control stalls Dynamic memory disambiguationRAW stalls involving memory Technique Reduces

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Basic Pipeline Scheduling and Loop Unrolling FP unit latencies Instruction producing Instruction using Latency in result result clock cycles FP ALU op Another FP ALU op3 FP ALU op Store double 2 Load double* FP ALU op1 Load double* Store double 0 * Same as integer Load since there is a 64-bit data path from/to memory. Fully pipelined or replicated --- no structural hazards, issue on every clock cycle for ( i =1; i <= 1000; i++) x[i] = x[i] + s;

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Loop:LDF0,0(R1);R1 is the pointer to a vector ADDDF4,F0,F2;F2 contains a scalar value SD0(R1),F4;store back result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot FP Loop Hazards Where are the stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0

Pipeline ComplicationsCS510 Computer ArchitecturesLecture FP Loop Showing Stalls 1 Loop:LDF0,0(R1);F0=vector element 2stall 3 ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8stall 9 BNEZR1,Loop;branch R1!=zero 10stall ;delayed branch slot Rewrite code to minimize stalls?

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Reducing Stalls 1 Loop:LDF0,0(R1) 2stall 3 ADDDF4,F0,F2 4stall 5stall 6 SD0(R1),F4 7 SUBIR1,R1,#8 8 stall 9 BNEZR1,Loop 10stall For Load-ALU latency There is only one instruction left, i.e., BNEZ. When we do that, SD instruction fills the delayed branch slot. For ALU-ALU latency Reading R1 by LD is done before Writing R1 by SUBI. Yes we can. Consider moving SUBI into this Load Delay Slot. When we do this, we need to change the immediate value 0 to 8 in SD 8

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Revised FP Loop to Minimize Stalls 1 Loop:LDF0,0(R1) 2 SUBIR1,R1,#8 3 ADDDF4,F0,F2 4stall 5 BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Unroll loop 4 times to make the code faster

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Unroll Loop 4 Times 1 Loop:LDF0,0(R1) 2 ADDDF4,F0,F2 3 SD0(R1),F4 ;drop SUBI & BNEZ 4 LDF6,-8(R1) 5 ADDDF8,F6,F2 6 SD-8(R1),F8 ;drop SUBI & BNEZ 7 LDF10,-16(R1) 8 ADDDF12,F10,F2 9 SD-16(R1),F12 ;drop SUBI & BNEZ 10 LDF14,-24(R1) 11 ADDDF16,F14,F2 12 SD-24(R1),F16 13 SUBIR1,R1,#32;alter to 4*8 14 BNEZR1,Loop 15 NOP x (1*+2 + )+1^= 28 clock cycles, or 7 per iteration. 1*: LD to ADDD stall 1 cycle 2 + : ADDD to SD stall 2 cycles 1^: Data dependency on R1 Rewrite loop to minimize the stalls

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Unrolled Loop to Minimize Stalls 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SUBIR1,R1,#32 12 SD16(R1),F12 ; 16-32= BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = -24 Assumptions - OK to move SD past SUBI even though SUBI changes R1 SUBI IF RF EX MEM WB SD IF ID EX MEM WB BNEZ IF ID EX MEM WB - OK to move loads before stores(Get right data) - When is it safe for compiler to do such changes? 14 clock cycles, or 3.5 per iteration

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Compiler Perspectives on Code Movement Definitions: Compiler is concerned about dependencies in the program, whether this causes a HW hazard or not depends on a given pipeline Data dependencies (RAW if a hazard for HW): Instruction j is data dependent on instruction i if either –Instruction i produces a result used by instruction j, or –Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. Easy to determine for registers (fixed names) Hard for memory: –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)?

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Compiler Perspectives on Code Movement Name Dependence: Two instructions use the same name(register or memory location) but they do not exchange data Two kinds of Name Dependence Instruction i precedes instruction j –Antidependence (WAR if a hazard for HW) Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first –Output dependence (WAW if a hazard for HW) Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Again Hard for Memory Accesses –Does 100(R4) = 20(R6)? –From different loop iterations, does 20(R6) = 20(R6)? Our example required compiler to know that if R1 doesn’t change then: 0(R1) ¹ -8(R1) ¹ -16(R1) ¹ -24(R1) 1 There were no dependencies between some loads and stores, so they could be moved by each other. Compiler Perspectives on Code Movement

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Compiler Perspectives on Code Movement Control Dependence Example if p1 {S1;}; if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

Pipeline ComplicationsCS510 Computer ArchitecturesLecture Compiler Perspectives on Code Movement Two (obvious) constraints on control dependencies: –An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. –An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch. Control dependencies may be relaxed in some systems to get parallelism; get the same effect if preserve the order of exceptions and data flow

Pipeline ComplicationsCS510 Computer ArchitecturesLecture When Safe to Unroll Loop? Example: When a loop is unrolled, where are data dependencies? (A,B,C distinct, non-overlapping) for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ 1. S2 uses the value A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1]. This is a loop-carried dependence between iterations Implies that iterations are dependent, and can’t be executed in parallel Not the case for our example; each iteration was distinct

Pipeline ComplicationsCS510 Computer ArchitecturesLecture When Safe to Unroll Loop? Example: Where are data dependencies? (A,B,C,D distinct & non-overlapping) Following looks like there is a loop carried dependence for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i];} /* S2 */ However, we can rewrite it as follows for loop carried dependence-free A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1];} B[101] = C[100]+D[100];

Pipeline ComplicationsCS510 Computer ArchitecturesLecture SummarySummary Instruction Level Parallelism in SW or HW Loop level parallelism is easiest to see SW parallelism dependencies defined for a program, hazards if HW cannot resolve SW dependencies/compiler sophistication determine if compiler can unroll loops