Chapter 2: ILP and Its Exploitation

Slides:



Advertisements
Similar presentations
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Chapter 3 - ILP CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David Patterson.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Ch2. Instruction-Level Parallelism & Its Exploitation 1. ILP ECE 468/562 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of.
1 Chapter 3: ILP and Its Dynamic Exploitation Review simple static pipeline Dynamic scheduling, out-of-order execution Dynamic branch prediction, Instruction.
CS203 – Advanced Computer Architecture Instruction Level Parallelism.
ILP: Instruction Level Parallelism Slides mainly from Hennessy & Patterson Computer Architecture, a Quantitative Approach 4 th edition.
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
Computer Architecture Principles Dr. Mike Frank
Concepts and Challenges
CS 704 Advanced Computer Architecture
Instruction-Level Parallelism (ILP)
现代计算机体系结构 主讲教师:张钢 教授 天津大学计算机学院 2017年
CSCE430/830 Computer Architecture
CMSC 611: Advanced Computer Architecture
Chapter 3: ILP and Its Exploitation
Dynamic Branch Prediction
Advanced Computer Architecture
/ Computer Architecture and Design
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Presentation transcript:

Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel P6 microarchitecture

Advanced Processor Pipelining Focus on exploit Instruction-Level Parallelism (ILP): Definition: Executing multiple instructions (within a single program thread) simultaneously Note that even ordinary pipelining does some ILP (overlapping execution of multiple instructions). Focus of this chapter: Increasing ILP even more by allowing out-of-order execution with/without speculation Using multiple-issue datapaths to initiate multiple instructions simultaneously for further improvement Microarchitectures that do this are called superscalar Examples: PowerPC, Pentium, etc.

Pipeline Performance Ideal pipeline CPI = 1 is minimum number of cycles per instruction issued, if no stalls occur. May be <1 in superscalar machines. E.g., Ideal CPI=1/3 in 3-way superscalar (often use IPC=3 in superscalar) Real pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Hazard Stalls Maximize performance using various techniques to eliminate stalls and reduce ideal CPI. Note: Real pipeline CPI still need to account for cache misses (discuss later).

Advanced Pipelining Techniques Technique Reduces Loop unrolling Control stalls Basic pipeline scheduling / forwarding RAW stalls Dynamic scheduling w. scoreboarding RAW stalls Dyna. sched. with register renaming WAR & WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal pipeline CPI Compiler dependence analysis/software Ideal CPI & data stalls Software pipelining & trace scheduling Ideal CPI & data stalls Hardware Speculation All data & control stalls Dynamic memory disambiguation RAW stalls involving mem.

Instruction-Level Parallelism (ILP) Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in and out except for the entry and at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between two branches Plus instructions in BB likely to depend on each other To obtain performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1)          x[i] = x[i] + y[i];

Dependences A dependence is a way in which one instruction can depend on (be impacted by) another for scheduling purposes. Three major dependence types: Data (True) dependence: RAW Name dependence: WAR, WAW Control dependence: branch, jump, etc. A dependency (or dependence) is a particular instance of one instruction depending on another. The instructions can’t be effectively (as opposed to just syntactically) fully parallelized, or reordered.

Data Dependence Recursive definition: Potential for a RAW hazard Instruction B is data dependent on instruction A iff: B uses a data result produced by instruction A, or There is another instruction C such that B is data dependent on C, and C is data dependent on A. Potential for a RAW hazard Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop A A B C B Data dependencies in loop example

Name Dependence Occurs when two instructions both access the same data storage location due to reuse the storage (Also called storage dependence, at least one of the accesses must be a write.) Two sub-types (for inst. B after inst. A): Antidependence: A reads, then B writes. Potential for a WAR hazard. Output dependence: A writes, then B writes. Potential for a WAW hazard. Note: Name dependencies can be avoided by changing instructions to use different locations (rather than reusing a location).

WAR, WAW Examples WAR Hazard: InstrJ writes operand before InstrI reads it WAW Hazard: InstrJ writes operand before InstrI writes it. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Control Dependence Occurs when the execution of an instruction depends on a conditional branch instruction. Program control flow must follow for correct executions. However, only two things must really be preserved: Data flow (how a given result is produced) Exception behavior (must handle in order) Example: (for Exception Behavior) DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) ; may not before BEQZ L1: ; due to exception & R1

Control Dependence – Another Example Example: (for data flow) DADDU R2, R3, R4 BEQZ R5, L1 DSUBU R2, R6, R7 L1: OR R8, R2, R9 OR depends DADDU and DSUBU. Maintaining data dependences is not enough Control flow decides where the correct R2 comes from (DADDU or DSUBU)

Relaxing Control Dependence Only two things must really be preserved: Data flow (how a given result is produced) Exception behavior Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead Speculation (betting on branches, to fill delay slots) Make instructions unconditional if no harm done Speculative multiple-execution Take both paths, invalidate results of one later Conditional / predicated instructions (used in IA-64). Note, Instruction reordering around branch must has no side-effect

Loop Unrolling This code, add a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Assume following latencies for all examples Ignore delayed branch in these examples Instruction Instruction Latency stalls between producing result using result in cycles in cycles --------------------------------------------------------------------------------------------------------- FP ALU op Another FP ALU op 4 3 FP ALU op Store double 3 2 Load double FP ALU op 1 1 Load double Store double 1 0 Integer op Integer op 1 0

MIPS Code First translate into MIPS code: To simplify, assume 8 is lowest address Loop: L.D F0,0(R1) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D 0(R1),F4 ;store result DADDUI R1,R1,-8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero

Execution Cycles without Inst Scheduling 1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D 0(R1),F4 ;store result 7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW) 8 stall ;assumes can’t forward branch 9 BNEZ R1,Loop ;branch R1!=zero Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clock cycles: Rewrite code to minimize stalls?

Apply Instruction Scheduling 1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 S.D 8(R1),F4 ;altered offset 7 BNEZ R1,Loop 7 clock cycles, but just 3 for execution (L.D, ADD.D, S.D), 4 for loop overhead; Can we make faster?

Unroll Loop Four Times 1 Loop: L.D F0,0(R1) 3 ADD.D F4,F0,F2 6 S.D 0(R1),F4 ;no DSUBUI/BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8,F6,F2 12 S.D -8(R1),F8 ;drop loop 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D -16(R1),F12 ;drop loop 19 L.D F14,-24(R1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDUI R1,R1,#-32 ;alter to 4*8 27 BNEZ R1,LOOP 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4)

Unrolled and Rescheduled Loop 1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration

Loop Unrolling Decisions Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) Use different registers to avoid unnecessary constraints forced by using same registers for different computations Eliminate the extra test and branch instructions and adjust the loop termination and iteration code Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent Transformation requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code, preserving any dependences needed to yield the same result as the original code

Three Unrolling Considerations Decrease in amount of overhead amortized with each extra unrolling Amdahl’s Law Growth in code size For larger loops, concern it increases the instruction cache miss rate Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not be possible to allocate all live values to registers, may lose some or all of its advantage  Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Static Branch Prediction Delayed branch To reorder code around branches, need to predict branch statically when compile Predict a branch as taken: Average misprediction = untaken frequency = 34% SPEC More accurate prediction: Profile based

Dynamic Branch Prediction Why does prediction work? Underlying algorithm has regularities Data that is being operated on has regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior

Dynamic Branch Prediction As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. Branches come more often An n-cycle delay postpones more instructions Dynamic Hardware Branch Prediction “Learns” which branches are taken, or not Make the right guess (most of the time) about whether a branch is taken, or not. Delay depends on whether prediction is correct, and whether branch is taken.

Branch-Prediction Buffers (BPB) Also called “branch history table” Low-order n bits of branch address used to index a table of branch history data for prediction. May have “collisions” between distant branches. Associative tables also possible In each entry, k bits of information about history of that branch are stored. Common values of k: 1, 2, and larger Entry is used to predict what branch will do. Actual behavior of branch will update the entry.

1-bit Branch Prediction The entry for a branch has only two states: Bit = 1 “The last time this branch was encountered, it was taken. I predict it will be taken next time.” Bit = 0 “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” Will make 2 mistakes each time a loop is encountered. At the end of the first & last iterations. May always mispredict in pathological cases!

2-bit Branch Prediction 4 states Based on the most recent two branch history actions Only 1 mis-prediction per loop execution, after the first time the loop is reached. Last iteration What about n-bit predictor?

State Transition Diagram

Implementing Branch Histories Separate “cache” (prediction table) accessed during IF stage Extra bits in instruction cache Problem with this approach in MIPS: After fetch, don’t know whether the instruction is really a branch or not (until decoding) Also don’t know the target address. In MIPS, by the time you know these things (in ID), you already know whether it’s really taken! Haven’t saved any time! Branch-Target Buffers can fix this problem (later)...

Misprediction Rate for 2-bit BPB

Branch-Prediction Performance Contribution to cycle count depends on: Branch frequency & misprediction frequency Freqs. of taken/not taken, predicted/mispredicted. Delay of taken/not taken, predicted/mispredicted. How to reduce misprediction frequency? Increase buffer size to avoid collisions. Has little effect beyond ~4,096 entries. Increase prediction accuracy Increase # of bits/entry (little effect beyond 2) Use a different prediction scheme (correlated predictors, tournament predictors)

Exhaustive Search for Optimal 2-bit Predictor • There are 2^20 possible state machines of 2-bit predictors • Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 • For each benchmark, determine the prediction accuracy for all the predictor state machines • Optimal 2-bit predictor for each application (by IBM) spice2g6 97.2% doduc 94.3% gcc 89.1% espresso 89.1% li 87.1% eqntott 87.9%

Correlated Prediction - Example Code fragment from eqntott: if (aa==2) aa=0; if1 if (bb==2) bb=0; if2 if (aa!=bb) { … };  false if (if1 & if2) MIPS code (aa=R1, bb=R2): SUBUI R3,R1,#2 ; (aa-2) BNEZ R3,L1 ; branch b1 (aa!=2) ADD R1,R0,R0 ; aa=0 L1: SUBUI R3,R2,#2 ; (bb-2) BNEZ R3,L2 ; branch b2 (bb!=2) ADD R2,R0,R0 ; bb=0 L2: SUBU R3,R1,R2 ; (aa-bb) BEQZ R3,L3 … ; branch b3 (aa==bb)

Even Simpler Example C code: MIPS code (d=R1): if (d==0) d=1; if (d==1) ... MIPS code (d=R1): BNEZ R1,L1 ; b1: d!=0 ADDI R1,R0,#1 ; d=1 L1: SUBUI R3,R1,#1 ; (d-1) BNEZ R3,L2 ; b2: d!=1 (and others) Any Correlation (b1, b2)?

Using 1-bit Predictor Suppose value of ‘d’ alternates between 2 and 0, and the code repeat multiple times All branches are mispredicted!! (with initial NT, NT)

Correlating Predictors Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. Notation: _ / _ (separate prediction table) What to predict if the last branch was NOT taken What to predict if the last branch was TAKEN Prediction used is shown in bold Based on the last outcome

(m,n) correlated predictors Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. Each of these predictors records n bits of history information for any given branch. On previous slide we saw a (1,1) predictor. Easy to implement: Behavior of last m branches: an m-bit shift register Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register.

(2,2) Correlated Predictor 1 (Correction based on the last 2 branch outcomes)

Correlated Predictors Better

Tournament Predictors Three predictors: Global correlated predictor, Local 2-bit predictor, and a tournament predictor The Tournament predictor determines which (global or local) predictor to be used for prediction

Performance Tournament Predictor

Branch-Target Buffers (BTB) How to know the address of the next instruction as soon as the current instruction is fetched? Normally, an extra (ID) cycle is needed to: Determine that the first instruction is a branch Determine whether the branch is taken Compute the target address PC+offset Branch prediction alone doesn’t help – Need next PC What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched?  BTB

BTB Schematic

Handling Instruction with BTB Fetch from target This flowchart based on 1-bit predictor

Branch Penalties, Branch Folding If instruction not in BTB and branch not taken (case not shown), penalty is 0. Store target instructions instead of their addresses in BTB Saves on fetch time. Permits branch folding - zero-cycle branches! Substitute destination instruction for branch in pipeline!

Return Address Predictor Predicting register/indirect branches E.g., abstract function calls, switch statements, procedure returns. CPU-internal return-address stack

Dynamic Branch Prediction Summary Prediction becoming important part of execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Either different branches (GA) Or different executions of same branches (PA) Tournament predictors take insight to next level, by using multiple predictors usually one based on global information and one based on local information, and combining them with a selector In 2006, tournament predictors using  30K bits are in processors like the Power5 and Pentium 4 Branch Target Buffer: include branch address & prediction