DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2.

Slides:



Advertisements
Similar presentations
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Copyright 2001 UCB & Morgan Kaufmann ECE668.1 Adapted from Patterson, Katz and Culler © UCB Csaba Andras Moritz UNIVERSITY OF MASSACHUSETTS Dept. of Electrical.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Instruction Level Parallelism 2. Superscalar and VLIW processors.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Instruction Level Parallelism Vincent H. Berk October 15, 2008 Reading for today: A.7 – A.8 Reading for Friday: 2.1 – 2.5 Project Proposals Due Right.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Intel IA-64 Architecture Chehun Kim Glenn Ramos. Contents *Pipelining - Stages of pipelining *Microprogramming *Interconnection Structures.
Lecture 5: Pipelining & Instruction Level Parallelism Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CSCE430/830 Computer Architecture
CPE 631 Lecture 13: Exploiting ILP with SW Approaches
IA-64 Microarchitecture --- Itanium Processor
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture 23: Static Scheduling for High ILP
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Chapter 3: ILP and Its Exploitation
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

DAP.F96 1 Lecture 4: Hazards, Introduction to Compiler Techniques, Chapter 2

DAP.F MIPS R4000 pipeline

DAP.F MIPS FP Pipe Stages FP Instr … Add, SubtractUS+AA+RR+S MultiplyUE+MMMMNN+AR DivideUARD 28 …D+AD+R, D+R, D+A, D+R, A, R Square rootUE(A+R) 108 …AR NegateUS Absolute valueUS FP compareUAR Stages: MFirst stage of multiplier NSecond stage of multiplier RRounding stage SOperand shift stage UUnpack FP numbers AMantissa ADD stage DDivide pipeline stage EException test stage

DAP.F96 4 Appendix A Fig. A.32 Fig. A.33 Fig. A.34

DAP.F R4000 Performance Not ideal CPI of 1: –Load stalls (1 or 2 clock cycles) –Branch stalls (2 cycles + unfilled slots) –FP result stalls: RAW data hazard (latency) –FP structural stalls: Not enough FP hardware (parallelism)

DAP.F96 6 FP Loop: Where are the Hazards? Consider the following example; For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code: Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot Where are the stalls?

DAP.F96 7 FP Loop Hazards InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op1 Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

DAP.F96 8 FP Loop Showing Stalls 10 clocks: Rewrite code to minimize stalls? 1 Loop:LDF0,0(R1);F0=vector element 2stall 3ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 stall 9 BNEZR1,Loop;branch R1!=zero 10stall;delayed branch slot

DAP.F96 9 Minimizing Stalls Technique 1: Compiler Optimization 6 clocks, but actual work is only 3 cycles for LD, ADD and SD!! Swap BNEZ and SD by changing address of SD 1 Loop:LDF0,0(R1) 2SUBIR1,R1,8 3ADDDF4,F0,F2 4Stall 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;Address altered from 0(R1) to 8(R1) when moved past SUBI What assumptions made when moved code? OK to move store past SUBI even though changes register OK to move loads before stores: get right data? When is it safe for compiler to do such changes?

DAP.F96 10 Compiler Technique 2: Loop Unrolling 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 ;1 cycle delay * 3SD0(R1),F4 ;drop SUBI & BNEZ – 2cycles delay * 4LDF6,-8(R1) 5ADDDF8,F6,F2 ; 1 cycle delay 6SD-8(R1),F8 ;drop SUBI & BNEZ – 2 cycles delay 7LDF10,-16(R1) 8ADDDF12,F10,F2 ; 1 cycle delay 9SD-16(R1),F12 ;drop SUBI & BNEZ – 2 cycles delay 10LDF14,-24(R1) 11ADDDF16,F14,F2 ; 1 cycle delay 12SD-24(R1),F16 ; 2 cycles daly 13SUBIR1,R1,#32;alter to 4*8; 1 cycle delay 14BNEZR1,LOOP ; Delayed branch 15NOP *1 cycle delay for FP operation after load. 2 cycles delay for store after FP. 1 cycle after SUBI x (1+2) + 1 = 28 clock cycles, or 7 per iteration Loop Unrolling is essential for ILP Processors Why? But, increase in Code memory and no. of registers.

DAP.F96 11 Minimize Stall + Loop Unrolling 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F12 12SUBIR1,R1,#32 13BNEZR1,LOOP ; Delayed branch 14SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration

DAP.F96 12 Steps Compiler Performed to Unroll Check OK to move the S.D after DSUBUI and BNEZ, and find amount to adjust S.D offset Determine unrolling the loop would be useful by finding that the loop iterations were independent Rename registers to avoid name dependencies Eliminate extra test and branch instructions and adjust the loop termination and iteration code Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent –requires analyzing memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield same result as the original code

DAP.F96 13 Compiler Optimization + 2-issue Superscalar EX: One LD/ST unit and one FP ALU 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) ADDDF4,F0,F2 4LDF14,-24(R1) ADDDF8,F6,F2 5SD0(R1),F4 ADDDF12,F10,F2 6SD-8(R1),F8 ADDDF16,F14,F2 7SD-16(R1),F12 8SUBIR1,R1,#32 9BNEZR1,LOOP ; Delayed branch 10SD8(R1),F16 ; 8-32 = clock cycles, or 2.5 per iteration

DAP.F Issue without Loop Unrolling 1 Loop:LDF0,0(R1) 2SUBIR1,R1,8 ADDD F4,F0,F2 3Stall 4BNEZR1,Loop 5 SD8(R1),F4 - Branch Delay Only One cycle improvement even with 2-issue superscalar! HW: What if 2-issue no compiler optimization?

DAP.F96 15 Static Multiple Issue: Very Long Instruction Word (VLIW) Architectures Wide-issue processor that relies on compiler to –Packet together independent instructions to be issued in parallel –Schedule code to minimize hazards and stalls Very long instruction words (3 to 8 operations) –Can be issued in parallel without checks –If compiler cannot find independent operations, it inserts nops Advantage: simpler HW for wide issue –Faster clock cycle –Lower design & verification cost Disadvantages: –Code size –Requires aggressive compilation technology

DAP.F96 16 Traditional VLIW Hardware Multiple functional units, many registers (e.g. 128) –Large multiported register file (for N FUs need ~3N ports) Simple instruction fetch unit –No checks, direct correspondence between slots & FUs Instruction format –16 to 24 bits per op => 5*16=80 bits to 5*24=120 bits wide –Can share immediate fields (1 per long instruction)

DAP.F96 17 VLIW Code Example Consider the following example; For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code: Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

DAP.F96 18 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch L.D F0,0(R1)L.D F6,-8(R1)1 L.D F10,-16(R1)L.D F14,-24(R1)2 L.D F18,-32(R1)L.D F22,-40(R1)ADD.D F4,F0,F2ADD.D F8,F6,F23 L.D F26,-48(R1)ADD.D F12,F10,F2ADD.D F16,F14,F24 ADD.D F20,F18,F2ADD.D F24,F22,F25 S.D 0(R1),F4S.D -8(R1),F8ADD.D F28,F26,F26 S.D -16(R1),F12S.D -24(R1),F167 S.D -32(R1),F20S.D -40(R1),F24DSUBUI R1,R1,#488 S.D -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

DAP.F96 19 How Can the HW Help the Compiler with Discovering ILP? Compiler’s performance is critical for VLIW processors –Find many independent instructions & schedule them in best possible way What limits the compiler’s ability to discover ILP –Name dependencies (WAW & WAR) »Can eliminate with large number of registers –Branches »Limit compiler’s ability to schedule »Modern VLIW processors use branch prediction too –Dependencies through memory »Force the compiler to use conservative schedule Can the HW help the compiler? –Ideally, with techniques simpler than those for superscalar processors

DAP.F96 20 VLIW Vs. Superscalar sequential stream of long instruction words instructions scheduled statically by the compiler number of simultaneously issued instructions is fixed during compile-time instruction issue is less complicated than in a superscalar processor Disadvantage: VLIW processors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars. The number of instructions in a VLIW instruction word is usually fixed. Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops. VLIW is an architectural technique, whereas superscalar is a microarchitecture technique. VLIW processors take advantage of spatial parallelism.

DAP.F96 21 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture; EPIC is type –EPIC = 2nd generation VLIW? Itanium™ is name of first implementation (2001) –Highly parallel and deeply pipelined hardware at 800Mhz –6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process bit integer registers bit floating point registers –Not separate register files per functional unit as in old VLIW Hardware checks dependencies (interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

DAP.F96 22 Branch Hints Memory Hints Instruction Cache & Branch Predictors Fetch Fetch Memory Subsystem Memory Subsystem Three levels of cache: L1, L2, L3 Register Stack & Rotation Explicit Parallelism 128 GR & 128 FR, Register Remap & Stack Engine RegisterHandling Fast, Simple 6-Issue Issue Control Micro-architecture Features in hardware : Itanium™ EPIC Design Maximizes SW-HW Synergy (Copyright: Intel at Hotchips ’00) : Architecture Features programmed by compiler: Predication Data & Control Speculation Bypasses & Dependencies Parallel Resources 4 Integer + 4 MMX Units 2 FMACs (4 for SSE) 2 L.D/ST units 32 entry ALAT Speculation Deferral Management

DAP.F Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00) Front End Pre-fetch/Fetch of up to 6 instructions/cyclePre-fetch/Fetch of up to 6 instructions/cycle Hierarchy of branch predictorsHierarchy of branch predictors Decoupling bufferDecoupling buffer Instruction Delivery Dispersal of up to 6 instructions on 9 portsDispersal of up to 6 instructions on 9 ports Reg. remappingReg. remapping Reg. stack engineReg. stack engine Operand Delivery Reg read + BypassesReg read + Bypasses Register scoreboardRegister scoreboard Predicated dependencies Predicated dependencies Execution 4 single cycle ALUs, 2 ld/str4 single cycle ALUs, 2 ld/str Advanced load controlAdvanced load control Predicate delivery & branchPredicate delivery & branch Nat/Exception//RetirementNat/Exception//Retirement IPGFET ROTEXP RENREGEXEDETWRBWL.D REGISTER READ WORD-LINE DECODE RENAMEEXPAND INST POINTER GENERATION FETCH ROTATE EXCEPTION DETECT EXECUTEWRITE-BACK

DAP.F96 24 Itanium processor 10-stage pipeline Front-end (stages IPG, Fetch, and Rotate): prefetches up to 32 bytes per clock (2 bundles) into a prefetch buffer, which can hold up to 8 bundles (24 instructions) –Branch prediction is done using a multilevel adaptive predictor like P6 microarchitecture Instruction delivery (stages EXP and REN): distributes up to 6 instructions to the 9 functional units –Implements registers renaming for both rotation and register stacking.

DAP.F96 25 Itanium processor 10-stage pipeline Operand delivery (WLD and REG): accesses register file, performs register bypassing, accesses and updates a register scoreboard, and checks predicate dependences. –Scoreboard used to detect when individual instructions can proceed, so that a stall of 1 instruction in a bundle need not cause the entire bundle to stall Execution (EXE, DET, and WRB): executes instructions through ALUs and load/store units, detects exceptions and posts NaTs, retires instructions and performs write-back –Deferred exception handling for speculative instructions is supported by providing the equivalent of poison bits, called NaTs for Not a Thing, for the GPRs (which makes the GPRs effectively 65 bits wide), and NaT Val (Not a Thing Value) for FPRs (already 82 bits wides)