We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byAngelique Darr
Modified about 1 year ago
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003
DAP Spr.‘98 ©UCB 2 Tomasulo Summary Reservations stations: renaming to larger set of registers + buffering source operands –Prevents registers as bottleneck –Avoids WAR, WAW hazards of Scoreboard –Allows loop unrolling in HW Not limited to basic blocks (integer units gets ahead, beyond branches) Helps cache misses as well Lasting Contributions –Dynamic scheduling –Register renaming –Load/store disambiguation 360/91 descendants are Pentium II; PowerPC 604; MIPS R10000; HP-PA 8000; Alpha 21264
DAP Spr.‘98 ©UCB 3 Tomasulo Drawbacks Complexity –delays of 360/91, MIPS 10000, IBM 620? Many associative stores (CDB) at high speed Performance limited by Common Data Bus –Multiple CDBs => more FU logic for parallel assoc stores
DAP Spr.‘98 ©UCB 4 HW support for More ILP Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (“HW undo”); called “boosting” Combine branch prediction with dynamic scheduling to execute before branches resolved Separate speculative bypassing of results from real bypassing of results –When instruction no longer speculative, write boosted results (instruction commit) or discard boosted results –execute out-of-order but commit in-order to prevent irrevocable action (update state or exception) until instruction commits HW support for More ILP
DAP Spr.‘98 ©UCB 5 HW support for More ILP Need HW buffer for results of uncommitted instructions: reorder buffer –3 fields: instr, destination, value –Reorder buffer can be operand source => more registers like RS –Use reorder buffer number instead of reservation station when execution completes –Supplies operands between execution complete & commit –Once operand commits, result is put into register –Instructions commit in order –As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Regs FP Op Queue FP Adder Res Stations
DAP Spr.‘98 ©UCB 6 Four Steps of Speculative Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station and reorder buffer slot free, issue instr & send operands & reorder buffer no. for destination (this stage sometimes called “dispatch”) 2.Execution—operate on operands (EX) When both operands ready then execute; if not ready, watch CDB for result; when both in reservation station, execute; checks RAW (sometimes called “issue”) 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting FUs & reorder buffer; mark reservation station available. 4.Commit—update register with reorder result When instr. at head of reorder buffer & result present, update register with result (or store to memory) and remove instr from reorder buffer. Mispredicted branch flushes reorder buffer (sometimes called “graduation”)
DAP Spr.‘98 ©UCB 7 Renaming Registers Common variation of speculative design Reorder buffer keeps instruction information but not the result Extend register file with extra renaming registers to hold speculative results Rename register allocated at issue; result into rename register on execution complete; rename register into real register on commit Operands read either from register file (real or speculative) or via Common Data Bus Advantage: operands are always from single source (extended register file)
DAP Spr.‘98 ©UCB 8 Very Long Instruction Word: VLIW Architectures Wide-issue processor that relies on compiler to –Packet together independent instructions to be issued in parallel –Schedule code to minimize hazards and stalls Very long instruction words (3 to 8 operations) –Can be issued in parallel without checks –If compiler cannot find independent operations, it inserts nops Advantage: simpler HW for wide issue –Faster clock cycle –Lower design & verification cost Disadvantages: –Code size –Requires aggressive compilation technology
DAP Spr.‘98 ©UCB 9 Traditional VLIW Hardware Multiple functional units, many registers (e.g. 128) –Large multiported register file (for N FUs need ~3N ports) Simple instruction fetch unit –No checks, direct correspondence between slots & FUs Instruction format –16 to 24 bits per op => 5*16=80 bits to 5*24=120 bits wide –Can share immediate fields (1 per long instruction)
DAP Spr.‘98 ©UCB 10 VLIW Code Example for (i=0; i<100; i++) X[i] = X[i] + B; MIPS code Loop:l.d f2, 0(r1) add.d f4, f0, f2 s.d f4, 0(r1) addui r1, r1, 8 bne r1, r2, Loop
DAP Spr.‘98 ©UCB 11 Unrolled Loop (7 times) on VLIW Hardware C# Mem1 Mem2 FP1 FP2 Int/Branch 1 l.d f2, 0(r1)l.d f6,8(r1) 2 l.d f10,16(r1) l.d f14,24(r1) 3 l.d f18,32(r1) l.d f22,40(r1) add.d f4,f0, f0 add.d f8,f0, f6 4 l.d f26,48(r1)add.d f12,f0,f10 add.d f16,f0,f14 daddu r1,r1,#56 5 add.d f20,f0,f18add.d f24,f0,f22 6 s.d f4,-56(r1) s.d f8,-48(r1) add.d f28,f0,f26 7 s.d f12,-40(r1) s.d f16,-32(r1) 8 s.d f20,-24(r1) s.d f24,-16(r1) 9 s.d f28, -8(r1) bneq r1,r2,Loop Assuming 2 cycle loads, 3 cycle FP adds 9 cycles for 7 iterations –1.28 cycles/iteration or 2.6 operations/cycle 21 nops (empty slots) 24 FP registers used (instead of just 3)
DAP Spr.‘98 ©UCB 12 Problems with Early VLIW Processors Characteristics –Large number of slots per long instruction –All operations in a long instructions must be independent –No interlocks (check for hazards across long instructions) –Fixed/direct correspondence between instruction slots and FUs Code size problems –Need many nops within each long instructions –Potential solution: code compression »Before of after L1-cache Binary compatibility problems –Different ISA every time number or latency of FUs changes –Must recompile for every processor generation –Potential solution: binary translation
DAP Spr.‘98 ©UCB 13 Modern VLIW Processors Hardware interlocks –HW check for dependencies across LIWs Use a special field (bit mask) to determine if operations in LIW are independent or not –E.g. stop bits in IA-64 –Can also specify dependence to following LIW Support several combinations of operation types per LIW –E.g., allow both Mem-FP-Int and Mem-Int-Int –Use an instruction field to specify the type of each operations –Requires network that will “route” operations to proper FUs
DAP Spr.‘98 ©UCB 14 How Can the HW Help the Compiler with Discovering ILP? Compiler’s performance is critical for VLIW processors –Find many independent instructions & schedule them in best possible way What limits the compiler’s ability to discover ILP –Name dependencies (WAW & WAR) »Can eliminate with large number of registers –Branches »Limit compiler’s ability to schedule »Modern VLIW processors use branch prediction too –Dependencies through memory »Force the compiler to use conservative schedule Can the HW help the compiler? –Ideally, with techniques simpler than those for superscalar processors
DAP Spr.‘98 ©UCB 15 VLIW and Superscalar sequential stream of long instruction words instructions scheduled statically by the compiler number of simultaneously issued instructions is fixed during compile-time instruction issue is less complicated than in a superscalar processor Disadvantage: VLIW processors cannot react on dynamic events, e.g. cache misses, with the same flexibility like superscalars. The number of instructions in a VLIW instruction word is usually fixed. Padding VLIW instructions with no-ops is needed in case the full issue bandwidth is not be met. This increases code size. More recent VLIW architectures use a denser code format which allows to remove the no-ops. VLIW is an architectural technique, whereas superscalar is a microarchitecture technique. VLIW processors take advantage of spatial parallelism.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
CS203 – Advanced Computer Architecture ILP and Speculation.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
EE524/CptS561 Advanced Computer Architecture Dynamic Scheduling A scheme to overcome data hazards.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
CPSC614 Lec 5.1 Instruction Level Parallelism and Dynamic Execution #4: Based on lectures by Prof. David A. Patterson E. J. Kim.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
COMP25212 Advanced Pipelining Out of Order Processors.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
Lecture 8: More ILP stuff Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall: Multicycle instructions lead to the requirement.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Mar 17, 2009 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP So far we have explored dynamic hardware techniques for.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
/ Computer Architecture and Design Instructor: Dr. Michael Geiger Summer 2014 Lecture 6: Speculation.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
© 2017 SlidePlayer.com Inc. All rights reserved.