Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.

Computer Architecture Lec 8 – Instruction Level Parallelism.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

Instruction-Level Parallelism dynamic scheduling prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University May 2015Instruction-Level Parallelism.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

CSC 4250 Computer Architectures October 31, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Morgan Kaufmann Publishers The Processor

CMSC 611: Advanced Computer Architecture

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

CS 704 Advanced Computer Architecture

Lecture 11: Memory Data Flow Techniques

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Lecture 10: Branch Prediction and Instruction Delivery

Larry Wittie Computer Science, StonyBrook University and ~lw

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department University of Massachusetts Dartmouth 285 Old Westport Rd. North Dartmouth, MA Slides based on the PowerPoint Presentations created by David Patterson as part of the Instructor Resources for the textbook by Hennessy & Patterson Updated by Honggang Wang.

2 Review Dynamic Scheduling –Tomasulo Algorithm Speculation –Speculative Tomasulo Exam ple Memory Aliases Exceptions Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction

3 Outline Dynamic Scheduling –Tomasulo Algorithm Speculation –Speculative Tomasulo Exam ple Memory Aliases Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction

4 Avoiding Memory Hazards WAW and WAR hazards through memory are eliminated with speculation because actual updating of memory occurs in order, when a store is at head of the ROB, and hence, no earlier loads or stores can still be pending RAW hazards through memory are maintained by two restrictions: 1.not allowing a load to initiate the second step of its execution if any active ROB entry occupied by a store has a Destination field that matches the value of the A field of the load, and 2.maintaining the program order for the computation of an effective address of a load with respect to all earlier stores. these restrictions ensure that any load that accesses a memory location written to by an earlier store cannot perform the memory access until the store has written the data

5 Outline Dynamic Scheduling –Tomasulo Algorithm Speculation –Speculative Tomasulo Exam ple Memory Aliases Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction

6 Increasing Instruction Fetch Bandwidth Predicts next instruct address, sends it out before decoding instructuction PC of branch sent to BTB When match is found, Predicted PC is returned If branch predicted taken, instruction fetch continues at Predicted PC Branch Target Buffer (BTB)

The Steps in Handling An Instruction with a BTB Determine the total branch penalty for a branch-target buffer. Make the following assumption about the prediction accuracy and hit rate: Prediction accuracy is 90% ( for instruction in the buffer) Hit rate in the buffer is 90% (for branches predicated taken) 7

IF BW: Return Address Predictor 8

9 Small buffer of return addresses acts as a stack Caches most recent return addresses Call  Push a return address on stack Return  Pop an address off stack & predict as new PC

10 More Instruction Fetch Bandwidth Integrated branch prediction branch predictor is part of instruction fetch unit and is constantly predicting branches Instruction prefetch Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction Instruction memory access and buffering Fetching multiple instructions per cycle: –May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks) –Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed

11 Outline Dynamic Scheduling –Tomasulo Algorithm Speculation –Speculative Tomasulo Exam ple Memory Aliases Exceptions Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction

12 Speculation: Register Renaming vs. ROB Alternative to ROB is a larger physical set of registers combined with register renaming –Extended registers replace function of both ROB and reservation stations Instruction issue maps names of architectural registers to physical register numbers in extended register set –On issue, allocates a new unused register for the destination (which avoids WAW and WAR hazards) –Speculation recovery easy because a physical register holding an instruction destination does not become the architectural register until the instruction commits Most Out-of-Order processors today use extended registers with renaming

13 Outline Dynamic Scheduling –Tomasulo Algorithm Speculation –Speculative Tomasulo Exam ple Memory Aliases Exceptions Increasing instruction bandwidth Register Renaming vs. Reorder Buffer Value Prediction

14 Value Prediction Attempts to predict value produced by instruction –E.g., Loads a value that changes infrequently Value prediction is useful only if it significantly increases ILP –Focus of research has been on loads; so-so results, no processor uses value prediction Related topic is address aliasing prediction –RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict –Has been used by a few processors

Value Prediction 15 Speculative prediction of register values –Values predicted during fetch and dispatch, forwarded to dependent instructions. –Dependent instructions can be issued and executed immediately. –Before committing a dependent instruction, we must verify the predictions. If wrong: must restart dependent instruction w/ correct values. FetchDecodeIssueExecuteCommit Predict Value Verify if mispredicted

Value Prediction 16 PC Pred History Value History Classification Table (CT) Value Prediction Table (VPT) Should I predict? Predicted Value

Value Prediction 17 Predicted instruction executes normally. Dependent instruction cannot commit until predicted instruction has finished executing. Computed result compared to predicted; if ok then dependent instructions can commit. If not, dependent instructions must reissue and execute with computed value. Miss penalty = 1 cycle later than no prediction.

18 Perspective Interest in multiple-issue because wanted to improve performance without affecting uniprocessor programming model Taking advantage of ILP is conceptually simple, but design problems are amazingly complex in practice Conservative in ideas, just faster clock and bigger Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron) have the same basic structure and similar sustained issue rates (3 to 4 instructions per clock) as the 1st dynamically scheduled, multiple- issue processors announced in 1995 –Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many renaming registers, and 2X as many load-store units  performance 8 to 16X Peak v. delivered performance gap increasing