1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

1 Lecture: Branch Prediction Topics: branch prediction, bimodal/global/local/tournament predictors, branch target buffer (Section 3.3, notes on class webpage)

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 2: System Metrics and Pipelining Today’s topics: (Sections 1.6, 1.7, 1.9, A.1)  Quantitative principles of computer design  Measuring cost.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Lecture: Out-of-order Processors

Lecture: Branch Prediction

Approaches to exploiting Instruction Level Parallelism (ILP)

Lecture 11: Advanced Static ILP

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture 6: Static ILP, Branch prediction

Lecture: Static ILP, Branch Prediction

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Branch Prediction

Lecture: Out-of-order Processors

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

Lecture 20: OOO, Memory Hierarchy

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Lecture 7: Branch Prediction, Dynamic ILP

Presentation transcript:

1 Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

2 Software Pipeline?! L.DADD.DS.D DADDUIBNE L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.D L.DADD.D DADDUIBNE DADDUIBNE DADDUIBNE DADDUIBNE DADDUIBNE … … Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop

3 Software Pipeline L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.DS.D L.DADD.D L.D Original iter 1 Original iter 2 Original iter 3 Original iter 4 New iter 1 New iter 2 New iter 3 New iter 4

4 Software Pipelining Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Loop: S.D F4, 16(R1) ADD.D F4, F0, F2 L.D F0, 0(R1) DADDUI R1, R1,# -8 BNE R1, R2, Loop Advantages: achieves nearly the same effect as loop unrolling, but without the code expansion – an unrolled loop may have inefficiencies at the start and end of each iteration, while a sw-pipelined loop is almost always in steady state – a sw-pipelined loop can also be unrolled to reduce loop overhead Disadvantages: does not reduce loop overhead, may require more registers

5 Problem 4 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall Show the SW pipelined version of the code and does it cause stalls?

6 Problem 4 for (i=1000; i>0; i--) x[i] = y[i] * s; Loop: L.D F0, 0(R1) ; F0 = array element MUL.D F4, F0, F2 ; multiply scalar S.D F4, 0(R2) ; store result DADDUI R1, R1,# -8 ; decrement address pointer DADDUI R2, R2,#-8 ; decrement address pointer BNE R1, R3, Loop ; branch if R1 != R3 NOP Source code Assembly code LD -> any : 1 stall FPMUL -> any: 5 stalls FPMUL -> ST : 4 stalls IntALU -> BR : 1 stall Show the SW pipelined version of the code and does it cause stalls? Loop: S.D F4, 0(R2) MUL F4, F0, F2 L.D F0, 0(R1) DADDUI R2, R2, #-8 BNE R1, R3, Loop DADDUI R1, R1, #-8 There will be no stalls

7 Predication A branch within a loop can be problematic to schedule Control dependences are a problem because of the need to re-fetch on a mispredict For short loop bodies, control dependences can be converted to data dependences by using predicated/conditional instructions

8 Predicated or Conditional Instructions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 = !R1 R8 = R2 R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1)

9 Predicated or Conditional Instructions The instruction has an additional operand that determines whether the instr completes or gets converted into a no-op Example: lwc R1, 0(R2), R3 (load-word-conditional) will load the word at address (R2) into R1 if R3 is non-zero; if R3 is zero, the instruction becomes a no-op Replaces a control dependence with a data dependence (branches disappear) ; may need register copies for the condition or for values used by both directions if (R1 == 0) R2 = R2 + R4 else R6 = R3 + R5 R4 = R2 + R3 R7 = !R1 ; R8 = R2 ; R2 = R2 + R4 (predicated on R7) R6 = R3 + R5 (predicated on R1) R4 = R8 + R3 (predicated on R1)

10 Problem 1 Use predication to remove control hazards in this code if (R1 == 0) R2 = R5 + R4 R3 = R2 + R4 else R6 = R3 + R2

11 Problem 1 Use predication to remove control hazards in this code if (R1 == 0) R2 = R5 + R4 R3 = R2 + R4 else R6 = R3 + R2 R7 = !R1 ; R6 = R3 + R2 (predicated on R1) R2 = R5 + R4 (predicated on R7) R3 = R2 + R4 (predicated on R7)

12 Complications Each instruction has one more input operand – more register ports/bypassing If the branch condition is not known, the instruction stalls (remember, these are in-order processors) Some implementations allow the instruction to continue without the branch condition and squash/complete later in the pipeline – wasted work Increases register pressure, activity on functional units Does not help if the br-condition takes a while to evaluate

13 Support for Speculation In general, when we re-order instructions, register renaming can ensure we do not violate register data dependences However, we need hardware support  to ensure that an exception is raised at the correct point  to ensure that we do not violate memory dependences st br ld

14 Detecting Exceptions Some exceptions require that the program be terminated (memory protection violation), while other exceptions require execution to resume (page faults) For a speculative instruction, in the latter case, servicing the exception only implies potential performance loss In the former case, you want to defer servicing the exception until you are sure the instruction is not speculative Note that a speculative instruction needs a special opcode to indicate that it is speculative

15 Program-Terminate Exceptions When a speculative instruction experiences an exception, instead of servicing it, it writes a special NotAThing value (NAT) in the destination register If a non-speculative instruction reads a NAT, it flags the exception and the program terminates (it may not be desireable that the error is caused by an array access, but the segfault happens two procedures later) Alternatively, an instruction (the sentinel) in the speculative instruction’s original location checks the register value and initiates recovery

16 Memory Dependence Detection If a load is moved before a preceding store, we must ensure that the store writes to a non-conflicting address, else, the load has to re-execute When the speculative load issues, it stores its address in a table (Advanced Load Address Table in the IA-64) If a store finds its address in the ALAT, it indicates that a violation occurred for that address A special instruction (the sentinel) in the load’s original location checks to see if the address had a violation and re-executes the load if necessary

17 Problem 2 For the example code snippet below, show the code after the load is hoisted: Instr-A Instr-B ST R2  [R3] Instr-C BEZ R7, foo Instr-D LD R8  [R4] Instr-E

18 Problem 2 For the example code snippet below, show the code after the load is hoisted: LD.S R8  [R4] Instr-A Instr-A Instr-B Instr-B ST R2  [R3] ST R2  [R3] Instr-C Instr-C BEZ R7, foo BEZ R7, foo Instr-D Instr-D LD R8  [R4] LD.C R8, rec-code Instr-E Instr-E rec-code: LD R8  [R4]

19 Amdahl’s Law Architecture design is very bottleneck-driven – make the common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahl’s Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play Example: a web server spends 40% of time in the CPU and 60% of time doing I/O – a new processor that is ten times faster results in a 36% reduction in execution time (speedup of 1.56) – Amdahl’s Law states that maximum execution time reduction is 40% (max speedup of 1.66)

20 Principle of Locality Most programs are predictable in terms of instructions executed and data accessed The Rule: a program spends 90% of its execution time in only 10% of the code Temporal locality: a program will shortly re-visit X Spatial locality: a program will shortly visit X+1

21 Problem 1 What is the storage requirement for a global predictor that uses 3-bit saturating counters and that produces an index by XOR-ing 12 bits of branch PC with 12 bits of global history?

22 Problem 1 What is the storage requirement for a global predictor that uses 3-bit saturating counters and that produces an index by XOR-ing 12 bits of branch PC with 12 bits of global history? The index is 12 bits wide, so the table has 2^12 saturating counters. Each counter is 3 bits wide. So total storage = 3 * 4096 = 12 Kb or 1.5 KB

23 Problem 2 What is the storage requirement for a tournament predictor that uses the following structures:  a “selector” that has 4K entries and 2-bit counters  a “global” predictor that XORs 14 bits of branch PC with 14 bits of global history and uses 3-bit counters  a “local” predictor that uses an 8-bit index into L1, and produces a 12-bit index into L2 by XOR-ing branch PC and local history. The L2 uses 2-bit counters.

24 Problem 2 What is the storage requirement for a tournament predictor that uses the following structures:  a “selector” that has 4K entries and 2-bit counters  a “global” predictor that XORs 14 bits of branch PC with 14 bits of global history and uses 3-bit counters  a “local” predictor that uses an 8-bit index into L1, and produces a 12-bit index into L2 by XOR-ing branch PC and local history. The L2 uses 2-bit counters. Selector = 4K * 2b = 8 Kb Global = 3b * 2^14 = 48 Kb Local = (12b * 2^8) + (2b * 2^12) = 3 Kb + 8 Kb = 11 Kb Total = 67 Kb

25 Problem 3 For the code snippet below, estimate the steady-state bpred accuracies for the default PC+4 prediction, the 1-bit bimodal, 2-bit bimodal, global, and local predictors. Assume that the global/local preds use 5-bit histories. do { for (i=0; i<4; i++) { increment something } for (j=0; j<8; j++) { increment something } k++; } while (k < some large number)

26 Problem 3 For the code snippet below, estimate the steady-state bpred accuracies for the default PC+4 prediction, the 1-bit bimodal, 2-bit bimodal, global, and local predictors. Assume that the global/local preds use 5-bit histories. do { for (i=0; i<4; i++) { increment something } for (j=0; j<8; j++) { increment something } k++; } while (k < some large number) PC+4: 2/13 = 15% 1b Bim: (2+6+1)/(4+8+1) = 9/13 = 69% 2b Bim: (3+7+1)/13 = 11/13 = 85% Global: (4+7+1)/13 = 12/13 = 92% (gets confused by unless you take branch-PC into account while indexing) Local: (4+7+1)/13 = 12/13 = 92%

27 Title Bullet