3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Speculative ExecutionCS510 Computer ArchitecturesLecture Lecture 11 Trace Scheduling, Conditional Execution, Speculation, Limits of ILP.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Exploiting ILP with Software Approaches
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.
Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Instruction Level Parallelism Chapter 4: CS465. Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase.
DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Instruction Level Parallelism (ILP) Colin Stevens.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
ENGS 116 Lecture 91 Dynamic Branch Prediction and Speculation Vincent H. Berk October 10, 2005 Reading for today: Chapter 3.2 – 3.6 Reading for Wednesday:
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Embedded Computer Architectures Hennessy & Patterson Chapter 4 Exploiting ILP with Software Approaches Gerard Smit (Zilverling 4102),
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers The Processor
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Computer Architecture
Lecture 23: Static Scheduling for High ILP
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
Presentation transcript:

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster –Balance must be found: E.g. sophisticated pipeline: CPI ↓ clock cycle ↑

Fallacies and Pitfalls Pitfall: Emphasizing improving CPI by increasing issue rate, while sacrificing clock rate can decrease performance –Again, question of balance SuperSPARC –vs– HP PA 7100 –Complex interactions between cycle time and organisation

Fallacies and Pitfalls Pitfall: Improving only one aspect of a multiple-issue processor and expecting overall performance improvement –Amdahl’s Law! –Boosting performance of one area may uncover problems in another

Fallacies and Pitfalls Pitfall: Sometimes bigger and dumber is better! –Alpha 21264: sophisticated multilevel tournament branch predictor –Alpha 21164: simple two-bit predictor –21164 performs better for transaction processing application! Can handle twice as many local branch predictions

Concluding Remarks Lots of open questions! –Clock speed –vs– CPI –Power issues –Exploiting parallelism ILP –vs– explicit

Characteristics of Modern (2001) Processors Figure 3.61 –3–4 way superscalar –4–22 stage pipelines –Branch prediction –Register renaming (except UltraSPARC) –400MHz – 1.7GHz –7–130 million transistors

Chapter 4 Exploiting ILP with Software

4.1. Compiler Techniques for Exposing ILP Compilers can improve the performance of simple pipelines –Reduce data hazards –Reduce control hazards

Loop Unrolling Compiler technique to increase ILP –Duplicate loop body –Decrease iterations Example: –Basic code: 10 cycles per iteration –Scheduled: 6 cycles for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } for (int k = 0; k < 1000; k+=4) { x[k] = x[k] + s; x[k+1] = x[k+1] + s; x[k+2] = x[k+2] + s; x[k+3] = x[k+3] + s; } Loop Unrolling Basic code: 7 cycles per “iteration” Scheduled: 3.5 cycles (no stalls!)

Loop Unrolling Requires clever compilers –Analysing data dependences, name dependences and control dependences Limitations –Code size –Decrease in amortisation of overheads –“Register pressure” –Compiler limitations Useful for any architecture

Superscalar Performance Two-issue MIPS (int + FP) 2.4 cycles per “iteration” –Unrolled five times

4.2. Static Branch Prediction Useful: –where behaviour can be predicted at compile- time –to assist dynamic prediction Architectural support –Delayed branches

Static Branch Prediction Simple: –Predict taken –Has average misprediction rate of 34% (SPEC) –Range: 59% – 9% Better: –Predict backward taken, forward not-taken –Worse for SPEC!

Static Branch Prediction Advanced compiler analysis can do better Profiling is very useful –FP: 9% ± 4% –Int: 15% ± 5%

4.3. Static Multiple Issue: VLIW Compiler groups instructions into “packets”, checking for dependences –Remove dependences –Flag dependences Simplifies hardware

VLIW First machines used a wide instruction with multiple operations per instruction –Hence Very Long Instruction Word (VLIW) –64–128 bits Alternative: group several instructions into an issue packet

VLIW Architectures Multiple functional units Compiler selects instructions for each unit to create one long instruction/an issue packet Example: five operations –Integer/branch, 2 × FP, 2 × memory access Need lots of parallelism –Use loop unrolling, or global scheduling

Example Loop unrolled seven times! 1.29 cycles per result 60% of available instruction “slots” filled for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; }

Summary of Improvements TechniqueUnscheduledScheduled Basic code106 Loop unrolled (4)73.5 Superscalar (5)2.4 VLIW (7)1.29

Drawbacks of Original VLIWs Large code size –Need to use loop unrolling –Wasted space for unused slots Clever encoding techniques, compression Lock-step execution –Stalling one unit stalls them all Binary code compatibility –Variations on structure required recompilation

4.4. Compiler Support for Exploiting ILP We will not cover this section in detail Loop unrolling –Loop-carried dependences Software pipelining –Interleave instructions from different iterations

4.5. Hardware Support for Extracting More Parallelism Techniques like loop-unrolling work well when branch behaviour can be predicted at compile time If not, we need more advanced techniques: –Conditional instructions –Hardware support for compiler speculation

Conditional or Predicated Instructions Instructions have associated conditions –If condition is true execution proceeds normally –If not, instruction becomes a no-op Removes control hazards if (a == 0) b = c; bnez %r8, L1 nop mov %r1, %r2 L1:... cmovz %r8, %r1, %r2

Conditional Instructions Control hazards effectively replaced by data hazards Can be used for speculation –Compiler reorders instructions depending on likely outcome of branches

Limitations on Conditional Instructions Annulled instructions still execute –But may occupy otherwise stalled time Most useful when conditions evaluated early Limited usefulness for complex conditions May be slower than unconditional operations

Conditional Instructions in Practice MachineConditional Instructions MIPS, Alpha, SPARC Move HP PA Any register-register instruction can annul the following instruction IA-64Full predication