CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Slides:



Advertisements
Similar presentations
ILP: Software Approaches
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Computer Organization and Architecture
Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Dynamic Branch Prediction (Sec 4.3) Control dependences become a limiting factor in exploiting ILP So far, we’ve discussed only static branch prediction.
ENGS 116 Lecture 111 ILP: Software Approaches 2 Vincent H. Berk October 14 th Reading for monday: 3.10 – 3.15, Reading for today: 4.2 – 4.6.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Exploiting ILP with Software Approaches
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.
COMP4211 Seminar Intro to Instruction-Level Parallelism 04S1 Week 02 Oliver Diessel.
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Hardware Support for Compiler Speculation
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
Concepts and Challenges
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Copyright © 2011, Elsevier Inc. All rights Reserved.
CS 704 Advanced Computer Architecture
Yingmin Li Ting Yan Qi Zhao
CS 201 Compiler Construction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Static Scheduling Techniques
Presentation transcript:

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster

5/30/3008CS 211 Lecture 62 Global Code Scheduling Used where body of an unrolled loop contains internal code flow –Will require moving instructions across branches Objective is to compact a code fragment with internal control structure into the shortest possible sequence that preserves the data and control dependences. –Data dependences require imposing a partial ordering –Control dependences dictate what code cannot be moved across control boundaries

5/30/3008CS 211 Lecture 63 Figure 4.8 A Common Control Flow

5/30/3008CS 211 Lecture 64 Global Scheduling Desire is to move the assignments to B and C earlier in the execution sequence, before the test of A. –Perform these movements ensuring that neither the data flow nor the exception behavior is changes Exception behavior – compilers will not move certain classes of instructions

5/30/3008CS 211 Lecture 65 Global Scheduling (continued) Can be extremely complex –Movement of instructions can have impact in many areas –Decision of how far to move the code can be complex –Trade-offs can occur that are difficult to program for –Selecting the candidate instruction to move is difficult Candidate methods –Trace scheduling: Focusing on critical path –Superblocks

5/30/3008CS 211 Lecture 66 Trace Scheduling Useful for processors with a large number of issues per clock, where simple loop unrolling may not be sufficient to uncover enough ILP to keep the processor busy. Take the costs of moving code on the less frequent paths –Best used where profile information indicates significant differences in frequency between different paths and where the profile is correct independent of particular inputs to the program –Only certain classes of programs meet these criteria

5/30/3008CS 211 Lecture 67 Trace Scheduling (Two Steps) Trace selection –Tries to find a likely sequence of basic blocks where the operations will be put together into a smaller number of instructions – this sequence being called a trace –Long traces generated by loop unrolling, since loop branches taken with high probability –Other static (by compiler) branch prediction is done –Figure 4.9 shows loop unrolled 4 times Trace compaction –Squeeze the trace into a smaller number of wide instructions Code scheduling – move operations as early as possible in a trace

5/30/3008CS 211 Lecture 68 Unrolled inner loop

5/30/3008CS 211 Lecture 69 Trace Scheduling (continued) Branches are viewed as jumps into or out of the selected trace – the selected trace assumed to be the most probable path. When code moved across trace entry and exit points, additional code is generated to keep up with alternate conditions An alternate code sequence is also generated

5/30/3008CS 211 Lecture 610 Superblocks In trace scheduling, entries and exits into the middle of the trace cause complications, with the compiler having to generate much additional code to compensate for incorrect branch speculation Superblocks are formed similarly to traces, but are a form of extended basic blocks, restricted to a single entry but allowing multiple exits. –Only code movement across an exit generates an alternative code set. –Generated by using tail duplication as in Figure 4.10

5/30/3008CS 211 Lecture 611 Figure 4.10 Superblock exit with n=4 Superblock exit with n=3 Superblock exit with n=2 Superblock exit with n=1 Execute n times

5/30/3008CS 211 Lecture 612 Hardware Support for Compile Time Exposure of Parallelism Loop unrolling, software pipelining, and trace scheduling work well only if behavior of branches is predictable at compile time. –Most predictable are loops (backward branches) –Programming style (how if then else blocks are coded with respect to the if test) affects these methods Techniques to overcome these limitations –Conditional or predicated instructions –Compiler speculation

5/30/3008CS 211 Lecture 613 Conditional or Predicated Instructions An instruction refers to a condition, which is evaluated as part of an instruction execution. –If condition is true, instruction executed –If condition is false, instruction continues as if it were a NOP –Many new architectures have these Control dependence converted to data dependence Number of multiple branches can be reduced Difficult to predict branches can be eliminated Conditional move limitations –May need to introduce many conditional moves If (A==0) {S=T;} ;normally compile to BNEZR1,L ADDUR2,R3,R0 L: ;above code gets changed to CMOVZR2,R3,R1

5/30/3008CS 211 Lecture 614 Conditional or Predicated Instructions Some architectures support full predication, whereby execution of all instructions is controlled by a predicate –When predicate is false, instruction is a no-op –Allows conversion of large blocks of code that are branch dependent –Can eliminate non-loop branches –Allows movement constrained only by predicate data Correct behavior –Predicated code must not generate an exception where the predicate is false

5/30/3008CS 211 Lecture 615 Constraints on Usefulness of Conditional or Predicated Instructions Predicated instructions that are annulled still go through fetch and in most processors some functional execution time –Movement across branches slows program down where branch would not have been taken Predicated instructions most useful when predicate can be evaluated early Use of conditional instructions can be limited when the control flow involves more than a simple alternative sequence –Moving an instruction across multiple branches makes it conditional on both branches of prior logic Conditional instructions may have some speed penalty compared with unconditional instructions Many architectures have included a few simple conditional instructions but only a few include conditional versions for most instructions (IA-64)

5/30/3008CS 211 Lecture 616 Compiler Speculation with Hardware Support Ambitious speculation requires –Compiler to find instructions that, with possible use of register renaming, can be speculatively moved and not affect program data flow (yes obviously the compiler) and we have seen this –Ability to ignore exceptions in speculated instructions until it is known whether or not the exceptions occur (information can only be know at run time here, so it’s the hardware’s responsibility) –Ability to speculatively interchange loads and stores or stores and store which may have address conflicts (information can only be know at run time here, so it’s the hardware’s responsibility)

5/30/3008CS 211 Lecture 617 Hardware Support for Preserving Exception Behavior 4 Approaches HW and OS cooperatively ignore exceptions for speculative instructions –Preserves exception behavior for correct programs but not for incorrect ones –Generally unacceptable but has been used, incorrect programs continue but get incorrect results Speculative instructions that never raise exceptions are used, with checks introduced to see when exceptions should occur A set of status bits are attached to result register and these “poison bits” are written by speculated instructions when instructions cause an exception (one bit) –Note stores are never speculated A mechanism is provided to indicate that an instruction is speculative, and the hardware buffers the instruction result until it is certain that the instruction is no longer speculative.

5/30/3008CS 211 Lecture 618 Fallacies and Pitfalls Fallacy – There is a simple approach to multiple-issue processors that yields high performance without a significant investment in silicon area or design complexity –Significant effort to find this fountain of gold –Exhibited phenomena are increased gaps between peak and sustained performance as issue rate increases –Compiling for processors with significant amounts of ILP has become complex

5/30/3008CS 211 Lecture 619 Itanium Separate attachment

5/30/3008CS 211 Lecture 620 SPECInt Benchmark

5/30/3008CS 211 Lecture 621 SPECfp Comparing Itanium, Alpha and Pentium 4