现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
COMP4611 Tutorial 6 Instruction Level Parallelism
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
CSE 502 Graduate Computer Architecture Lec 11 – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Computer Architecture Lec 8 – Instruction Level Parallelism.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CS136, Advanced Architecture Speculation. CS136 2 Outline Speculation Speculative Tomasulo Example Memory Aliases Exceptions VLIW Increasing instruction.
CSE 502 Graduate Computer Architecture Lec – More Instruction Level Parallelism Via Speculation Larry Wittie Computer Science, StonyBrook University.
EECC551 - Shaaban #1 lec # 6 Fall Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
1 EE524 / CptS561 Computer Architecture Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences.
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
EECC551 - Shaaban #1 lec # 6 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
COMP381 by M. Hamdi 1 Superscalar Processors. COMP381 by M. Hamdi 2 Recall from Pipelining Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 lec # 8 Winter Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to.
CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.
DAP.F96 1 Lecture 9: Introduction to Compiler Techniques Chapter 4, Sections L.N. Bhuyan CS 203A.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
1 Overcoming Control Hazards with Dynamic Scheduling & Speculation.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
EECS 252 Graduate Computer Architecture Lec 8 – Instruction Level Parallelism David Patterson Electrical Engineering and Computer Sciences University of.
现代计算机体系结构 主讲教师:张钢天津大学计算机学院 2009 年.
CS 5513 Computer Architecture Lecture 6 – Instruction Level Parallelism continued.
CS 352H: Computer Systems Architecture
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Henk Corporaal TUEindhoven 2009
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Lecture 6: Static ILP, Branch prediction
The University of Adelaide, School of Computer Science
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Lecture 23: Static Scheduling for High ILP
Henk Corporaal TUEindhoven 2011
Larry Wittie Computer Science, StonyBrook University and ~lw
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CPSC 614 Computer Architecture Lec 5 – Instruction Level Parallelism
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Chapter 3: ILP and Its Exploitation
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Overcoming Control Hazards with Dynamic Scheduling & Speculation
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年

现代计算机体系结构 2 Exploiting ILP Using Multiple Issue and Static Scheduling

现代计算机体系结构 3 Multiple-issue Processors Come in Three Major Flavors Statically Scheduled Superscalar Processors –issue varying numbers of instructions per clock –use in-order execution Dynamically Scheduled Superscalar Processors –issue varying numbers of instructions per clock –use out-of-order execution VLIW (very long instruction word) processors –issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet

现代计算机体系结构 4 The Basic VLIW Approach VLIWs use multiple, independent function units A VLIW packages the multiple operations into one very long instruction Or a VLIW requires that the instructions in the issue packet satisfy the same constraints There is no fundamental difference in the two approaches

现代计算机体系结构 5 Case study: A VLIW processor A VLIW processor with instructions that contain five operations –One integer operation (or a branch) –Tow floating-point operations –Two memory references An instruction length of between 80 and 120 bits –16 to 24 bits per field => 5*16 or 80 bits to 5*24 or 120 bits wide

现代计算机体系结构 6 Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:L.DF0,0(R1) 2 L.DF6,-8(R1) 3 L.DF10,-16(R1) 4 L.DF14,-24(R1) 5 ADD.DF4,F0,F2 6 ADD.DF8,F6,F2 7 ADD.DF12,F10,F2 8 ADD.DF16,F14,F2 9 S.D0(R1),F4 10 S.D-8(R1),F8 11 S.D-16(R1),F12 12 DSUBUIR1,R1,#32 13 BNEZR1,LOOP 14 S.D8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles

现代计算机体系结构 7 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch L.D F0,0(R1)L.D F6,-8(R1)1 L.D F10,-16(R1)L.D F14,-24(R1)2 L.D F18,-32(R1)L.D F22,-40(R1)ADD.D F4,F0,F2ADD.D F8,F6,F23 L.D F26,-48(R1)ADD.D F12,F10,F2ADD.D F16,F14,F24 ADD.D F20,F18,F2ADD.D F24,F22,F25 S.D 0(R1),F4S.D -8(R1),F8ADD.D F28,F26,F26 S.D -16(R1),F12S.D -24(R1),F167 S.D -32(R1),F20S.D -40(R1),F24DSUBUI R1,R1,#488 S.D -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

现代计算机体系结构 8 Problems with 1st Generation VLIW Increase in code size –generating enough operations in a straight- line code fragment requires ambitiously unrolling loops –whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding

现代计算机体系结构 9 Problems with 1st Generation VLIW Operated in lock-step; no hazard detection HW –a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized –Compiler might prediction function units, but caches hard to predict

现代计算机体系结构 10 Problems with 1st Generation VLIW Binary code compatibility –Pure VLIW => different numbers of functional units and unit latencies require different versions of the code

现代计算机体系结构 11 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture bit integer regs bit floating point regs –Not separate register files per functional unit as in old VLIW Hardware checks dependencies Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

现代计算机体系结构 12 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”  Itanium™ was first implementation (2001) –Highly parallel and deeply pipelined hardware at 800Mhz –6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process  Itanium 2™ is name of 2nd implementation (2005) –6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process –Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3

现代计算机体系结构 13 Increasing Instruction Fetch Bandwidth  Predicts next instruct address, sends it out before decoding instruction  PC of branch sent to BTB  When match is found, Predicted PC is returned  If branch predicted taken, instruction fetch continues at Predicted PC

现代计算机体系结构 14 Example On a 2 issue processor Loop: LW R2,0(R1);R2=array element DADDIU R2,R2,#1; increment R2 SW 0(R1),R2;store result DADDIU R1,R1,#4;increment pointer BNE R2,R3,Loop ; branch if not last element Assume –separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. –up to two instructions of any type can commit per clock

现代计算机体系结构 15 Without speculation, control dependency is the main performance limitation

现代计算机体系结构 16 With speculation, overlapping between iterations

现代计算机体系结构 17 Branch Target Buffer (BTB) To reduce the branch penalty, we must know whether the as-yet-undecoded instruction is a branch, if so, what the next PC should be. –We can have a branch penalty of zero. Branch-target buffer/Branch-target cache –A branch-prediction cache that stores the predicted address for the next instruction after a branch

现代计算机体系结构 18 Branch Target Buffer (BTB)

现代计算机体系结构 19 Return Address Predictor Indirect jumps –Destination address varies at run time –For example Case statement Procedure return Procedure return can be predicted with a branch-target buffer, but the accuracy can be low. Why? –The procedure may be called from multiple sites –The calls from one site are not clustered in time E.g. nested recursion

现代计算机体系结构 20 Return Address Predictor How to overcome this problem? Small buffer of return addresses acts as a stack –Caches most recent return addresses –Call: Push a return address on stack –Return: Pop an address off stack & predict as new PC

现代计算机体系结构 21 Integrated Instruction Fetch Units Multiple instructions are demanded by multiple-issue processors in a clock How to meet the demand? Integrated Instruction Fetch Units is one of the approaches –Integrated branch prediction –Instruction prefetch –Instruction memory access and buffering

现代计算机体系结构 22 Integrated Instruction Fetch Units Integrated branch prediction –branch predictor is part of instruction fetch unit and is constantly predicting branches Instruction prefetch –Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction Instruction memory access and buffering –Fetching multiple instructions per cycle: May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks) Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed

现代计算机体系结构 23 Value Prediction Taxonomy of speculative execution

现代计算机体系结构 24 Why can we do value Prediction? Several recent studies have shown that there is significant result redundancy in programs, i.e., many instructions perform the same computation and, hence, produce the same result over and over again. These studies have found that for several benchmarks more than 75% of the dynamic instructions produce the same result as before.

现代计算机体系结构 25 Value Prediction Attempts to predict value produced by instruction –E.g., Loads a value that changes infrequently Value prediction is useful only if it significantly increases ILP –Focus of research has been on loads; so-so results, no processor uses value prediction Related topic is address aliasing prediction –RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict –Has been used by a few processors

现代计算机体系结构 26 Pipeline with VP The predictions are obtained from a hardware table, called Value Prediction Table (VPT). These predicted values are used as inputs by instructions, which can then execute earlier than they could have if they had to wait for their inputs to become available in the traditional way. When the correct values become available (after executing an instruction) the speculated values are verified –if a speculation is found to be wrong, the instructions which executed with the wrong inputs are re-executed –if the speculation is found to be correct then nothing special needs to be done

现代计算机体系结构 27 Pipeline with VP a flow of a dependent chain of instructions (I, J, and K) through two different pipelines: (i) a base pipeline (without VP or IR); (ii) a pipeline with VP. we assume the instructions I, J, and K, are fetched, decoded and renamed together. In the base pipeline, the instructions execute sequentially, since they are data dependent, requiring three cycles to execute them; –the chain is committed by cycle 6. In the pipeline with VP, the dependence between instructions is broken by predicting the outputs of I and J (alternately, the inputs of J and K). This enables the three instructions to execute simultaneously; –the chain is committed in cycle 4.

现代计算机体系结构 28 作业 4 习题 2.2