Presentation is loading. Please wait.

Presentation is loading. Please wait.

现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年.

Similar presentations


Presentation on theme: "现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年."— Presentation transcript:

1 现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: gzhang@tju.edu.cn 提交作业邮箱: tju_arch@163.com 2012 年

2 现代计算机体系结构 2 Exploiting ILP Using Multiple Issue and Static Scheduling

3 现代计算机体系结构 3 Multiple-issue Processors Come in Three Major Flavors Statically Scheduled Superscalar Processors –issue varying numbers of instructions per clock –use in-order execution Dynamically Scheduled Superscalar Processors –issue varying numbers of instructions per clock –use out-of-order execution VLIW (very long instruction word) processors –issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet

4 现代计算机体系结构 4 The Basic VLIW Approach VLIWs use multiple, independent function units A VLIW packages the multiple operations into one very long instruction Or a VLIW requires that the instructions in the issue packet satisfy the same constraints There is no fundamental difference in the two approaches

5 现代计算机体系结构 5 Case study: A VLIW processor A VLIW processor with instructions that contain five operations –One integer operation (or a branch) –Tow floating-point operations –Two memory references An instruction length of between 80 and 120 bits –16 to 24 bits per field => 5*16 or 80 bits to 5*24 or 120 bits wide

6 现代计算机体系结构 6 Recall: Unrolled Loop that Minimizes Stalls for Scalar 1 Loop:L.DF0,0(R1) 2 L.DF6,-8(R1) 3 L.DF10,-16(R1) 4 L.DF14,-24(R1) 5 ADD.DF4,F0,F2 6 ADD.DF8,F6,F2 7 ADD.DF12,F10,F2 8 ADD.DF16,F14,F2 9 S.D0(R1),F4 10 S.D-8(R1),F8 11 S.D-16(R1),F12 12 DSUBUIR1,R1,#32 13 BNEZR1,LOOP 14 S.D8(R1),F16; 8-32 = -24 14 clock cycles, or 3.5 per iteration L.D to ADD.D: 1 Cycle ADD.D to S.D: 2 Cycles

7 现代计算机体系结构 7 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch L.D F0,0(R1)L.D F6,-8(R1)1 L.D F10,-16(R1)L.D F14,-24(R1)2 L.D F18,-32(R1)L.D F22,-40(R1)ADD.D F4,F0,F2ADD.D F8,F6,F23 L.D F26,-48(R1)ADD.D F12,F10,F2ADD.D F16,F14,F24 ADD.D F20,F18,F2ADD.D F24,F22,F25 S.D 0(R1),F4S.D -8(R1),F8ADD.D F28,F26,F26 S.D -16(R1),F12S.D -24(R1),F167 S.D -32(R1),F20S.D -40(R1),F24DSUBUI R1,R1,#488 S.D -0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS)

8 现代计算机体系结构 8 Problems with 1st Generation VLIW Increase in code size –generating enough operations in a straight- line code fragment requires ambitiously unrolling loops –whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding

9 现代计算机体系结构 9 Problems with 1st Generation VLIW Operated in lock-step; no hazard detection HW –a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized –Compiler might prediction function units, but caches hard to predict

10 现代计算机体系结构 10 Problems with 1st Generation VLIW Binary code compatibility –Pure VLIW => different numbers of functional units and unit latencies require different versions of the code

11 现代计算机体系结构 11 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” IA-64: instruction set architecture 128 64-bit integer regs + 128 82-bit floating point regs –Not separate register files per functional unit as in old VLIW Hardware checks dependencies Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions?

12 现代计算机体系结构 12 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”  Itanium™ was first implementation (2001) –Highly parallel and deeply pipelined hardware at 800Mhz –6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process  Itanium 2™ is name of 2nd implementation (2005) –6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process –Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3

13 现代计算机体系结构 13 Increasing Instruction Fetch Bandwidth  Predicts next instruct address, sends it out before decoding instruction  PC of branch sent to BTB  When match is found, Predicted PC is returned  If branch predicted taken, instruction fetch continues at Predicted PC

14 现代计算机体系结构 14 Example On a 2 issue processor Loop: LW R2,0(R1);R2=array element DADDIU R2,R2,#1; increment R2 SW 0(R1),R2;store result DADDIU R1,R1,#4;increment pointer BNE R2,R3,Loop ; branch if not last element Assume –separate integer functional units for effective address calculation, for ALU operations, and for branch condition evaluation. –up to two instructions of any type can commit per clock

15 现代计算机体系结构 15 Without speculation, control dependency is the main performance limitation

16 现代计算机体系结构 16 With speculation, overlapping between iterations

17 现代计算机体系结构 17 Branch Target Buffer (BTB) To reduce the branch penalty, we must know whether the as-yet-undecoded instruction is a branch, if so, what the next PC should be. –We can have a branch penalty of zero. Branch-target buffer/Branch-target cache –A branch-prediction cache that stores the predicted address for the next instruction after a branch

18 现代计算机体系结构 18 Branch Target Buffer (BTB)

19 现代计算机体系结构 19 Return Address Predictor Indirect jumps –Destination address varies at run time –For example Case statement Procedure return Procedure return can be predicted with a branch-target buffer, but the accuracy can be low. Why? –The procedure may be called from multiple sites –The calls from one site are not clustered in time E.g. nested recursion

20 现代计算机体系结构 20 Return Address Predictor How to overcome this problem? Small buffer of return addresses acts as a stack –Caches most recent return addresses –Call: Push a return address on stack –Return: Pop an address off stack & predict as new PC

21 现代计算机体系结构 21 Integrated Instruction Fetch Units Multiple instructions are demanded by multiple-issue processors in a clock How to meet the demand? Integrated Instruction Fetch Units is one of the approaches –Integrated branch prediction –Instruction prefetch –Instruction memory access and buffering

22 现代计算机体系结构 22 Integrated Instruction Fetch Units Integrated branch prediction –branch predictor is part of instruction fetch unit and is constantly predicting branches Instruction prefetch –Instruction fetch units prefetch to deliver multiple instruct. per clock, integrating it with branch prediction Instruction memory access and buffering –Fetching multiple instructions per cycle: May require accessing multiple cache blocks (prefetch to hide cost of crossing cache blocks) Provides buffering, acting as on-demand unit to provide instructions to issue stage as needed and in quantity needed

23 现代计算机体系结构 23 Value Prediction Taxonomy of speculative execution

24 现代计算机体系结构 24 Why can we do value Prediction? Several recent studies have shown that there is significant result redundancy in programs, i.e., many instructions perform the same computation and, hence, produce the same result over and over again. These studies have found that for several benchmarks more than 75% of the dynamic instructions produce the same result as before.

25 现代计算机体系结构 25 Value Prediction Attempts to predict value produced by instruction –E.g., Loads a value that changes infrequently Value prediction is useful only if it significantly increases ILP –Focus of research has been on loads; so-so results, no processor uses value prediction Related topic is address aliasing prediction –RAW for load and store or WAW for 2 stores Address alias prediction is both more stable and simpler since need not actually predict the address values, only whether such values conflict –Has been used by a few processors

26 现代计算机体系结构 26 Pipeline with VP The predictions are obtained from a hardware table, called Value Prediction Table (VPT). These predicted values are used as inputs by instructions, which can then execute earlier than they could have if they had to wait for their inputs to become available in the traditional way. When the correct values become available (after executing an instruction) the speculated values are verified –if a speculation is found to be wrong, the instructions which executed with the wrong inputs are re-executed –if the speculation is found to be correct then nothing special needs to be done

27 现代计算机体系结构 27 Pipeline with VP a flow of a dependent chain of instructions (I, J, and K) through two different pipelines: (i) a base pipeline (without VP or IR); (ii) a pipeline with VP. we assume the instructions I, J, and K, are fetched, decoded and renamed together. In the base pipeline, the instructions execute sequentially, since they are data dependent, requiring three cycles to execute them; –the chain is committed by cycle 6. In the pipeline with VP, the dependence between instructions is broken by predicting the outputs of I and J (alternately, the inputs of J and K). This enables the three instructions to execute simultaneously; –the chain is committed in cycle 4.

28 现代计算机体系结构 28 作业 4 习题 2.2


Download ppt "现代计算机体系结构 1 主讲教师:张钢 教授 天津大学计算机学院 通信邮箱: 提交作业邮箱: 2012 年."

Similar presentations


Ads by Google