Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 计算机系统结构 Computer Architecture 主讲教师:张钢 教授 Assignment mailbox: 2015 年.

Similar presentations


Presentation on theme: "1 计算机系统结构 Computer Architecture 主讲教师:张钢 教授 Assignment mailbox: 2015 年."— Presentation transcript:

1 1 计算机系统结构 Computer Architecture 主讲教师:张钢 教授 Assignment mailbox: tju_arch@163.comtju_arch@163.com Email: gzhang@tju.edu.cn 2015 年

2 2 William Stallings Computer Organization and Architecture 7 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors 指令级并行性和超标量处理器

3 3 What is Superscalar? The term superscalar refers to a machine that designed to improve the performance of the execution of scalar instructions. There are multiple independent instruction pipelines in a superscalar processor. 有多个独立的指令流水线 Each pipeline consists of multiple stages, can handle multiple instructions at a time. Multiple pipelines introduce a new level of parallelism, enabling multiple streams of instructions to be processed at a time.

4 4 What is Superscalar? A superscalar processor fetches multiple instructions at a time 超标量处理器一次取多条指令 Attempts to find nearby instructions that are independent of one another and can be executed in parallel. The essence of the superscalar approach is the ability to execute instructions independently in different pipelines.

5 5 What is Superscalar? Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently in a superscalar processor Equally applicable to RISC & CISC In practice usually RISC

6 6 Why Superscalar? Most operations are on scalar quantities (see RISC notes) Improve these operations to get an overall improvement

7 7 General Superscalar Organization Two integer, two floating-point, and one memory (either load or store) operations can be executing at the same time.

8 8 Superpipelined Many pipeline stages need less than half a clock cycle Double internal clock speed gets two tasks per external clock cycle

9 9 Superscalar v Superpipeline

10 10 Superscalar v Superpipeline Base machine

11 11 Superscalar v Superpipeline Superpipeline

12 12 Superscalar v Superpipeline Superscalar

13 13 Limitations Instruction level parallelism —Degree of instructions be executed in parallel —Compiler based optimization —Hardware techniques Limited by —True data dependency 数据相关 —Procedural dependency 过程相关 —Resource conflicts 资源冲突 —Output dependency 输出相关 —Antidependency 反相关

14 14 True Data Dependency ADD r1, r2 (r1 := r1+r2;) MOVE r3, r1 (r3 := r1;) Can fetch and decode second instruction in parallel with first Can NOT execute second instruction until first is finished Also called flow dependency or write-read dependency

15 15 True Data Dependency

16 16 Procedural Dependency The instructions following a branch cannot be executed until the branch is executed Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed This prevents simultaneous fetches

17 17 Procedural Dependency

18 18 Resource Conflict Resources —Memories, caches, buses, register-file, ports, functional units Two or more instructions requiring access to the same resource at the same time —e.g. two arithmetic instructions Can duplicate resources —e.g. have two arithmetic units

19 19 Resource Conflict

20 20 Effect of Dependencies

21 21 Design Issues Instruction level parallelism —Instructions in a sequence are independent —Execution can be overlapped —Governed by data and procedural dependency Machine Parallelism —Ability to take advantage of instruction level parallelism 处理器提供指令级并行性支持能力的度量 —Governed by number of parallel pipelines E.g. —Load R1  R2(23)Add R3  R3,”1” —Add R3  R3,”1”Add R4  R3,R2 —Add R4  R4,R2Store [R4]  R0 独立,可 并行执行 不独立,不可 并行执行

22 22 Instruction Issue Policy ( 指令发射策略 ) Order in which instructions are fetched — 取指令的顺序 Order in which instructions are executed — 指令执行的顺序 Order in which instructions change registers and memory — 指令改变寄存器和存储器内容的顺序

23 23 In-Order Issue In-Order Completion Issue instructions in the order they occur Not very efficient May fetch >1 instruction Instructions must stall if necessary

24 24 In-Order Issue In-Order Completion (Diagram)

25 25 In-Order Issue In-Order Completion (Diagram) f1d1 f2d2 a1a2 e1 e2 s1 s2 取指段译码段执行段保存段

26 26 In-Order Issue Out-of-Order Completion Output dependency —R3:= R3 + R5; (I1) —R4:= R3 + 1; (I2) —R3:= R5 + 1; (I3) —I2 depends on result of I1 - data dependency —If I3 completes before I1, the result from I1 will be wrong - output (write-write) dependency

27 27 In-Order Issue Out-of-Order Completion (Diagram)

28 28 Out-of-Order Issue Out-of-Order Completion Decouple decode pipeline from execution pipeline Can continue to fetch and decode until this pipeline is full When a functional unit becomes available an instruction can be executed Since instructions have been decoded, processor can look ahead

29 29 Out-of-Order Issue Out-of-Order Completion (Diagram)

30 30 Out-of-Order Issue Out-of-Order Completion (Diagram) d1 d2 a1a2 e1 e2 s1 s2 译码段 Decode 执行段 Execute 保存段 Write d3 先行窗口

31 31 Antidependency read-write dependency —R3:=R3 + R5; (I1) —R4:=R3 + 1; (I2) —R3:=R5 + 1; (I3) —R7:=R3 + R4; (I4) —I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3

32 32 Register Renaming Output and antidependencies occur because register contents may not reflect the correct ordering from the program May result in a pipeline stall Registers allocated dynamically —i.e. registers are not specifically named

33 33 Register Renaming example Register renaming —R3b:=R3a + R5a (I1) —R4b:=R3b + 1 (I2) —R3c:=R5a + 1 (I3) —R7b:=R3c + R4b (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c

34 34 Background to IA-64 Pentium 4 appears to be last in x86 line Intel & Hewlett-Packard (HP) jointly developed New architecture —64 bit architecture —Not extension of x86 —Not adaptation of HP 64bit RISC architecture Exploits vast circuitry and high speeds Systematic use of parallelism Departure from superscalar

35 35 Basic Concepts Instruction level parallelism —Implicit in machine instruction —Not determined at run time by processor Long or very long instruction words (LIW/VLIW) Branch predication 转移断定 (not the same as branch prediction 转移预测 ) Speculative loading 推测装入

36 36 Basic Concepts Intel & HP call this Explicit Parallel Instruction Computing (EPIC) 显式并行指令计算 IA-64 is an instruction set architecture intended for implementation on EPIC Merced is first Intel product based on IA-64 —Itanium was released in June, 2001 —Itanium is the official name of Intel's Merced processor

37 37 Superscalar v IA-64

38 38 Why New Architecture? Not hardware compatible with x86 Now have tens of millions of transistors available on chip —In the late 1970s, had tens of thousands of transistors —In the mid-1980s, had hundreds of thousands of transistors

39 39 Why New Architecture? How to use this glut of transistors? Could build bigger cache —Have a point of diminishing returns Add more execution units —Increase superscaling —“Complexity wall” —More units makes processor “wider” —More logic needed to orchestrate —Improved branch prediction required —Longer pipelines required —Greater penalty for misprediction —Larger number of renaming registers required —At most four instructions per cycle

40 40 Explicit Parallelism Instruction parallelism scheduled at compile time —Included with machine instruction Processor uses this info to perform parallel execution Requires less complex circuitry Compiler has much more time to determine possible parallel operations Compiler sees whole program

41 41 General Organization

42 42 IA-64 Register Set

43 43 Key Features Large number of registers —IA-64 instruction format assumes 256 –128 * 64 bit integer, logical & general purpose –128 * 82 bit floating point and graphic —64 * 1 bit predicated execution registers (see later) —To support high degree of parallelism Multiple execution units —Expected to be 8 or more —Depends on number of transistors available —Execution of parallel instructions depends on hardware available –8 parallel instructions may be spilt into two lots of four if only four execution units are available

44 44 IA-64 Execution Units I-Unit —Integer arithmetic —Shift and add —Logical —Compare —Integer multimedia ops M-Unit —Load and store –Between register and memory —Some integer ALU B-Unit —Branch instructions F-Unit —Floating point instructions

45 45 Instruction Format 128 bit bundle —Holds three instructions (syllables) plus template —Can fetch one or more bundles at a time —Template contains info on which instructions can be executed in parallel –Not confined to single bundle –e.g. a stream of 8 instructions may be executed in parallel –Compiler will have re-ordered instructions to form contiguous bundles –Can mix dependent and independent instructions in same bundle

46 46 Predication

47 47 Example of Predication Source Code: If (a&&b) /a 为真且 b 为真 j=j+1; Else If (c) k=k+1; Else k=k-1; i=i+1; Assembly Code: Beq a,0,L1 Beq b,0,L1 Add j,j,1 Jump L3 L1:Beq c,0,L2 Add k,k,1 Jump L3 L2:Sub k,k,1 L3:Add i,i,1

48 48 Example of Predication Beq a,0,L1 Beq b,0,L1 Add j,j,1 Beq c,0,L2 Add k,k,1 Add i,i,1 Sub k,k,1 F F F T T T

49 49 Predicated Code P1,P2=cmp(a==0) P1,P3=cmp(b==0) add j,j,1 P4,P5=cmp(c!=0) add k,k,1 sub k,k,1 add i,i,1

50 50 Example of Predication Beq a,0,L1 Beq b,0,L1 Add j,j,1 Beq c,0,L2 Add k,k,1 Add i,i,1 Sub k,k,1 F F F T T T P2 P1P3 P4P5

51 51 Speculative Loading

52 52 Speculative Loading Assembly Code: R1=&b[j] Ld R2,(R1) Bne R2,1,L2 R3=&a[i+j] Ld R4,(R3) Bne R4,1,L2 R5=&c[i-j+7] Ld R6,(R5) Bne R6,1,L2 L1: L2: Speculation Code: R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] Ld R2,(R1) Ld.s R4,(R3) Ld.s R6,(R5) P1,P2=cmp(R2==1) br L2 Chk.s R4 P3,P4=cmp(R4==1) br L2 Chk.s R6 P5,P6=cmp(R6==1) br L2 L1: L2:

53 53 作业 14.5 14.6

54 54


Download ppt "1 计算机系统结构 Computer Architecture 主讲教师:张钢 教授 Assignment mailbox: 2015 年."

Similar presentations


Ads by Google