Download presentation
Presentation is loading. Please wait.
Published byNorma Stanley Modified over 8 years ago
1
1 计算机系统结构 Computer Architecture 主讲教师:张钢 教授 Assignment mailbox: tju_arch@163.comtju_arch@163.com Email: gzhang@tju.edu.cn 2015 年
2
2 William Stallings Computer Organization and Architecture 7 th Edition Chapter 14 Instruction Level Parallelism and Superscalar Processors 指令级并行性和超标量处理器
3
3 What is Superscalar? The term superscalar refers to a machine that designed to improve the performance of the execution of scalar instructions. There are multiple independent instruction pipelines in a superscalar processor. 有多个独立的指令流水线 Each pipeline consists of multiple stages, can handle multiple instructions at a time. Multiple pipelines introduce a new level of parallelism, enabling multiple streams of instructions to be processed at a time.
4
4 What is Superscalar? A superscalar processor fetches multiple instructions at a time 超标量处理器一次取多条指令 Attempts to find nearby instructions that are independent of one another and can be executed in parallel. The essence of the superscalar approach is the ability to execute instructions independently in different pipelines.
5
5 What is Superscalar? Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently in a superscalar processor Equally applicable to RISC & CISC In practice usually RISC
6
6 Why Superscalar? Most operations are on scalar quantities (see RISC notes) Improve these operations to get an overall improvement
7
7 General Superscalar Organization Two integer, two floating-point, and one memory (either load or store) operations can be executing at the same time.
8
8 Superpipelined Many pipeline stages need less than half a clock cycle Double internal clock speed gets two tasks per external clock cycle
9
9 Superscalar v Superpipeline
10
10 Superscalar v Superpipeline Base machine
11
11 Superscalar v Superpipeline Superpipeline
12
12 Superscalar v Superpipeline Superscalar
13
13 Limitations Instruction level parallelism —Degree of instructions be executed in parallel —Compiler based optimization —Hardware techniques Limited by —True data dependency 数据相关 —Procedural dependency 过程相关 —Resource conflicts 资源冲突 —Output dependency 输出相关 —Antidependency 反相关
14
14 True Data Dependency ADD r1, r2 (r1 := r1+r2;) MOVE r3, r1 (r3 := r1;) Can fetch and decode second instruction in parallel with first Can NOT execute second instruction until first is finished Also called flow dependency or write-read dependency
15
15 True Data Dependency
16
16 Procedural Dependency The instructions following a branch cannot be executed until the branch is executed Also, if instruction length is not fixed, instructions have to be decoded to find out how many fetches are needed This prevents simultaneous fetches
17
17 Procedural Dependency
18
18 Resource Conflict Resources —Memories, caches, buses, register-file, ports, functional units Two or more instructions requiring access to the same resource at the same time —e.g. two arithmetic instructions Can duplicate resources —e.g. have two arithmetic units
19
19 Resource Conflict
20
20 Effect of Dependencies
21
21 Design Issues Instruction level parallelism —Instructions in a sequence are independent —Execution can be overlapped —Governed by data and procedural dependency Machine Parallelism —Ability to take advantage of instruction level parallelism 处理器提供指令级并行性支持能力的度量 —Governed by number of parallel pipelines E.g. —Load R1 R2(23)Add R3 R3,”1” —Add R3 R3,”1”Add R4 R3,R2 —Add R4 R4,R2Store [R4] R0 独立,可 并行执行 不独立,不可 并行执行
22
22 Instruction Issue Policy ( 指令发射策略 ) Order in which instructions are fetched — 取指令的顺序 Order in which instructions are executed — 指令执行的顺序 Order in which instructions change registers and memory — 指令改变寄存器和存储器内容的顺序
23
23 In-Order Issue In-Order Completion Issue instructions in the order they occur Not very efficient May fetch >1 instruction Instructions must stall if necessary
24
24 In-Order Issue In-Order Completion (Diagram)
25
25 In-Order Issue In-Order Completion (Diagram) f1d1 f2d2 a1a2 e1 e2 s1 s2 取指段译码段执行段保存段
26
26 In-Order Issue Out-of-Order Completion Output dependency —R3:= R3 + R5; (I1) —R4:= R3 + 1; (I2) —R3:= R5 + 1; (I3) —I2 depends on result of I1 - data dependency —If I3 completes before I1, the result from I1 will be wrong - output (write-write) dependency
27
27 In-Order Issue Out-of-Order Completion (Diagram)
28
28 Out-of-Order Issue Out-of-Order Completion Decouple decode pipeline from execution pipeline Can continue to fetch and decode until this pipeline is full When a functional unit becomes available an instruction can be executed Since instructions have been decoded, processor can look ahead
29
29 Out-of-Order Issue Out-of-Order Completion (Diagram)
30
30 Out-of-Order Issue Out-of-Order Completion (Diagram) d1 d2 a1a2 e1 e2 s1 s2 译码段 Decode 执行段 Execute 保存段 Write d3 先行窗口
31
31 Antidependency read-write dependency —R3:=R3 + R5; (I1) —R4:=R3 + 1; (I2) —R3:=R5 + 1; (I3) —R7:=R3 + R4; (I4) —I3 can not complete before I2 starts as I2 needs a value in R3 and I3 changes R3
32
32 Register Renaming Output and antidependencies occur because register contents may not reflect the correct ordering from the program May result in a pipeline stall Registers allocated dynamically —i.e. registers are not specifically named
33
33 Register Renaming example Register renaming —R3b:=R3a + R5a (I1) —R4b:=R3b + 1 (I2) —R3c:=R5a + 1 (I3) —R7b:=R3c + R4b (I4) Without subscript refers to logical register in instruction With subscript is hardware register allocated Note R3a R3b R3c
34
34 Background to IA-64 Pentium 4 appears to be last in x86 line Intel & Hewlett-Packard (HP) jointly developed New architecture —64 bit architecture —Not extension of x86 —Not adaptation of HP 64bit RISC architecture Exploits vast circuitry and high speeds Systematic use of parallelism Departure from superscalar
35
35 Basic Concepts Instruction level parallelism —Implicit in machine instruction —Not determined at run time by processor Long or very long instruction words (LIW/VLIW) Branch predication 转移断定 (not the same as branch prediction 转移预测 ) Speculative loading 推测装入
36
36 Basic Concepts Intel & HP call this Explicit Parallel Instruction Computing (EPIC) 显式并行指令计算 IA-64 is an instruction set architecture intended for implementation on EPIC Merced is first Intel product based on IA-64 —Itanium was released in June, 2001 —Itanium is the official name of Intel's Merced processor
37
37 Superscalar v IA-64
38
38 Why New Architecture? Not hardware compatible with x86 Now have tens of millions of transistors available on chip —In the late 1970s, had tens of thousands of transistors —In the mid-1980s, had hundreds of thousands of transistors
39
39 Why New Architecture? How to use this glut of transistors? Could build bigger cache —Have a point of diminishing returns Add more execution units —Increase superscaling —“Complexity wall” —More units makes processor “wider” —More logic needed to orchestrate —Improved branch prediction required —Longer pipelines required —Greater penalty for misprediction —Larger number of renaming registers required —At most four instructions per cycle
40
40 Explicit Parallelism Instruction parallelism scheduled at compile time —Included with machine instruction Processor uses this info to perform parallel execution Requires less complex circuitry Compiler has much more time to determine possible parallel operations Compiler sees whole program
41
41 General Organization
42
42 IA-64 Register Set
43
43 Key Features Large number of registers —IA-64 instruction format assumes 256 –128 * 64 bit integer, logical & general purpose –128 * 82 bit floating point and graphic —64 * 1 bit predicated execution registers (see later) —To support high degree of parallelism Multiple execution units —Expected to be 8 or more —Depends on number of transistors available —Execution of parallel instructions depends on hardware available –8 parallel instructions may be spilt into two lots of four if only four execution units are available
44
44 IA-64 Execution Units I-Unit —Integer arithmetic —Shift and add —Logical —Compare —Integer multimedia ops M-Unit —Load and store –Between register and memory —Some integer ALU B-Unit —Branch instructions F-Unit —Floating point instructions
45
45 Instruction Format 128 bit bundle —Holds three instructions (syllables) plus template —Can fetch one or more bundles at a time —Template contains info on which instructions can be executed in parallel –Not confined to single bundle –e.g. a stream of 8 instructions may be executed in parallel –Compiler will have re-ordered instructions to form contiguous bundles –Can mix dependent and independent instructions in same bundle
46
46 Predication
47
47 Example of Predication Source Code: If (a&&b) /a 为真且 b 为真 j=j+1; Else If (c) k=k+1; Else k=k-1; i=i+1; Assembly Code: Beq a,0,L1 Beq b,0,L1 Add j,j,1 Jump L3 L1:Beq c,0,L2 Add k,k,1 Jump L3 L2:Sub k,k,1 L3:Add i,i,1
48
48 Example of Predication Beq a,0,L1 Beq b,0,L1 Add j,j,1 Beq c,0,L2 Add k,k,1 Add i,i,1 Sub k,k,1 F F F T T T
49
49 Predicated Code P1,P2=cmp(a==0) P1,P3=cmp(b==0) add j,j,1 P4,P5=cmp(c!=0) add k,k,1 sub k,k,1 add i,i,1
50
50 Example of Predication Beq a,0,L1 Beq b,0,L1 Add j,j,1 Beq c,0,L2 Add k,k,1 Add i,i,1 Sub k,k,1 F F F T T T P2 P1P3 P4P5
51
51 Speculative Loading
52
52 Speculative Loading Assembly Code: R1=&b[j] Ld R2,(R1) Bne R2,1,L2 R3=&a[i+j] Ld R4,(R3) Bne R4,1,L2 R5=&c[i-j+7] Ld R6,(R5) Bne R6,1,L2 L1: L2: Speculation Code: R1=&b[j] R3=&a[i+j] R5=&c[i-j+7] Ld R2,(R1) Ld.s R4,(R3) Ld.s R6,(R5) P1,P2=cmp(R2==1) br L2 Chk.s R4 P3,P4=cmp(R4==1) br L2 Chk.s R6 P5,P6=cmp(R6==1) br L2 L1: L2:
53
53 作业 14.5 14.6
54
54
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.