Presentation on theme: "ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 1 Early ILP Processors."— Presentation transcript:
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 1 Early ILP Processors and Performance Bound Model
ECE8833 H.-H. S. Lee 2009 2 Decoupled Access/Execute Computer Architectures James E. Smith, ACM TOCS, 1984 (a earlier version was published in ISCA 1982)
ECE8833 H.-H. S. Lee 2009 3 Background of DAE, circa. 1982 Written at a time when vector machine was dominating LVv1, mem[a1] MULVv3, v2, v1 ADDVv5, v4, v3 MULV v3, v2, v1 LV v1, mem[a1] ADDV v5, v4, v3 Time line Vector chaining (Cray-1) MULV v3, v2, v1 LV v1, mem[a1] ADDV v5, v4, v3 64-bit register 0 63 4096-bit
ECE8833 H.-H. S. Lee 2009 4 Background of DAE, circa. 1982 Written at a time when vector machine was dominating LVv1, mem[a1] MULVv3, v2, v1 ADDVv5, v4, v3 v1 v3 Memory MUL v2 v4 ADD v5 What about modern SIMD ISA ?
ECE8833 H.-H. S. Lee 2009 5 Today State-of-the-art ? Intel AVX Intel Larrabee NI
ECE8833 H.-H. S. Lee 2009 6 DAE, circa. 1982 Fine-grained parallelism: Vector vs. Superscalar What about scalar performance? –Remember what’s Flynn’s bottleneck? Page 290
ECE8833 H.-H. S. Lee 2009 7 Flynn’s Bottleneck ILP 1.86 –Programs on IBM 7090 –Basically, he sort of said one cannot execute more than one instruction per cycle –ILP exploited within basic blocks [Riseman & Foster ’ 72][Riseman & Foster ’ 72] –Breaking control dependency –A perfect machine model –Benchmark includes numerical programs, assembler and compiler passed jumps0 jump1 jump2 jumps8 jumps32 jumps128 jumps jumps Average ILP1.722.723.627.2114.824.251.2 BB0 BB1 BB3 BB2 BB4
ECE8833 H.-H. S. Lee 2009 8 DAE, circa. 1982, 1984 Issues in CDC6600 & IBM 360/91 –Overlap instructions by OoO complex control slower clock offset the benefit –Complex issue methods were abandoned by their manufacturers Less determinism Problems in HW debugging Errors may not be reproducible –Complexity can be shifted to system software
ECE8833 H.-H. S. Lee 2009 9 Decoupled Access/Execute Architecture An architecture with two instruction streams to break Flynn’s bottleneck –Access processor –eXecute processor –Hey, this was 1980s Separate RFs (A 0, A 1, A 2.., A n-1 & X 0, X 1, X 2..,X m-1 ), which can be totally incompatible –Synchronization issue?
ECE8833 H.-H. S. Lee 2009 15 Modern Issue Consideration Despite it is a ‘82/’84 paper, it considers
ECE8833 H.-H. S. Lee 2009 16 Precise Exception Simple approach force the instructions to complete in order In DAE, applied to each of the streams separately Example of Imprecise exception issues Require cautiousness when coding A and E programs
ECE8833 H.-H. S. Lee 2009 17 Requirement for Precise Exception
ECE8833 H.-H. S. Lee 2009 18 Why (and How) It Works? Avg. speedup = 1.58 for LFK Executions between 2 processors are somewhat balanced Why? –Work nicely as shown in LFK –X-processor’s computation is not as fast 6-cycle FP add 7-cycle FP multiply –A-process takes care of Memory (11-cycle load) Branch resolution
ECE8833 H.-H. S. Lee 2009 19 Disadvantages of DAE Architecture 1.Writing 2 separate programs What High-level language ? Who should do it? 2.Certain duplication in Hardware Instruction memory/cache Instruction fetch unit Decoder
ECE8833 H.-H. S. Lee 2009 20 Interleaving Instruction Streams Use a bit to tag streams No split branch instruction (1)X 7 is XLQ or XSQ; (2)Once loaded, it is used once. (3)It must be stored after X-processor writes to it (A) X
ECE8833 H.-H. S. Lee 2009 21 Summary of DAE Architecture 2-wide issue per cycle Allow a constrained type of OoO –Data accesses could be done well in advance (i.e., “slip” ahead) –Enable certain level of data prefetching Was novel in 1982!
ECE8833 H.-H. S. Lee 2009 22 The ZS-1 Central Processor James E. Smith, et al. in ASPLOS-II, 1987
ECE8833 H.-H. S. Lee 2009 23 Astronautics ZS-1 ZS-1 Central Processor A realization of DAE (by the same author) Decouple instruction stream into –Fixed point/memory –Floating-point operations Communicate via Architectural queues Is extensively pipelined 22.5 MFLOPS, 45 MIPS
ECE8833 H.-H. S. Lee 2009 24 ZS-1 Central Processor Communicate with memory 31 A (and X) registers + 1 Queue entry = 5-bit encoded operands Hold 24 insts Hold 4 insts
ECE8833 H.-H. S. Lee 2009 25 ZS-1 Central Processor + Instruction cannot be issued unless the dependency is resolved. + A load may bypass independent stores + Maintain load-load, store-store order
ECE8833 H.-H. S. Lee 2009 26 Can Load Bypass Load? Why not? Load R1, (A) Load R2, (A) Core 1 Store (A), R3 Core 2 (A)=100 R3=25 (1) (2) (3) What’s wrong with (2)(3)(1)?
ECE8833 H.-H. S. Lee 2009 27 ZS-1: Processing of Two Iterations S: splitter B: inst buffer read D: decoded I: issued E: Execution
ECE8833 H.-H. S. Lee 2009 28 IBM RS/6000 and POWER Evolved from IBM ACS and 801 Foundation of POWER architecture (Performance Optimization With Enhanced RISC) –10 discrete chips in the early POWER1 system –Single chip solution in RSC and some subsequent POWER2 version called P2SC
ECE8833 H.-H. S. Lee 2009 29 POWER2 Processor Node 8 Discrete chips on MCM 66.7 MHz, 6-issue (2 reserved for br/comp) 2 FXUs –Memory, INT, Logical –2 per cycles 3 dual-pipe FPUs can perform –2 DP Fma –2 FP loads –2 FP stores --- I-Cache (32KB) Dispatch Dual Branch Processors Instruction Cache Unit Instruction Buffer Execution Unit w/o Mult/Div Execution Unit w Mult/Div Instruction Buffer Arithmetic Execution Unit Store Execution Unit Load Execution Unit Sync Fixed-Point Unit (FXU)Floating-Point Unit (FPU) Data Cache Unit (DCU) 4 separate chips (32KB each) Memory Unit (64MB – 512MB) Optional Secondary Cache (1 or 2MB) Storage Control Unit
ECE8833 H.-H. S. Lee 2009 30 MACS Performance Bound Model Actual Run Time M Bound MA Bound MAC Bound MACS Bound Physically Measured GAP A GAP C GAP S GAP P To analyze achievable performance (mostly FP) in scientific applications
ECE8833 H.-H. S. Lee 2009 31 MACS Performance Bound Model Gap A (keep you from attaining peak performance) –Excessive loads/stores (more than essential ones, i.e., a[i] = b[i]) –Loop bookkeeping GAP C (reason we may want to have 432?) –Hardware restriction (architectural registers) –Redundant instructions –Load/store overhead in function calls GAP S –Weak scheduling algorithm –Resource conflicts preventing tighter schedule –Sol: Modulo scheduling to compact the code GAP P –Cache misses, inter-core communication, system effect (i.e., context switches) –Sol: prefetch, loop blocking, loop fusion, loop exchange, etc.
ECE8833 H.-H. S. Lee 2009 32 POWER2 M Bound (Ideal, Ideal) M Bound Peak = 1 f ma to 2 FPU pipelines = 0.25 CPF --- Instruction Buffer Arithmetic Execution Unit Store Execution Unit Load Execution Unit Floating-Point Unit (FPU) Dispatch
ECE8833 H.-H. S. Lee 2009 33 POWER2 MA Bound (Ideal compiler and rest) MA Bound 1.Given the visible workload of the high level application 2.Calculate the essential operations must be performed Time bound for all FP operations Essential, minimum FP operations to complete the computation A factor of 4 for div and sqrt is a common choice to reflect their relative weight to other computations
ECE8833 H.-H. S. Lee 2009 34 POWER2 MA Bound (Ideal compiler and rest) 2 pipelines Max 4 dispatches to FPU and FXU Other fixed-point considered irrelevant Simplified memory model Non-pipelined FP ops
ECE8833 H.-H. S. Lee 2009 35 POWER2 MAC Bound MAC Bound Similar to computing MA Bound but using actual, generated instruction count
ECE8833 H.-H. S. Lee 2009 36 POWER2 MACS Bound MACS Bound Similar to computing MAC Bound but the numerator is the actual compiler-scheduled code
ECE8833 H.-H. S. Lee 2009 37 IBM SP2 Performance Bound Later expansion to include inter-processor communication bound