Processor Architectures and Program Mapping

Processor Architectures and Program Mapping
Exploiting ILP part 1: VLIW architectures TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

What are we talking about?
ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture operation 1 operation 2 operation 3 operation 4 Instruction format: operation 5 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW: Topics Overview Enhance performance: architecture methods
Instruction Level Parallelism Limits on ILP VLIW Examples Clustering Code generation Hands-on 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Enhance performance: 3 architecture methods
(Super)-pipelining Powerful instructions MD-technique multiple data operands per operation MO-technique multiple operations per instruction Multiple instruction issue 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methods Pipelined Execution of Instructions
IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 IF DC RF EX WB 2 INSTRUCTION 3 4 Simple 5-stage pipeline Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one More efficient Hardware Problems Hazards: pipeline stalls Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methods Pipelined Execution of Instructions
Superpipelining: Split one or more of the critical pipeline stages * 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methods Powerful Instructions (1)
MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SIMD computing Nodes used for independent operations Mesh or hypercube connectivity Exploit data locality of e.g. image processing applications Dense encoding (few instruction bits needed) SIMD Execution Method time Instruction 1 Instruction 2 Instruction 3 Instruction n node1 node2 node-K 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Sub-word parallelism SIMD on restricted scale: Used for Multi-media instructions Examples MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II Example: i=1..4|ai-bi| * 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

MO-technique: multiple operations per instruction CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methods: Powerful Instructions (2) VLIW Characteristics
Only RISC like operation support Short cycle times Flexible: Can implement any FU mixture Extensible Tight inter FU connectivity required Large instructions (up to 1000 bits) Not binary compatible But good compilers exist 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methods Multiple instruction issue (per cycle)
Who guarantees semantic correctness? can instructions be executed in parallel User specifies multiple instruction streams MIMD (Multiple Instruction Multiple Data) Run-time detection of ready instructions Superscalar Compile into dataflow representation Dataflow processors 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Multiple instruction issue Three Approaches
Example code a := b + 15; c := 3.14 * d; e := c / f; Translation to DDG (Data Dependence Graph) &d ld 3.14 &f &b ld ld * 15 &c + / st &a &e st st 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Generated Code Instr. Sequential Code Dataflow Code I1 ld r1,M(&b) ld(M(&b) -> I2 I2 addi r1,r1,15 addi 15 -> I3 I3 st r1,M(&a) st M(&a) I4 ld r1,M(&d) ld M(&d) -> I5 I5 muli r1,r1, muli > I6, I8 I6 st r1,M(&c) st M(&c) I7 ld r2,M(&f) ld M(&f) -> I8 I8 div r1,r1,r2 div -> I9 I9 st r1,M(&e) st M(&e) Notes: An MIMD may execute two streams: (1) I1-I3 (2) I4-I9 No dependencies between streams; in practice communication and synchronization required between streams A superscalar issues multiple instructions from sequential stream Obey dependencies (True and name dependencies) Reverse engineering of DDG needed at run-time Dataflow code is direct representation of DDG 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Multiple Instruction Issue: Data flow processor
Token Matching Token Store Instruction Generate Instruction Store Result Tokens Reservation Stations FU-1 FU-2 FU-K 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instruction Pipeline Overview
CISC IF DC RF EX WB RISC IF DC/RF EX WB IF1 DC1 RF1 EX1 ROB ISSUE WB1 IF2 DC2 RF2 EX2 WB2 IF3 DC3 RF3 EX3 WB3 IFk DCk RFk EXk WBk Superscalar Superpipelined IF1 IF2 IFs DC RF --- EX1 EX2 EX5 WB RF1 EX1 WB1 RF2 EX2 WB2 RFk EXk WBk IF DC RF1 EX1 WB1 RF2 EX2 WB2 RFk EXk WBk DATAFLOW VLIW 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Four dimensional representation of the architecture design space <I, O, D, S>
SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar MIMD Dataflow 0.1 10 100 RISC Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ Superpipelining Degree ‘S’ 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture design space
Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures Architecture K I O D S Mpar CISC RISC VLIW Superscalar Superpipelined Vector SIMD MIMD Dataflow S(architecture) =  f(Op) * lt (Op) Op I_set Mpar = I*O*D*S 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Overview Enhance performance: architecture methods
Instruction Level Parallelism limits on ILP VLIW Examples Clustering Code generation Hands-on 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

General organization of an ILP architecture
Instruction memory Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Motivation for ILP Increasing VLSI densities; decreasing feature size
Increasing performance requirements New application areas, like multi-media (image, audio, video, 3-D) intelligent search and filtering engines neural, fuzzy, genetic computing More functionality Use of existing Code (Compatibility) Low Power: P = fCVdd2 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Low power through parallelism
Sequential Processor Switching capacitance C Frequency f Voltage V P = fCV2 Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P = f/2 2C V’2 = fCV’2 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Measuring and exploiting available ILP
How much ILP is there in applications? How to measure parallelism within applications? Using existing compiler Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window register dependence memory dependence Check for correct branch prediction if prediction correct continue if wrong, flush schedule and start in next cycle 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Trace analysis How parallel can this code be executed? Trace set r1,0
set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Trace analysis Compiled code set r1,0 set r2,3 set r3,&A Loop: st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Program For i := 0..2 A[i] := i; S := X+3; How parallel can this code be executed? Explain trace analysis using the trace on the right-hand side for different models. No renaming + oracle prediction + unlimited window size Full renaming + oracle prediction + unlimited window size (shown in next slide) Full renaming + 2-bit prediction (assuming back-edge taken) + unlimited window size etc. 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Trace analysis Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7
Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman Note that with oracle prediction and renaming the last operation, add r1,r5,3, can be put in the first cycle.

Ideal Processor Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Upper Limit to ILP: Ideal Processor
Integer: FP: IPC 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Window Size and Branch Impact
Change from infinite window to examine 2000 and issue at most 64 instructions per cycle FP: Integer: 6 – 12 IPC Perfect Tournament BHT(512) Profile No prediction 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Impact of Limited Renaming Registers
Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) FP: Integer: IPC Infinite 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Memory Address Alias Impact
Changes: instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers FP: (Fortran, no heap) IPC Integer: 4 - 9 Perfect Global/stack perfect Inspection None 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Window Size Impact Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window FP: IPC Integer: Infinite 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

How to Exceed ILP Limits of This Study?
WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory Unnecessary dependences compiler did not unroll loops so iteration variable dependence Overcoming the data flow limit: value prediction, predicting values and speculating on prediction Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Conclusions Amount of parallelism is limited
higher in Multi-Media higher in kernels Trace analysis detects all types of parallelism task, data and operation types Detected parallelism depends on quality of compiler hardware source-code transformations 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instruction Level Parallelism VLIW Examples C6 TM IA-64: Itanium, .... TTA Clustering Code generation Hands-on 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

A VLIW architecture with 7 FUs
VLIW concept A VLIW architecture with 7 FUs Int Register File Instruction Memory Int FU Data Memory LD/ST FP FU Floating Point Register File Instruction register Function units 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW characteristics Multiple operations per instruction
One instruction per cycle issued (at most) Compiler is in control Only RISC like operation support Short cycle times Easier to compile for Flexible: Can implement any FU mixture Extensible / Scalable However: tight inter FU connectivity required not binary compatible !! (new long instruction format) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VelociTIC6x datapath 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW example: TMS320C62 TMS320C62 VelociTI Processor
8 operations (of 32-bit) per instruction (256 bit) Two clusters 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) 2 x 16 registers One bus available to write in register file of other cluster Flexible addressing modes (like circular addressing) Flexible instruction packing All instruction conditional 5 ns, 200 MHz, 0.25 um, 5-layer CMOS 128 KB on-chip RAM 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW example: Trimedia
5-issue 128 registers 27 FUs 32-bit 8-Way set associative caches dual-ported data cache guarded operations VLIW example: Trimedia SDRAM Timers Memory Interface PCI interface Video In Video Out Audio In Audio Out I2C Interface Serial Interface VLIW Processor 32K I$ 16K D$ VLD coprocessor 32 bit, 33 MHZ 40 Mpix/s 208 chanel digital audio Huffman decoder MPEG1,2 19 Mpix/s Digital audio 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Intel Architecture IA-64
Explicit Parallel Instruction Computer (EPIC) IA-64 architecture -> Itanium, first realization Register model: bit int x bits, stack, rotating bit floating point, rotating bit boolean bit branch target address system control registers 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

EPIC Architecture: IA-64
Instructions grouped in 128-bit bundles 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id Supports speculative loads 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Itanium 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Itanium 2: McKinley 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

EPIC Architecture: IA-64
EPIC allows for more binary compatibility then a plain VLIW: Function unit assignment performed at run-time Lock when FU results not available See other website for more info on IA-64: (look at related material) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW evaluation Strong points of VLIW: Weak points:
Scalable (add more FUs) Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: With N FUs: Bypassing complexity: O(N2) Register file complexity: O(N) Register file size: O(N2) Register file design restricts FU flexibility Solution: ? 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW evaluation O(N2) O(N)-O(N2) FU-1 CPU FU-2 Instruction memory
Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network Control problem O(N2) O(N)-O(N2) With N function units 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Solution Mirroring the Programming Paradigm
TTA: Transport Triggered Architecture + - + - > * > * st st 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Transport Triggered Architecture
General organization of a TTA FU-1 CPU FU-2 FU-3 Instruction fetch unit Instruction decode unit Bypassing network FU-4 Instruction memory Data memory FU-5 Register file 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

TTA structure; datapath details
Data Memory integer RF float boolean instruct. unit immediate load/store ALU Socket Instruction Memory 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

TTA hardware characteristics
Modular: building blocks easy to reuse Very flexible and scalable easy inclusion of Special Function Units (SFUs) Very low complexity > 50% reduction on # register ports reduced bypass complexity (no associative matching) up to 80 % reduction in bypass connectivity trivial decoding reduced register pressure easy register file partitioning (a single port is enough!) 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

TTA software characteristics
add r3, r1, r2 That does not look like an improvement !?! r1  add.o1; r2 add.o2; add.r  r3 o1 o2 + r More difficult to schedule ! But: extra scheduling optimizations 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

add.r -> sub.o1, 95 -> sub.o2
Scheduling example add r1,r2,r2 sub r4,r1,95 VLIW load/store unit integer ALU integer ALU r1 -> add.o1, r2 -> add.o2 add.r -> sub.o1, 95 -> sub.o2 sub.r -> r4 TTA integer RF immediate unit 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Programming TTAs General MOVE field:
g : guard specifier i : immediate specifier src : source dst : desitnation g i src dst General MOVE instructions: multiple fields move 1 move 2 move 3 move 4 How to use immediates? Small, 6 bits Long, 32 bits g 1 imm dst g Ir-1 dst imm 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Programming TTAs How to do conditional execution Each move is guarded
Example r1  cmp.o1 // operand move to compare unit r2  cmp.o2 // trigger move to compare unit cmp.r g // put result in boolean register g g:r3 r4 // guarded move takes place when r1=r2 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Register file port pressure for TTAs
9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Summary of TTA Advantages
Better usage of transport capacity Instead of 3 transports per dyadic operation, about 2 are needed # register ports reduced with at least 50% Inter FU connectivity reduces with 50-70% No full connectivity required Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs Flexible: Fus can incorporate arbitrary functionality Scalable: #FUS, #reg.files, etc. can be changed FU splitting results into extra exploitable concurrency TTAs are easy to design and can have short cycle times 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Level 1 Instruction Cache
Clustered VLIW Clustering = Splitting up the VLIW data path - same can be done for the instruction path – FU loop buffer register file Level 1 Instruction Cache Level 1 Data Cache Level 2 (shared) Cache 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Clustered VLIW Why clustering? Timing: faster clock Lower Cost
silicon area T2M Lower Energy 9/17/2018 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Processor Architectures and Program Mapping

Similar presentations

Presentation on theme: "Processor Architectures and Program Mapping"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processor Architectures and Program Mapping

Similar presentations

Presentation on theme: "Processor Architectures and Program Mapping"— Presentation transcript:

Similar presentations

About project

Feedback