Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman
Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Compiler basics Overview Compiler trajectory / structure / passes Control Flow Graph (CFG) Mapping and Scheduling Basic block list scheduling Extended scheduling scope Loop schedulin 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Compiler basics: structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Compiler basics: structure Simple compilation example position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 := + id * 60 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Compiler basics: Control flow graph (CFG) C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Mapping / Scheduling: placing operations in space and time b 2 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; * * d z y + + + e f - x r Data Dependence Graph (DDG) 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
How to map these operations? Architecture constraints: One Function Unit All operations single cycle latency a b 2 * * d cycle + + z y 1 * e f + 2 - * x 3 + r 4 + 5 - 6 + 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
How to map these operations? Architecture constraints: One Add-sub and one Mul unit All operations single cycle latency * + - a b 2 z y d e f r x Mul Add-sub cycle 1 * + 2 * + 3 + 4 - 5 6 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
There are many mapping solutions Pareto curve (solution space) T execution x Cost 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Basic Block Scheduling Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: Scheduling order sequential Priority determined by used heuristic; e.g. slack 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Basic Block Scheduling ASAP cycle B C ALAP cycle ADD A <1,1> slack SUB A C <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z X y 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Cycle based list scheduling proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhile endproc 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Extended basic block scheduling: Code Motion a) add r4, r4, 4 b) beq . . . D e) st r1, 8(r4) C d) sub r1, r1, r2 B c) add r1, r1, r2 Downward code motions? — a B, a C, a D, c D, d D Upward code motions? — c A, d A, e B, e C, e A 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Extended Scheduling scope Code: CFG: Control Flow Graph A; If cond Then B Else C; D; Then E Else F; G; A B C D E F G 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Scheduling scopes Trace Superblock Decision tree Hyperblock/region 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Code movement (upwards) within regions destination block Legend: Copy needed I I Intermediate block I I Check for off-liveness Code movement I add source block 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Extended basic block scheduling: Code Motion A dominates B A is always executed before B Consequently: A does not dominate B code motion from B to A requires code duplication B post-dominates A B is always executed after A B does not post-dominate A code motion from B to A is speculative A C B E D F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Scheduling: Loops Loop Optimizations: Loop unrolling Loop peeling A B 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Basic block scheduling Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Software pipelining Software pipelining a loop is: Or: Scheduling the loop such that iterations start before preceding iterations have finished Or: Moving operations across the backedge Example: y = a.x LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST LD ML ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Software pipelining (cont’d) Basic techniques: Modulo scheduling (Rau, Lam) list scheduling with modulo resource constraints Kernel recognition techniques unroll the loop schedule the iterations identify a repeating pattern Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) fill first cycle of iteration copy this instruction over the backedge 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Software pipelining: Modulo scheduling Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Kernel ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Software pipelining: determine II, Initation Interval Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 ld r1, (r2) mul r3, r1, 3 (0,1) (1,0) sub r4, r3, 1 st r4, (r5) (1,6) (delay, distance) cycle(v) cycle(u) + delay(u,v) - II.distance(u,v) 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or: 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
The Role of the Compiler 9 steps required to translate an HLL program Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Division of responsibilities between hardware and compiler Application Frontend Superscalar Determine Dependencies Determine Dependencies Dataflow Binding of Operands Binding of Operands Multi-threaded Scheduling Scheduling Indep. Arch Binding of Operations Binding of Operations VLIW Binding of Transports Binding of Transports TTA Execute Responsibility of compiler Responsibility of Hardware 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Hands-on (not this year) Map JPEG to a TTA processor see web page: http://www.ics.ele.tue.nl/~heco/courses/pam Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Hands-on Let’s look at DSE: Design Space Exploration We will use the Imagine processor http://cva.stanford.edu/projects/imagine/ 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Mapping applications to processors MOVE framework User intercation Pareto curve (solution space) cost exec. time x Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Code generation trajectory for TTAs Frontend: GCC or SUIF (adapted) Application (C) Compiler frontend Architecture description Sequential code Sequential simulation Input/Output Compiler backend Profiling data Parallel code Parallel simulation Input/Output 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Exploration: TTA resource reduction 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Exporation: TTA connectivity reduction Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time Number of connections removed 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Can we do better Yes !! How ? Transformations SFUs: Special Function Units Multiple Processors Cost Execution time 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Transforming the specification + + + + Based on associativity of + operation a + (b + c) = (a + b) + c 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; 1 b y z << a + - x r 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Changing the architecture adding SFUs: special function units + + + + 4-input adder why is this faster? 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
SFUs: fine grain patterns Why using fine grain SFUs: Code size reduction Register file #ports reduction Could be cheaper and/or faster Transport reduction Power reduction (avoid charging non-local wires) Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
SFUs: covering results 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Exploration: resulting architecture 9 buses 4 RFs 4 Addercmp FUs 2 Multiplier FUs 2 Diffadd FUs stream output input Architecture for image processing Note the reduced connectivity 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Conclusions Billions of embedded processing systems how to design these systems quickly, cheap, correct, low power,.... ? what will their processing platform look like? VLIWs are very powerful and flexible can be easily tuned to application domain TTAs even more flexible, scalable, and lower power 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Conclusions Compilation for ILP architectures is getting mature, and Enters the commercial area. However Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Bottom line: Do not pay for hardware if you can do it by software !! 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman