Download presentation
Presentation is loading. Please wait.
1
Processor Architectures and Program Mapping
Exploiting ILP part 2: code generation TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman
2
Overview Enhance performance: architecture methods
Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
3
Compiler basics Overview Compiler trajectory / structure / passes
Control Flow Graph (CFG) Mapping and Scheduling Basic block list scheduling Extended scheduling scope Loop schedulin 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
4
Compiler basics: trajectory
Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
5
Compiler basics: structure / passes
Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
6
Compiler basics: structure Simple compilation example
position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 := + id * 60 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
7
Compiler basics: Control flow graph (CFG)
C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
8
Mapping / Scheduling: placing operations in space and time
b 2 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; * * d z y + + + e f - x r Data Dependence Graph (DDG) 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
9
How to map these operations?
Architecture constraints: One Function Unit All operations single cycle latency a b 2 * * d cycle + + z y 1 * e f + 2 - * x 3 + r 4 + 5 - 6 + 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
10
How to map these operations?
Architecture constraints: One Add-sub and one Mul unit All operations single cycle latency * + - a b 2 z y d e f r x Mul Add-sub cycle 1 * + 2 * + 3 + 4 - 5 6 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
11
There are many mapping solutions
Pareto curve (solution space) T execution x Cost 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
12
Basic Block Scheduling
Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: Scheduling order sequential Priority determined by used heuristic; e.g. slack 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
13
Basic Block Scheduling
ASAP cycle B C ALAP cycle ADD A <1,1> slack SUB A C <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z X y 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
14
Cycle based list scheduling
proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v) E } ready’ = ready sched = current_cycle = 0 while sched V do for each v ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v sched (u,v) E, u sched } ready’ = { v | v ready (u,v) E, cycle(u) + delay(u,v) current_cycle} endwhile endproc 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
15
Extended basic block scheduling: Code Motion
a) add r4, r4, 4 b) beq . . . D e) st r1, 8(r4) C d) sub r1, r1, r2 B c) add r1, r1, r2 Downward code motions? — a B, a C, a D, c D, d D Upward code motions? — c A, d A, e B, e C, e A 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
16
Extended Scheduling scope
Code: CFG: Control Flow Graph A; If cond Then B Else C; D; Then E Else F; G; A B C D E F G 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
17
Scheduling scopes Trace Superblock Decision tree Hyperblock/region
1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
18
Code movement (upwards) within regions
destination block Legend: Copy needed I I Intermediate block I I Check for off-liveness Code movement I add source block 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
19
Extended basic block scheduling: Code Motion
A dominates B A is always executed before B Consequently: A does not dominate B code motion from B to A requires code duplication B post-dominates A B is always executed after A B does not post-dominate A code motion from B to A is speculative A C B E D F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
20
Scheduling: Loops Loop Optimizations: Loop unrolling Loop peeling A B
1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
21
Basic block scheduling
Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
22
Software pipelining Software pipelining a loop is: Or:
Scheduling the loop such that iterations start before preceding iterations have finished Or: Moving operations across the backedge Example: y = a.x LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST LD ML ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
23
Software pipelining (cont’d)
Basic techniques: Modulo scheduling (Rau, Lam) list scheduling with modulo resource constraints Kernel recognition techniques unroll the loop schedule the iterations identify a repeating pattern Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) fill first cycle of iteration copy this instruction over the backedge 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
24
Software pipelining: Modulo scheduling
Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Kernel ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
25
Software pipelining: determine II, Initation Interval
Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 ld r1, (r2) mul r3, r1, 3 (0,1) (1,0) sub r4, r3, 1 st r4, (r5) (1,6) (delay, distance) cycle(v) cycle(u) + delay(u,v) - II.distance(u,v) 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
26
Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or: 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
27
The Role of the Compiler
9 steps required to translate an HLL program Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
28
Division of responsibilities between hardware and compiler
Application Frontend Superscalar Determine Dependencies Determine Dependencies Dataflow Binding of Operands Binding of Operands Multi-threaded Scheduling Scheduling Indep. Arch Binding of Operations Binding of Operations VLIW Binding of Transports Binding of Transports TTA Execute Responsibility of compiler Responsibility of Hardware 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
29
Overview Enhance performance: architecture methods
Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
30
Hands-on (not this year)
Map JPEG to a TTA processor see web page: Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
31
Hands-on Let’s look at DSE: Design Space Exploration
We will use the Imagine processor 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
32
Mapping applications to processors MOVE framework
User intercation Pareto curve (solution space) cost exec. time x Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
33
Code generation trajectory for TTAs
Frontend: GCC or SUIF (adapted) Application (C) Compiler frontend Architecture description Sequential code Sequential simulation Input/Output Compiler backend Profiling data Parallel code Parallel simulation Input/Output 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
34
Exploration: TTA resource reduction
1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
35
Exporation: TTA connectivity reduction
Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time Number of connections removed 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
36
Can we do better Yes !! How ? Transformations
SFUs: Special Function Units Multiple Processors Cost Execution time 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
37
Transforming the specification
+ + + + Based on associativity of + operation a + (b + c) = (a + b) + c 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
38
Transforming the specification
d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; 1 b y z << a + - x r 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
39
Changing the architecture adding SFUs: special function units
+ + + + 4-input adder why is this faster? 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
40
Changing the architecture adding SFUs: special function units
In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
41
SFUs: fine grain patterns
Why using fine grain SFUs: Code size reduction Register file #ports reduction Could be cheaper and/or faster Transport reduction Power reduction (avoid charging non-local wires) Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
42
SFUs: covering results
1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
43
Exploration: resulting architecture
9 buses 4 RFs 4 Addercmp FUs 2 Multiplier FUs 2 Diffadd FUs stream output input Architecture for image processing Note the reduced connectivity 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
44
Conclusions Billions of embedded processing systems
how to design these systems quickly, cheap, correct, low power,.... ? what will their processing platform look like? VLIWs are very powerful and flexible can be easily tuned to application domain TTAs even more flexible, scalable, and lower power 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
45
Conclusions Compilation for ILP architectures is getting mature, and
Enters the commercial area. However Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
46
Bottom line: Do not pay for hardware if you can do it by software !!
1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.