Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical.

Similar presentations


Presentation on theme: "Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical."— Presentation transcript:

1 Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical University Eindhoven DTI / NUS Singapore 2005/2006

2 H.C. TD Compiling for ILP Architectures Overview: Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Summary and Conclusions

3 H.C. TD Motivation Performance requirements increase Applications may contain much instruction level parallelism Processors offer lots of hardware concurrency Problem to be solved: –how to exploit this concurrency automatically?

4 H.C. TD Goals of code generation High speedup –Exploit all the hardware concurrency –Extract all application parallelism obey true dependencies only resolve false dependencies by renaming No code rewriting: automatic parallelization –However: application tuning may be required Limit code expansion

5 H.C. TD Overview Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Summary and Conclusions

6 H.C. TD Measuring and exploiting available parallelism How to measure parallelism within applications? –Using existing compiler –Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window –register dependence –memory dependence Check for correct branch prediction –if prediction correct continue –if wrong, flush schedule and start in next cycle

7 H.C. TD Trace analysis Program For i := 0..2 A[i] := i; S := X+3; Compiled code set r1,0 set r2,3 set r3,&A Loop:st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Execution trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 How parallel can this code be executed?

8 H.C. TD Trace analysis Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7

9 H.C. TD Ideal Processor Assumptions for ideal/perfect processor: 1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: –unlimited number of instructions issued/cycle (unlimited resources), and –unlimited instruction window –perfect caches –1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level

10 H.C. TD Upper Limit to ILP: Ideal Processor Integer: FP: IPC

11 H.C. TD Different effects reduce the exploitable parallelism Reducing window size –i.e., the number of instructions to choose from Non-perfect branch prediction –perfect (oracle model) –dynamic predictor (e.g. 2 bit prediction table with finite number of entries) –static prediction (using profiling) –no prediction Restricted number of registers for renaming –typical superscalars have O(100) registers Restricted number of other resources, like FUs

12 H.C. TD Non-perfect alias analysis ( memory disambiguation) Models to use: –perfect –inspection: no dependence in following cases: r1 := 0(r9) r1 := 0(fp) 4(r9) := r2 0(gp) := r2 A more advanced analysis may disambiguate most stack and global references, but not the heap references –none Important: –good branch prediction, 128 registers for renaming, alias analysis on stack and global accesses, and for FloatingPt a large window size Different effects reduce the exploitable parallelism

13 H.C. TD Summary Amount of parallelism is limited –higher in Multi-Media –higher in kernels Trace analysis detects all types of parallelism –task, data and operation types Detected parallelism depends on –quality of compiler –hardware –source-code transformations

14 H.C. TD Overview Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Source level transformations Compilation frameworks Summary and Conclusions

15 H.C. TD Compiler basics Overview –Compiler trajectory / structure / passes –Abstract Syntax Tree (AST) –Control Flow Graph (CFG) –Data Dependence Graph (DDG) –Basic optimizations –Register allocation –Code selection

16 H.C. TD Compiler basics: trajectory Preprocessor Compiler Assembler Loader/Linker Source program Object program Error messages Library code

17 H.C. TD Compiler basics: structure / passes Lexical analyzer Parsing Code optimization Register allocation Source code Sequential code Intermediate code Code generation Scheduling and allocation Object code token generation check syntax check semantic parse tree generation data flow analysis local optimizations global optimizations code selection peephole optimizations making interference graph graph coloring spill code insertion caller / callee save and restore code exploiting ILP

18 H.C. TD Compiler basics: structure Simple compilation example Lexical analyzer Syntax analyzer Intermediate code generator position := initial + rate * 60 id := id + id * 60 := + id * 60 id Code optimizer Code generator temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 temp1 := id3 * 60.0 id1 := id2 + temp1 movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1

19 H.C. TD Compiler basics: structure - SUIF-1 toolkit example pre-processing C front-end converting non-standard structures to SUIF constant propagation forward propagation induction variable identification scalar privatization analysis reduction analysis locality optimization and parallelism analysis parallel code generation FORTRAN specific transformations SUIF to textSUIF to postscriptSUIF to C SUIF text postscript C FORTRAN to C FORTRAN C high-SUIF to low-SUIF constant propagation strength reduction dead-code elimination register allocation assembly code generation assembly code

20 H.C. TD Compiler basics: Abstract Syntax Tree (AST) C input code: if (a > b) { r = a % b; } else { r = b % a; } Parse tree: ‘infinite’ nesting: Stat IF Cmp > Var a Var b Statlist Stat Expr Assign Var r Binop % Var a Var b Statlist Stat Expr Assign Var r Binop % Var b Var a

21 H.C. TD Compiler basics: Control flow graph (CFG) C input code: CFG: 1 sub t1, a, b bgz t1, 2, 3 4 ………….. 3 rem r, b, a goto 4 2 rem r, a, b goto 4 Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. if (a > b) { r = a % b; } else { r = b % a; }

22 H.C. TD a := b + 15; c := 3.14 * d; e := c / f; Translation to DDG ld + st &b 15 &a ld* /st ld st &f 3.14 &e &d &c Data Dependence Graph (DDG)

23 H.C. TD Machine independent optimizations Machine dependent optimizations (details are in any good compiler book) Compiler basics : Basic optimizations

24 H.C. TD –Common subexpression elimination –Constant folding –Copy propagation –Dead-code elimination –Induction variable elimination –Strength reduction –Algebraic identities Commutative expressions Associativity: Tree height reduction –Note: not always allowed (due to limited precision) Machine independent optimizations

25 H.C. TD What’s the optimal implementation of a*34 ? –Use multiplier: mul Tb,Ta,34 Pro: No thinking required Con: May take many cycles –Alternative: SHL Tc, Ta, 1 ADD Tb, Tc, Tzero SHL Tc, Tc, 4 ADD Tb, Tb, Tc Pros: May take fewer cycles Cons: Uses more registers Additional instructions ( I-cache load / code size) Machine dependent optimization example

26 H.C. TD Register Organization Conventions needed for parameter passing and register usage across function calls; a MIPS example: Compiler basics : Register allocation r31 r21 r20 r11 r10 r1 r0 Callee saved registers Caller saved registers Argument and result transfer Hard-wired 0 Temporaries

27 H.C. TD Register allocation using graph coloring Given a set of registers, what is the most efficient mapping of registers to program variables in terms of execution time of the program? A variable is defined at a point in program when a value is assigned to it. A variable is used at a point in a program when its value is referenced in an expression. The live range of a variable is the execution range between definitions and uses of a variable.

28 H.C. TD Program: a := c := b := := b d := := a := c := d abcd Live Ranges Register allocation using graph coloring Example:

29 H.C. TD Register allocation using graph coloring a bc d Inference Graph a bc d Coloring: a = red b = green c = blue d = green Graph needs 3 colors (chromatic nr =3) => program needs 3 registers

30 H.C. TD Register allocation using graph coloring Spill/ Reload code Spill/ Reload code is needed when there are not enough colors (registers) to color the interference graph Example: Only two registers available !! Program: a := c := store c b := := b d := := a load c := c := d abcd Live Ranges

31 H.C. TD CISC era –Code size important –Determine shortest sequence of code Many options may exist –Pattern matching Example M68020: D1 := D1 + M[ M[10+A1] + 16*D ]  ADD ([10,A1], D2*16, 20), D1 RISC era –Performance important –Only few possible code sequences –New implementations of old architectures optimize RISC part of instruction set only; for e.g. i486 / Pentium / M68020 Compiler basics : Code selection

32 H.C. TD Overview Motivation and Goals Measuring and exploiting available parallelism Compiler basics Scheduling for ILP architectures Source level transformations Compilation frameworks Summary and Conclusions

33 H.C. TD What is scheduling? Time allocation: –Assigning instructions or operations to time slots –Preserve dependences: Register dependences Memory dependences –Optimize code with respect to performance/ code size/ power consumption/.. Space allocation –satisfy resource constraints: Bind operations to FUs Bind variables to registers/ register files Bind transports to buses

34 H.C. TD Why scheduling? Let’s look at the execution time: T execution = N cycles x T cycle = N instructions x CPI x T cycle Scheduling may reduce T execution –Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow N issue > 1 and therefore CPI < 1 –Reduce N instructions compact many operations into each instruction (VLIW)

35 H.C. TD Scheduling data hazards RaW dependence Avoiding RaW stalls: Reordering of instructions by the compiler Example: avoiding one-cycle load interlock Code: a = b + c d = e - f Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4 Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4

36 H.C. TD Scheduling control hazards Branch requires 3 actions: Compute new address Determine condition Perform the actual branch (if taken): PC := new address IF ID OF EX WB IF ID OF EX IF ID OF EX WB time Branch L Predict not taken L:

37 H.C. TD Control hazards: what's the penalty? CPI = CPI ideal + f branch x P branch P branch = N delayslots x miss_rate Superscalars tend to have large branch penalty P branch due to –many pipeline stages –multiple instructions (or operations) / cycle Note: –the lower CPI the larger the effect of penalties

38 H.C. TD What can we do about control hazards and CPI penalty? Keep penalty P branch low: –Early computation of new PC –Early determination of condition –Visible delay slots filled by compiler (MIPS) Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’95] Remove branches: if-conversion –Conditional instructions: CMOVE, cond skip next –Guarding all instructions: TriMedia

39 H.C. TD Scheduling: Conditional instructions After conversion: Example: Cmove (supported by Alpha) If (A=0) S = T; assume: r1: A, r2: S, r3: T Object code: Bnez r1, L Mov r2, r3 L:.... Cmovz r2, r3, r1

40 H.C. TD Scheduling: Conditional instructions Conditional instructions are useful, however: Squashed instructions still take execution time and execution resources –Consequence: long target blocks can not be if-converted Condition has to be known early Moving operations across multiple branches requires complicated predicates Compatibility: change of ISA (instruction set architecture) Practice: Current superscalars support a limited set of conditional instructions CMOVE: alpha, MIPS, PowerPC, SPARC HP PA: any RR instruction can conditionally squash next instruction Large VLIWs profit from making all instructions conditional guarded execution: TriMedia, Intel/HP IA-64, TI C6x

41 H.C. TD Guarded execution SLT r1,r2,r3 BEQ r1,r0, else then: ADDI r2,r2,1..X.. j cont else:SUBI r2,r2,1..Y.. cont: MUL r4,r2 SLT b1,r2,r3 b1:ADDI r2,r2,1 !b1: SUBI r2,r2,1 b1:..X.. !b1:..Y.. MUL r4,r2 IF-conversion

42 H.C. TD Scheduling: Conditional instructions Full guard support If-conversion of conditional code Assume: t branch branch latency p branch branching probability t true execution time of the TRUE branch t false execution time of the FALSE branch Execution times of original and if-converted code for non-ILP architecture: t original_code = (1 + p branch ) x t branch + p x t true + (1 - p branch ) x t false t if_converted_code = t true + t false

43 H.C. TD Scheduling: Conditional instructions Speedup of if-converted code for non-ILP architectures Only interesting for short target blocks!

44 H.C. TD Scheduling: Conditional instructions Speedup of if-converted code for ILP architectures with sufficient resources Much larger area of interest !! t if_converted = max(t true, t false )

45 H.C. TD Scheduling: Conditional instructions Full guard support for large ILP architectures has a number of advantages: –Removing unpredictable branches –Enlarging scheduling scope –Enabling software pipelining –Enhancing code motion when speculation is not allowed –Resource sharing; even when speculation is allowed guarding may be profitable

46 H.C. TD Scheduling: Overview Transforming a sequential program into a parallel program: read sequential program read machine description file for each procedure do perform function inlining for each procedure do transform an irreducible CFG into a reducible CFG perform control flow analysis perform loop unrolling perform data flow analysis perform memory reference disambiguation perform register allocation for each scheduling scope do perform instruction scheduling write parallel program

47 H.C. TD Scheduling: Int.Lin.Programming Integer linear programming scheduling method Introduce: –Decision variables: x i,j = 1 if operation i is scheduled in cycle j –Constraints like: –Limited resources: where x t operation of type t and M t number of resources of type t –Data dependence constraints –Timing constraints Problem: too many decision variables

48 H.C. TD List Scheduling Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: –Scheduling order sequential –Priority determined by used heuristic; e.g. slack

49 H.C. TD Basic Block Scheduling ADD LD AC y MUL AB z ADD SUB NEGLD A BC X ASAP cycle ALAP cycle slack

50 H.C. TD ASAP and ALAP formulas asap(v) = max {asap(u) + delay(u,v) | (u,v)  E } if pred(v)   0 otherwise alap(v) = min {alap(u) - delay(u,v) | (u,v)  E } if succ(v)   L max otherwise slack(v) = alap(v) - asap(v)

51 H.C. TD Cycle based list scheduling proc Schedule (DDG = (V,E)) beginproc ready = { v |  (u,v)  E }// all nodes which have no predecessor ready’ = ready// all nodes which can be scheduled in sched =  // current cycle current_cycle = 0 while sched  V do for each v  ready’ do if  ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched   (u,v)  E, u  sched } ready’ = { v | v  ready   (u,v)  E, cycle(u) + delay(u,v)  current_cycle} endwhile endproc

52 H.C. TD Problem with basic block scheduling Basic blocks contain on average only about 6 instructions Unrolling may help for loops Go beyond basic blocks: 1. Extended basic block scheduling 2. Software pipelining

53 H.C. TD Extended basic block scheduling: Scope B C E F D G A Trace Superblock B C F E’ D’ G’ A E D G tail duplication Partitioning a CFG into scheduling scopes:

54 H.C. TD Extended basic block scheduling: Scope B C E F D G A Hyperblock/ region Partitioning a CFG into scheduling scopes: B C E’ F’ D’ G’’ A E D G Decision Tree tail duplication F G’

55 H.C. TD Comparing scheduling scopes: Extended basic block scheduling: Scope

56 H.C. TD Extended basic block scheduling: Code Motion A a) add r4, r4, 4 b) beq... D e) st r1, 8(r4) C d) sub r1, r1, r2 B c) add r1, r1, r2 Downward code motions? — a  B, a  C, a  D, c  D, d  D Upward code motions? — c  A, d  A, e  B, e  C, e  A

57 H.C. TD Extended basic block scheduling: Code Motion D/b I D II I D b’ M M M M D I M b Legend: Basic blocks between source and destination basic blocks Control flow edges where off-liveness checks have to be performed Basic blocks where duplication have to be placed Destination basic blocks Source basic blocks SCP (single copy on a path) rule: no path may exist between 2 different D blocks

58 H.C. TD Extended basic block scheduling: Code Motion A dominates B  A is always executed before B –Consequently: A does not dominate B  code motion from B to A requires code duplication B post-dominates A  B is always executed after A –Consequently: B does not post-dominate A  code motion from B to A is speculative A CB ED F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B?

59 H.C. TD Scheduling: Loops B C D A B C’’ D A C’ C B C’’ D A C’ C Loop peeling Loop unrolling Loop Optimizations:

60 H.C. TD Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling Software pipelining resource utilization time

61 H.C. TD Software pipelining Software pipelining a loop is: –Scheduling the loop such that iterations start before preceding iterations have finished Or: –Moving operations across the backedge LD ML ST LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST Example: y = a.x  3 cycles/iteration Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration

62 H.C. TD Software pipelining: Modulo scheduling Example: Modulo scheduling a loop for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue Kernel Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline

63 H.C. TD Summary and Conclusions Compilation for ILP architectures is getting mature and enters the commercial area. However: –Great discrepancy between available and exploitable parallelism What if you need more parallelism? -source-to-source transformations -use other algorithms

64 H.C. TD


Download ppt "Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical."

Similar presentations


Ads by Google