Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instruction Selection Presented by Huang Kuo-An, Lu Kuo-Chang Subproject 3 A. Aho, M. Lam, R. Sethi, J. Ullman, “Instruction Selection by Tree Rewriting.”

Similar presentations


Presentation on theme: "Instruction Selection Presented by Huang Kuo-An, Lu Kuo-Chang Subproject 3 A. Aho, M. Lam, R. Sethi, J. Ullman, “Instruction Selection by Tree Rewriting.”"— Presentation transcript:

1 Instruction Selection Presented by Huang Kuo-An, Lu Kuo-Chang Subproject 3 A. Aho, M. Lam, R. Sethi, J. Ullman, “Instruction Selection by Tree Rewriting.” Compilers: Principles, Techniques & Tools”, 2 nd edition, Pearson Education, Inc, pp “The LLVM Target-Independent Code Generator: Instruction Selection.”

2 Outline Introducing LLVM Instruction Selection ▫Tree Rewriting Why we use LLVM? Progress

3 Introducing LLVM The LLVM compiler infrastructure ▫Provides modular & reusable components. ▫Reduces the time & cost to build a particular compiler. ▫Those components shared across different compiles.

4 LLVM IR The Steps of the LLVM Compiler Language Front-end C C++

5 LLVM IR The Steps of the LLVM Compiler Language Front-end C C++ either one

6 LLVM IR The Steps of the LLVM Compiler Language Front-end C C++ An intermediate representation: Lower than the high level language (simple instructions, no for loops, etc) Higher than the machine code (no opcodes, no registers, etc)

7 LLVM IR The Steps of the LLVM Compiler Language Front-end C C++ An intermediate representation: Lower than the high level language (simple instructions, no for loops, etc) Higher than the machine code (no opcodes, no registers, etc) source language independent target processor independent

8 LLVM IR The Steps of the LLVM Compiler Language Front-end Mid-level Optimizer LLVM IR C C++

9 LLVM IR The Steps of the LLVM Compiler Language Front-end Mid-level Optimizer LLVM IR C C++ Code Generation.s file executable

10 LLVM IR The Steps of the LLVM Compiler Language Front-end Mid-level Optimizer LLVM IR C C++ Code Generation.s file executable Instruction Selection Scheduling Register Allocation Machine- specific Optimization s Code Emission Target Machine Instructions LLVM IR

11 Instruction Selection How does the com- piler translate a C instruction like this: Into machine code like this: a[i] = b+1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1

12 Instruction Selection How does the com- piler translate a C instruction like this: Into machine code like this: a[i] = b+1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 First Answer: break it into two steps

13 Instruction Selection How does the com- piler translate a C instruction like this: Into machine code like this: a[i] = b+1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 The intermediate representation (IR): ind Mb + = C1+ ind+ + CiRsp Ca Rsp First Answer: break it into two steps

14 Instruction Selection Into machine code like this: LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 The intermediate representation (IR): ind Mb + = C1+ ind+ + CiRsp Ca Rsp

15 Instruction Selection Into machine code like this: LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 New question: How to go from IR to machine code? The intermediate representation (IR): ind Mb + = C1+ ind+ + CiRsp Ca Rsp

16 Instruction Selection One answer: use tree rewriting Into machine code like this: LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 The intermediate representation (IR): ind Mb + = C1+ ind+ + CiRsp Ca Rsp

17 Tree Rewriting ind Mb + = C1 + ind + + CiRsp Ca Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

18 Tree Rewriting ind Mb + = C1 + ind + + CiRsp Ca Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

19 Tree Rewriting ind Mb + = C1 + ind + + CiRsp Ca Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

20 Tree Rewriting ind Mb + = C1 + ind + + CiRsp R0 Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

21 Tree Rewriting ind Mb + = C1 + ind + + CiRsp R0 Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 LD R0, #a {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

22 Tree Rewriting ind Mb + = C1 + ind + + CiRsp R0 Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 LD R0, #a {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

23 Tree Rewriting ind Mb + = C1 + ind + + CiRsp R0 Rsp RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 LD R0, #a {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

24 Tree Rewriting ind Mb + = C1 + ind + CiRsp R0 RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 LD R0, #a {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

25 Tree Rewriting ind Mb + = C1 + ind + CiRsp R0 RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 LD R0, #a ADD R0, R0, SP {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

26 Tree Rewriting ind Mb + = C1 + ind + CiRsp R0 RiCa RiMx M= Ri ind CaRj + M= indRj Ri ind CaRj + + Ri + Rj Ri+ C1 LD R0, #a ADD R0, R0, SP {LD Ri, #a} {LD Ri, x} {ST x, Ri} {LD Ri, a(Rj)} {ST *Ri, Rj} {ADD Ri, Ri, a(Rj)} {ADD Ri, Ri, Rj} {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

27 Tree Rewriting ind Mb + = C1 + ind + CiRsp R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

28 Tree Rewriting ind Mb + = C1 + ind + CiRsp R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

29 Tree Rewriting ind Mb + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

30 Tree Rewriting ind Mb + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

31 Tree Rewriting ind Mb + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

32 Tree Rewriting ind Mb + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

33 Tree Rewriting ind R1 + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

34 Tree Rewriting ind R1 + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

35 Tree Rewriting ind R1 + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

36 Tree Rewriting ind R1 + = C1 R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

37 Tree Rewriting ind R1 = R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

38 Tree Rewriting ind R1 = R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

39 Tree Rewriting ind R1 = R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

40 Tree Rewriting ind R1 = R0 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

41 Tree Rewriting RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 M

42 Tree Rewriting RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 M

43 Tree Rewriting RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} RiC1 LD R0, #a ADD R0, R0, SP ADD R0, R0, i(SP) LD R1, b INC R1 ST *R0, R1 ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1

44 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 But actually, something is missing… The IR immediate value, #a, does not have a size limit, but the actual machine has a limited number of bits for the immediate value (let’s say, 16 bits)

45 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 But actually, something is missing… So we ought to state that this tree rewriting rule only applies when the immediate value can be expressed in 16 bits (ie, a≤FFFF)

46 RiCa {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} MxRi ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 But what about if a cannot be expressed in 16 bits ? Then we need a new rule: But actually, something is missing…

47 Ri {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} Mx Ri ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 Ca But what about if a cannot be expressed in 16 bits ? Then we need a new rule: But actually, something is missing…

48 Ri {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} Mx Ri ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 ld Ri, #a (a>FFFF) Ca But what about if a cannot be expressed in 16 bits ? Then we need a new rule: But actually, something is missing…

49 Ri {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} Mx Ri ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 ld Ri, #a (a>FFFF) Ca But actually, something is missing… The problem is the target processor does not have an instruction for 32-bit immediates. Instead, a set of machine instructions is needed. We call this set a pattern.

50 Ri Ca {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} Mx Ri ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 Ri {LD Ri, low16(#a) LD Rj, high16(#a) SHR Rj, Rj, #16 ADD Ri, Ri, Rj} ld Ri, #a (a>FFFF) Ca But actually, something is missing… The problem is the target processor does not have an instruction for 32-bit immediates. Instead, a set of machine instructions is needed. We call this set a pattern.

51 One-to-One add R1,R1,#1 Kinds of the tree rewriting rules

52 One-to-One add R1,R1,#1 INC Ri Kinds of the tree rewriting rules

53 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a add Ri,Ri,Rx Kinds of the tree rewriting rules

54 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx Kinds of the tree rewriting rules

55 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF) Kinds of the tree rewriting rules

56 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF)LD Ri, low16(#a) LD Rj, high16(#a) SHL Rj, #16 ADD Ri, Ri, Rj Kinds of the tree rewriting rules

57 So, what’s the point? To design an instruction selector, you do not need to write a program. Just define a set of rewriting rules.

58 Ri Ca {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} Mx Ri ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 Ri {LD Ri, low16(#a) LD Rj, high16(#a) SHR Rj, Rj, #16 ADD Ri, Ri, Rj} ld Ri, #a (a>FFFF) Ca So, what’s the point? To design an instruction selector, you do not need to write a program. Just define a set of rewriting rules.

59 Ri Ca {LD Ri, #a} RiMx {LD Ri, x} M= {ST x, Ri} Mx Ri ind {LD Ri, a(Rj)} CaRj + M= {ST *Ri, Rj} indRj Ri ind {ADD Ri, Ri, a(Rj)} CaRj + + Ri + {ADD Ri, Ri, Rj} RiRj Ri+ {INC Ri} ld Ri, #a (a≤FFFF) ld Ri, x st x, Ri st *Ri, Rj add Rx, Rj, #a ld Ri, Rx add Rx, Rj, #a add Ri, Ri, Rx add Ri, Ri, Rj add Ri, Ri, #1 RiC1 Ri {LD Ri, low16(#a) LD Rj, high16(#a) SHR Rj, Rj, #16 ADD Ri, Ri, Rj} ld Ri, #a (a>FFFF) Ca So, what’s the point? To design an instruction selector, you do not need to write a program. Just define a set of rewriting rules. Then use an existing instruction selection program to apply your set of rules. The LLVM compiler has such a selector.

60 Instruction Selection Suppose you want to use the LLVM compiler to create PowerPC code. The PowerPC has a single-precision floating point add instruction: FADDS T1, X, Y How can we allow the LLVM compiler to generate FADDS instructions? We need to create a tree rewriting rule in the LLVM format:

61 Instruction Selection Suppose you want to use the LLVM compiler to create PowerPC code. The PowerPC has a single-precision floating point add instruction: FADDS T1, A, B Q:How can we allow the LLVM compiler to generate FADDS instructions? We need to create a tree rewriting rule in the LLVM format:

62 Instruction Selection … def FADDS:Aform_2<59, 21, (outs F4RC:$FRT), (ins F4RC:$FRA, F4RC:$FRB), “FADDS $FRT, $FRA, $FRB”, [(set F4RC:$FRT, (fadd F4RC:$FRA, F4RC:$FRB))]>; … Instruction Selector Suppose you want to use the LLVM compiler to create PowerPC code. The PowerPC has a single-precision floating point add instruction: FADDS T1, A, B Q:How can we allow the LLVM compiler to generate FADDS instructions? A:We need to create a tree rewriting rule in the LLVM format: FRAFRB FRT+ {FADDS FRT, FRA, FRB} fadd RT, RA, RB

63 Instruction Selection The PowerPC also has a single-precision floating point multiply instruction: FMULS T1, X, Y So we need to create a tree rewriting rule for it too:

64 … def FADDS:Aform_2<59, 21, (outs F4RC:$FRT), (ins F4RC:$FRA, F4RC:$FRB), “FADDS $FRT, $FRA, $FRB”, [(set F4RC:$FRT, (fadd F4RC:$FRA, F4RC:$FRB))]>; def FMULS:Aform_3<59, 25, (outs F4RC:$FRT), (ins F4RC:$FRA, F4RC:$FRB), “FMULS $FRT, $FRA, $FRB”, [(set F4RC:$FRT, (fmul F4RC:$FRA, F4RC:$FRB))]>; … Instruction Selection Instruction Selector The PowerPC also has a single-precision floating point multiply instruction: FMULS T1, X, Y So we need to create a tree rewriting rule for it too: FRAFRB FRT* {FMULS FRT, FRA, FRB} fmul RT, RA, RB FRT+ {FADDS FRT, FRA, FRB} fadd RT, RA, RB FRA

65 Instruction Selection With these two rules, we could now generate PowerPC code for the following LLVM IR: FRAFRB FRT* {FMULS FRT, FRA, FRB} fmul RT, RA, RB FRT+ {FADDS FRT, FRA, FRB} fadd RT, RA, RB fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y FRAFRB

66 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

67 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

68 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

69 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

70 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

71 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

72 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

73 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

74 Instruction Selection But wait! PowerPC has the FMADDS instruction that performs both a multiply and an add. Why didn’t the compiler choose that instruction? Because no tree rewriting rule was defined for FMADDS. What are the consequences of not giving a rule for FMADDS? Broken compiler? No. Why not? Because the FMADDS instruction’s function can also be performed by other PowerPC instructions that were defined. (But, if FADDS was not defined the compiler would be broken.) Bad compiler? Yes. Why? FMADDS will never be used, and its faster than FMULS +FADDS fadd:f32 X, Y FADDS t2, t1, Z %t1 = mul float %X, %Y %t2 = add float %t1, %Z fmul:f32 X, Y FMULS t1, X, Y

75 … def FADDS:Aform_2<59, 21, (outs F4RC:$FRT), (ins F4RC:$FRA, F4RC:$FRB), “FADDS $FRT, $FRA, $FRB”, [(set F4RC:$FRT, (fadd F4RC:$FRA, F4RC:$FRB))]>; def FMULS:Aform_3<59, 25, (outs F4RC:$FRT), (ins F4RC:$FRA, F4RC:$FRB), “FMULS $FRT, $FRA, $FRB”, [(set F4RC:$FRT, (fmul F4RC:$FRA, F4RC:$FRB))]>; def FMADDS:Aform_1<59, 29, (ops F4RC:$FRT, F4RC:$FRA, F4RC:$FRC, F4RC:$FRB), “FMADDS $FRT, $FRA, $FRC, $FRB”, [(set F4RC:$FRT, (fadd (fmul F4RC:$FRA, F4RC:$FRC), F4RC:$FRB))]>; … Instruction Selection Instruction Selector We can add a new rule for the PowerPC’s multiply and add instruction: FMADDS T1, A, B, C FRAFRB FRT* {FMULS FRT, FRA, FRB} fmul RT, RA, RB FRT+ {FADDS FRT, FRA, FRB} fadd RT, RA, RB FRAFRC FRT * {FMADDS FRT, FRA, FRB, FRC} fmul RT1, RA, RC fadd RT2, RT1, RB + FRB FRA FRB

76 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF)LD Ri, low16(#a) LD Rj, high16(#a) SHL Rj, #16 ADD Ri, Ri, Rj 3 Kinds of the tree rewriting rules

77 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF)LD Ri, low16(#a) LD Rj, high16(#a) SHL Rj, #16 ADD Ri, Ri, Rj 3 Kinds of the tree rewriting rules FMADDS is a Many-to-One

78 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF)LD Ri, low16(#a) LD Rj, high16(#a) SHL Rj, #16 ADD Ri, Ri, Rj 3 Kinds of the tree rewriting rules FMADDS is not needed for a basic compiler

79 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF)LD Ri, low16(#a) LD Rj, high16(#a) SHL Rj, #16 ADD Ri, Ri, Rj 3 Kinds of the tree rewriting rules Infact, many-to- ones can all be skipped.

80 One-to-One add R1,R1,#1 INC Ri Many-to-One add Rx,Rj,#a ADD Ri,Ri,a(Rj) add Ri,Ri,Rx One-to-Many ld Ri, #a (a>0xFFFF)LD Ri, low16(#a) LD Rj, high16(#a) SHL Rj, #16 ADD Ri, Ri, Rj 3 Kinds of the tree rewriting rules

81 We will use the LLVM compiler Because: It has good optimizations It has good documentation It is designed to be a little bit easier to retarget to a new processor It was the compiler used by subproject 3, year 1 – so there is some infrastructure

82 But there are some difficulties with the LLVM compiler Because: It compiles C, not OpenGL 2.0 Though it has backends for several processors, none of them are SIMD So, the LLVM IR is not SIMD

83 How we will use the LLVM compiler Our work is in two parallel paths: Fast track: uses Subproject 2’s code to convert OPENGL to C Slow track: use Subproject 3 year 1’s code to generate SIMD instructions in the LLVM IR

84 A quick reminder OpenGL 2.0 code is stored in a string array. It is not compiled until the game is actually running. At some point during the running of the game, the game calls glCompileShader, which takes the string array as an input argument and returns an object file. Maybe the player entered a new level, and the new level has brick walls. But the previous level did not have brick walls, so the graphics processor does not have a rule for how to render bricks. The brick shader must be compiled, linked, and loaded to the graphics processor. This is accomplished through 3 operating system calls from within the game glCompileShader(…) glLinkProgram(…) glUseProgram(…) Our current work is only on the implementation of glCompileShader. glCompileShader is a program that runs on the ARM processor, when called by the ARM’s OS. So, our compiler (which is written in C++) is compiled into an ARM executable. But when this compiler executable is run, it generates a shader executable.

85 iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } A sample OpenGL code Here is some shader code: s hader string array

86 void AddBrickFragments(GLuint currentProgram) { GLuint brickFS = glCreateShader( GL_FRAGMENT_SHADER ); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickFS); glAttachShader(currentProgram,brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } A sample compilation trigger And here is a function inside of the game that compiles and loads the shader: iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } s hader string array

87 void AddBrickFragments(GLuint currentProgram) { GLuint brickFS = glCreateShader( GL_FRAGMENT_SHADER ); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickFS); glAttachShader(currentProgram,brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } A sample compilation trigger And here is a function inside of the game that compiles and loads the shader: shader string array g ame running on ARM iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); }

88 The fast track compiler process 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM

89 The fast track compiler process 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM

90 The fast track compiler process 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter

91 void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM The fast track compiler process 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter

92 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter The fast track compiler process equivalent LLVM IR LLVM frontend

93 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter equivalent LLVM IR LLVM frontend The fast track compiler process

94 1.So now, as the game runs, this call to glCompileShader happens 2.Then the ARM processor calls the LLVM compiler, passing in this code for compilation 3.The LLVM compiler then: 1.Runs Proj2Converter to make C code 2.Runs the LLVM front end to create IR 3.Runs our new LLVM backend to create shader object file 4.Sends the object file back to the game void AddBrickFragments(Gluint currentProgram) { GLuint brickFS =glCreateShader( GL_FRAGMENT_SHADER); glShaderSource(brickFS, 1, brickStringArray, NULL); glCompileShader(brickF S); glAttachShader(currentProgram, brickFS); glLinkProgram(currentProgram); glUseProgram(currentProgram); } iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array g ame running on ARM iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter equivalent LLVM IR LLVM frontend The fast track compiler process … MUL R1, R2, R3 MADD R4,R1,R5 … equivalent shader object file fast track backend

95 The fast track compiler process iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter equivalent LLVM IR LLVM frontend … MUL R1, R2, R3 MADD R4,R1,R5 … equivalent shader object file fast track backend

96 The fast track compiler process iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter equivalent LLVM IR LLVM frontend … MUL R1, R2, R3 MADD R4,R1,R5 … equivalent shader object file fast track backend

97 The slow track compiler process It is not good to use subproject 2’s converter: The compiler is run during game execution, so the conversion step adds overhead The conversion destroys vectors, so that you can’t create SIMD code After all, if C was a good fit for 3D shaders, then we wouldn’t need the OpenGL language!

98 The slow track compiler process The subproject 3, year 1 team addressed this problem: They modified the LLVM frontend to read OpenGL code instead of C code To handle the SIMD information expressed in the OpenGL (such as variables declared as “vec4”), they added vectors into the LLVM IR The problem is that the LLVM backend was not modified, so their result is a non-standard LLVM IR, that can’t be currently compiled The gist of our slow track development process is modifying the backend to understand the augmented IR

99 The fast track compiler process iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } equivalent C code Proj2 converter equivalent LLVM IR LLVM frontend … MUL R1, R2, R3 MADD R4,R1,R5 … equivalent shader object file fast track backend The slow track compiler process iform vec3 LightPosition; const float SpecularContribution = 0.3; const float DiffuseContribution = SpecularContribution; varying float LightIntensity; varying vec2 MCposition; void main(void) { vec3 ecPosition = vec3 (gl_ModelViewMatrix * gl_Vertex); vec3 tnorm = normalize(gl_NormalMatrix * gl_Normal); vec3 lightVec = normalize(LightPosition - ecPosition); vec3 reflectVec = reflect(-lightVec, tnorm); vec3 viewVec = normalize(-ecPosition); float diffuse = max(dot(lightVec, tnorm), 0.0); float spec = 0.0; if (diffuse > 0.0) { spec = max(dot(reflectVec, viewVec), 0.0); spec = pow(spec, 16.0); } LightIntensity = DiffuseContribution * diffuse + SpecularContribution * spec; MCposition = gl_Vertex.xy; gl_Position = ftransform(); } shader string array equivalent, aug- mented LLVM IR Proj3Y1 LLVM frontend … SQRT R1, R2 RCP R4,R1 … equivalent shader object file slow track backend

100 Instruction selection summary There are then 3 steps in our instruction selector 1 st cut: fast track selection - Backend changed to target the shader processors - Works but has no SIMD operation 2 nd cut: slow track selection - Merge in the second backend change to understand the augmented IR - Update instruction selector to make SIMD choices 3 rd cut: Create tree rewriting rules for the complex processor instructions, like SQRT and LOG

101 Progress SHADER InstructionsLLVM MOV LD ST MUL ADD MAD MIN MAX SLT SLE SGT SGE SHADER InstructionsLLVM AND OR XOR DP3 DP4 RCP RSQ LOG EXP BEQ JMP NOP The following table shows the Shader Instructions. And our goal is map LLVM Instructions into our Shader Instructions.

102 SHADER InstructionsLLVM MOV LD ST MUL mul ADD add MAD MIN MAX SLT setlt SLE setle SGT setgt SGE setge SHADER InstructionsLLVM AND and OR or XOR xor DP3 DP4 RCP RSQ LOG EXP shl BEQ seteq JMP NOP nop The following table shows the Shader Instructions. And our goal is map LLVM Instructions into our Shader Instructions. There are some LLVM Instructions that can obviously map into our Shader Instructions. Progress

103 SHADER InstructionsLLVM MOV LD ST MUL mul ADD add MAD MIN MAX SLT setlt SLE setle SGT setgt SGE setge SHADER InstructionsLLVM AND and OR or XOR xor DP3 DP4 RCP RSQ LOG EXP shl BEQ seteq JMP NOP nop We have map some of them, but there are more LLVM IR. If you have a LLVM IR without a tree rewriting rule for it, then you are not going to get a working compiler. Progress

104 SHADER InstructionsLLVM MOV LD ST MUL mul ADD add MAD MIN MAX SLT setlt SLE setle SGT setgt SGE setge SHADER InstructionsLLVM AND and OR or XOR xor DP3 DP4 RCP RSQ LOG EXP shl BEQ seteq JMP NOP nop We have map some of them, but there are more LLVM IR. If you have a LLVM IR without a tree rewriting rule for it, then you are not going to get a working compiler. These are some harder to map, which means we are going to cover these one by one. Progress

105 SHADER InstructionsLLVM MOV LD ST MUL mul ADD add MAD MIN MAX SLT setlt SLE setle SGT setgt SGE setge SHADER InstructionsLLVM AND and OR or XOR xor DP3 DP4 RCP RSQ LOG EXP shl BEQ seteq JMP NOP nop Some are harder to map, which means one of 2 things: It will require a more complicated mapping It can be skipped (for now), it’s a many-to-one mapping Progress

106 For example, here is how we map the SHR instruction, which is easy.

107 Progress For example, here is how we map the SHR instruction, which is easy. First, we took the MIPS backend to modify, it defines the SHR instruction like this: def SHR : SetCC_R ;

108 Progress For example, here is how we map the SHR instruction, which is easy. First, we took the MIPS backend to modify, it defines the SHR instruction like this: def SHR : SetCC_R ; Then we turn it into the following code: def SHR : SetCC_R ;

109 Progress For example, here is how we map the SHR instruction, which is easy. First, we took the MIPS backend to modify, it defines the SHR instruction like this: def SHR : SetCC_R ; Then we turn it into the following code: def SHR : SetCC_R ; Because this is a simple mapping, we can just change the the string which we can actually see in the assembly file. For now our target just to get correct assembly, not executables.

110 Progress For example, here is how we map the SHR instruction, which is easy. First, we took the MIPS backend to modify, it defines the SHR instruction like this: def SHR : SetCC_R ; Then we turn it into the following code: def SHR : SetCC_R ; Because this is a simple mapping, we can just change the the string which we can actually see in the assembly file. For now our target just to get correct assembly, not executables. But there are some instruction hard to map, for example, the ASHR instruction.

111 First, to remind what arithmetic shift right is: It’s a shift that preserves sign extension. Consider: if R 0 = then SHR R0,10 = but ASHR R0,10 = The shr was easy to make a rule for, because the shader has an SHR instruction. But it doesn’t have an ASHR. Q: How then can we make a rule to deal with the LLVM ashr IR instruction? A: We’ll need to use multiple shader instructions (1 to many)

112 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 =

113 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = This part can be different

114 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = This part can be differentThis part is always the same

115 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = Left PartRight Part

116 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = Left PartRight Part This part can be differentThis part is always the same These are always the same

117 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = This number is 1023 = , which can be computed as (1<<10) – 1

118 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = This number is 1023 = , which can be computed as (1<<10) – 1

119 But how to define a pattern of shader instructions? In the example of previous slide, we see that SHR R0,10 = ASHR R0,10 = This number is 1023 = , which can be computed as (1<<10) – 1 So it looks like answer here is to compute the right part with SHR and the left part as: (TopBit << ShiftAmount) - 1

120 Now we can start to build the ASHR instruction First we define a pattern call RED. Recall that shader registers have 4 32-bit fields: Red, Green, Blue, and Alpha. Since we are not using SIMD yet, we will only deal with 1 32-bit register. That is what RED does. Here is the LLVM pattern: def:PAT Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))>

121 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This strips out everything but the sign bit, which is now in the bottom bit position.

122 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This strips out everything but the sign bit, which is now in the bottom bit position. For example

123 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This strips out everything but the sign bit, which is now in the bottom bit position. For example After we do the blue part:

124 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This strips out everything but the sign bit, which is now in the bottom bit position. For example After we do the blue part:

125 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This pushes the sign bit up y places. Thus it computes 2 y, if the sign bit is 1.

126 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This pushes the sign bit up y places. Thus it computes 2 y, if the sign bit is 1. For example After we do the blue part: After we do the green part (assuming y =10):

127 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This pushes the sign bit up y places. Thus it computes 2 y, if the sign bit is 1. For example After we do the blue part: After we do the green part (assuming y =10):

128 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This now computes the left part.

129 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> This now computes the left part. For example After we do the blue part: After we do the green part (assuming y =10): After we do the purple part:

130 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> Finally, the sign extension bits shift up to where they go.

131 Second we define a pattern of shader instructions for computing the left part: def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> Finally, the sign extension bits shift up to where they go. For example After we do the blue part: After we do the green part (assuming y =10): After we do the purple part: After we do the red part:

132 Third we define a pattern of shader instructions for merging the left and right parts: def : PAT <(ASHR, Rx), (OR (TOPBITS Rx,Ry), (SHR (RED Rx), y))> From the previous slide the red part is: It is clear that the lavender part is: And a bitwise-OR of the two parts yields:

133 So, we defined 3 patterns: def:PAT def:PAT<(TOPBITS Rx,Ry), (SHL(SUB(SHL(SHR(RED Rx),31),Ry),1),(SUB 32,Ry))> def : PAT <(ASHR, Rx), (OR (TOPBITS Rx,Ry), (SHR (RED Rx), y))> As a result, there is now a rewriting rule for ashr Its awkward, but it works ▫Besides its unclear how often shaders would do an ashr We must similarly build patterns for every LLVM IR instruction that does not naturally map to a shader processor instruction

134 Future work All of the above is just for the first-cut compiler 1 st cut: fast track selection - Backend changed to target the shader processors - Works but has no SIMD operation 2 nd cut: slow track selection - Merge in the second backend change to understand the augmented IR - Update instruction selector to make SIMD choices 3 rd cut: Create tree rewriting rules for the complex processor instructions, like SQRT and LOG


Download ppt "Instruction Selection Presented by Huang Kuo-An, Lu Kuo-Chang Subproject 3 A. Aho, M. Lam, R. Sethi, J. Ullman, “Instruction Selection by Tree Rewriting.”"

Similar presentations


Ads by Google