Static Optimizations (aka: the complier) Dr. Mark Brehob EECS 470.

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Chapter 21 IA-64 Architecture (Think Intel Itanium)

IA-64 Architecture (Think Intel Itanium) also known as (EPIC – Extremely Parallel Instruction Computing) a new kind of superscalar computer HW 5 - Due.

Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)

IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

10/27: Lecture Topics Survey results Current Architectural Trends Operating Systems Intro –What is an OS? –Issues in operating systems.

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

CS 352H: Computer Systems Architecture

Computer Organization CS224

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS203 – Advanced Computer Architecture

/ Computer Architecture and Design

Henk Corporaal TUEindhoven 2009

The processor: Pipelining and Branching

CS 704 Advanced Computer Architecture

Yingmin Li Ting Yan Qi Zhao

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

Henk Corporaal TUEindhoven 2011

Control unit extension for data hazards

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Control unit extension for data hazards

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

How to improve (decrease) CPI

Control unit extension for data hazards

Static Scheduling Techniques

Instruction Level Parallelism

Presentation transcript:

Static Optimizations (aka: the complier) Dr. Mark Brehob EECS 470

Announcements Milestone 2 due today – 1 page memo, don’t spend more than 30 minutes (sum of everyone) on it. Not graded – Meet on Friday Most already signed up – Should have large parts integrated by now Quiz on Monday – Old quizes on-line – HW4 answers posted by tomorrow night.

Today Finish up multi-processor – Use previous slides Start on static optimizations.

The big picture We’ve spent a lot of time learning about dynamic optimizations – Finding ways to improve ILP in hardware Out-of-order execution Branch prediction But what can be done statically (at compile time)? – As hardware architects it behooves us to understand this. Partly so we are aware what things software is likely to be better at. But partly so we can find ways to find hardware/software “synergy”

Some ways a compiler can help Improve locality of data Remove instructions that aren’t needed Reduce number of branches executed Many others

Improve locality of reference o Examples: o Loop interchange—flip inner and outer loops o Loop fission—split into multiple loops Some examples taken from Wikipedia for i from 0 to 10 for j from 0 to 20 a[j,i] = i + j for j from 0 to 20 for i from 0 to 10 a[j,i] = i + j

Removing code (1/2) Register optimization – Registers are fast, and doing “spills and fills” is slow. – So keep the data likely to be used next in registers. Loop invariant code motion – Move recomputed statements outside of the loop. for (int i=0; i<n; i++) { x = y+z; a[i] = 6*i+x*x; } x = y+z; for (int i=0; i<n; i++) { a[i] = 6*i+x*x; }

Removing code (2/2) Common sub-expression elimination – (a + b) - (a + b)/4 Constant folding – Replace (3+5) with 8.

Reducing number of branches executed Using predicates or CMOVs instead of short branches Loop unrolling for(i=0;i<10000;i++) { A[i]=B[i]+C[i]; } for(i=0;i<10000;i=i+2) { A[i]=B[i]+C[i]; A[i+1]=B[i+1]+C[i+1]; }

We’ll mostly focus on one thing “Hoist” loads – That is move the loads up so if there is a miss we can hide that latency. Very similar goal to our OoO processor. xxxxx LD R1=MEM[x] R2=R1+R3 LD R1=MEM[x] xxxxx R2=R1+R3

What limits our ability to hoist a load? – ________________________________________ _______________________________ – _________________________________________

Create room to move code around Loop unrolling – The idea is to take a loop (usually a short loop) and do two or more iterations in a single loop body. Initial Loop body for i=1 to { } for i=1 to 5000 { Loop body } “Glue” logic

Unroll this loop for(i=0;i<10000;i++) { A[i]=B[i]+C[i]; n+=A[i]; } Glue logic? Reduce operations?

What does unrolling buy us? Reduces number of branches –Less to (mis-)predict –If not predicting branches (say cheap embedded processor) very helpful! –If limited number of branches allowed in ROB at a time, reduces this problem. Can schedule for pipeline better –If superscalar might be best to combine certain operations. Loop unrolling adds flexibility

What does it cost us? Code space. – Mainly worried about impact on I-cache hit rate. But L2 or DRAM impact if unroll too much! If loop body has branches in it can hurt branch prediction performance. Other?

Another one to unroll. for(i=0;i<99999;i++) { A[i]=B[i]+C[i]; B[i+1]=C[i]+D[i]; }

One more to do while (B[i]!=0) { i++; A[i]=B[i]+C[i]; B[i+1]=C[i]+D[i]; }

How about this code? Loop:r1=MEM[r2+0] r1=r1*2 MEM[r2+0]=r1 r2=r2+4 bne r2 r3 Loop We’ll come back to this later…

Other ILP techniques Consider an in-order superscalar processor executing the following code: R1=16//A R2=R1+5//B R3=14//C R4=R3+5//D – Without OoO we would execute A, BC, D. – Note that A&B are independent of C&D. So ordering ACBD would let us do AC, BD. Thus, the simple action of reordering instructions can increase ILP.

So… We can expose ILP by – Unrolling loops – Reordering code To increase # of independent instructions near each other To move a load (or other high-latency instruction) from its use. – What limits reordering options?

The limits of hoisting loads (again) Moving code outside of its “basic block” is scary – In other words, moving code past branches or branch targets can give wrong execution Loads or stores might go to invalid locations Need to be sure don’t trash a needed register. Also – Moving loads past stores is scary What if store wrote to that address The problem is that we don’t have the recovery mechanisms we do in hardware – After all, the program specifies behavior! How do we know when the specified behavior is “wrong”? In hardware it is fairly easy…

Static dependency checking A superscalar processor has to do certain dependency checking at issue – Is a given set of instructions dependent on each other? – If ALU resources are shared are there enough resources? Many of these issues can be resolved at compile time. – What can’t be resolved? – Once resolved, how do you tell the CPU?

One static solution: VLIW reviewed Have a bunch of pipelines, usually with different functional units. – Each “instruction” actually contains directions for all the pipelines. – (Thus the “very long instruction”) Pipe1 – int Pipe2 – int Pipe3 – fp Pipe4 – ld/st Pipe5 – branch VLIW Instruction word

What’s good about VLIW? Compiler does all dependency checking, including structural hazards! – No dependence checking makes the hardware a lot simpler! Reduces mis-prediction penalty. Saves power May save area! Since the compiler can also reorder instructions we may be able to make good use of the pipes.

So what’s bad? Code density – If you can’t fill a given pipe, need a no-op. – To get the ILP needed to be able to fill the pipe, often need to unroll loops. When a newer processor comes out, 100% compatibility is hard – Word length may need to change – Structural dependencies may be different

Conditional execution Conditional execution (we’ve done this before!) bne r1 r2 skip r4=r5+r6 skip: r7=r4+r12 r8=cmp(r1,r2) if(r8) r4=r5 + r6 r7=r4+r12 -or- r8=cmp(r1,r2) r9=r5+r6 cmov (r8, r4  r9) r7=r4+r12

Quick review question bne r1 r2 skip r4=r5/r6 skip: r7=r4+r12 Translate using cmov r8=cmpEQ(r1,r2) cmov(!r8, r6  1) r9=r5/r6 cmov (r8, r4  r9) r7=r4+r12

Software pipelining

Code example from before Loop:r1=MEM[r2+0] r1=r1*2 MEM[r2+0]=r1 r2=r2+4 bne r2 r3 Loop And one name dependency on r2. Also the store is dependent on r2 but hidden…

ILP? r1=MEM[r2+0] //A r1=r1*2 //B MEM[r2+0]=r1 //C r2=r2+4 //D bne r2 r3 Loop //E Currently no two instructions can be executed in parallel on a statically scheduled machine. – (With branch prediction A and E could be executed in parallel) On a dynamically scheduled machine could execute instructions from different iterations at once.

What would OoO do? r1=MEM[r2+0] //A r1=r1*2 //B MEM[r2+0]=r1 //C r2=r2+4 //D bne r2 r3 Loop //E A perfect, dynamically scheduled, speculative computer would find the following: Let A1 indicate the execution of A in the first iteration of the loop. A1D1 B1A2D2E1 C1B2A3D3E2 C2B3A4D4E3 …..

r1=MEM[r2+0] //A r1=r1*2 //B MEM[r2+0]=r1 //C r2=r2+4 //D bne r2 r3 Loop //E Software Pipeline With “software pipelining” we can do the same thing in software. MEM[r2+0]=r1 //C(n) r1=r4*2 //B(n+1) r4=MEM[r2+8] //A(n+2) r2=r2+4 //D(n) bne r2 r3 Loop //E(n) What problems could arise? “Speculative load” might cause an exception. Latency of load could be too slow.

Prolog and epilog r3=r3-8 // Needed to check legal! r4=MEM[r2+0]//A(1) r1=r4*2//B(1) r4=MEM[r2+4]//A(2) Loop:MEM[r2+0]=r1 //C(n) r1=r4*2 //B(n+1) r4=MEM[r2+8] //A(n+2) r2=r2+4 //D(n) bne r2 r3 Loop //E(n) MEM[r2+0]=r1// C(x-1) r1=r4*2// B(x) MEM[r2+0]=r1// C(x) r3=r3+8// Could have used tmp var.

Software Pipelining example Execution Code Layout Action Flow Stage A Stage B Stage C Stage A Stage B Stage A Stage C Stage B Stage C Stage D Stage CStage BStage A iter1 iter0 II Prologue Kernel Epilogue itern-2 itern-1

Example, just to be sure. r4=MEM[r2+0]//A1 r1=r4*2//B1 r4=MEM[r2+4]//A2 Loop:MEM[r2+0]=r1 //C(n) r1=r4*2 //B(n+1) r4=MEM[r2+8] //A(n+2) r2=r2+4 //D(n) bne r2 r3 Loop //E(n) ADDRDATA R2=12, r3=28 R4=_______ R1= _______

Next step Parallel execution – It isn’t clear how D and E of any iteration can be executed in parallel on a statically scheduled machine What if load latency is too long? – Will be stalling a lot… – Fix by unrolling loop some. r1=MEM[r2+0] //A r1=r1*2 //B MEM[r2+0]=r1 //C r2=r2+4 //D bne r2 r3 Loop //E

NEXT… Let’s now jump from Software Pipelining to IA- 64. – We will come back to Software Pipelining in the context of IA-64…

IA bit registers – Use a register window similarish to SPARC bit fp registers 64 1 bit predicate registers 8 64-bit branch target registers

Explicit Parallelism Groups – Instructions which could be executed in parallel if hardware resources available. Bundle – Code format. 3 instructions fit into a 128-bit bundle. – 5 bits of template, 41*3 bits of instruction. Template specifies what execution units each instruction requires.

Instructions 41 bits – 4 high order specify opcode (combined with template for bundle) – 6 low order bits specify predicate register number. Every instruction is predicated! Also NaT bits are used to handle speculated exceptions.

Speculative Load TraditionalIA-64 Load instruction (ld.s) can be moved outside of a basic block even if branch target is not known Speculative loads does not produce exception - it sets the NaT Check instruction (chk.s) will jump to fix- up code if NaT is set

Propagation of NaT IF ( NaT[r3] || NaT[r4] ) THEN set NaT[r6] IF ( NaT[r6] ) THEN set NaT[r5] Require check on NaT[r5] only since the NaT is inherited Reduce number of checks Fix-up will execute the entire chain Only single check required NaT[reg] = NaT bit of reg

Advanced loads ld.a – Advanced load – Performs the load, puts it into the “ALAT” If any following store writes to the same address, this is noted with a single bit. When a ld.c is executed, if that bit is set, we refetch. When chk.a is executed, if bit is set, fix up code is run. (Useful if load result already used.) Both also cause any deferred exception to occur.

Software pipelining on IA-64 Lots of tricks – Rotating registers – Special counters Often don’t need Prologue and Epilog. – Special counters and prediction lets us only execute those instructions we need to.