HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

Slides:

Advertisements

Similar presentations

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Advertisements

Software Exploits for ILP We have already looked at compiler scheduling to support ILP – Altering code to reduce stalls – Loop unrolling and scheduling.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Exploiting ILP with Software Approaches

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CSCE430/830 Computer Architecture

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

How to improve (decrease) CPI

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

HW 2 is out! Due 9/25!

CS 6290 Static Exploitation of ILP

Data-Dependence Stalls w/o OOO Single-Issue Pipeline –When no bypassing exists –Load-to-use –Long-latency instructions Multiple-Issue (Superscalar), but in-order –Instructions executing in same cycle cannot have RAW –Limits on WAW

Solutions: Static Exploitation of ILP Code Transformations –Code scheduling, loop unrolling, tree height reduction, trace scheduling VLIW (later lecture)

Simple Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; L.DF0,0(R1); F0 = array element ADD.DF4,F0,F2; add scalar in F2 S.DF4,0(R1); store result DADDUIR1,R1,#-8; decrement pointer ;8 bytes (per DW) BNER1, R2, Loop; branch R1 != R2 Loop: Assume: Single-Issue FP ALU  Store +2 cycles Load DW  FP ALU +1 cycle Branch +1 cycle L.DF0,0(R1) stall ADD.DF4,F0,F2 stall S.DF4,0(R1) DADDUIR1,R1,#-8 stall BNER1, R2, Loop Loop:

Scheduled Loop Body Assume: FP ALU  Store +2 cycles Load DW  FP ALU +1 cycle Branch +1 cycle L.DF0,0(R1) stall ADD.DF4,F0,F2 stall S.DF4,0(R1) DADDUIR1,R1,#-8 stall BNER1, R2, Loop Loop: L.DF0,0(R1) DADDUIR1,R1,#-8 ADD.DF4,F0,F2 stall S.DF4,0(R1) BNER1, R2, Loop Loop: hoist the add

Scheduling for Multiple-Issue A: R1 = R2 + R3 B: R4 = R1 – R5 C: R1 = LOAD 0[R7] D: R2 = R1 + R6 E: R6 = R3 + R5 F: R5 = R6 – R4 A B C D E F A: R1 = R2 + R3 C’: R8 = LOAD 0[R7] B: R4 = R1 – R5 E’: R9 = R3 + R5 D’: R2 = R8 + R6 F’: R5 = R9 – R4 A B C’ D’ E’ F’ B Same functionality, no stalls

Interaction with RegAlloc and Branches Largely limited by architected registers –Weird interactions with register allocation … could possibly cause more spills/fills Code motion may be limited: R1 = R2 + R3 BEQZ R9 R1 = LOAD 0[R6]R5 = R1 – R4 Need to allocate registers differently Causes unnecessary execution of LOAD when branch goes left (AKA Dynamic Dead Code) R8

Goal of Multi-Issue Scheduling Place as many independent instructions in sequence –“as many”  up to execution bandwidth Don’t need 7 independent insts on a 3-wide machine –Avoid pipeline stalls If compiler is really good, we should be able to get high performance on an in-order superscalar processor –In-order superscalar provides execution B/W, compiler provides dependence scheduling

Why this Should Work Compiler has “all the time in the world” to analyze instructions –Hardware must do it in < 1ns Compiler can “see” a lot more –Compiler can do complex inter-procedural analysis, understand high-level behavior of code and programming language –Hardware can only see a small number of instructions at a time

Why this Might not Work Compiler has limited access to dynamic information –Profile-based information –Perhaps none at all, or not representative –Ex. Branch T in 1 st ½ of program, NT in 2 nd ½, looks like branch in profile Compiler has to generate static code –Cannot react to dynamic events like data cache misses

Loop Unrolling Transforms an M-iteration loop into a loop with M/N iterations –We say that the loop has been unrolled N times for(i=0;i<100;i+=4){ a[i]*=2; a[i+1]*=2; a[i+2]*=2; a[i+3]*=2; } for(i=0;i<100;i++) a[i]*=2; Some compilers can do this ( gcc -funroll-loops ) Or you can do it manually (above)

Why Loop Unrolling? (1) Less loop overhead for(i=0;i<100;i+=4){ a[i] += 2; a[i+1] += 2; a[i+2] += 2; a[i+3] += 2; } for(i=0;i<100;i++) a[i] += 2; How many branches?

Why Loop Unrolling? (2) Allows better scheduling of instructions R2 = R3 * #4 R2 = R2 + #a R1 = LOAD 0[R2] R1 = R1 + #2 STORE R1  0[R2] R3 = R3 + 1 BLT R3, 100, #top R2 = R3 * #4 R2 = R2 + #a R1 = LOAD 0[R2] R1 = R1 + #2 STORE R1  0[R2] R3 = R3 + 1 BLT R3, 100, #top R2 = R3 * #4 R2 = R2 + #a R1 = LOAD 0[R2] R1 = R1 = #2 STORE R1  0[R2] R3 = R3 + 1 BLT R3, 100, #top R2 = R3 * #4 R2 = R2 + #a R1 = LOAD 0[R2] R1 = R1 + #2 STORE R1  0[R2] R1 = LOAD 4[R2] R1 = R1 + #2 STORE R1  4[R2] R1 = LOAD 8[R2] R1 = R1 + #2 STORE R1  8[R2] R1 = LOAD 12[R2] R1 = R1 + #2 STORE R1  12[R2] R3 = R3 + 4 BLT R3, 100, #top

Why Loop Unrolling? (3) Get rid of small loops for(i=0;i<4;i++) a[i]*=2; a[0]*=2; a[1]*=2; a[2]*=2; a[3]*=2; for(0) Difficult to schedule/hoist insts from bottom block to top block due to branches Easier: no branches in the way for(1) for(2) for(3)

Loop Unrolling: Problems Program size is larger (code bloat) What if N not a multiple of M? –Or if N not known at compile time? –Or if it is a while loop? j1=j-j%4; for(i=0;i<j1;i+=4){ a[i]*=2; a[i+1]*=2; a[i+2]*=2; a[i+3]*=2; } for(i=j1;i<j;i++) a[i]*=2; for(i=0;i<j;i++) a[i]*=2;

Function Inlining Sort of like “unrolling” a function Similar benefits to loop unrolling: –Remove function call overhead CALL/RETN (and possible branch mispreds) Argument/ret-val passing, stack allocation, and associated spills/fills of caller/callee-save regs –Larger block of instructions for scheduling Similar problems –Primarily code bloat

Tree Height Reduction Shorten critical path(s) using associativity ADD R6,R2,R3 ADD R7,R6,R4 ADD R8,R7,R5 I1 I2 I3 ADD R6,R2,R3 ADD R7,R4,R5 ADD R8,R7,R6 I1I2 I3 R8=((R2+R3)+R4)+R5R8=(R2+R3)+(R4+R5) Not all Math operations are associative! C defines L-to-R semantics for most arithmetic

Trace Scheduling Works on all code, not just loops –Take an execution trace of the common case –Schedule code as if it had no branches –Check branch condition when convenient –If mispredicted, clean up the mess How do we find the “common case” –Program analysis or profiling

Trace Scheduling Example a=log(x); if(b>0.01){ c=a/b; }else{ c=0; } y=sin(c); Suppose profile says that b> % of the time a=log(x); c=a/b; y=sin(c); if(b<=0.01) goto fixit; fixit: c=0; y=0; // sin(0) Now we have larger basic block for our scheduling and optimizations

Pay Attention to Cost of Fixing Assume the code for b > 0.01 accounts for 80% of the time Optimized trace runs 15% faster But, fix-up code may cause the remaining 20% of the time to be even slower! Assume fixup code is 30% slower By Amdahl’s Law: Speedup = 1 / ( *0.85) = = % performance Speedup = 1 / ( 0.2* *0.85) = Over 1/3 of the benefit removed!