Siddhartha Chatterjee Spring 2008

Slides:

Advertisements

Similar presentations

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Compiler techniques for exposing ILP

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

COMP4611 Tutorial 6 Instruction Level Parallelism

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,

Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Instruction-Level Parallelism and Its Dynamic Exploitation

Compiler Techniques for ILP

Computer Architecture Principles Dr. Mike Frank

Concepts and Challenges

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS5100 Advanced Computer Architecture Instruction-Level Parallelism

CS203 – Advanced Computer Architecture

CSCE430/830 Computer Architecture

CSL718 : VLIW - Software Driven ILP

CPE 631 Lecture 13: Exploiting ILP with SW Approaches

CS 5513 Computer Architecture Pipelining Examples

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

Static Scheduling Techniques

Lecture 5: Pipeline Wrap-up, Static ILP

CS 3853 Computer Architecture Pipelining Examples

Presentation transcript:

Siddhartha Chatterjee Spring 2008 CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Siddhartha Chatterjee Spring 2008

Siddhartha Chatterjee COMP 206 Fall 2000 Review of Pipelining Pipelining reduces throughput of an instruction sequence, not the latency of an individual instruction Speedup due to pipelining limited by hazards Structural hazards lead to contention for limited resources Data hazards require stalling or forwarding to maintain sequential semantics Control hazards require cancellation of instructions (to maintain sequential branch semantics) or delayed branches (to define a new branch semantics) Hazard Detection: interlocks in hardware Elimination: renaming, branch elimination Resolution: stalling, forwarding, scheduling Spring 2008 Siddhartha Chatterjee Siddhartha Chatterjee

CPI of a Pipelined Machine Issuing multiple instructions per cycle Compiler dependence analysis Software pipelining Trace scheduling Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Dynamic scheduling with register renaming Compiler dependence analysis Software pipelining Trace scheduling Speculation Basic pipeline scheduling Dynamic scheduling with scoreboarding Dynamic memory disambiguation (for stalls involving memory) Compiler dependence analysis Software pipeline Trace scheduling Speculation Loop unrolling Dynamic branch prediction Speculation Spring 2008 Siddhartha Chatterjee

Instruction-Level Parallelism (ILP) Pipelining is most effective when we have parallelism among instructions Instructions u and v are parallel if neither P(u,v) nor P(v,u) holds Parallelism within a basic block is limited Branch frequency of 15% implies about six instructions in basic block These instructions are likely to depend on each other Need to look beyond basic blocks Loop-level parallelism Parallelism among iterations of a loop To convert loop-level parallelism into ILP, we need to “unroll” the loop Statically, by the compiler Dynamically, by the hardware Using vector instructions Spring 2008 Siddhartha Chatterjee

Motivating Example for Loop Unrolling for (u = 0; u < 1000; u++) x[u] = x[u] + s; for (u = 999; u >= 0; u--) x[u] = x[u] + s; Assumptions Loop is being run backwards Scalar s is in register pair F2:F3 Array x starts at memory address 0 1-cycle branch delay No structural hazards LOOP: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1, LOOP NOP 10 cycles per iteration Spring 2008 Siddhartha Chatterjee

How Far Can We Get With Scheduling? LOOP: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1, LOOP NOP LOOP: LD F0, 0(R1) SUBI R1, R1, 8 ADDD F4, F0, F2 BNEZ R1, LOOP SD 8(R1), F4 for (u = 999; u >= 0; ){ register double d = x[u]; u--; d += s; x[u+1] = d; } 6 cycles per iteration Note change in SD instruction, from 0(R1) to 8(R1); this is a non-trivial change. Spring 2008 Siddhartha Chatterjee

Observations on Scheduled Code 3 out of 5 instructions involve FP work The other two constitute loop overhead Could we improve loop performance by unrolling the loop? Assume number of loop iterations is a multiple of 4, and unroll loop body four times In real life, would need to handle the fact that loop trip count may not be a multiple of 4 Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee Unrolling: Take 1 LOOP: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1, LOOP LD F0, 0(R1) This is not any different from situation before unrolling Branches induce control dependence Can’t move instructions much during scheduling However, the whole point of unrolling was to guarantee that the three internal branches will fall through So, maybe we can delete the intermediate branches There is an implicit NOP after the final branch Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee COMP 206 Fall 2000 Unrolling: Take 2 LOOP: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, 8 LD F0, 0(R1) BNEZ R1, LOOP Even though we have got rid of the control dependences, we have flow dependences through R1 We could remove flow dependences by observing that R1 is decremented by 8 each time Adjust the address specifiers Delete the first three SUBIs Change the constant in the fourth SUBI to 32 These are non-trivial inferences for a compiler to make for (u = 999; u >= 0; ){ register double d; d = x[u]; d += s; x[u] = d; u--; } Spring 2008 Siddhartha Chatterjee Siddhartha Chatterjee

Siddhartha Chatterjee Unrolling: Take 3 LOOP: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) SD -8(R1), F4 LD F0, -16(R1) SD -16(R1), F4 LD F0, -24(R1) SD -24(R1), F4 SUBI R1, R1, 32 BNEZ R1, LOOP Performance is now limited by the anti-dependences and output dependences on F0 and F4 These are name dependences The instructions are not in a producer-consumer relation They are simply using the same registers, but they don’t have to We can use different registers in different loop iterations, subject to availability Let’s rename registers for (u = 999; u >= 0; u -= 4){ register double d; d = x[u]; d += s; x[u] = d; d = x[u-1]; d += s; x[u-1] = d; d = x[u-2]; d += s; x[u-2] = d; d = x[u-3]; d += s; x[u-3] = d; } Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee Unrolling: Take 4 LOOP: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, 32 BNEZ R1, LOOP Time for execution of 4 iterations 14 instruction cycles 4 LDADDD stalls 8 ADDDSD stalls 1 SUBIBNEZ stall 1 branch delay stall 28 cycles for 4 iterations, or 7 cycles per iteration Slower than scheduled version of original loop Let’s schedule the unrolled loop for (u = 999; u >= 0; u -= 4){ register double d0, d1, d2, d3; d0 = x[u]; d0 += s; x[u] = d0; d1 = x[u-1]; d1 += s; x[u-1] = d1; d2 = x[u-2]; d2 += s; x[u-2] = d2; d3 = x[u-3]; d3 += s; x[u-3] = d3; } Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee Unrolling: Take 5 LOOP: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, 32 SD 16(R1), F12 BNEZ R1, LOOP SD 8(R1), F16 This code runs without stalls 14 cycles for 4 iterations 3.5 cycles per iteration Performance is limited by loop control overhead once every four iterations Note that original loop had three FP instructions that were not independent Loop unrolling exposed independent instructions from multiple loop iterations By unrolling further, can approach asymptotic rate of 3 cycles per iteration Subject to availability of registers for (u = 999; u >= 0; ){ register double d0, d1, d2, d3; d0 = x[u]; d1 = x[u-1]; d2 = x[u-2]; d3 = x[u-3]; d0 += s; d1 += s; d2 += s; d3 += s; x[u] = d0; x[u-1] = d1; u -= 4; x[u+2] = d2; x[u+1] = d3; } Spring 2008 Siddhartha Chatterjee

What Did The Compiler Have To Do? Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset Determine that loop unrolling would be useful by discovering independence of loop iterations Rename registers to avoid name dependences Eliminate extra tests and branches and adjust loop control Determine that LDs and SDs can be interchanged by determining that (since R1 is not being updated) the address specifiers 0(R1), -8(R1), -16(R1), -24(R1) all refer to different memory locations Schedule the code, preserving dependences Resources consumed: Code space, architectural registers Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee Dependences Three kinds of dependences Data dependence Name dependence Control dependence In the context of loop-level parallelism, data dependence can be Loop-independent Loop-carried Data dependences act as a limit of how much ILP can be exploited in a compiled program Compiler tries to identify and eliminate dependences Hardware tries to prevent dependences from becoming stalls Spring 2008 Siddhartha Chatterjee

Data and Name Dependences Instruction v is data-dependent on instruction u if u produces a result that v consumes Instruction v is anti-dependent on instruction u if u precedes v v writes a register or memory location that u reads Instruction v is output-dependent on instruction u if v writes a register or memory location that u writes A data dependence that cannot be removed by renaming corresponds to a RAW hazard Anti-dependence corresponds to a WAR hazard Output dependence corresponds to a WAW hazard Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee Control Dependences A control dependence determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be if (p1) {s1;} if (p2) {s2;} Control dependence constrains code motion An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch Spring 2008 Siddhartha Chatterjee

Data Dependence in Loop Iterations A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee Loop Transformation Sometimes loop-carried dependence does not prevent loop parallelization Example: Second loop of previous slide In other cases, loop-carried dependence prohibits loop parallelization Example: First loop of previous slide A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; Spring 2008 Siddhartha Chatterjee

Siddhartha Chatterjee COMP 206 Fall 2000 Software Pipelining Observation If iterations from loops are independent, then we can get ILP by taking instructions from different iterations Software pipelining reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop i0 i1 i2 i3 Software Pipeline Iteration i4 Spring 2008 Siddhartha Chatterjee Siddhartha Chatterjee

Software Pipelining Example Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2 6 SD -8(R1),F4 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP (As in slide 10) After: Software Pipelined LD F0,0(R1) ADDD F4,F0,F2 LD F0,-8(R1) 1 SD 0(R1),F4; Stores X[u] 2 ADDD F4,F0,F2; Adds to X[u-1] 3 LD F0,-16(R1); loads X[u-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP SD 0(R1),F4 SD -8(R1),F4 register double d, e; d = x[999]; e = d+s; d = x[998]; for (u = 999; u >= 2; u--){ x[u] = e; e = d+s; d = x[u-2]; } x[1] = e; e = d+s; x[0] = e; Read F4 Read F0 SD ADDD LD IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB Write F4 Write F0 Spring 2008 Siddhartha Chatterjee

Software Pipelining: Concept “A Study of Scalar Compilation Techniques for Pipelined Supercomputers”, S. Weiss and J. E. Smith, ISCA 1987, pages 105-109 Notation: Load, Execute, Store Iterations are independent In normal sequence, Ei depends on Li, and Si depends on Ei, leading to pipeline stalls Software pipelining attempts to reduce these delays by inserting other instructions between such dependent pairs and “hiding” the delay “Other” instructions are L and S instructions from other loop iterations Does this without consuming extra code space or registers Performance usually not as high as that of loop unrolling How can we permute L, E, S to achieve this? L1 E1 S1 JC Loop L2 E2 S2 L3 E3 S3 … Ln En Sn Loop: Li Ei Si JC Loop Spring 2008 Siddhartha Chatterjee