CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.

Slides:



Advertisements
Similar presentations
Instruction-Level Parallelism
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Exploiting ILP with Software Approaches
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
CSC 4250 Computer Architectures November 14, 2006 Chapter 4.Instruction-Level Parallelism & Software Approaches.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
EENG449b/Savvides Lec /24/05 February 24, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
Instructor: Morris Lancaster
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Compiler Techniques for ILP
CS203 – Advanced Computer Architecture
CSCE430/830 Computer Architecture
CSL718 : VLIW - Software Driven ILP
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Computer Architecture
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.2

January 2013CS 6461 Compiler Based Scheduling2 Basic Compiler Techniques for Exposing ILP Crucial for processors that use static issue, and important for processors that make dynamic issue decisions but use static scheduling

January 2013CS 6461 Compiler Based Scheduling3 Basic Pipeline Scheduling and Loop Unrolling Exploiting parallelism among instructions –Finding sequences of unrelated instructions that can be overlapped in the pipeline –Separation of a dependent instruction from a source instruction by a distance in clock cycles equal to the pipeline latency of the source instruction. (Avoid the stall) The compiler works with a knowledge of the amount of available ILP in the program and the latencies of the functional units within the pipeline –This couples the compiler, sometimes to the specific chip version, or at least requires the setting of appropriate compiler flags

January 2013CS 6461 Compiler Based Scheduling4 Assumed Latencies Instruction Producing ResultInstruction Using ResultLatency In Clock Cycles (needed to avoid stall) FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Result of the load can be bypassed without stalling store

January 2013CS 6461 Compiler Based Scheduling5 Basic Pipeline Scheduling and Loop Unrolling (cont) Assume standard 5 stage integer pipeline –Branches have a delay of one clock cycle Functional units are fully pipelined or replicated (as many times as the pipeline depth) –An operation of any type can be issued on every clock cycle and there are no structural hazards

January 2013CS 6461 Compiler Based Scheduling6 Basic Pipeline Scheduling and Loop Unrolling (cont) Sample code For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code Loop:L.DF0,0(R1);F0 = array element ADD.DF4,F0,F2;add scalar in F2 S.DF4,0(R1);store back DADDUI R1,R1,#-8;decrement index BNER1,R2,Loop;R2 is precomputed so that ;8(R2) is last value to be ;computed

January 2013CS 6461 Compiler Based Scheduling7 Basic Pipeline Scheduling and Loop Unrolling (cont) MIPS code Loop:L.DF0,0(R1);1 clock cycle stall;2 ADD.DF4,F0,F2;3 stall;4 stall;5 S.DF4,0(R1);6 DADDUI R1,R1,#-8;7 stall;8 BNER1,R2,Loop;9

January 2013CS 6461 Compiler Based Scheduling8 Rescheduling Gives Sample code For (i=1000; i>0; i=i-1) x[i] = x[i] + s; MIPS code Loop:L.DF0,0(R1)1 DADDUI R1,R1,#-82 ADD.DF4,F0,F2*3 stall4 stall5 S.DF4,8(R1)* 6 BNER1,R2,Loop7

January 2013CS 6461 Compiler Based Scheduling9 Unrolling Summary (continued) Simple Unroll Loop:L.DF0,0(R1) ADD.DF4,F0,F2 S.DF4,0(R1) L.DF0,-8(R1) ADD.DF4,F0,F2 S.DF4,-8(R1) L.DF0,-16(R1) ADD.DF4,F0,F2 S.DF4,-16(R1) L.DF0,-24(R1) ADD.DF4,F0,F2 S.DF4,-24(R1) DADDUI R1,R1,#-32 BNER1,R2,Loop Name Dependences Data Dependences

January 2013CS 6461 Compiler Based Scheduling10 Unrolling and Renaming Gives MIPS code Loop:L.DF0,0(R1) ADD.DF4,F0,F2 we have a stall coming S.DF4,0(R1) L.DF6,-8(R1) ADD.DF8,F6,F2 S.DF8,-8(R1) L.DF10,-16(R1) ADD.DF12,F10,F2 S.DF12,-16(R1) L.DF14,-24(R1) ADD.DF16,F14,F2 S.DF16,-24(R1) DADDUIR1,R1,#-32 BNER1,R2,Loop

January 2013CS 6461 Compiler Based Scheduling11 Unrolling and Removing Hazards Gives MIPS code Loop:L.DF0,0(R1);total of 14 clock cycles L.DF6,-8(R1) L.DF10,-16(R1) L.DF14,-24(R1) ADD.DF4,F0,F2 ADD.DF8,F6,F2 ADD.DF12,F10,F2 ADD.DF16,F14,F2 S.DF4,0(R1) S.DF8,-8(R1) DADDUIR1,R1,#-32 S.DF12,16(R1) S.DF16,8(R1) BNER1,R2,Loop

January 2013CS 6461 Compiler Based Scheduling12 Unrolling Summary for Above Determine that it was legal to move the S.D after the DADDUI and BNE, and find the amount to adjust the S.D offset Determine that unrolling the loop would be useful by finding that the loop iterations were independent, except for loop maintenance code Use different registers to avoid unnecessary constraints that would be forced by using the same registers Eliminate the extra test and branch instruction and adjust the loop termination and iteration code. Determine that the loads and stores can be interchanged by determining that the loads and stores from different iterations are independent Schedule the code, preserving any dependencies

January 2013CS 6461 Compiler Based Scheduling13 Unrolling Summary (continued) Example on Page 311 shows the steps Loop:L.DF0,0(R1) ADD.DF4,F0,F2 S.DF4,0(R1) L.DF0,-8(R1) ADD.DF4,F0,F2 S.DF4,-8(R1) L.DF0,-16(R1) ADD.DF4,F0,F2 S.DF4,-16(R1) L.DF0,-24(R1) ADD.DF4,F0,F2 S.DF4,-24(R1) DADDUI R1,R1,#-32 BNER1,R2,Loop Name Dependences Data Dependences

January 2013CS 6461 Compiler Based Scheduling14 Unrolling Summary (Renaming) Example on Page 311 shows the steps Loop:L.DF0,0(R1) ADD.DF4,F0,F2 S.DF4,0(R1) L.DF6,-8(R1) ADD.DF8,F6,F2 S.DF8,-8(R1) L.DF10,-16(R1) ADD.DF12,F10,F2 S.DF12,-16(R1) L.DF14,-24(R1) ADD.DF16,F14,F2 S.DF16,-24(R1) DADDUI R1,R1,#-32 BNER1,R2,Loop Name Dependences Data Dependences

January 2013CS 6461 Compiler Based Scheduling15 Unrolling Summary (continued) Limits to Impacts of Unrolling Loops –As we unroll more, each unroll yields a decreased amount of improvement of distribution of overhead –Growth in code size –Shortfall in available registers (register pressure) Scheduling the code to increase ILP causes the number of live values to increase This could generate a shortage of registers and negatively impact the optimization Useful in a variety of processors today