\course\ELEG652-03Fall\Topic3-6521 Exploitation of Instruction-Level Parallelism (ILP)

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
CS 6461: Computer Architecture Basic Compiler Techniques for Exposing ILP Instructor: Morris Lancaster Corresponding to Hennessey and Patterson Fifth Edition.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
COMP4611 Tutorial 6 Instruction Level Parallelism
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Lecture 3: Chapter 2 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid CSEN 601 Spring 2011 Computer Architecture Text book slides: Computer Architec.
Static Scheduling for ILP Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
CS252 Graduate Computer Architecture Lecture 6 Static Scheduling, Scoreboard February 6 th, 2012 John Kubiatowicz Electrical Engineering and Computer Sciences.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Instruction-Level Parallelism (ILP)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
DLX Instruction Format
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.
-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CSCE430/830 Computer Architecture
\course\cpeg323-08F\Topic6b-323
Pipelining: Advanced ILP
CSC 4250 Computer Architectures
Computer Architecture
\course\cpeg323-05F\Topic6b-323
CS 704 Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)

\course\ELEG652-03Fall\Topic Reading List Slides: Topic4x Henn&Patt: Chapter 4 Other assigned readings from homework and classes

\course\ELEG652-03Fall\Topic Design Space for Processors Cycle per Instruction { Enough Parallelism ? [TheobaldGaoHen 1992,1993,1994] Scalar CISC Scalar RISC Superpipelined Most likely future processor space Multithreaded Superscalar RISC Vector Supercomputer VLIW MHz Clock Rate

\course\ELEG652-03Fall\Topic Pipelining - A Review Hazards Structural: resource conflicts when hardware cannot support all possible combinations of insets.. in overlapped exec. Data: insts depend on the results of a previous inst. Control: due to branches and other insts that change PC Hazard will cause “stall” but in pipeline “stall” is serious - it will hold up multiple insts.

\course\ELEG652-03Fall\Topic RISC Concepts: Revisit What makes it a success ? - Pipeline - cache What prevents CPI = 1? - hazards and its resolution - Def - dependence graph

\course\ELEG652-03Fall\Topic Structural Hazards - Non-pipelined Fus - One port of a R-file - One port of M. Data hazards for some data hazards ( e.g. ALU/ALU ops solutions): forwards (bypass) for others: pipeline interlock + pipeline stall (bypass cannot do on time) LDR1A +R4R1R7 this may need a “stall” or bubble

\course\ELEG652-03Fall\Topic Example of Structural Hazard Instruction Clock cycle number Load instruction IF ID EX MEM WB Instruction i+1 IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 stall IF ID EX MEM WB Instruction i+4 IF ID EX MEM

\course\ELEG652-03Fall\Topic Clock cycle Instruction ADD instruction IFIDEXMEMWB- data written here SUB instruction IFID- data read hereEXMEM WB The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it! (1) data hazard may cause SUB read wrong value (2) this is dangerous: as the result may be non-deterministic (3) forwarding (by-passing) Data Hazard

\course\ELEG652-03Fall\Topic IF ID EX MEM WB ADDR 1,R 2,R 3 SUBR 4,R 1,R 5 ANDR 6,R 1,R 7 ORR 8,R 1,R 9 XORR 10,R 1,R 11 A set of instructions in the pipeline that need to forward results.

\course\ELEG652-03Fall\Topic Register file Mux R4 R1 Bypass paths ALU result buffers Result write bus Pipeline Bypassing ALU

\course\ELEG652-03Fall\Topic AB + C. EA + D Flow-dependency ( R/W conflicts)

\course\ELEG652-03Fall\Topic AB + C. AB - C Output dependency ( W/W conflicts) Leave A in wrong state if order is changed

\course\ELEG652-03Fall\Topic AA + B. AC + D anti-dependency ( W/R conflicts)

\course\ELEG652-03Fall\Topic How about arrays?. A [i] = = A[i-1]+..

\course\ELEG652-03Fall\Topic j i DLX Read / Read Write / Writeno Read / Write Write /Readno “Shared datum” conflicts

\course\ELEG652-03Fall\Topic Not all data hazards can be eliminated by bypassing LWR1, 32 (R6) ADDR4, R1, R7 SUB R5, R1, R8 ANDR6, R1, R7

\course\ELEG652-03Fall\Topic Load latency cannot be eliminated by forward alone It is handled often by “pipeline interlock” - which detects a hazard and “stall” the pipeline the delay cycle - called stall or “bubble” Any instruction IF ID EX MEM WB LWR1, 32 (R6) IF ID EX MEM WB ADDR4, R1, R7 IF ID stall EX MEM WB SUB R5, R1, R8 IF stall ID EX MEM WB ANDR6, R1, R7 stall IF ID EX MEM 45

\course\ELEG652-03Fall\Topic “Issue” - pass ID stage “Issued instructions” - DLX always only issue inst where there is no hazard. Detect interlock early in the pipeline has the advantage that it never needs to suspend an inst and undo the state changes.

\course\ELEG652-03Fall\Topic Exploitation Instruction Level Parallelism static scheduling dynamic scheduling simple scheduling loop unrolling loop unrolling + scheduling software pipelining out-of-order execution dataflow computers

\course\ELEG652-03Fall\Topic directed-edges: data-dependence undirected-edges: Resources constraint An edge (u,v) (directed or undirected) of length e represent an interlock between node u and v, and they must be separated by e time. Constraint Graph S1S1 S6S6 S5S5 S4S4 S3S3 S2S operation latencies 4 3

\course\ELEG652-03Fall\Topic Code Scheduling for Single Pipeline (CSSP problem) Input: A constraint Graph G = (V.E.) Output: A sequence of operations in G, v 1, v 2,...v n with number of no-ops no greater than k such that: 1.if the no-ops are deleted, the result is a topological sort of G. 2.any two nodes u,v in the sequence is separated by a distance >= d (u,v)

\course\ELEG652-03Fall\Topic Advanced Pipelining Instruction reordering/scheduling within loop body loop unrolling : the code is not compact superscalar: compact code + multiple issuing of different class of instructions VLIW

\course\ELEG652-03Fall\Topic Loop :LDF0, 0 (R1) ; load the vector element ADDDF4, F0, F2; add the scalar in F2 SD0 (R1), F4; store the vector element SUBR1, R1, #8; decrement the pointer by ; 8 bytes (per DW) BNEZR1, LOOP; branch when it’s zero An Example: X + a

\course\ELEG652-03Fall\Topic Instruction producing Destination instruction Latency in ? result FP ALU opAnother FP ALU op 3 FP ALU opStore double 2 Load doubleFP ALU op 1 Load doubleStore double 0 Latencies of FP operations used in this section. The first column shows the originating instruction type. The second column is the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. These numbers are similar to the average latencies we would see on an FP unit, like the one we described for DLX in the last chapter. The major change versus the DLX FP pipeline was to reduce the latency of FP multiply; this helps keep our examples from becoming unwieldy. The latency of a floating-point load to a store is zero, since the result of the load can be bypassed without stalling the store. We will continue to assume an integer load latency of 1 and an integer ALU operation latency of 0.

\course\ELEG652-03Fall\Topic Without any scheduling the loop will execute as follows: Clock cycle issued Loop :LD F0, 0 (R1)1 stall2 ADDD F4, F0, F23 stall4 stall5 SD 0(R1), F46 SUB R1, R1, #87 BNEZ R1, LOOP8 stall9 This requires 9 clock cycles per iteration.

\course\ELEG652-03Fall\Topic We can schedule the loop to obtain Loop : LDF0, 0 (R1) stall ADDDF4, F0, F2 SUBR1, R1, #8 BNEZR1, LOOP; delayed branch SD8 (R1), F4; changed because interchanged with SUB Average: 6 cycles per element

\course\ELEG652-03Fall\Topic Loop unrolling: Here is the result after dropping the unnecessary SUB and BNEZ operations duplicated during unrolling. Loop :LDF0, 0 (R1) ADDDF4, F0, F2 SD0 (R1), F4 ; drop SUB & BNEZ LDF6, -8 (R1) ADDDF8, F6, F2 SD-8 (R1), F8 ; drop SUB & BNEZ LDF10, -16 (R1) ADDDF12, F10, F2 SD-16 (R1), F12 ; drop SUB & BNEZ LDF14, -24 (R1) ADDDF16, F14, F2 SD-24 (R1), F16 SUBR1, R1, #32 BNEZR1, LOOP Average: 6.8 cycles per element

\course\ELEG652-03Fall\Topic Unrolling + Scheduling Show the unrolled loop in the previous example after it has been scheduled on DLX. Loop :LDF0, 0 (R1) LDF6, - 8 (R1) LDF10, -16 (R1) LDF14, -24 (R1) ADDDF4, F0, F2 ADDDF8, F6, F2 ADDDF12, F10, F2 ADDDF16, F14, F2 SD0 (R1), F4 SD-8 (R1), F8 SD-16 (R1), F12 SUBR1, R1, #32 ; branch dependence BNEZR1, LOOP SD8 (R1), F16 ; 8-32 = -24 The execution time of the unrolled loop has dropped to a total of 14 clock cycles, or 3.5 clock cycles per element, compared to 6.8 per element before scheduling

\course\ELEG652-03Fall\Topic R1 LD 0 F0 a + SD R1F4 F R1 LD -24 F14 a + SD R1F16 F R1 LD -8 F6 a + SD R1F8 F R1 LD -6 F10 a + SD R1F12 F Simple unrolling : We have eliminated three branches and three decrements of R1. The addresses on the loads and stores have been compensated for. Without scheduling, every operation is followed by a dependent operation, and thus will cause a stall. This loop will run in 27 clock cycles - each LD takes 2 clock cycles,each ADDD 3, the branch 2, and all other instructions 1 - or 6.8 clock cycles for each of the four elements y[i] = X [i] + a 27 cycle 4 elem. = 6.8 cycle/elem.

\course\ELEG652-03Fall\Topic LD F0 a + SD F4 F LD F6 a + SD F8 F LD F10 a + SD F12 F LD F14 a + SD F16 F Unrolling + Scheduling 14 cycle 4 elem = 3.5 cycle/elem