We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byRigoberto Taplin
Modified about 1 year ago
11/21/2002© 2002 Hal Perkins & UW CSEO-1 CSE 582 – Compilers Instruction Scheduling Hal Perkins Autumn 2002
11/21/2002© 2002 Hal Perkins & UW CSEO-2 Agenda Instruction scheduling issues – latencies List scheduling Credits: Adapted from slides by Keith Cooper, Rice University
11/21/2002© 2002 Hal Perkins & UW CSEO-3 Issues Many operations have non-zero latencies Modern machines can issue several operations per cycle Loads & Stores may or may not block may be slots after load/store for other work Branch costs vary Branches on modern processors typically have delay slots GOAL: Scheduler should reorder instructions to hide latencies and take advantage of delay slots
11/21/2002© 2002 Hal Perkins & UW CSEO-4 Some Idealized Latencies OperationCycles LOAD3 STORE3 ADD1 MULT2 SHIFT1 BRANCH0 TO 8
11/21/2002© 2002 Hal Perkins & UW CSEO-5 Example: w = w*2*x*y*z; Simple schedule 1 LOAD r1 <- w 4 ADD r1 <- r1,r1 5 LOADr2 <- x 8 MULTr1 <- r1,r2 9 LOAD r2 <- y 12 MULTr1 <- r1,r2 13 LOADr2 <- z 16 MULTr1 <- r1,r2 18 STORE w <- r1 21 r1 free 2 registers, 20 cycles Loads early 1LOADr1 <- w 2LOADr2 <- x 3LOADr3 <- y 4ADDr1 <- r1,r1 5MULTr1 <- r1,r2 6LOADr2 <- z 7MULTr1 <- r1,r3 9MULTr1 <- r1,r2 11 STOREw <- r1 14 r1 is free 3 registers, 13 cycles
11/21/2002© 2002 Hal Perkins & UW CSEO-6 Instruction Scheduling Problem Given a code fragment for some machine and latencies for each operation, reorder to minimize execution time Constraints Produce correct code Minimize wasted cycles Avoid spilling registers Do this efficiently
11/21/2002© 2002 Hal Perkins & UW CSEO-7 Precedence Graph Nodes n are operations Attributes of each node type – kind of operation delay – latency If node n2 uses the result of node n1, there is an edge e = (n1,n2) in the graph
11/21/2002© 2002 Hal Perkins & UW CSEO-8 Example Graph Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOADr2 <- x d MULTr1 <- r1,r2 e LOAD r2 <- y f MULTr1 <- r1,r2 g LOADr2 <- z h MULTr1 <- r1,r2 i STORE w <- r1
11/21/2002© 2002 Hal Perkins & UW CSEO-9 Schedules (1) A correct schedule S maps each node n into a non-negative integer representing its cycle number, and S (n ) >= 0 for all nodes n (obvious) If (n1,n2) is an edge, then S(n1)+delay(n1) <= S(n2) For each type t there are no more operations of type t in any cycle than the target machine can issue
11/21/2002© 2002 Hal Perkins & UW CSEO-10 Schedules (2) The length of a schedule S, denoted L(S) is L(S) = max n ( S(n )+delay(n ) ) The goal is to find the shortest possible correct schedule Other possible goals: minimize use of registers, power, space, …
11/21/2002© 2002 Hal Perkins & UW CSEO-11 Constraints Main points All operands must be available Multiple operations can be ready at any given point Moving operations can lengthen register lifetimes Moving uses near definitions can shorten register lifetimes Operations can have multiple predecessors Collectively this makes scheduling NP-complete Local scheduling is the simpler case Straight-line code Consistent, predictable latencies
11/21/2002© 2002 Hal Perkins & UW CSEO-12 Algorithm Overview Build a precedence graph P Compute a priority function over the nodes in P (typical: longest latency-weighted path) Use list scheduling to construct a schedule, one cycle at a time Use queue of operations that are ready At each cycle Chose a ready operation and schedule it Update ready queue Rename registers to avoid false dependencies and conflicts
11/21/2002© 2002 Hal Perkins & UW CSEO-13 List Scheduling Algorithm Cycle = 1; Ready = leaves of P; Active = empty; while (Ready and/or Active are not empty) if (Ready is not empty) remove an op from Ready; S(op) = Cycle; Active = Active + op; Cycle++; for each op in Active if (S(op) + delay(op) <= Cycle) remove op from Active; for each successor s of op in P if (s is ready – i.e., all operands available) add s to Ready
11/21/2002© 2002 Hal Perkins & UW CSEO-14 Example Code a LOAD r1 <- w b ADD r1 <- r1,r1 c LOADr2 <- x d MULTr1 <- r1,r2 e LOAD r2 <- y f MULTr1 <- r1,r2 g LOADr2 <- z h MULTr1 <- r1,r2 i STORE w <- r1
11/21/2002© 2002 Hal Perkins & UW CSEO-15 Variations Backward list scheduling Work from the root to the leaves Schedules instructions from end to beginning of the block In practice, try both and pick the result that minimizes costs Little extra expense since the precedence graph and other information can be reused Global scheduling and loop scheduling Extend basic idea in more aggressive compilers
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
Local Instruction Scheduling — A Primer for Lab 3 — Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 1998 Morgan Kaufmann Publishers Chapter Six. 2 1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
10/1/2015© Hal Perkins & UW CSEG-1 CSE P 501 – Compilers Intermediate Representations Hal Perkins Autumn 2009.
Part 5 – Superscalar & Dynamic Pipelining - An Extra Kicker! 5/5/04+ Three major directions that simple pipelines of chapter 6 have been extended If you.
10/22/2002© 2002 Hal Perkins & UW CSEG-1 CSE 582 – Compilers Intermediate Representations Hal Perkins Autumn 2002.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.
Pipelining III Andreas Klappenecker CPSC321 Computer Architecture.
Spring 2003CSE P5481 Midterm Philosophy What the exam looks like. Definitions, comparisons, advantages & disadvantages what is it? how does it work? why.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
12/18/2015© Hal Perkins & UW CSEG-1 CSE P 501 – Compilers Intermediate Representations Hal Perkins Winter 2008.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Wrapping Up Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 2004 Morgan Kaufmann Publishers Chapter Six. 2 2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
11/19/2002© 2002 Hal Perkins & UW CSEN-1 CSE 582 – Compilers Instruction Selection Hal Perkins Autumn 2002.
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
1 Lecture 17: Basic Pipelining Today’s topics: 5-stage pipeline Hazards and instruction scheduling Mid-term exam stats: Highest: 90, Mean: 58.
CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Machine cycle. What is the machine cycle? 4 steps that processors follow to execute program instructions.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
Local Instruction Scheduling — A Primer for Lab 3 — Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Intro to the “c6x” VLIW processor ● Texas Instruments TMSC6000 series ● TMSC6700 subseries – include floating point ● VLIW = Very Long Instruction Word.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Basics and Architectures. 2 RISC and CISC Processor.
CSCE 212 Quiz 4 – 2/16/11 *Assume computes take 1 clock cycle, loads and stores take 10 cycles and branches take 4 cycles and that they are running on.
ARM7 Architecture What We Have Learned up to Now.
Anshul Kumar, CSE IITD CSL718 : Pipelined Processors Types of Pipelines Types of Hazards 16th Jan, 2006.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
© 2017 SlidePlayer.com Inc. All rights reserved.