U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Slides:



Advertisements
Similar presentations
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
Advertisements

Lecture 4: CPU Performance
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.
1 CS 201 Compiler Construction Machine Code Generation.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 More Control Flow John Cavazos University.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Instruction-Level Parallelism (ILP)
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Static Single Assignment John Cavazos.
COMP25212 Lecture 51 Pipelining Reducing Instruction Execution Time.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Yet More Data flow analysis John Cavazos.
Instruction-Level Parallelism and Its Dynamic Exploitation
Advanced Architectures
Pipelining: Advanced ILP
Instruction Scheduling for Instruction-Level Parallelism
Instruction Scheduling Hal Perkins Summer 2004
CS 201 Compiler Construction
Instruction Scheduling Hal Perkins Winter 2008
Dr. Javier Navaridas Pipelining Dr. Javier Navaridas COMP25212 System Architecture.
Lecture 23: Static Scheduling for High ILP
How to improve (decrease) CPI
Advanced Computer Architecture
Control unit extension for data hazards
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Static Code Scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Scheduling Hal Perkins Autumn 2005
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipeline Control unit (highly abstracted)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CMSC 611: Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism
Instruction Scheduling Hal Perkins Autumn 2011
Conceptual execution on a processor which exploits ILP
Presentation transcript:

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University of Delaware

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Instruction Scheduling Reordering instructions to improve performance Takes into account anticipated latencies Machine-specific Performed late in optimization pass Instruction-Level Parallelism (ILP)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 3 Modern Architectures Features Superscalar Multiple logic units Multiple issue 2 or more instructions issued per cycle Speculative execution Branch predictors Speculative loads Deep pipelines

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 4 Types of Instruction Scheduling Local Scheduling Basic Block Scheduling Global Scheduling Trace Scheduling Superblock Scheduling Software Pipelining

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 5 Scheduling for different Computer Architectures Out-of-order Issue Scheduling is useful In-order issue Scheduling is very important VLIW Scheduling is essential!

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 6 Challenges to ILP Structural hazards: Insufficient resources to exploit parallelism Data hazards Instruction depends on result of previous instruction still in pipeline Control hazards Branches & jumps modify PC affect which instructions should be in pipeline

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Recall from Architecture… IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back IF ID EX MA WB

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Structural Hazards IF ID EX MA WB addf R3,R1,R2 addf R3,R3,R4 stall EX Assumes floating point ops take 2 execute cycles Instruction latency: execute takes > 1 cycle

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Data Hazards IF ID EX MA WB lw R1,0(R2) add R3,R1,R4 stall Memory latency: data not ready

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Control Hazards IF ID --- EX --- MA --- WB IFIDEXMAWB IFIDEXMAWB Taken Branch Instr + 1 Branch Target Branch Target + 1

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 11 Basic Block Scheduling For each basic block: Construct directed acyclic graph (DAG) using dependences between statements Node = statement / instruction Edge (a,b) = statement a must execute before b Schedule instructions using the DAG

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Data Dependences If two operations access the same register and one access is a write, they are dependent Types of data dependences RAW=Read after WriteWAWWAR r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 r1 = r2 + r3 r2 = r5 * 6 Cannot reorder two dependent instructions

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Basic Block Scheduling Example a) lw R2, (R1) b) lw R3, (R1) 4 c) R4  R2 + R3 d) R5  R2 - 1 ab dc a) lw R2, (R1) b)lw R3, (R1) nop c) R4  R2 + R3 d) R5  R2 - 1 a) lw R2, (R1) b) lw R3, (R1) 4 d)R5  R2 - 1 c) R4  R2 + R3 Original Schedule Dependence DAG Schedule 1 (5 cycles) Schedule 2 (4 cycles)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 14 Scheduling Algorithm Construct dependence dag on basic block Put roots in candidate set Use scheduling heuristics (in order) to select instruction While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from candidate set Add newly-exposed candidates

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 15 Instruction Scheduling Heuristics NP-complete = we need heuristics Bias scheduler to prefer instructions: Earliest execution time Have many successors More flexibility in scheduling Progress along critical path Free registers Reduce register pressure Can be a combination of heuristics

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Computing Priorities Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 17 Example – Determine Height and CP Code alw r1, w badd r1,r1,r1 clw r2,x dmult r1,r1,r2 elw r2,y fmult r1,r1,r2 glw r2,z hmult r1,r1,r2 isw r1, a Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle Critical path: _______ a b d f h i c e g

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 18 Example star t Schedule ___ cycles a b d f h i c e g Code alw r1, w badd r1,r1,r1 clw r2,x dmult r1,r1,r2 elw r2,y fmult r1,r1,r2 glw r2,z hmult r1,r1,r2 isw r1, a

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Global Scheduling: Superblock Definition: single trace of contiguous, frequently executed blocks a single entry and multiple exits Formation algorithm: pick a trace of frequently executed basic block eliminate side entrance (tail duplication) Scheduling and optimization: speculate operations in the superblock apply optimization to scope defined by superblock

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Superblock Formation A 100 B 90 E 90 C 10 D0D0 F 100 A 100 B 90 E 90 C 10 D0D0 F 90 F’ 10 Select a trace Tail duplicate

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizations within Superblock By limiting the scope of optimization to superblock: optimize for the frequent path may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) For example: CSE r1 = r2*3 r2 = r2 +1 r3 = r2*3 trace selection r1 = r2*3 r2 = r2 +1 r3 = r2*3 tail duplication r1 = r2*3 r2 = r2 +1 r3 = r1 r3 = r2*3 CSE within superblock (no merge since single entry)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 22 Scheduling Algorithm Complexity Time complexity: O(n 2 ) n = max number of instructions in basic block Building dependence dag: worst-case O(n 2 ) Each instruction must be compared to every other instruction Scheduling then requires each instruction be inspected at each step = O(n 2 ) Average-case: small constant (e.g., 3)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Very Long Instruction Word (VLIW) Compiler determines exactly what is issued every cycle (before the program is run) Schedules also account for latencies All hardware changes result in a compiler change Usually embedded systems (hence simple HW) Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Sample VLIW code c = a + bd = a - be = a * bld j = [x]nop g = c + dh = c - dnopld k = [y]nop i = j * cld f = [z]br g Add/Sub Mul/DivLd/StBranch VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots)

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT 25 Next Time Phase-ordering