Instruction Scheduling for Instruction-Level Parallelism

Slides:

Advertisements

Similar presentations

Topics Left Superscalar machines IA64 / EPIC architecture

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

Computer architecture

CSCI 4717/5717 Computer Architecture

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Introduction 23rd Mar, 2006.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Chapter 14 Instruction Level Parallelism and Superscalar Processors

COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

EKT303/4 Superscalar vs Super-pipelined.

Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.

Instruction level parallelism And Superscalar processors By Kevin Morfin.

Use of Pipelining to Achieve CPI < 1

Lecture 38: Compiling for Modern Architectures 03 May 02

CS 352H: Computer Systems Architecture

Advanced Architectures

Instruction Level Parallelism

Computer Architecture Principles Dr. Mike Frank

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 8th Edition

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

CS203 – Advanced Computer Architecture

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Hyperthreading Technology

Instruction Level Parallelism and Superscalar Processors

Instruction Scheduling Hal Perkins Summer 2004

Hardware Multithreading

Instruction Level Parallelism and Superscalar Processors

Instruction Scheduling Hal Perkins Winter 2008

Computer Architecture

Control unit extension for data hazards

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

* From AMD 1996 Publication #18522 Revision E

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Static Code Scheduling

Lecture 15: Code Generation - Instruction Selection

Instruction Scheduling Hal Perkins Autumn 2005

Chapter 12 Pipelining and RISC

CSC3050 – Computer Architecture

Created by Vivi Sahfitri

How to improve (decrease) CPI

Loop-Level Parallelism

Presentation transcript:

Instruction Scheduling for Instruction-Level Parallelism CSS 548 Daniel R. Lewis November 28, 2012

Agenda Where does instruction scheduling fit into the compilation process? What is instruction-level parallelism? What are data dependencies, and how do they limit instruction-level parallelism? How should the compiler order instructions to maximize instruction-level parallelism? What is the affect on register allocation? What else must be considered in instruction scheduling?

Big Picture Instruction scheduling is an optimization that is implemented in the back-end of the compiler Operates on machine code (not IR) Tied to the characteristics of the CPU Assumes generic optimization is complete Idea: reorder instructions to increase instruction-level parallelism.

Instruction-Level Parallelism Parallelism on a single core; not multicore Pipelined processors are executing several instructions at once, at different stages Superscalar and VLIW processors can issue multiple instructions per cycle

Pipelined Parallelism (Jouppi and Wall, 1989) Ubiquitous in modern processors Superpipelining: Longer pipelines with shorter stages (Pentium 4 had 20-stage pipeline)

Superscalar Parallelism (Jouppi and Wall, 1989) Works with CPUs that have multiple functional units (e.g., ALU, multiplier, bit shifter) Since mid-1990s, all general purpose processors have been superscalar (original Pentium was first superscalar x86)

VLIW Parallelism Most commonly seen in embedded DSPs (Jouppi and Wall, 1989) Most commonly seen in embedded DSPs Non-embedded example: Intel Itanium

Data Dependencies inc ebx ;; ebx++ mov eax, ebx ;; eax := ebx Ordering of some instructions must be preserved Three flavors of data dependence: True dependence (read after write) Antidependence (write after read) Output dependence (write after write) Data dependencies substantially reduce available parallelism Dynamically-scheduled processors detect dependencies at run-time and stall instructions until their operands are ready (most processors) Statically-scheduled processors leave dependency detection to the compiler, which must insert no-ops (simple, low-power embedded)

Instruction Scheduling Goal: re-order instructions, accounting for data dependencies and other factors, to minimize the number of stalls/no-ops (Engineering a Compiler, p. 644)

Dependence Graphs and List Scheduling Key data structure is the dependence graph (* List scheduling algorithm *) Cycle := 1 Ready := [leaves of Graph] Active := [] while (Ready + Active).size > 0 for each op in Active if op.startCycle + op.length < Cycle remove op from Active for each successor s of op in Graph if s is ready add s to Ready if Ready.size > 0 remove an op from Ready op.startCycle = Cycle add op to Active Cycle++ (Engineering a Compiler, p. 645–652)

Register Allocation Trade-Offs More storage locations = fewer data dependencies = more parallelism Many register allocation schemes seek to minimize the number of registers used, undermining parallelism Processors developed hardware register renaming as a workaround However, excess register usage may require spillover code which negates the benefit of parallelism Register allocation can be done either before or after instruction scheduling

Advanced Topics List-scheduling algorithm operates on basic blocks Global code scheduling, code motion Software pipelining: schedule entire loop at once Branch prediction Alias analysis: determine if a pointer causes a data dependency Scheduling variable-length operations LOAD can take hundreds or thousands of cycles upon a cache miss Speculative execution

Questions?