Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

Slides:

Advertisements

Similar presentations

CH14 Instruction Level Parallelism and Superscalar Processors

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Computer Organization and Architecture

Computer Architecture Instruction-Level Parallel Processors

CSCI 4717/5717 Computer Architecture

VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Instruction Level Parallelism (ILP) Colin Stevens.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

Multiscalar processors

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2 Understanding the pipeline.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Pipelining and Parallelism Mark Staveley

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

Sunpyo Hong, Hyesoon Kim

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Computer Architecture Principles Dr. Mike Frank

William Stallings Computer Organization and Architecture 8th Edition

Simultaneous Multithreading

Chapter 9 a Instruction Level Parallelism and Superscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CC 423: Advanced Computer Architecture Limits to ILP

Henk Corporaal TUEindhoven 2009

Instruction Level Parallelism and Superscalar Processors

Instruction Scheduling for Instruction-Level Parallelism

Jonathan Mak & Alan Mycroft University of Cambridge

Instruction Level Parallelism and Superscalar Processors

Yingmin Li Ting Yan Qi Zhao

Lecture 20: OOO, Memory Hierarchy

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Created by Vivi Sahfitri

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Presentation transcript:

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W. Wall WRL Research Report, November 1993

What is ILP? Instructions that do not have dependencies on each other; can be executed in any order. r1 := 0[r9] r2 := 17 r2 := r [r3] := r6 4[r2] := r6 (has ILP) (no ILP)‏ Super-scalar machine – a machine that can issue multiple independent instructions in the same clock cycle.

Definition of Parallelism Parallelism = (Number of Instructions) / (Number of Cycles it takes to execute)‏ r1 := 0[r9] r2 := 17 r2 := r [r3] := r6 4[r2] := r6 Parallelism = 3 Parallelism = 1

How much parallelism is there? That depends how hard you want to look for it... Ways to increase ILP: Register renaming Branch prediction Alias analysis Indirect-jump prediction

Low estimate for ILP Programs are made up of “basic blocks”—uninterrupted sequences of instructions with no branches. On average, in typical applications, basic blocks are ~10 instructions long. Each basic block has parallelism of around 3.

High estimate for ILP If you look beyond a basic block, at the entire scope of a program, studies have shown that an “omniscient” scheduler can achieve parallelism of > 1000 in some numerical applications. “Omniscient” scheduling can be implemented by saving a trace of a program execution, and using an oracle to schedule it. The oracle knows what will happen, and thus can create a perfect execution schedule. Practical, achievable ILP should be between 3 and 1000.

Types of dependencies Types of dependencies: * True dependency - given the computations involved, the dependency must exist * False dependency - dependency happens to exist as an artifact of the code generation engine. E.g., two independent values are allocated to the same register by the compiler. r1 := 20[r4] r2 := r1 + r r2 := r1 + 1 r1 := r (a) true data dependency (b) anti-dependency r1 := r2 * r3 if r17 = 0 goto L r1 := r2 + r3... r1 := 0[r7] L: (c) output dependency (d) control dependency

Register renaming The compiler's register allocation algorithm can insert false dependencies by assigning unrelated values to the same register. We can undo this damage by assigning each value to a unique register so that only true dependencies remain. However, machines have a finite number of registers, so we can never guarantee perfect parallelism.

Register renaming

Alias analysis We often have registers that point to a memory location or contain a memory offset. Can two memory pointers point to the same place in memory? If so, there might be a dependency. We're not sure yet. We can try to inspect pointer values at runtime to see if they point to overlapping memory.

Alias analysis

Limitations of branch prediction: We can correctly predict around ~0.9 by counting which branches have been recently taken, and taking the most common one.

Indirect-jump prediction If we jump to an address that is not known at compile time--for example, if a destination address is calculated into a register at runtime. This is often the case for "return" constructs, where the the calling function's address is stored on the stack. In this case, we can do indirect-jump prediction.

Latency Multi-cycle instructions can greatly decrease parallelism

Window size The window size is the maximum number of instructions that can appear in the pending cycle list.

Overall results

Conclusions: the ILP Wall Even with “perfect” techniques, most real applications hit an ILP limit of around 20 With reasonable, practical methods, it's even worse—it's very difficult to get an ILP above 10.

Relationship to Term Project Our term project is about optimization techniques for AMD64 Opteron/Athlon processors. Maximizing ILP is essential to getting the most performance out of any processor. Branch prediction, register renaming, etc., are all particularly relevant optimizations