Vishwani D. Agrawal James J. Danaher Professor

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

Chapter 14 Instruction Level Parallelism and Superscalar Processors
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
ELEN 468 Advanced Logic Design
Computer Architecture Pipelines & Superscalars. Pipelines Data Hazards Code: lw $4, 0($1) add $15, $1, $1 sub$2, $1, $3 and $12, $2, $5 or $13, $6, $2.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Spring 2008, Jan. 14 ELEC / Lecture 2 1 ELEC / Computer Architecture and Design Spring 2007 Introduction Vishwani D. Agrawal.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Spring 07, Feb 22 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Power Aware Microprocessors Vishwani D. Agrawal.
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Fall 2015, Aug 17 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Fall 2015 Introduction Vishwani D. Agrawal.
Fall 2014, Nov ELEC / Lecture 12 1 ELEC / Computer Architecture and Design Fall 2014 Instruction-Level Parallelism.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
CA406 Computer Architecture Pipelines... continued.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Spring 2016, Jan 13 ELEC / Lecture 1 1 ELEC / Computer Architecture and Design Spring 2016 Introduction Vishwani D. Agrawal.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
11/15/05ELEC / Lecture 191 ELEC / (Fall 2005) Special Topics in Electrical Engineering Low-Power Design of Electronic Circuits.
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Use of Pipelining to Achieve CPI < 1
Chapter Six.
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
Morgan Kaufmann Publishers
ELEN 468 Advanced Logic Design
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Instructor: Justin Hsia
Single Clock Datapath With Control
Pipeline Implementation (4.6)
CDA 3101 Spring 2016 Introduction to Computer Organization
Lecture 12 Reorder Buffers
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Instruction Level Parallelism and Superscalar Processors
Lecture 6: Advanced Pipelines
Vishwani D. Agrawal James J. Danaher Professor
ELEC / Computer Architecture and Design Spring Pipelining (Chapter 6)
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Computer Architecture
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Chapter Six.
Chapter Six.
CS203 – Advanced Computer Architecture
CSC3050 – Computer Architecture
Morgan Kaufmann Publishers The Processor
Introduction to Computer Organization and Architecture
The University of Adelaide, School of Computer Science
Lecture 5: Pipeline Wrap-up, Static ILP
Guest Lecturer: Justin Hsia
ELEC / Computer Architecture and Design Fall 2014 Introduction
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Instruction-Level Parallelism Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 http://www.eng.auburn.edu/~vagrawal vagrawal@eng.auburn.edu Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

A Computer System Processor Interrupts Cache Memory – I/O bus I/O controller I/O controller I/O controller Main memory Disk Disk Graphics output Network Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Advanced Architectures – ILP Instruction level parallelism (ILP): multiple instructions fetched and executed simultaneously. ILP is used in addition to pipelining. Processors with ILP are called multiple-issue processors – multiple instructions launched in 1 clock cycle. Two ways: MIMD: Multiple Instructions Multiple Data Superpipeline Superscalar – dynamic multiple issue Very long instruction word (VLIW) – static multiple issue SIMD: Single Instruction Multiple Data Vector processor Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Superpipeline and Superscalar IF ID EX MEM WB Pipeline 1 instruction/cycle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Superpipeline (Pipeline clock is twice as fast as the system clock) 2 instructions per cycle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Superscalar 2 (or more) instructions/cycle IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB System clock cycles 0 1 2 3 4 5 6 7 8 Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

A Static Two-Issue MIPS Pipeline Read two instructions per cycle: An ALU or branch instruction, and A load or store instruction Insert one nop if above pair is not available Added hardware (Figure 4.69, page 336): A second instruction memory Additional input/output ports in register file Additional ALU in execute stage for address calculation Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

An Example (Page 337) Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0(s1) addi $s1, $s1, – 4 bne $s1, $0, Loop Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Static Two-Issue Execution ALU or branch instruction Data transfer instruction Clock cycle Loop: nop lw $t0, 0($s1) 1 addi $s1, $s1, – 4 2 addu $t0, $t0, $s2 3 bne $s1, $0, Loop sw $t0, 4($s1) 4 Note code reordering and change in sw argument. CPI = 4/5 = 0.8 < 0.5 (ideal) Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Loop Unrolling (Index Multiple of 4) ALU or branch instruction Data transfer instruction Clock cycle Loop: addi $s1, $s1, – 16 lw $t0, 0($s1) 1 nop lw $t1, 12($s1) 2 addu $t0, $t0, $s2 lw $t2, 8($s1) 3 addu $t1, $t1, $s2 lw $t3, 4($s1) 4 addu $t2, $t2, $s2 sw $t0, 16($s1) 5 addu $t3, $t3, $s2 sw $t1, 12($s1) 6 sw $t2, 8($s1) 7 bne $s1, $0, Loop sw $t3, 4($s1) 8 CPI = 8/14 = 0.57 < 0.5 (ideal) Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

VLIW: Very Long Instruction Word Static multiple issue, ILP determined by compiler. Datapath contains multiple execution units. Compiler groups instructions that have no data or resource conflicts for parallel execution. Grouped instructions are packed in very long words of a wide instruction memory. Speedup benefit of VLIW is highly program dependent. J. A. Fisher, “Very Long Instruction Word Architecture and ELI-512,” Proc. 10th Symp. on Computer Architecture, Stockholm, June 1983, pp. 478-490. J. A. Fisher, P. Faraboschi and C. Young, Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools, Morgan Kaufmann. Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Superscalar: Dynamic Scheduling and Out-of-Order Execution Instruction fetch and decode unit In-order issue Out-of-order issue Reservation station Reservation station Reservation station Reservation station Functional units integer integer Floating point Load/ store Out-of-order execution Commit unit In-order commit Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Out of Order Execution (OOE) A procedural programming language sequences instructions. Sequencing assumes an order of execution – no parallelism. OOE must preserve correctness of result. Principle: Two instructions can be executes in parallel if they do not have dependences. Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

RAW Dependence Read after write (RAW): A dependent instruction reads from a register being written to by another instruction. Example: add $s1, $s2, $s3 sub $s2, $s1, $s3 sub has RAW dependence on add Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

WAR Dependence Write after read (WAR): A dependent instruction writes to a register being read by another instruction. Example: add $s1, $s2, $s3 sub $s2, $s1, $s3 sub has WAR dependence on add Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

WAW Dependence Read after write (RAW): One instruction writes to a register to being written to by another instruction. Example: add $s2, $s2, $s3 sub $s2, $s1, $s3 sub has WAW dependence on add Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Superscalar Instruction Issue Rules: RAW dependence – If any operand is being written, do not issue. WAR dependence – If the result register is being read, do not issue. WAW dependence – If the result register is being written, do not issue. Scoreboard: Cycle by cycle record of registers and execution units showing how many instructions are using them. Example 1: In-order issue (next 2 slides). Example 2: Out-of-order issue (3rd slide). Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Dynamic Scheduling Consider an example: Assume: First with in-order issue Then with out-of-order issue Assume: Up to two instructions are fetched in a cycle Instruction register can hold two instructions An Instruction is issued in decode cycle, or must wait until there is no RAW, WAR or WAW dependence An instruction can retire two or three cycles after it is issued Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

# Ck Decoded Issue Inst# Retire Reg. to read Reg. to write 1 2 3 4 5 6 cycle Inst # Decoded Issue Inst# Retire Reg. to read Reg. to write 1 2 3 4 5 6 7 R3 = R0 * R1 R4 = R0 + R2 R5 = R0 + R1 R6 = R1 + R4 - R7 = R1 * R2 R1 = R0 – R2 8 9 R3 = R3 * R1 Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

In-order Issue scoreboard (Continued) # Ck cycle Instr # Decoded Issue Inst# Retire Reg. to read Reg. to write 1 2 3 4 5 6 7 10 11 12 8 R1 = R4 + R4 - 13 14 15 16 17 18 Out-of-order scoreboard (Next 2 Slides) Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Questions? RAW dependence: Inst# 4 (R6 = R1 + R4) could not be issued until cycle 5. Should Inst# 5 (R7 = R1 * R2) wait in queue? Answer: No. Inst# 5 can be issued in cycle 3 as there is no register conflict (out-of-order issue). WAR dependence: Must the issue of Inst#6 (R1 = R0 – R2) waits until cycle 9 when all instructions reading R1 have retired? Answer: No. Provided new result of Inst#6 does not affect R1 being used by previous instructions (register renaming). Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Inst # Ck Decoded Issue Inst# Retire Reg. to read Reg. to write 1 2 3 cycle Inst # Decoded Issue Inst# Retire Reg. to read Reg. to write 1 2 3 4 5 6 7 R3 = R0 * R1 R4 = R0 + R2 R5 = R0 + R1 R6 = R1 + R4 - R7 = R1 * R2 S1 = R0 – R2 8 R3 = R3 * S1 S2 = R4 + R4 11 9 Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

References Previous example is from: Further reading: A. S. Tanenbaum, Structured Computer Organization, Fifth Edition, Prentice-Hall, 2006, pp. 304-309, Section 4.5.3. Further reading: D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. & Dev., vol. 11, no. 1, pp. 8-24, Jan. 1967. Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Power Reduction by Slack Scheduling Application: Superscalar, out-of-order execution: An instruction is executed as soon as the required data and resources become available. A commit unit reorders the results. Delay the completion of instructions whose result is not immediately needed. Example of RISC instructions: add r0, r1, r2; (A) sub r3, r4, r5; (B) and r9, r1, r9; (C) or r5, r9, r10; (D) xor r2, r10, r11; (E) J. Casmira and D. Grunwald, “Dynamic Instruction Scheduling Slack,” Proc. ACM Kool Chips Workshop, Dec. 2000. Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Slack Scheduling Example Standard scheduling A B C D E Slack scheduling A B C D E Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Slack Scheduling Re-order buffer Low-power execution units Slack bit Scheduling logic Low-power execution units (Reduced voltage) Slack bit Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Superscalar Design of P4 (CISC) CISC shell: Processor fetches instructions from memory in the order of static program. Each instruction is translated into one or more fixed-length RISC instructions, known as micro-operations (micro-ops). RISC core: Micro-ops are executed out-of-order in a dynamically scheduled pipeline. Processor commits the result of each micro-op execution to register file in the order of original program flow. Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Superscalars 3 or more instruction issues per clock: References: Intel P6 AMD K5 Sun UltraSPARC Alpha 21164 MIPS R10000 PowerPC 604/620 HP 8000 References: D. W. Anderson, F. J. Sparacio and R. M. Tomasulo, “The IBM 360 Model 91: Processor Philosophy and Instruction Handling,” IBM J. Res. Dev., vol. 11, pp. 8-24, January 1967. T. Agerwala and J. Cocke, “Reduced Instruction Set Processors,” Technical Report RC12434 (#55845), Yorktown Heights, NY: IBM T. J. Watson Research Center, January 1987. Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

Topics in Computer Architecture Instruction set Program execution through register transfer See Lectures 13-14. Computer arithmetic (2’s complement, IEEE 754 floating point standard, addition, multiplication) Datapaths (single-cycle, multicycle, pipeline) Control (combinational logic, FSM, microcode) Pipelining (throughput, hazards, forwarding, stall, branch prediction) Memory organization (cache, virtual memory) Performance (benchmarks, energy efficiency, Amdal’s law) Advanced architectures (ILP, OOE, superscalar, etc.) Not discussed in this course: Multiprocessors Compiler and software techniques – loop unrolling, trace execution, etc. Input and output Power management Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12

One who claims to know much about computer architecture speaks from ignorance . . . because a lot is going to happen in the future, which is . . . http://www.youtube.com/watch?v=xZbKHDPPrrc Doris Day in Hitchcock’s 1956 Movie “The Man Who Knew Too Much” Spr 2015, Apr 22 . . . ELEC 5200-001/6200-001 Lecture 12