Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.

Slides:



Advertisements
Similar presentations
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
Pipeline Optimization
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.
CPE432 Chapter 4C.1Dr. W. Abu-Sufah, UJ Chapter 4C: The Processor, Part C Read Section 4.10 Parallelism and Advanced Instruction-Level Parallelism Adapted.
10/24/05 ELEC62001 Kasi L.K. Anbumony Department of Electrical and Computer Engineering Auburn University Auburn, AL Superscalar Processors.
Instruction Level Parallelism Chapter 4: CS465. Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase.
Chapter 4 CSF 2009 The processor: Instruction-Level Parallelism.
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 19 - Pipelined.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
1 Chapter Six - 2nd Half Pipelined Processor Forwarding, Hazards, Branching EE3055 Web:
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
ECE200 – Computer Organization Chapter 6 – Enhancing Performance with Pipelining.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Fall 2014, Nov ELEC / Lecture 12 1 ELEC / Computer Architecture and Design Fall 2014 Instruction-Level Parallelism.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CMPE 421 Parallel Computer Architecture
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Csci 136 Computer Architecture II – Superscalar and Dynamic Pipelining Xiuzhen Cheng
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
CSCI206 - Computer Organization & Programming
/ Computer Architecture and Design
CS203 – Advanced Computer Architecture
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Vishwani D. Agrawal James J. Danaher Professor
Pipelining review.
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 18: Pipelining Today’s topics:
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture 18: Pipelining Today’s topics:
Pipelining in more detail
Computer Architecture
CSCI206 - Computer Organization & Programming
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Vishwani D. Agrawal James J. Danaher Professor
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 5: Pipeline Wrap-up, Static ILP
Guest Lecturer: Justin Hsia
Presentation transcript:

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs Dynamic Optimizations

Optimization Example C-code k = len; do { k--; A[k] = A[k] + x; } while(k > 0)

Register Usage $solen $s1base address of A $s2x $t1address of A[k] $t0value of A[k] (old and new)

Basic loop using pointer hopping sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -4 lw $t0, 0($t1) add $t0, $t0, $s2 sw $t0, 0($t1) bne $t1, $s1, loop xxx

Time for 1000 Iterations for Single- and Multi-cycle Single Cycle –Every instruction takes 800 picoseconds (ps) –Time = 5x800x x800 = 4,001,600 ps = nanoseconds (ns) Multicycle –200 ps per cycle, variable number of CPI –Cycles = (1x3 + 3x4 + 1x5)x x4 = 20,008 –Time = 20,008 x 200 ps = 4,001.6 ns

Pipeline Filling Stall/Delay slots sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -4 lw $t0, 0($t1) nop add $t0, $t0, $s2 sw $t0, 0($t1) bne $t1, $s1, loop nop xxx

Time for simple pipeline 200 ps per cycle, 1 CPI (including nops) Time = 7x200x x200 ps = 1,400.4 ns

Reordering Code to fill branch delay and stall slots sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: lw $t0, -4($t1) addi $t1, $t1, -4 add $t0, $t0, $s2 bne $t1, $s1, loop sw $t0, 0($t1) xxx

Time for pipeline with reordered code 200 ps per cycle, 1 CPI (including nops) Time = 5x200x x200 ps = 1,000.4 ns

Loop Unrolling step 1 (4 iterations) sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: lw $t0, -4($t1) addi $t1, $t1, -4 add $t0, $t0, $s2 beq $t1, $s1, loopend sw $t0, 0($t1) lw $t0, -4($t1) addi $t1, $t1, -4 add $t0, $t0, $s2 beq $t1, $s1, loopend sw $t0, 0($t1) lw $t0, -4($t1) addi $t1, $t1, -4 add $t0, $t0, $s2 beq $t1, $s1, loopend sw $t0, 0($t1) lw $t0, -4($t1) addi $t1, $t1, -4 add $t0, $t0, $s2 bne $t1, $s1, loop sw $t0, 0($t1) loopend: xxx

Loop Unrolling step 2 One pointer with offsets sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -16 lw $t0, 12($t1) nop add $t0, $t0, $s2 sw $t0, 12($t1) lw $t0, 8($t1) nop add $t0, $t0, $s2 sw $t0, 8($t1) lw $t0, 4($t1) nop add $t0, $t0, $s2 sw $t0, 4($t1) lw $t0, 0($t1) nop add $t0, $t0, $s2 bne $t1, $s1, loop sw $t0, 0($t1) xxx

Loop Unrolling step 3 Filling data hazard slots with register renaming sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -16 lw $t0, 12($t1) lw $t3, 8($t1) add $t0, $t0, $s2 sw $t0, 12($t1) lw $t0, 4($t1) add $t3, $t3, $s2 sw $t3, 8($t1) lw $t3, 0($t1) add $t0, $t0, $s2 sw $t0, 4($t1) add $t3, $t3, $s2 bne $t1, $s1, loop sw $t3, 0($t1) xxx

Time for pipeline with loop unrolling 200 ps per cycle, 1 CPI (including nops) 4 iterations per loop means 250 times in loop Time = 14x200x x200 ps = ns

Dual Pipeline with Our MIPS Upper path for ALU ops, Lower path for lw/sw

Dual Pipeline Requirements Two instruction (64 bits) IR Two additional read ports in register file An additional write port in register file Appropriate forwarding hardware Compiler must insure no conflicts between the two simultaneous instructions Instruction pairing and order can be done statically by compiler

Dual Pipeline Two instruction pipe –one for arithmetic or branch –one for load or store Instructions can be issued at same time –if no data dependencies –following instructions follow same delay rules Loop unrolling for more overlap Register renaming to avoid data dependency

Dual Pipeline Code pairing instructions sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -4 lw $t0, 0($t1) nop add $t0, $t0, $s2 bne $t1, $s1, loopsw $t0, 0($t1) nop xxx

Dual Pipeline Code fill branch delay slot sll $t1, $s0, 2 addu $t1, $t1, $s1 addi $t1, $t1, -4 loop: lw $t0, 0($t1) nop add $t0, $t0, $s2 bne $t1, $s1, loopsw $t0, 0($t1) addi $t1, $t1, -4 xxx

Time for dual pipeline (no loop unrolling) 200 ps per cycle, 1 or 2 instr per cycle Time = 5x200x x200 ps = 1,000.6 ns

Loop Unrolling step 2 One pointer with offsets sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -16 lw $t0, 12($t1) nop add $t0, $t0, $s2 sw $t0, 12($t1) lw $t0, 8($t1) nop add $t0, $t0, $s2 sw $t0, 8($t1) lw $t0, 4($t1) nop add $t0, $t0, $s2 sw $t0, 4($t1) lw $t0, 0($t1) nop add $t0, $t0, $s2 bne $t1, $s1, loop sw $t0, 0($t1) xxx

Dual Pipe Optimization with loop unrolling sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -16 lw $t0, 12($t1) lw $t3, 8($t1) add $t0, $t0, $s2 sw $t0, 12($t1) lw $t0, 4($t1) add $t3, $t3, $s2 sw $t3, 8($t1) lw $t3, 0($t1) add $t0, $t0, $s2 sw $t0, 4($t1) add $t3, $t3, $s2 bne $t1, $s1, loop sw $t3, 0($t1) loopend: xxx Unrolled and reordered loop

step 1, use more registers (register renaming) sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -16 lw $t0, 12($t1) lw $t3, 8($t1) add $t0, $t0, $s2 sw $t0, 12($t1) lw $t5, 4($t1) add $t3, $t3, $s2 sw $t3, 8($t1) lw $t7, 0($t1) add $t5, $t5, $s2 sw $t5, 4($t1) add $t7, $t7, $s2 bne $t1, $s1, loop sw $t7, 0($t1) loopend: xxx

step 2, reorder/pair instructions sll $t1, $s0, 2 addu $t1, $t1, $s1 loop: addi $t1, $t1, -16 lw $t0, 12($t1) lw $t3, 8($t1) add $t0, $t0, $s2lw $t5, 4($t1) add $t3, $t3, $s2sw $t0, 12($t1) add $t5, $t5, $s2lw $t7, 0($t1) sw $t3, 8($t1) add $t7, $t7, $s2sw $t5, 4($t1) bne $t1, $s1, loopsw $t7, 0($t1) nop xxx

step 2, fill branch delay sll $t1, $s0, 2 addu $t1, $t1, $s1 addi $t1, $t1, -16 lw $t0, 12($t1) loop: lw $t3, 8($t1) add $t0, $t0, $s2lw $t5, 4($t1) add $t3, $t3, $s2sw $t0, 12($t1) add $t5, $t5, $s2lw $t7, 0($t1) sw $t3, 8($t1) add $t7, $t7, $s2sw $t5, 4($t1) bne $t1, $s1, loopsw $t7, 0($t1) addi $t1, $t1, -16lw $t0, -4($t1) xxx

Time for dual pipeline 200 ps per cycle, 1 or 2 instr per cycle 4 iterations per loop, 250 times through loop Time = 8x200x x200 ps = ns 10 times faster than single cycle or multi-cycle 3 ½ times faster than simple pipeline 1 ¾ times faster than single pipeline, loop unrolled

More Parallelism? Suppose loop has more operations –Multiplication (takes longer than adds) –Floating point (takes much longer than integer) More parallel pipelines for different operations – all the above techniques could result in better performance Static (as above) vs dynamic reordering Speculation Out-of-order execution

Multiple Issue – Multiple Functional Units

Dynamic Scheduling Processor determines which instructions to issue Reservation unit use copies of operand values from registers (equivalent to renaming) When operation is completed, results are buffered in the commit unit When can be done in order, results are written to registers or memory Speculation –guessing which way a branch goes –Guessing that a lw does not conflict with a sw

FIGURE 4.74 The microarchitecture of AMD Opteron X4. The extensive queues allow up to 106 RISC operations to be outstanding, including 24 integer operations, 36 floating point/SSE operations, and 44 loads and stores. The load and store units are actually separated into two parts, with the first part handling address calculation in the Integer ALU units and the second part responsible for the actual memory reference. There is an extensive bypass network among the functional units; since the pipeline is dynamic rather than static, bypassing is done by tagging results and tracking source operands, so as to allow a match when a result is produced for an instruction in one of the queues that needs the result. Copyright © 2009 Elsevier, Inc. All rights reserved.

FIGURE 4.75 The Opteron X4 pipeline showing the pipeline flow for a typical instruction and the number of clock cycles for the major steps in the 12-stage pipeline for integer RISC-operations. The floating point execution queue is 17 stages long. The major buffers where RISC-operations wait are also shown. Copyright © 2009 Elsevier, Inc. All rights reserved.