Instruction Level Parallelism

Slides:



Advertisements
Similar presentations
CSCI 4717/5717 Computer Architecture
Advertisements

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative.
Pipeline Optimization
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Cont. Computer Architecture.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 3.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
Memory/Storage Architecture Lab Computer Architecture Pipelining Basics.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs.
CMPE 421 Parallel Computer Architecture
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
CS 352H: Computer Systems Architecture
/ Computer Architecture and Design
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Morgan Kaufmann Publishers
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Single Clock Datapath With Control
Pipeline Implementation (4.6)
Advantages of Dynamic Scheduling
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Superscalar Processors & VLIW Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 18: Pipelining Today’s topics:
Lecture 18: Pipelining Today’s topics:
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
How to improve (decrease) CPI
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Data Dependence Distances
CS203 – Advanced Computer Architecture
Pipeline Control unit (highly abstracted)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Lecture 5: Pipeline Wrap-up, Static ILP
Presentation transcript:

Instruction Level Parallelism Pipeline with data forwarding and accelerated branch Loop Unrolling Multiple Issue -- Multiple functional Units Static vs Dynamic Optimizations

Optimization Example C-code k = len; do { k--; A[k] = A[k] + x; } while(k > 0)

Register Usage X19 len X20 base address of A X21 x X10 address of A[k] X9 value of A[k] (old and new)

Basic loop using pointer hopping lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 8 ldur X9, [X10, #0] add X9, X9, X21 stur X9, [X10, #0] sub X11, X10, X20 cbnz X11, loop xxx

Time for 1000 Iterations for Single- and Multi-cycle Single Cycle Every instruction takes 800 picoseconds (ps) Time = 6x800x1000 + 2x800 = 4,801,600 ps = 4801.6 nanoseconds (ns) Multicycle 200 ps per cycle, variable number of CPI Cycles = (1x3 + 4x4 + 1x5)x1000 + 2x4 = 24,008 Time = 24,008 x 200 ps = 4,801.6 ns

Pipeline, nops for data hazards lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 8 ldur X9, [X10, #0] nop add X9, X9, X21 stur X9, [X10, #0] sub X11, X10, X20 cbnz X11, loop stall if taken

Time for simple pipeline 200 ps per cycle, 1 CPI (including nops) Time = 9x200x1000 + 2x200 – 1x200 ps = 1,800.2 ns (no stall last iteration)

Reordering code sub X11, X10, X20 add X9, X9, X21 lsl X10, X19, 3 loop: sub X10, X10, 8 ldur X9, [X10, #0] sub X11, X10, X20 add X9, X9, X21 stur X9, [X10, #0] cbnz X11, loop stall if taken

Time for pipeline with reordered code 200 ps per cycle, 1 CPI (including nops) Time = 7x200x1000 + 2x200 – 1x200 ps = 1,400.2 ns (no stall last iteration) with branch prediction, no stall, 1,200.4 ns

Loop unrolling – step 1 lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 8 ldur X9, [X10, #0] sub X11, X10, X20 add X9, X9, X21 stur X9, [X10, #0] cbnz X11, loop stall if taken

Loop unrolling – step 2 remove extra loop control lsl X10, X19, 3 stur X9, [X10, #8] add X10, X10, X20 loop: ldur X9, [X10, #0] sub X10, X10, 32 sub X11, X10, X20 ldur X9, [X10, #24] nop stur X9, [X10, #0] add X9, X9, X21 stur X9, [X10, #24] cbnz X11, loop ldur X9, [X10, #16] stall if taken stur X9, [X10, #16] ldur X9, [X10, #8]

Loop unrolling – step 2 remove data hazards with reorder, register renaming lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 32 ldur X9, [X10, #24] stur X9, [X10, #8] ldur X12, [X10, #16] add X9, X9, X21 sub X11, X10, X20 stur X9, [X10, #24] stur X12, [X10, #0] add X12, X12, X21 stur X12, [X10, #16] cbnz X11, loop ldur X9, [X10, #8] stall if taken ldur X12, [X10, #0]

Time for pipeline with loop unrolling 200 ps per cycle, 1 CPI (including nops) 4 iterations per loop means 250 times in loop Time = 16x200x250 + 2x200 -1x200 ps = 800.2 ns (no stall last iteration) with branch prediction 750.4 ns

Upper path for ALU ops, Lower path for lw/sw Dual Pipeline Upper path for ALU ops, Lower path for lw/sw

Dual Pipeline Requirements Two instruction (64 bits) IR Two additional read ports in register file An additional write port in register file Appropriate forwarding hardware Compiler must insure no conflicts between the two simultaneous instructions Instruction pairing and order can be done statically by compiler

Dual Pipeline Two instruction pipe one for arithmetic or branch one for load or store Instructions can be issued at same time if no data dependencies following instructions follow same hazard rules Loop unrolling for more overlap Register renaming to avoid data dependency

Dual Pipeline, pairings lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 8 ldur X9, [X10, #-8] sub X11, X10, X20 nop add X9, X9, X21 nop cbnz X11, loop stur X9, [X10, #0] stall if taken

Time for dual pipeline (no loop unrolling) 200 ps per cycle, 1 or 2 instr per cycle Time = 5x200x1000 + 2x200 – 1x200 ps = 1,000.2 ns 800.4 ns with branch prediction

Loop unrolling – step 2 remove data hazards with reorder, register renaming lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 32 ldur X9, [X10, #24] stur X9, [X10, #8] ldur X12, [X10, #16] add X9, X9, X21 sub X11, X10, X20 stur X9, [X10, #24] stur X12, [X10, #0] add X12, X12, X21 stur X12, [X10, #16] cbnz X11, loop ldur X9, [X10, #8] stall if taken ldur X12, [X10, #0]

Dual Pipeline, Loop unrolling more register renaming lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 32 add X13, X13, X21 ldur X9, [X10, #24] stur X13, [X10, #8] ldur X12, [X10, #16] add X9, X9, X21 sub X11, X10, X20 stur X9, [X10, #24] add X14, X14, X21 stur X14, [X10, #0] add X12, X12, X21 stur X12, [X10, #16] cbnz X11, loop ldur X13, [X10, #8] stall if taken ldur X14, [X10, #0]

step 2, reorder/pair instructions lsl X10, X19, 3 add X10, X10, X20 loop: sub X10, X10, 32 ldur X9, [X10, #-8] ldur X12, [X10, #16] add X9, X9, X21 ldur X13, [X10, #8] add X12, X12, X21 stur X9, [X10, #24] add X13, X13, X21 ldur X14, [X10, #0] sub X11, X10, X20 stur X12, [X10, #16] add X14, X14, X21 stur X13, [X10, #8] cbnz X11, loop stur X14, [X10, #0] stall if taken

Time for dual pipeline 200 ps per cycle, 1 or 2 instr per cycle 4 iterations per loop, 250 times through loop Time = 9x200x250 + 2x200 – 1x200 ps = 450.2 ns 400 ns with branch prediction 10 times faster than single cycle or multi-cycle 3 times faster than simple pipeline 1 ¾ times faster than single pipeline, loop unrolled

More Parallelism? Suppose loop has more operations Multiplication (takes longer than adds) Floating point (takes much longer than integer) More parallel pipelines for different operations – all the above techniques could result in better performance Static (as above) vs dynamic reordering Speculation Out-of-order execution

ARM Cortex A53 Pipeline Indirect Instruction Fetch & Predict AGU & TLB Instruction Cache Hybrid Predictor Indirect Integer Register File Issue Writeback ALU Pipe 0 MAC Pipe ALU Pipe 1 Divide Pipe Load Pipe Store Pipe Instruction Fetch & Predict Early Decode 13 Entry Queue Main Late Instruction Decode Integer Execute & Load/Store F1 F2 F3 F4 D1 D2 D3 F1 F2 F3 F4 F5 NEON ALU Pipe Mul/Div/Sqrt Pipe Iss Ex1 Ex2 Wr Floating Point Execute ARM Cortex A53 Pipeline

Multiple Issue – Multiple Functional Units

Dynamic Scheduling Processor determines which instructions to issue Reservation unit use copies of operand values from registers (equivalent to renaming) When operation is completed, results are buffered in the commit unit When can be done in order, results are written to registers or memory Speculation guessing which way a branch goes Guessing that a load does not conflict with a store

FIGURE 4. 74 The microarchitecture of AMD Opteron X4 FIGURE 4.74 The microarchitecture of AMD Opteron X4. The extensive queues allow up to 106 RISC operations to be outstanding, including 24 integer operations, 36 floating point/SSE operations, and 44 loads and stores. The load and store units are actually separated into two parts, with the first part handling address calculation in the Integer ALU units and the second part responsible for the actual memory reference. There is an extensive bypass network among the functional units; since the pipeline is dynamic rather than static, bypassing is done by tagging results and tracking source operands, so as to allow a match when a result is produced for an instruction in one of the queues that needs the result. Copyright © 2009 Elsevier, Inc. All rights reserved.

FIGURE 4.75 The Opteron X4 pipeline showing the pipeline flow for a typical instruction and the number of clock cycles for the major steps in the 12-stage pipeline for integer RISC-operations. The floating point execution queue is 17 stages long. The major buffers where RISC-operations wait are also shown. Copyright © 2009 Elsevier, Inc. All rights reserved.