Instruction Level Parallelism 2. Superscalar and VLIW processors.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 19, 2005 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

A scheme to overcome data hazards

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

COMP4611 Tutorial 6 Instruction Level Parallelism

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

1 Lecture 18: VLIW and EPIC Static superscalar, VLIW, EPIC and Itanium Processor (First introduce fast and high- bandwidth L1 cache design)

COMP25212 Advanced Pipelining Out of Order Processors.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Computer Architecture Lec 8 – Instruction Level Parallelism.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.

DAP Spr.‘98 ©UCB 1 Lecture 6: ILP Techniques Contd. Laxmi N. Bhuyan CS 162 Spring 2003.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 9, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

CIS 629 Fall 2002 Multiple Issue/Speculation Multiple Instruction Issue: CPI < 1 To improve a pipeline’s CPI to be better [less] than one, and to utilize.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

1 Chapter 2: ILP and Its Exploitation Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Hardware-based.

1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.

CSCE 614 Fall Hardware-Based Speculation As more instruction-level parallelism is exploited, maintaining control dependences becomes an increasing.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Use of Pipelining to Achieve CPI < 1

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

CS 352H: Computer Systems Architecture

Instruction Level Parallelism

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

COMP 740: Computer Architecture and Implementation

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

CS203 – Advanced Computer Architecture

Lecture 12 Reorder Buffers

Pipelining: Advanced ILP

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

The Microarchitecture of the Pentium 4 processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 23: Static Scheduling for High ILP

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Chapter 3: ILP and Its Exploitation

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Overcoming Control Hazards with Dynamic Scheduling & Speculation

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

Instruction Level Parallelism 2. Superscalar and VLIW processors

Superscalar and VLIW Processors Scalar processors fetch and issue max 1 operation in each clock cycle. Multiple-issue processors: Superscalar (issue a varying number of instructions at each clock cycle). VLIW (issue a fixed number of instructions at each clock cycle). Vittorio Zaccaria – ST 2001

Superscalar Processors Issues from 1 to 8 instructions at each clock cycle. If instructions are dependent, only the instructions preceding that one are issued (in-order issue). This decision is made at run-time by the processor. => Variability in the issue rate.

Superscalar Processors Can be: Statically scheduled: Do not allow (issue) instructions behind stalls to proceed or Dynamically scheduled and speculative (allow instructions behind RAW hazards to proceed).

How to optimize code for Superscalar Processors (1) Loop:LDF0,0(R1) ADDDF4,F0,F2 SD0(R1),F4 SUBIR1,R1,#8 BNEZR1,LOOP The loop is unrolled 4 times (load/addd/store) in which RAW hazards have been reduced, but there are resource conflicts on the pipelines... Loop:LDF0,0(R1) LDF6,-8(R1) LDF10,-16(R1) LDF14,-24(R1) ADDDF4,F0,F2 ADDDF8,F6,F2 ADDDF12,F10,F2 ADDDF16,F14,F2 SD0(R1),F4 SD-8(R1),F8 SD-16(R1),F12 SUBIR1,R1,#32 BNEZR1,LOOP SD (R1),F16 ; 8-32 = -24

How to optimize code for Superscalar processors (2) Integer instructionFP instruction Loop:LD F0,0(R1)// LD F6,-8(R1)// LD F10,-16(R1)ADDD F4,F0,F2 LD F14,-24(R1)ADDD F8,F6,F2 LD F18,-32(R1)ADDD F12,F10,F2 SD 0(R1),F4ADDD F16,F14,F2 SD -8(R1),F8ADDD F20,F18,F2 SD -16(R1),F12// SD -24(R1),F16// SUBI R1,R1,#40// BNEZ R1,LOOP// SD -32(R1),F20// 5 times unrolled loop.

The PowerPC 620 [’94] Superscalar Architecture Similar to: MIPS R10000 HP PA 8000 Fetch, issue and completion of up to 4 instructions per clock cycle. Six separate execution units buffered with reservation stations.

PowerPC functional units 2 integer units (XSU0, XSU1), 0 cycles latency [+,-,shift..] 1 complex integer function unit MCFXU for integer (pipelined *, unpipelined /). Latency from 3 to 20 cycles). 1 Load store unit. Latency=1 for integer loads, 2 for FP loads.

PowerPC functional units 1 FPU with latencies of: 2 cycles for multiply,add, multiply-add 31 for DP FP divide. (fully pipelined except for divide). 1 BRU, completes branches and informs the fetch unit of mispredictions. Includes the condition register used for conditional branches.

PowerPC Architecture Speculative Tomasulo with register renaming. Extendend register file holds speculative result of an instruction until the instruction commits. The ROB enforces only in-order commit. Advantages: operands are available from a single location (no need for additional complex logic to access ROB result values)

PowerPC 620 architecture

PowerPC Pipeline Fetch: The Fetch unit loads the decode queue with instructions from the cache. Next address is predicted through a 256-entry, two-way set associative BTB. A BPB is used if there is a miss in the BTB.

PowerPC Pipeline Instruction decode: Instructions are decoded and inserted into an 8-entry instruction queue. Instruction Issue: 4 Instructions are taken from the 8-entry instruction queue and are issued to the RS. Allocate a rename register and a reorder buffer entry for the instruction issued. If we can’t, stall.

PowerPC Pipeline Execution: Proceeds with execution when all operands are available. At the end, the result is written on the result bus. The completion unit is notified that the instruction has completed.

PowerPC Pipeline If the instruction is a (mispredicted) branch, IFU and IC(ompletion)U are notified. Instruction fetch restarts, and ICU discards all the speculated instructions after the branch and free the rename buffers. Commit: When all previous instructions have been committed, commit the result into the RF and free the rename buffer. Stores also commit from store buffer to memory.

Performance results IPC from under 1 to 1.8. We do not reach IPC=4 due to: Fus are not replicated for each instruction (structural hazards) Limited instruction level parallelism or limited buffering (insufficient buffers).

P6 Processor Family: Intel Pentium II/III 3-way superscalar. Basic Idea, three engines:

P6 Pipeline Fetch/Decode Unit: decodes instructions and puts them in the instruction pool in-order. converts the instructions in micro-ops that represent instruction code. Dispatch/Execute Unit: out-of-order issue from the instruction pool in a reservation station and out-of-order execution of micro-ops. Retire Unit Reorders the instructions and commits speculative results to the architectural state.

P6 Instruction Decode The decoder fetches 16 bytes at each clock cycle from the cache 3 parallel decoders convert most of the instructions into one or more triadic micro-ops. Some instruction need microcode (several micro-ops) to be executed. Register Alias Table unit converts logical reg. ref. into physical reg. ref. In the ROB (register renaming)

P6 Instruction Dispatch/Execute The dispatch unit dispatches out-of-order the microops in the instruction pool through the reservation station unit This happens when: All the operands are ready The resource needed is ready. Maximum throughput: 5 micro-ops/cycle. If micro-ops are branches, their execution is compared with the predicted address (in the Fetch phase). If mispredicted the JEU changes the status of all the micro-ops behind the branch and removes them from the instruction pool.

P6 Instruction Retire The retire unit looks for micro-ops that have been executed and can be removed from the pool. The original architectural target of the micro-ops is written. This is done in-order by committing an instruction only if: Previous instructions have been committed The instruction has been executed. Up to 3 micro-ops can be retired at each clock cycle.

Pentium 4

New NetBurst micro-architecture 20 pipeline stages (hyper-pipeline) 1.4 GHz to 2GHz 3 prefetching mechanisms Harware instruction prefetcher (based on BTB). Software controlled data cache prefetching. L3->L2 data and instruction hardware prefetcher

Pentium 4 Execution Trace Cache TC stores decoded IA-32 instructions or micro-ops. Removes decoding costs 12K micro-ops, 3 micro-ops per cycle fetch bandwidth It stores traces built across predicted branches. However some instructions need micro-code from ROM.

Pentium 4 Branch penalty delay can be much more than 10 cycles Uses BTB In case of a miss in the BTB, static prediction is used (back=T, forw=NT) Use of software branch hints during the trace construction that override static prediction.

Pentium 4 Execution Units and Issue Ports

Pentium 4 1 load and 1 store issue for each cycle. Loads can be reordered w.r.t. other loads and stores Loads can be executed speculatively Up to 4 outstanding load misses. Load/store forwarding

AMD Athlon K7 Nine-issue (micro-ops), super-pipelined, superscalar x86 processor Multiple x86 instruction decoders (into triadic micro- ops) Three out-of-order, superscalar, fully pipelined floating point execution units. Three out-of-order, superscalar, pipelined integer units. Three out-of-order, superscalar, pipelined address calculation units. 72-entry instruction control unit (ROB)

AMD Athlon K7

The Instruction Control Unit contains a reorder buffer and distributed reservation stations to hold operands while OP’s wait to be scheduled. The Integer Instruction Scheduler is an instruction scheduling logic that picks OP’s for execution based on their operand availability and issues them to functional units or address generation units. The function units perform transformations on data and return their results to the reorder buffer, while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing.

Clustered VLIW

Multi-Ported Register File Limits Area of the register file grows approximately with the square of the number of ports

Multiported Register File Read Access time of a register file grows approximately linearly with the number of ports Internal Bit Cell loading becomes larger Larger area of register file causes longer wire delays What is reasonable today in terms of number of ports? Changes with technology, ports is currently about the maximum (read ports + write ports)

Clustered VLIW To solve the bottleneck, create partitioned register files connected to small numbers of Executions Units

Register File Communication Architecturally Invisible Partitioned RFs appear as one large register file to the compiler Copying between RFs is done by control Detection of when copying is needed can be complicated; goes against VLIW philosophy of minimal control overhead

Register File Communication Architecturally Visible Remote and Local versions of instructions Explicit copy primitives Remote Instructions: have one or more operands in non-local RF Copying of remote operands to local RFs takes clock cycles. Because copying is ‘atomic’ part of remote instruction, execution unit is idle while copying is done => performance

Register File Communication Copy instructions: Separation of copy and execution allows more flexible scheduling by compiler move r1, r60 //(r60 in another RF) independent instr a //do not waste useful independent instr b //clock cycles add r2, r1, r3

Instruction Comression Embedded Processors often put a limit on code size How to reduce size? NOPs are common, use only a few bits (2-3) to represent a NOP. Mark explicitly start and stop of the long instruction and do not insert nop.

Instructions Decompression On Instruction Cache fill ICache has to hold uncompressed instructions - limits cache size On instruction fetch Decompression in critical path of fetch stage, may have to add one or more pipeline stages just for decompression

VLIW Architectures: Some real world example

TMS320C6X CPU 8 independent execution units Execution unit types: L : Integer adder, Logical, Bit Counting, FP adder, FP conversion S : Integer adder, Logical, Bit Manipulation, Shifting, Constant, Branch/Control, FP compare D : Integer adder, Load-Store M : Integer Multiplier, FP multiplier Split into two identical datapaths, each contains the same four units (L, S, D, M)

TMS320C6X CPU (cont). Max clock speed of 200 Mhz Each datapath has a 16 x 32bit Register file

Instruction Encoding Internal Execution path is 256 bits-wide Each operation is 32 bits wide => 8 operations per clock A fetch packet is a group of instructions fetched simultaneously. Fetch packet has 8 instructions. A execute packet is a group of instructions beginning execution in parallel. Execute packet has 8 instructions

Instruction Encoding Instructions in ICache have an associated P-bit (Parallel-bit). Fetch packet expanded to 1 to 8 Execute packets during fetch stage depending on P-bits

Fetch Packet to Execute Packet Expansion

Philips TM 1000/Multimedia Processor

Philips Trimedia Five Execution Units => Five operations per clock issued 15 Read and 5 Write Ports on register File Need 15 read ports for 5 Execution Units because each operation requires two operands and a Guard operand. Guard operand makes each operation conditional based upon value of LSB of the guard operand => Predicated Execution. 128 Registers (r0, r1 always 0)

Philips Trimedia Instructions Multiple operation sizes 2 bits for NOP, 26 bits, 34 bits, and 44 bits.