PowerPC 604 Superscalar Microprocessor

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
Chapter 12 Pipelining Strategies Performance Hazards.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Computer Architecture Computer Architecture Superscalar Processors Ola Flygt Växjö University +46.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
Chapter 6 The PowerPC 620. The PowerPC 620  The 620 was the first 64-bit superscalar processor to employ: True out-of-order execution, aggressive branch.
Dynamic Pipelines. Interstage Buffers Superscalar Pipeline Stages In Program Order In Program Order Out of Order.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Chapter 6 The PowerPC 620. The PowerPC 620  The 620 was the first 64-bit superscalar processor to employ: True out-of-order execution, aggressive branch.
CS203 – Advanced Computer Architecture ILP and Speculation.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Computer Organization CS224
CSL718 : Superscalar Processors
Data Prefetching Smruti R. Sarangi.
CS203 – Advanced Computer Architecture
Dynamic Branch Prediction
/ Computer Architecture and Design
Microprocessor Microarchitecture Dynamic Pipeline
Chapter 4 The Processor Part 4
Introduction to Pentium Processor
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
The processor: Pipelining and Branching
Superscalar Pipelines Part 2
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Ka-Ming Keung Swamy D Ponpandi
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Advanced Computer Architecture
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Data Prefetching Smruti R. Sarangi.
Computer Architecture
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Control unit extension for data hazards
Dynamic Hardware Prediction
Control unit extension for data hazards
Chapter 11 Processor Structure and function
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

PowerPC 604 Superscalar Microprocessor IBM, Motorola, Apple

11/13

PPC604e Overview RISC PowerPC family PowerPC architecture : 32-bit effective (logical) addresses, 8, 16, and 32 bits integer data types, and floating-point data types of 32 and 64 bits (single- and double-precision, respectively). A superscalar processor : can issue four instructions Up to seven instructions can execute in parallel. 11/13

Overview: 604e has 7 units The 604e has seven parallel – independent execution units Floating-point unit (FPU) Branch processing unit (BPU) Condition register unit (CRU) Load/store unit (LSU) Three integer units (IUs): — Two single-cycle integer units (SCIUs) — One multiple-cycle integer unit (MCIU) 11/13

Three-stage pipelined floating-point unit (FPU) Fully IEEE 754 compliant FPU Supports non-IEEE mode for time-critical operations Fully pipelined, single-pass double-precision design Two-entry reservation station to minimize stalls Thirty-two 64-bit FPRs for single- or double-precision operands 11/13

BPU & CRU BPU Branch Processing Unit with dynamic branch prediction Two-entry reservation station Out-of-order execution through two branches 64-entry fully-associative branch target address cache (BTAC), 512-entry branch history table (BHT) Two bits per entry predictions Condition register unit (CRU) 11/13

Condition resolution takes time

Solution: Branch speculation

Branch History Table (BHT) Table of predictors Each branch given predictor BHT is table of “Predictors” Could be 1-bit or more Indexed by PC address of Branch most schemes use at least 2 bit predictors Performance = ƒ(accuracy, cost of misprediction) Misprediction  Flush Reorder Buffer In Fetch state of branch: Use Predictor to make prediction When branch completes Update corresponding Predictor Predictor 0 Branch PC Predictor 1 Predictor 7 11/13

BTB: Branch Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Branch PC Predicted PC =? PC of instruction FETCH prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 11/13

11/13

PPC604 Pipeline 11/13

PowerPC604 Pipeline overview Instruction fetch (IF) — loads decode queue (DEQ) with instructions from I - cache and determines next instruction address Instruction decode (ID)— time-critical decoding on instructions in dispatch queue (DISQ). Instruction dispatch (DS)— up to 4 instructions dispatched – max – in order one per functional unit non- time-critical instructions decoding. determines when instruction can be dispatched to EX Units At end of DS, instructions and their operands are latched into the execution input latches or into unit’s reservation station. Rename registers and reorder buffer entries allocated 11/13

Execute (E), Complete (C), Writeback instruction flow split among six execution units. Instructions enter execute from dispatch or reservation station. results written into rename buffer entry ; notifies complete stage • Complete (C) ensures correct machine state maintained ; monitors instructions in complete and execute stages. Instructions removed from reorder buffer (ROB) when complete Results written back from rename buffers to register at complete or writeback • Writeback (W) writes back results from rename buffers not written back during complete 11/13

604 Block Diagram – Internal Data paths 11/13

Reservation Stations & Result Buses 11/13

Execution Latencies 11/13

PPC604e Unit Pipeline Stages 11/13

Example 1: Instruction timing for Cache HIT 11/13

11/13

Example 1: Instruction Timing for cache Hit Clock 1 2 3 4 5 6 7 8 9 10 11   0 AND Fet DQ DS EX C/WB 1 OR 2 FADD 3 FSUB 4 ADDC C 5 SUBFC 6 FMADD 7 FMSUB 8 XOR 9 NEG 10 FADDS 11 FSUBS 12 ADD 13 SUB 11/13

BTB: Branch Address at Same Time as Prediction Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) Branch PC Predicted PC =? PC of instruction FETCH prediction state bits Yes: instruction is branch and use predicted PC as next PC No: branch not predicted, proceed normally (Next PC = PC+4) Only predicted taken branches and jumps held in BTB Next PC determined before branch fetched and decoded later: check prediction, if wrong kill instruction, update BPb 11/13

No branch penalty; 4 OR is from target stream Example 2 : Branch Taken with BTAC hit No branch penalty; 4 OR is from target stream Clock 1 2 3 4 5 6 7 8 9 10 0 AND Fet DQ DS EX C/WB   1 LD C/WB   2 ADD C 3 BC taken Waits for  LD  add  4 OR waits bc  5 CMP 6 LD 7 MULLI  Fet Cycle 5: Because the branch is taken, the OR (4) instruction, which could otherwise write back in this cycle, stays in the complete stage and completes and writes back in the nextcycle. The CMP (5) Instruction also enters the complete stage; ld (6) and mulli (7) enter the second stages of the LSU and MCIU pipelines, respectively. Cycle 2: instructions 4 – 7 fetched from Target based on address from BTAC HIT Cycle 5: inst. 2 -3 wait for LD to retire (WB) & retire with it 11/13

Example 2: Branch taken with BTAC HIT No penalty 11/13

Example 3: Branch taken, BTAC HIT, Icache MISS 11/13

Ex 4: Branch taken, BTAC Miss, correct at Decode stage One clock penalty, to fetch target group (2,3,4,5) Correction at Decode includes branch on CR (flags), LR 11/13

Ex 5: Branch taken, BTAC Miss, correct at Dispatch stage - 2 clock branch penalty 11/13

Example 6: Branch taken, BTAC Miss, correct at Execute --- 3 clock penalty 11/13

Class Example – real dependencies 1 ADD R1, R2, R3 ; R1 = R2 + R3 2 ADD R2, R1, R4 3 OR R3, R1, R4 4 SUB R3, R2, R3 5 FMUL F7, F5, F6 6 FSUB F8, F10, F7 7 AND R4, R1, R3 Clock 1 2 3 4 5 6 7 8 9 10   1ADD Fet Dq DS EX C/WB 2 ADD DQ 3 OR 4SUB 5 FMUL 6 FSUB 7 AND  FET C 11/13

11/13

PPC604 Pipeline 11/13

Pipeline Details: Fetch Stage Fetches instructions from I cache and loads decode queue (DEQ) Determines address of next instruction to be fetched. Keeps queue supplied with instructions for dispatch Instructions fetched from I cache in groups of four, from a cache block If only two instructions remain in the cache block, only two instructions are fetched. 11/13

next instruction fetch address: Each stage offers candidate address to be fetched, latest stage has highest priority As a block is prefetched, branch target address cache (BTAC) and branch history table (BHT) searched with fetch address. If address is in BTAC, next instruction fetched from that address DECODE may indicate, based on BHT or an unconditional branch decode, that earlier BTAC prediction was incorrect BPU can indicate that a previous branch prediction, from the BTAC or DECODE was incorrect 11/13

Decode Stage Handles time-critical decoding of instructions in instruction buffer. Contains four-instruction buffer (DEQ); shifts one or two pairs of instructions into dispatch buffer as space becomes available. Branch correction predicts branches whose target is taken from the CTR or LR. Occurs if no CTR or LR updates are pending. 11/13

Dispatch Stage non–time-critical decoding of instructions supplied by decode determines which instructions can be dispatched source operands read from register file and dispatched to execute units dispatched instructions and their operands latched into reservation stations or execution unit input latches. Dispatched Instructions issued a position in 16-entry completion buffer Rename Buffer allocated to instruction if needed 11/13

Execute Stage Instruction passed to appropriate execution unit after fetch, decode, and dispatch. EX units have different latencies Floating-point unit has fully pipelined, three-stage execution unit EX units write results into appropriate rename buffer & notifies complete stage 11/13

Branch Mispredict / Exceptions ? What if a branch instruction was mispredicted in an earlier Stage ? Instructions from mispredicted path flushed Fetching resumes at the correct address. If an instruction causes an exception, the execution unit reports the exception to the complete stage and continues executing instructions 11/13

Complete Stage maintains correct architectural machine state. As instruction finish EX, their status is recorded in completion buffer (FIFO) entry. entries examined in order in which instructions dispatched. Retains program order, ensures instructions completed in order four entries examined during each cycle for writeback completion buffer is used to ensure a precise exception model. . 11/13

Write-Back Stage Write back results from rename buffers not written back by the complete stage. Each rename buffers has two read ports for write-back, corresponding to the two ports provided for write-back for the GPRs, FPRs, and CR. Two results can be copied from the write-back buffers to registers per clock cycle. 11/13