ALPHA Introduction I- Stream

Slides:



Advertisements
Similar presentations
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
Computer Organization and Architecture
National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
The Alpha – Data Stream Matt Ziegler.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
CSL718 : Pipelined Processors
Instruction-Level Parallelism and Its Dynamic Exploitation
Lecture: Out-of-order Processors
Dynamic Scheduling Why go out of style?
Cache Memory.
CS203 – Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
PowerPC 604 Superscalar Microprocessor
Part IV Data Path and Control
Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Appendix C Pipeline implementation
Flow Path Model of Superscalars
Pipelining: Advanced ILP
Part IV Data Path and Control
CS 152 Computer Architecture & Engineering
Seoul National University
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Comparison of Two Processors
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Ka-Ming Keung Swamy D Ponpandi
Lecture: Branch Prediction
Alpha Microarchitecture
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 10: Branch Prediction and Instruction Delivery
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Dynamic Hardware Prediction
Chapter 11 Processor Structure and function
Lecture 9: Dynamic ILP Topics: out-of-order processors
Ka-Ming Keung Swamy D Ponpandi
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

ALPHA 21264 Introduction I- Stream Dharmesh Parikh

System Overview RISC instruction Set 64-bit processor 15 million transistors 1st Alpha w/ out-of-order execution Speculative execution 2

Alpha 21264 Key Features Four-wide Instruction Fetch Line and Branch Predictor Tournament prediction using both local & global history Dynamic JSR/JMP prediction Out-of-Order Execution Pipelines Quad-speculative-issue integer pipeline Dual-speculative-issue floating-point pipeline One ADD, One MULTIPLY Also DIV (6 bits/cycle) and SQRT (2 bits/cycle) Memory System Two out-of-order memory references per cycle Up to 16 outstanding off-chip memory references 8 fills and 8 victims 2

Alpha 21264 Key Features (2) 80 In-flight Instructions Registers: 80 Integer, 72 Floating Point Queue Entries: 20 Integer, 15 Floating Point 128-Entry Fully Associative DTB 128-Entry Fully Associative ITB 64 KB L1 On-Chip Instruction Cache Virtual Index/Tag 2-Way Set Predict 64 KB L1 On-Chip Data Cache Virtual Index/Physical tag 2-Way Set Associative Cache-hit prediction logic enhances speculative issue 3

Differences from 21164 Out of order issue Smaller pipeline Increased memory bandwidth Memory references can be accessed in parallel to caches One pipeline for both floating point and integer operations 4x the bandwidth 3

Previous instruction types Branch Floating point Memory Memory/Function code Memory/branch Operate PALcode 3

New instructions Floating Point Cache prefetching MVI (motion video instructions) 3

Alpha 21264 Instruction Fetch Bandwidth Enablers The 64 KB two-way associative instruction cache supplies four instructions every cycle The next-fetch and set predictors provide fast cache access of a direct-mapped cache and eliminate bubbles in non-sequential control flows The instruction fetcher speculates through up to 20 branch predictions to supply a continuous stream of instructions The tournament branch predictor dynamically selects between Local and Global history to minimize mispredicts 8

Alpha 21264 – 7 Stage Pipeline [1] FETCH MAP QUEUE REG EXEC DCACHE Stage: 0 1 2 3 4 5 6 Branch Predictors Int Reg Map Int Issue Queue (20) Reg File (80) Exec Bus Inter- face Unit Sys Bus Exec Addr L1 Data Cache 64KB 2-Set 64-bit Reg File (80) Exec 80 in-flight instructions plus 32 loads and 32 stores Cache Bus Addr Exec 128-bit Next-Line Address Phys Addr L1 Ins. Cache 64KB 2-Set 4 Instructions / cycle 44-bit FP ADD Div/Sqrt FP Issue Queue (15) Reg File (72) FP Reg Map Victim Buffer FP MUL Miss Address 8

Alpha 21264 Instruction Stream Improvements Learns Dynamic Jumps PC GEN ... 0-cycle Penalty On Branches Instruction Decode, Branch Prediction Set-Associative Behavior PC Tag S0 Tag S1 Icache Data Next- Fetch Pred Set Pred Cmp Cmp Hit/Miss/Set Miss Next LP address + set Instructions (4) 7

7

Instruction Fetch,Issue and Retire Unit Virtual Program Counter Logic Branch Predictor Instruction -Stream Translation Buffer(ITB) Instruction fetch logic Register rename maps Integer and Floating-Point issue Queues Exception and Interrupt Logic Retire Logic 7

Virtual Program Counter Logic The virtual program counter (VPC) logic maintains the virtual addresses for instructions that are in flight. There can be up to 80 instructions, in 20 successive fetch slots, in-flight between the register rename mappers and the end of the pipeline. The VPC logic contains a 20-entry table to store these fetched VPC addresses. 7

Alpha 21264 Tournament Branch Prediction Local History (1024x10) Local Prediction (1024x3) Global Prediction (4096x2) Choice Prediction (4096x2) Program Counter Path History Prediction 7

Branch Prediction: Motivation Highly pipelined design introduces additional misprediction penalty cycles . Instructions typically spend 4-5 cycles in instruction queue -- more delay! Minimum penalty: 7 cycles Average penalty: > 11 cycles 2-cycle instruction cache adds another cycle to branch misprediction penalty -- sounds like a lot, but only 10% 7

Local Predictor 7

Global Predictor 7

Choice Predictor 7

Other Branch Prediction Stuff 32 entry return-address stack to predict target of subroutine returns. Situated near the instruction cache and accessed in parallel. 35K bits of storage for branch history information = 2% of die area. 48K bits of target address information stored in cache . 7-10 mispredictions per 1000 instructions on SPECint95 * 11 cycles per misprediction = .1 CPI loss. 7

Branch Prediction Problems Instructions must always be grouped in fours, with predicted-taken branch in slot four, and branch targets in slot one, in order to issue four instructions on a branch. Context switching -- every time the CPU does a context switch, branch prediction tables are reset. 7

Instruction-Stream Translation Buffer(ITB) 128-entry, fully associative ITB. Store recently used instruction-stream(Istream) address translations and page protection information. ITB is accessed only for the Istream references that miss in the Icache. 7

Instruction Fetch Logic Up to four aligned instructions are fetched from the Icache, in program order. The branch prediction tables are also accessed in this cycle( instruction fetch cycle or Stage 0). The branch predictor uses tables and a branch history algorithm to predict a branch instruction target address for one branch or memory format. Branch prediction and line prediction bits accompany the four instructions. 7

I-Cache Large 64 k byte, two-way set associative instruction cache. Each fetch block of four instructions includes a line and set prediction. Prediction indicates where to fetch the next block of instructions from,including which set should be used. Remove the bubble if the branch is predicted to be taken by the branch predictor. 7

I-Cache(continued) On cache fills, the line predictor value at each fetch line is initialized with the index of the next sequential fetch line, and later retrained by the branch predictor if necessary. The line predictor does not train on every mispredict. The mispredict cost is typically a single cycle bubble. Line predictor is also trained for jumps that use direct register addressing. 7

Instruction Fetch Logic(continued) In the slot stage(stage1), the branch predictor compares the next Icache index that it generates to the index that was generated by the line predictor. If there is a mismatch, the branch predictor wins and results in one bubble. The line predictor takes precedence over the branch predictor during memory format calls or jump. 7

Register Rename Maps In addition to the 64 architectural registers,up to 41 integer registers and 41 floating point registers. Eliminates write-after-read(WAR) and write-after-write(WAR) while preserving true read-after-write(RAW) dependence's. Provides a means of speculatively executing instructions. Each instruction is assigned a unique 8-bit number, called an inum, which is used to identify the instruction and it’s program order. 7

Register Rename Maps(continued) Map logic can process four instructions per cycle. Physical register is not returned to the free list until the instruction using it has been retired. If a branch mispredict or exception occurs, the map logic backs up the contents of the register rename maps to the state associated with the instruction that triggered the condition,and the prefetcher restarts at the appropriate VPC. Map logic can back up the contents of the maps to the state associated with any of 80 instructions in flight in a single cycle. 7

Alpha 21264 Mapper and Queue Stages Block Diagram MAPPER ISSUE QUEUE Saved Map State Arbiter Grant Request Register Numbers Renamed Registers Register Scoreboard MAP CAMs 80 Entries 4 Instr / Cycle Issued Instructions Instructions QUEUE 20 Entries 7

Integer Issue Queue(IQ) The 20-integer issue queue, associated with the integer execution units, issues instructions at the maximum rate of four per cycle. Each queue entry asserts four request signal(one for each sub-cluster( U0,U1,L0 &L1). There are two arbiters-one for the upper sub-clusters and one for the lower sub-clusters. Some instructions like load and store can only go to lower sub-clusters and shift instructions can only go to the upper sub-clusters. 7

Integer Issue Queue(continued) Other instructions , such as addition and logic operations, can execute in any sub-clusters and are statically assigned before being placed in the IQ. The IQ chooses between simultaneous requesters of a sub-cluster based on the age of the request- older request are given priority over newer requests. 7

Floating-Point Issue Queue(FQ) The 15-entry FPQ is associated with the floating point execution units. Each queue entry has three request lines-one for the add pipeline,one for the multiply pipeline and one for the two store pipelines. There are three arbiters(one for each). The add and multiply arbiters pick one per cycle,while the store pipeline arbiter picks two requesters per cycle. The FQ arbiter picks between simultaneous requesters of a pipeline based on the age of the requests. 7

Alpha 21264 Register and Execute Stages Floating Point Execution Units Integer Execution Units m. video mul FP Mul shift / br shift / br add / logic add / logic Reg Reg Reg FP Add add / logic / memory add / logic / memory FP Div SQRT 7

Exception and Interrupt Logic Two types of exception:faults and synchronous traps.Arithmetic exceptions are precise and are reported as synchronous traps. There are four types of interrupts:Two are hardware, one is software interrupt and the last is Asynchronous systems traps 7

Retire Logic The Instruction box(Ibox) fetches instructions in program order,executes them out of order, and then retires them in order. The Ibox retire logic maintains the architectural state of the machine by retiring an instruction only if all previous instructions have executed without generating exceptions or branch mispredictions. Retiring an instruction commits the machine to any changes the instruction may have made to the software-visible state. 7

Retire Logic(continued) The three software-visible states are: Integer and floating-point registers Memory Internal processor registers The retire logic can sustain a maximum retire rate of 8 instructions per cycle,and can retire up to as many as 11 instructions in a single cycle. 7