Instruction Fetch Mechanism Two features: –Line and way prediction –Branch prediction Line-way predictor predicts the line-way of the I-cache that will be accessed in the next cycle Line-way prediction takes the branch predictor outside the critical fetch loop. On cache fills, line predictor value at each line points to the next sequential fetch line. Line predictor is later trained by the branch predictor. In effect, line-way predictor is similar to a very fast BTB. Prediction of the line predictor is verified in stage 1 (Instruction slot). If line-way prediction is incorrect, slot stage is flushed and PC generated using the branch predictor information is used to redirect fetch.
Fetch - Line and Way Prediction Source: IEEE Micro, March-April 1999
Branch Prediction Mechanism Hybrid Branch Predictor Global predictor: –Good for inter-correlated branches. –Indexed by global path history register (T/NT status of last 12 branches) –4K-entry table of 2-bit counters Local Predictor –Good for self-correlated branches. –10 bits of PC indexes a per-address local history table, which in turn indexes a 1K-entry table of 3-bit counters. –Aliasing among branches is a problem. Choice Predictor –Decides which predictor to use. –Indexed by global path history register –4K-entry table of 2-bit counters
Branch Prediction Mechanism Minimum branch penalty: 7 cycles Typical branch penalty: 11+ cycles (IQ delay) 48K bits of target addresses stored in I-cache 32-entry return address stack Predictor tables are reset on a context switch Source: Microprocessor Report, 10/28/96
Instruction Slotting Check line predictor prediction Branch predictor compares the next cache index it generates with the one generated by line predictor Determine the subclusters integer instructions will go to Some subclusters are specialized resource constraints Perform load balancing on subclusters
Register Renaming 31 Integer 31 FP architectural registers 41 Int 41 FP extra physical registers Uses a merged rename and architectural register file, one for Int one for FP Same physical register holds the results of an instruction before and after commit No separate architectural register file (no data copying on commit) Register map table stores current mappings of architectural registers. A map silo contains old mappings of up to 20 previous decode cycles (used in case of misprediction)
Register Renaming Logic On decoding an instruction: –Search map CAMs for the source registers –Find the physical registers currently containing the value of the architectural source registers –Access free physical register list –Map the found free physical register to the architectural destination register Source: Presentation by R. Kessler, August 1998.
Register Renaming Logic On completing an instruction: –Write result into the physical destination register –Mark the physical destination register as valid in the register scoreboard –Broadcast results to issue queue entries –Physical destination register number is broadcast as tag On committing an instruction –Mark the physical destination register as committed –Free the physical register that corresponds to an old mapping of the same architectural register On a misprediction/exception –Roll back the map state to what it was when the exception-causing instruction was renamed –To be able to do this, instructions should be associated with map entries –This is done using inums. Each instruction is given an 8-bit unique identifier during register mapping
Physical Register States 4 states Initially n architectural registers are in AR state. Rest are Available When an instruction with a destination register is issued, one of the available registers is allocated as rename buffer (RB) When instruction finishes execution, state is set to valid On instruction commit, state is set to AR and old AR mapping is reclaimed Source: Sima, D. The Design Space of Register Renaming Techniques. IEEE Micro, September/October 2000.
Integer Issue Queues - Clustering 20 entries, maximum 4 per cycle Two arbiters pick the instructions that will issue (One for upper subclusters, one for lower subclusters) Each queue entry asserts a request to the arbiter when it contains an instruction that can be executed by the subcluster (if operand values are available within that subcluster) 4 request signals (U0, U1, L0, L1) Arbiters choose between simultaneous requesters of a subcluster based on the age of the request Older instructions are given priority Each arbiter picks 2 of the possible 20 requesters for service A given instruction can request only upper or lower subclusters (load balancing based on the assignment done by Stage 1) Subcluster assignment is static (Stage 1) Cluster selection on issue is dynamic (Stage 2)
Integer/FP Execution Pipes Integer cluster communication latency: 1 cycle Advantage of clustering: –Fewer read/write ports to the register file –Register file will not be a cycle time limiter FP issue queue: –15 entries –2 inst/cycle Source: IEEE Micro, March-April 1999
Memory References Load Queue –Reorder buffer for loads –32 entries, in-order –Maintains state of loads issued but not yet retired Store Queue –Reorder buffer for stores –32 entries, in-order –Maintains state of stores issued but not yet written to the data cache –Holds data associated with store instructions –Forwards data to older matching stores Miss Address File –Holds physical addresses associated with pending L1 cache misses (instruction or data) –Maximum 8 misses to off-chip memory system
Load/Store Ordering New memory references check their address and age against older references. For example, when a store issues: –LDQ compares store address to the addresses of younger loads (CAM search) –If the older store issues to the same memory address as a younger load, LDQ squashes the load and initiates recovery When a load is ready to issue: –STQ compares the load address to the addresses of younger stores –If a match is found: If store data is available, STQ forwards the data Else load issue is delayed until store data becomes available
Load/Store Ordering When a load is ready to issue: –If a younger store exists in STQ with an unknown address: Predict that the ready load will not access the same memory location unless this load was incorrectly ordered before (check the load wait table) –Exposes more ILP if prediction is correct –In case of misprediction: Minimum 14 cycle penalty Initiate recovery: Load and all subsequent instructions are squashed and re-executed Mark the load in the load wait table so that it will wait for all younger stores to compute addresses next time around
Load/Store Ordering Example Source: IEEE Micro, March-April 1999
Features of Memory System Data cache –64 KB, 2-way, virtually-indexed physically tagged (translation in parallel with access) –Write-back, read/write allocate –64-byte block size + ECC bits –Prevents synonyms by not allowing different physical addresses corresponding to the same virtual address to co-exist in the cache –Load hit/miss prediction to minimize load-use latency (Data cache access is 3 cycles after the issue queue + 1 cycle to get the hit/miss signal to issue queue) Victim Buffer (Victim address and data files) –Contains evicted L1(Data and Inst) and L2 cache lines –8 entries, Serial access Off-chip L2 cache –Minimum data cache miss latency 13 cycles –Up to 16 MB –Dedicated access to L2 cache
Overall System Diagram Source: Microprocessor Report, 10/28/96
References R.E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro. March/April 1999. D. Leibholz and R. Razdan. The Alpha 21264: A 500 MHz Out-of- order Execution Microprocessor. COMPCON97, 1997. Compaq Computer Corporation. Alpha 21264/EV6 Hardware Reference Manual. R. Kessler, E. McLellan, and D. Webb. The Alpha 21264 microprocessor architecture. International Conference on Computer Design, October 1998 B.A. Gieseke et. al. A 600 MHz Superscalar RISC Microprocessor with Out-of-order Execution. International Solid State Circuits Conference. 1997. L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Report. October 28, 1996. Dezso Sima. The Design Space of Register Renaming Techniques. IEEE Micro. September/October 2000. P.E. Gronowski et. al. High Performance Microprocessor Design. IEEE Journal of Solid State Circuits. May 1998.