Presentation is loading. Please wait.

Presentation is loading. Please wait.

Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o.

Similar presentations


Presentation on theme: "Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o."— Presentation transcript:

1 Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o p y - 4 b inst 16 c o p y - 1 Block Cache Instructions from the fill unit (n-bit) What about fragmentation?

2 Predict and Fetch Trace Fetch Cycle Predict Cycle Block Cache More efficient: redundancy is in the trace table and not the block cache

3 Next Trace Prediction... b_id0b_id1b_id2b_id3 w pred. block_ids Trace Table Hash Function Next trace_id to the block cache predicted branch path

4 The Block-Based Trace Cache Fetch Buffer trace_id Completion Final Collapse Br. block_ids I-Cache pre-collapse hist. Execution Core History Hash Fill Unit Rename Table Trace Table Block Cache

5 1. Next trace prediction 2. Trace cache fetch Proposed Trace Cache Enhanced Instruction Cache Fetch Completion Execution Core 1. Multiple-branch prediction 2. Instruction cache fetch 3. Instruction alignment & collapsing 1. Multiple-branch predictor update Execution Core Wide-Fetch I-cache vs. T-cache 1. Trace construction and fill

6 Trace Cache Trade-offs Fetch time complexity Trace cache: Enhanced instruction cache: Pros  Moves complexity to backend Cons  Inefficient instruction storage Pros  Efficient instruction storage Cons  Complexity during fetch time Instruction storage redundancy

7 As Machines Get Wider (… and Deeper) Fetch Rename Dispatch Execute Retire Dispatch Execute Retire 1. Eliminate Stages 2 Relocate work to the backend Decode Rename

8 Flow Path Model of Superscalars I-cache FETCH DECODE COMMIT D-cache Branch Predictor Instruction Buffer Store Queue Reorder Buffer Integer Floating-point Media Memory Instruction Register Data Memory Data Flow EXECUTE (ROB) Flow

9 CPU-Memory Bottleneck  Performance of high speed computers is usually limited by memory performance, bandwidth & latency  Main memory access time  Processor cycle time over 100 times difference!!  if m fraction of instructions are loads and stores then average ‘1+m’ references per instruction suppose m=40%, IPC=4@1GHz  22.4 GByte/sec CPU Memory

10 How to Incorporate Faster Memory  SRAM access time << Main memory access time  SRAM bandwidth >> Main memory bandwidth  SRAM is expensive  SRAM is smaller than main memory  Programs exhibit temporal locality ­frequently-used data can be held in the scratch pad ­the cost of the first and last memory access can be amortized over multiple reuse  Programs must have a small working set (aka footprint) CPU Main Memory (DRAM) RF Scratch Pad (SRAM)

11 Caches: Automatic Management of Fast Storage CPU cache Main Memory CPU L2 cache Main Memory L3 cache L1 16~32KB 1~2 pclk latency ~256KB ~10 pclk latency ~50 pclk latency ~4MB

12 Cache Memory Structures indexkey idx key tag data decoder Indexed Memory k-bit index 2 k blocks Associative Memory (CAM) no index unlimited blocks N-Way Set-Associative Memory k-bit index 2 k N blocks

13 Direct Mapped Caches tag idx b.o. = Tag match Multiplexor decoder = Tag Match decoder tag index block index

14  Each cache block or (cache line) has only one tag but can hold multiple “chunks” of data ­reduce tag storage overhead In 32-bit addressing, an 1-MB direct-mapped cache has 12 bits of tags 4-byte cache block  256K blocks  ~384KB of tag 128-byte cache block  8K blocks  ~12KB of tag ­the entire cache block is transferred to and from memory all at once good for spatial locality since if you access address i, you will probably want i+1 as well (prefetching effect)  Block size = 2 b ; Direct Mapped Cache Size = 2 B+b Cache Block Size tag block index block offset LSBMSB B-bitsb-bits

15 Large Blocks and Subblocking  Large cache blocks can take a long time to refill ­refill cache line critical word first ­restart cache access before complete refill  Large cache blocks can waste bus bandwidth if block size is larger than spatial locality ­divide a block into subblocks ­associate separate valid bits for each subblock. tagsubblockv v v

16 tag blk.offset Fully Associative Cache     Multiplexor Associative Search Tag

17 tag index BO N-Way Set Associative Cache   Multiplexor Associative search decoder Cache Size = N x 2 B+b

18 N-Way Set Associative Cache tag idx b.o. = Tag match decoder = Tag match Multiplexor decoder a set a way (bank) Cache Size = N x 2 B+b

19 Pseudo Associative Cache  Simple direct-mapped cache structure, 1st look up is unchanged  If 1st lookup misses, start a 2nd lookup using a hashed index (e.g. flip msb)  If 2nd lookup hits, then swap the contents between the 1st and 2nd lookup locations  If 2nd lookup fails then go to the next hierarchy A cache hit on the 2nd lookup is slower, but still much faster than a miss 1st lookup2nd lookupMemory lookup

20 Victim Cache: a small cache to backup a direct-map cache  Lines evicted from the direct-mapped cache due to collision is stored into the victim cache  Avoids ping-ponging when the working set contains a few addresses that collides tag = status data block tag = status data block address from CPU regular direct-map cache small victim cache

21 Reducing the Miss Penalty  Giving priority to read-misses over writes ­a write can be completed by writing into a write buffer but a read miss will stall the processor  let the read overtake the writes  Sub-block replacement: fetch only part of a block  Early restart and critical word first  Non-blocking caches to reduce stalls on cache misses ­allowing new fetches to proceed out-of-order after a miss  Multilevel caches ­reduced latency on a primary miss Inclusion Property

22 Principle Behind Hierarchical Storage  Each level memoizes values stored at lower levels  Instead of paying the full latency for the “furthermost” level of storage each time Effective Access T i = h i t i + (1 - h i ) T i+1  where h i is the ‘hit’ ratio, the probability of finding the desired data memoized at level i  t i is the raw access time of memory at level i  Given a program with good locality of reference S working-set < s i  h i  1  T i  t i  A balanced system achieves the best of both worlds ­the performance of higher-level storage ­the capacity of lower-level low-cost storage.


Download ppt "Replicated Block Cache... block_id d e c o d e r N=2 n direct mapped cache FAi1i2i b word lines Final Collapse Fetch Buffer c o p y - 2 c o p y - 3 c o."

Similar presentations


Ads by Google