Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.

Similar presentations


Presentation on theme: "CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality."— Presentation transcript:

1 CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three basic models and how they work (in detail) This is review from CIS 314 and CIS 429. °How they interact with the pipeline °Performance analysis of caches Average Memory Access Time (AMAT) °Enhancements to caches

2 CIS629 - Fall 2002 Caches 2 °The Five Classic Components of a Computer ° (Many of this topic’s slides were developed by David Patterson, UC Berkeley for CS 252.) Caches Control Datapath Memory Processor Input Output

3 CIS629 - Fall 2002 Caches 3 Technology Trends DRAM YearSizeCycle Time 198064 Kb250 ns 1983256 Kb220 ns 19861 Mb190 ns 19894 Mb165 ns 199216 Mb145 ns 199564 Mb120 ns CapacitySpeed (latency) Logic: 2x in 3 years2x in 3 years DRAM:4x in 3 years2x in 10 years Disk:4x in 3 years2x in 10 years 1000:1!2:1!

4 CIS629 - Fall 2002 Caches 4 µProc 60%/yr. (2X/1.5yr) DRAM 9%/yr. (2X/10 yrs) 1 10 100 1000 19801981198319841985198619871988 1989 19901991199219931994199519961997199819992000 DRAM CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance Time “Moore’s Law” Processor-DRAM Memory Gap (latency) Cache needed to fill the performance gap

5 CIS629 - Fall 2002 Caches 5 The Goal: illusion of large, fast, cheap memory °Fact: Large memories are slow, fast memories are small °How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism

6 CIS629 - Fall 2002 Caches 6 An Expanded View of the Memory System Control Datapath Memory Processor Memory Fastest Slowest Smallest Biggest Highest Lowest Speed: Size: Cost: Memory Hierarchy

7 CIS629 - Fall 2002 Caches 7 Memory Hierarchy: How Does it Work? °Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor °Spatial Locality (Locality in Space): => Move blocks consists of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

8 CIS629 - Fall 2002 Caches 8 Memory Hierarchy: Terminology °Hit: data appears in some block in the upper level Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of Time to determine hit/miss + time to deliver block to processor °Miss: data needs to be retrieve from a block in the lower level Miss Rate = 1 - (Hit Rate) Miss Time: Time to determine hit/miss + Time to replace a block in the upper level + Time to deliver the block to the processor Miss Penalty: Extra time incurred for a miss = Time to replace a block in the upper level °Hit Time << Miss Penalty and Miss Time Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

9 CIS629 - Fall 2002 Caches 9 How is the hierarchy managed? °Registers Memory by compiler (programmer?) °cache memory by the hardware °memory disks by the hardware and operating system (virtual memory) by the programmer (files)

10 CIS629 - Fall 2002 Caches 10 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification) °Q3: Which block should be replaced on a miss? (Block replacement) °Q4: What happens on a write? (Write strategy) Four Questions for Caches and Memory Hierarchy

11 CIS629 - Fall 2002 Caches 11 °Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets 0 1 2 3 4 5 6 7 Block no. Fully associative: block 12 can go anywhere 0 1 2 3 4 5 6 7 Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) 0 1 2 3 4 5 6 7 Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) Set 0 Set 1 Set 2 Set 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Block-frame address 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 Block no. Q1: Where can a block be placed in the upper level?

12 CIS629 - Fall 2002 Caches 12 Example: 1 KB Direct Mapped Cache with 32 B Blocks °For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index 0 1 2 3 : Cache Data Byte 0 0431 : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9 Block address

13 CIS629 - Fall 2002 Caches 13 Example: Fully Associative °Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators : Cache Data Byte 0 0431 : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X

14 CIS629 - Fall 2002 Caches 14 Example : Set Associative Cache °N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel °Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

15 CIS629 - Fall 2002 Caches 15 °Direct indexing (using index and block offset), tag compares, or combination °Increasing associativity shrinks index, expands tag Block offset Block Address Tag Index Q2: How is a block found if it is in the upper level?

16 CIS629 - Fall 2002 Caches 16 °Easy for Direct Mapped °Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity:2-way4-way8-way SizeLRU Random LRU Random LRU Random 16 KB5.2%5.7% 4.7%5.3% 4.4%5.0% 64 KB1.9%2.0% 1.5%1.7% 1.4%1.5% 256 KB1.15%1.17% 1.13% 1.13% 1.12% 1.12% Q3: Which block should be replaced on a miss?

17 CIS629 - Fall 2002 Caches 17 °Writes occur less frequently than reads: Under MIPS: 7% of all memory traffic are writes 25% of all data traffic are writes °Thus, Amdahl’s Law implies that caches should be optimized for reads. However, we cannot ignore writes. °Problems with writes: Must check tag BEFORE writing into the cache Only a portion of the cache block is modified Write stalls - CPU must wait until the write completes Q4: What happens on a write?

18 CIS629 - Fall 2002 Caches 18 °Write through—The information is written to both the block in the cache and to the block in the lower- level memory. °Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? °Pros and Cons of each? WT: read misses don’t cause writes; easier to implement; copy of data always exists WB: write at the speed of the cache; multiple writes to cache before write to memory; less memory BW consumed Q4: What happens on a write: Design Options

19 CIS629 - Fall 2002 Caches 19 °A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory °Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle °Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Processor Cache Write Buffer DRAM Write Buffer for Write Through

20 CIS629 - Fall 2002 Caches 20 Write Buffer Saturation °Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): -Store buffer will overflow no matter how big you make it -The CPU Cycle Time <= DRAM Write Cycle Time °Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) Processor Cache Write Buffer DRAM Processor Cache Write Buffer DRAM L2 Cache

21 CIS629 - Fall 2002 Caches 21 Write Miss Design Options °Write allocate (“fetch on write”) - block is loaded on a write miss, followed by the write. °No-write allocate (“write around”) - block is modified in the lower level, not loaded into the cache.

22 CIS629 - Fall 2002 Caches 22 °Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? -Yes: Write Allocate -No: Write Not Allocate Cache Index 0 1 2 3 : Cache Data Byte 0 0431 : Cache TagExample: 0x00 Ex: 0x00 0x50 Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9 Write-miss Policy: Write Allocate versus Not Allocate

23 CIS629 - Fall 2002 Caches 23 Impact on Cycle Time Example: direct map allows miss signal after data IR PC I -Cache D Cache AB R T IRex IRm IRwb miss invalid Miss Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity Average Memory Access time (AMAT) = Hit Time + Miss Rate x Miss Penalty Compute Time = IC x CT x (ideal CPI + memory stalls)

24 CIS629 - Fall 2002 Caches 24 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr Use Full/Empty bits in registers + MSHR queue -MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. –Per cache-line: keep info about memory address. –For each word: register (if any) that is waiting for result. –Used to “merge” multiple requests to one memory line -New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. -Attempt to use register before result returns causes instruction to block in decode stage. -Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. °Out-of-order pipelines already have this functionality built in… (load queues, etc). What happens on a Cache miss?


Download ppt "CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality."

Similar presentations


Ads by Google