Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.

Similar presentations


Presentation on theme: "Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220."— Presentation transcript:

1 Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220 ECE 252 Fall 2008

2 2 © Alvin R. Lebeck 2008 Admin Some stuff will be review…some will not Projects… Reading –H&P Appendix C & Chapter 5 There will be three papers to read

3 3 © Alvin R. Lebeck 2008 Outline of Today’s Lecture Most of today should be review…later stuff will not The Memory Hierarchy Direct Mapped Cache. Two-Way Set Associative Cache Fully Associative cache Replacement Policies Write Strategies Memory Hierarchy Performance Improving Performance

4 4 © Alvin R. Lebeck 2008 Cache What is a cache? What is the motivation for a cache? Why do caches work? How do caches work?

5 5 © Alvin R. Lebeck 2008 The Motivation for Caches Motivation: –Large memories (DRAM) are slow –Small memories (SRAM) are fast Make the average access time small by: –Servicing most accesses from a small, fast memory. Reduce the bandwidth required of the large memory Processor Memory System Cache DRAM

6 6 © Alvin R. Lebeck 2008 Levels of the Memory Hierarchy CPU Registers 100s Bytes <1s ns Cache K-M Bytes 1-30 ns Main Memory G Bytes 100ns-1us $250 2GB 10 -8 cents/bit Disk 40-200 G Bytes 3-15 ms 80GB $110 10 -10 cents/bit Capacity Access Time Cost Tape 400 GB sec-min $80 400GB 10 -11 cents/bit Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache controller 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

7 7 © Alvin R. Lebeck 2008 The Principle of Locality The Principle of Locality: –Program access a relatively small portion of the address space at any instant of time. –Example: 90% of time in 10% of the code Two Different Types of Locality: –Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon. –Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon. Address Space02n2n Probability of reference

8 8 © Alvin R. Lebeck 2008 Memory Hierarchy: Principles of Operation At any given time, data is copied between only 2 adjacent levels: –Upper Level (Cache) : the one closer to the processor »Smaller, faster, and uses more expensive technology –Lower Level (Memory): the one further away from the processor »Bigger, slower, and uses less expensive technology Block: –The minimum unit of information that can either be present or not present in the two level hierarchy Lower Level (Memory) Upper Level (Cache) To Processor From Processor Blk X Blk Y

9 9 © Alvin R. Lebeck 2008 Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) –Hit Rate: the fraction of memory access found in the upper level –Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) –Miss Rate = 1 - (Hit Rate) –Miss Penalty = Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty Lower Level (Memory) Upper Level (Cache) To Processor From Processor Blk X Blk Y

10 10 © Alvin R. Lebeck 2008 Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

11 11 © Alvin R. Lebeck 2008 Direct Mapped cache is an array of fixed size frames. Each frame holds consecutive bytes (cache block) of main memory data. The Tag Array holds the Block Memory Address. A valid bit associated with each cache block tells if the data is valid. Direct Mapped Cache Cache-Index = ( Mod (Cache_Size))/ Block_Size Block-Offset = Mod (Block_Size) Tag = / (Cache_Size) Cache Index: The location of a block (and it’s tag) in the cache. Block Offset: The byte location in the cache block.

12 12 © Alvin R. Lebeck 2008 The Simplest Cache: Direct Mapped Cache Memory 4 Byte Direct Mapped Cache Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F Cache Index 0 1 2 3 Location 0 can be occupied by data from: –Memory location 0, 4, 8,... etc. –In general: any memory location whose 2 LSBs of the address are 0s –Address => cache index Which one should we place in the cache? How can we tell which one is in the cache?

13 13 © Alvin R. Lebeck 2008 Direct Mapped Cache (Cont.) For a Cache of 2 M bytes with block size of 2 L bytes – There are 2 M-L cache blocks, – Lowest L bits of the address are Block-Offset bits – Next (M - L) bits are the Cache-Index. – The last (32 - M) bits are the Tag bits. block offsetCache IndexTag Data Address L bitsM-L bits32-M bits

14 14 © Alvin R. Lebeck 2008 Example: 1-KB Cache with 32B blocks: Cache Index = ( Mod (1024))/ 32 Block-Offset = Mod (32) Tag = / (1024) 1K = 2 10 = 1024 2 5 = 32 Direct Mapped Cache Data Byte 0Byte 1Byte 30Byte 31 Cache Tag Valid bit.. 32-byte block22 bits 32 cache blocks 5 bits block offset5 bits Cache Index22 bits Tag Address

15 15 © Alvin R. Lebeck 2008 Example: 1KB Direct Mapped Cache with 32B Blocks For a 1024 (2 10 ) byte cache with 32-byte blocks: –The uppermost 22 = (32 - 10) address bits are the Cache Tag –The lowest 5 address bits are the Byte Select (Block Size = 2 5 ) –The next 5 address bits (bit5 - bit9) are the Cache Index 04319 Cache Index : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 0 1 2 3 : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 Byte Select

16 16 © Alvin R. Lebeck 2008 Example: 1K Direct Mapped Cache 04319Cache Index : Cache Tag 0x0002fe0x00 0x000050 Valid Bit : 0 1 2 3 : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x00 Byte Select = Cache Miss 1 0 1 0xxxxxxx 0x004440

17 17 © Alvin R. Lebeck 2008 Example: 1K Direct Mapped Cache 04319Cache Index : Cache Tag 0x0002fe0x00 0x000050 Valid Bit : 0 1 2 3 : Cache Data 31 Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x00 Byte Select = 1 1 1 0x0002fe 0x004440 New Bloick of data

18 18 © Alvin R. Lebeck 2008 Example: 1K Direct Mapped Cache 04319Cache Index : Cache Tag 0x0000500x01 0x000050 Valid Bit : 0 1 2 3 : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x08 Byte Select = Cache Hit 1 1 1 0x0002fe 0x004440

19 19 © Alvin R. Lebeck 2008 Example: 1K Direct Mapped Cache 04319Cache Index : Cache Tag 0x0024500x02 0x000050 Valid Bit : 0 1 2 3 : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x04 Byte Select = Cache Miss 1 1 1 0x0002fe 0x004440

20 20 © Alvin R. Lebeck 2008 Example: 1K Direct Mapped Cache 04319Cache Index : Cache Tag 0x0024500x02 0x000050 Valid Bit : 0 1 2 3 : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x04 Byte Select = 1 1 1 0x0002fe 0x002450New Block of data

21 21 © Alvin R. Lebeck 2008 Block Size Tradeoff In general, larger block size takes advantage of spatial locality BUT: –Larger block size means larger miss penalty: »Takes longer time to fill up the block –If block size is too big relative to cache size, miss rate will go up »Too few cache blocks Average Memory Access Time: –Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate Miss Penalty Block Size Miss Rate Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Increased Miss Penalty & Miss Rate Block Size

22 22 © Alvin R. Lebeck 2008 A N-way Set Associative Cache N-way set associative: N entries for each Cache Index –N direct mapped caches operating in parallel Example: Two-way set associative cache –Cache Index selects a “set” from the cache –The two tags in the set are compared in parallel –Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 SEL1SEL0 Cache Block Compare Adr. Tag Compare OR Hit Adr. Tag

23 23 © Alvin R. Lebeck 2008 Advantages of Set associative cache Higher Hit rate for the same cache size. Fewer Conflict Misses. Can have a larger cache but keep the index smaller (same size as virtual page index)

24 24 © Alvin R. Lebeck 2008 Disadvantage of Set Associative Cache N-way Set Associative Cache versus Direct Mapped Cache: –N comparators vs. 1 (power penalty) –Extra MUX delay for the data (delay penalty) –Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: –Possible to assume a hit and continue, recover later if it is a miss. Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 SEL1SEL0 Cache Block Compare Adr Tag Compare OR Hit

25 25 © Alvin R. Lebeck 2008 And yet Another Extreme Example: Fully Associative cache Fully Associative Cache -- push the set associative idea to its limit! –Forget about the Cache Index –Compare the Cache Tags of all cache entries in parallel –Example: Block Size = 32B blocks, we need N 27-bit comparators By definition: Conflict Miss = 0 for a fully associative cache : Cache Data Byte 0 0431 : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X

26 26 © Alvin R. Lebeck 2008 Sources of Cache Misses Compulsory (cold start or process migration, first reference): first access to a block –Miss in infinite cache –“Cold” fact of life: not a whole lot you can do about it –Prefetch, larger block gives implicit prefetch Capacity: –Miss in Fully Associative cache with LRU replacement –Cache cannot contain all blocks access by the program –Solution: increase cache size Conflict (collision): –Hit in Fully Associative, miss in Set-Associative –Multiple memory locations mapped to the same cache location –Solution 1: increase cache size –Solution 2: increase associativity Invalidation: other process (e.g., I/O) updates memory

27 27 © Alvin R. Lebeck 2008 Sources of Cache Misses Direct MappedN-way Set AssociativeFully Associative Compulsory Miss Cache Size Capacity Miss Invalidation Miss BigMediumSmall Note: If you are going to run “billions” of instruction, Compulsory Misses are insignificant. Same Conflict MissHighMediumZero Low(er)MediumHigh SameSame/lower

28 28 © Alvin R. Lebeck 2008 The Need to Make a Decision! Direct Mapped Cache: –Each memory location can map to only 1 cache location (frame) –No need to make any decision :-) »Current item replaced the previous item in that cache location N-way Set Associative Cache: –Each memory location have a choice of N cache frames Fully Associative Cache: –Each memory location can be placed in ANY cache frame Cache miss in a N-way Set Associative or Fully Associative Cache: –Bring in new block from memory –Throw out a cache block to make room for the new block – We need to make a decision on which block to throw out!

29 29 © Alvin R. Lebeck 2008 Cache Block Replacement Policy Random Replacement: –Hardware randomly selects a cache item and throw it out Least Recently Used: –Hardware keeps track of the access history –Replace the entry that has not been used for the longest time. –For two way set associative cache only needs one bit for LRU replacement. Example of a Simple “Pseudo” Least Recently Used Implementation: –Assume 64 Fully Associative Entries –Hardware replacement pointer points to one cache entry –Whenever an access is made to the entry the pointer points to: »Move the pointer to the next entry –Otherwise: do not move the pointer : Entry 0 Entry 1 Entry 63 Replacement Pointer

30 30 © Alvin R. Lebeck 2008 Cache Write Policy: Write Through versus Write Back Cache read is much easier to handle than cache write: –Instruction cache is much easier to design than data cache Cache write: –How do we keep data in the cache and memory consistent? Two options (decision time again :-) –Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss. »Need a “dirty bit” for each cache block »Greatly reduce the memory bandwidth requirement »Can buffer the data in a write-back buffer »Control can be complex –Write Through: write to cache and memory at the same time. »What!!! How can this be? Isn’t memory too slow for this?

31 31 © Alvin R. Lebeck 2008 Write Buffer for Write Through A Write Buffer is needed between the Cache and Memory –Processor: writes data into the cache and the write buffer –Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: –Typical number of entries: 4 –Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: –Store frequency (w.r.t. time) > 1 / DRAM write cycle –Write buffer saturation Processor Cache Write Buffer DRAM

32 32 © Alvin R. Lebeck 2008 Write Buffer Saturation Store frequency (w.r.t. time) -> 1 / DRAM write cycle –If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): »Store buffer will overflow no matter how big you make it »The CPU Cycle Time << DRAM Write Cycle Time Solution for write buffer saturation: –Use a write back cache –Install a second level (L2) cache Processor Cache Write Buffer DRAM Processor Cache Write Buffer DRAM L2 Cache

33 33 © Alvin R. Lebeck 2008 Write Policies We know about write-through vs. write-back Assume: a 16-bit write to memory location 0x00 causes a cache miss. Do we change the cache tag and update data in the block? Yes: Write Allocate No: Write No-Allocate Do we fetch the other data in the block? Yes: Fetch-on-Write (usually do write-allocate) No: No-Fetch-on-Write Write-around cache –Write-through no-write-allocate

34 34 © Alvin R. Lebeck 2008 Sub-block Cache (Sectored) Sub-block: –Share one cache tag between all sub-blocks in a block –Each sub-block within a block has its own valid bit –Example: 1 KB Direct Mapped Cache, 32-B Block, 8-B Sub-block »Each cache entry will have: 32/8 = 4 valid bits Miss: only the bytes in that sub-block are brought in. –reduces cache fill bandwidth (penalty). 0 1 2 3 : Cache Data : SB0’s V Bit : 31 Cache Tag SB1’s V Bit : SB2’s V Bit : SB3’s V Bit : Sub-block0Sub-block1Sub-block2Sub-block3 : B0B7 : B24B31 Byte 992Byte 1023

35 35 © Alvin R. Lebeck 2008 CPS 220 Review: Four Questions for Memory Hierarchy Designers Q1: Where can a block be placed in the upper level? (Block placement) –Fully Associative, Set Associative, Direct Mapped Q2: How is a block found if it is in the upper level? (Block identification) –Tag/Block Q3: Which block should be replaced on a miss? (Block replacement) –Random, LRU Q4: What happens on a write? (Write strategy) –Write Back or Write Through (with Write Buffer)

36 36 © Alvin R. Lebeck 2008 CPS 220 Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty

37 37 © Alvin R. Lebeck 2008 CPS 220 Cache Performance CPUtime = IC x (CPI execution + (Mem accesses per instruction x Miss rate x Miss penalty)) x Clock cycle time hits are included in CPI execution Misses per instruction = Memory accesses per instruction x Miss rate CPUtime = IC x (CPI execution + Misses per instruction x Miss penalty) x Clock cycle time

38 38 © Alvin R. Lebeck 2008 Example Miss penalty 50 clocks Miss rate 2% Base CPI 2.0 1.33 references per instruction Compute the CPUtime CPUtime = IC x (2.0 + (1.33 x 0.02 x 50)) x Clock CPUtime = IC x 3.33 x Clock So CPI increased from 2.0 to 3.33 with a 2% miss rate

39 39 © Alvin R. Lebeck 2008 Example 2 Two caches: both 64KB, 32 byte blocks, miss penalty 70ns, 1.3 references per instruction, CPI 2.0 w/ perfect cache direct mapped –Cycle time 2ns –Miss rate 1.4% 2-way associative –Cycle time increases by 10% –Miss rate 1.0% Which is better? –Compute average memory access time –Compute CPU time

40 40 © Alvin R. Lebeck 2008 Example 2 Continued Ave Mem Acc Time = Hit time + (miss rate x miss penalty) –1-way: 2.0 + (0.014 x 70) = 2.98ns –2-way: 2.2 + (0.010 x 70) = 2.90ns CPUtime = IC x CPIexec x Cycle –CPIexec = CPIbase + ((memacc/inst) x Miss rate x miss penalty) –Note: miss penalty x cycle time = 70ns –1-way: IC x ((2.0 x 2.0) + (1.3x0.014x70)) = 5.27 x IC –2-way: IC x ((2.0 x 2.2) + (1.3x0.010x70)) = 5.31 x IC


Download ppt "Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220."

Similar presentations


Ads by Google