Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.

Similar presentations


Presentation on theme: "The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir."— Presentation transcript:

1 The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

2 Memory systems that support cache  DRAMS are designed to increase density not access time  To reduce the miss penalty we need to change the memory access design, to increase throughput. Wide memory Sequential access Parallel access to all words in a block 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

3  The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways. Memory Systems that Support Caches DRAM Memory One word wide organization (one word wide bus and one word wide memory)  Assume  1 clock cycle (2 ns) to send the address  25 clock cycles (50 ns) for DRAM cycle time,  1 clock cycle (2ns) to return a word of data  Memory-Bus to Cache bandwidth  number of bytes accessed from memory and transferred to cache/CPU per clock cycle bus 32-bit data & 32-bit addr per cycle CPU Cache on-chip 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

4  Number of bytes transferred per clock cycle (bandwidth) for a single miss is bytes per clock cycles One Word Wide Memory Organization CPU Cache Memory bus on-chip  If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory 1 cycle to send address 25 cycles to read DRAM 1 cycle to return data  27 total clock cycles miss penalty 4/27 = 0.148 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

5  What if the block size were four words? 1 cycle to send 1 st address 100 cycles to read DRAM 1 cycle to return last data word  102 total clock cycles miss penalty One Word Wide Memory Organization CPU Cache Memory bus on-chip 25 cycles (4 x 4)/102 = 0.157 bytes/clock cycle  Number of bytes transferred per clock cycle (bandwidth) for a single miss is 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

6 Interleaved Memory Organization  For a block size of four words  1 cycle to send 1 st address  25 + 3 = 28 cycles to read DRAM  1 cycle to return last data word  30 total clock cycles miss penalty Memory bank 0 CPU Cache Memory bank 1 bus on-chip Memory bank 2 Memory bank 3  Number of bytes transferred per clock cycle (bandwidth) for a single miss is (4 x 4)/30 = 0.533 bytes per clock cycle = 4.264 bits/clock cycle. 25 cycles 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

7 Further Improvements to Memory Organization (DDR-SDRAMs)  An external clock (300 MHz) synchronizes memory addresses  Example – 4 M DRAM – outputs one bit from the array  2048 column latches and 1 multiplexor  SDRAM is provided the starting address and the burst length 2/4/8– need not provide successive addresses.  DDR – double data rate – transfers data on both the raising and falling edge of the external clock.  1980 DRAMS were 64 Kbit, column access to an existing row 150 ns  2004 DRAMS were 1024 Mbit, column access to existing row 3 ns. 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

8 Further Improvements – two-level cache  Figure shows the AMD Athlon and Duron processor architecture  Two-level caches allow L1 cache to be smaller – improves the hit time as they are faster  L2 cache is larger – its access time is less crytical – larger block sized  L2 is accessed whenever a miss occurs in L1, which reduces the L1 miss penalty dramatically.  L2 is also used to store the contents of the “victim buffer” – data rejected from L1 cache when a L1 miss occurs 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

9 Reducing Cache Misses through Associativity  We recall that a direct-mapped cache allows one memory location to map to only one block in cache (uses tags) - needs only one comparator.  A fully associative cache - a block in memory can map to any block in the cache. Thus all entries in cache must be searched. This is done in parallel, with one comparator for each memory block. It is expensive (from hardware point of view). Works for small caches  In-between the two extremes are set-associative caches - a block in memory maps to only one set of blocks, but can occupy any position within that set. 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

10 Reducing Cache Misses through Associativity  A n-way set-associative cache has sets with n blocks each  All blocks in the set have to be searched - reduces the number of comparators to n. One-way set-associative (same as direct mapped) Two-way set-associative Four-way set-associative Eight-way set-associative (same as fully associative)  As associativity increases the miss rate decreases (1-way 10.3%, 8-way 8.1% data miss rate), but the hit time increases. 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

11 22 32  For set-assoc. caches, any doubling of associativity decreases the number of index bits by one and increases the number of tag bits by 1.  For fully associative cache, no index bits since it is only one set. 4-way associative cache Four blocks Four comparators 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

12 u It had 20 tag bits vs. 22 for 4-way associative cache and 10 index bits vs. 8 for the 4-way associative cache. u How many tag and index bits for an 8-way associative cache? Recall the Direct Mapped Cache Hit 20Tag 10 Index DataIndexTagValid 0 1 2. 1021 1022 1023 31 30... 13 12 11... 2 1 0 Byte offset 20 Data 32 23 tag bits 7 index bits 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

13  The basic principle is “Least-recently Used” (LRU) – replace the block that is older.  Keeping track of a block’s “age” done in hardware  It is practical for small set-associativity (2-way or 4- way).  For higher associativity LRU is either approximated of replacement is random  For 2-way set-associative, random replacement has 10% higher miss rate than LRU  As caches become larger the miss rates for both strategies fall, and the difference between the two is smaller. Which block to replace in an associative cache? 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

14  Associativity usually improves miss ratio, but not always. Give a short series of address references for which a 2-way set-associative cache with LRU replacement would experience more misses than a direct-mapped cache of the same size. Exercise  2-way has half the number of sets for same size. All map to same set A B C  The sequence A,B,C,A,B,C generates: Miss, miss, miss, Hit, miss, miss, Hit.. AB C  The same sequence generates: Miss, miss, miss, miss, miss, miss…. AB 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

15  Suppose a computer address size is k bits (using byte addressing), the cache size is S bytes, the block size is B bytes, and the cache is A -way set-associative. Assume that B is a power of 2, so B =2 b. Figure out what the following quantities are: - the number of sets in the cache? - the number of index bits in the address? - the number of bits needed to implement the cache? Address size = k bits Cache size = S bytes/cache Block size = B= 2 b bytes/block Associativity= A blocks/set The number of sets/cache=Bytes/cache = Bytes/cache = S Bytes/set Block/set  Bytes/block A  B Exercise 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir

16  Index bits 2 (#index bits) = sets/cache = S __ A  B #Index bits =log 2 ( S ) = log 2 ( S ) = log 2 ( S ) - log 2 (2 b ) = log 2 ( S ) - b A  B A  2 b A A Tag address bits = total address bits - index bits - block offset bits= k - [log 2 ( S ) - b] - b = K - log 2 ( S ) A A Bits in tag memory/cache = Tag address bits/block  Blocks/set  Sets/cache = [K - log 2 (S) ] A  S = S [K - log 2 (S) ] A A  B B A Exercise - continued 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir


Download ppt "The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir."

Similar presentations


Ads by Google