Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Published byModified over 4 years ago
Presentation on theme: "Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng."— Presentation transcript:
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng
Principle of locality programs access a relatively small portion of their address space at a given time. Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon. Spatial locality (locality in space): if an item is referenced, items whose addresses are close will tend to be referenced soon.
Basic Structure Memory TechnologyTypical Access Time$ per GB in 2004 SRAM0.5 - 5 ns$4000 - $10,000 DRAM50 - 70 ns$100 - $200 Magnetic disk5 ms - 20 ms$0.05-$2
The Principal By combining two concepts (locality and hierarchy): – Temporal Locality => Keep most recently accessed data items closer to the processor – Spatial Locality => Move blocks consisting of multiple contiguous words to upper levels of the hierarchy
Memory hierarchy (II) Data is copied between adjacent levels Minimum unit of information copied is a block If the requested data appears in some block in the upper level, this is called a hit, otherwise a miss and a block containing the requested data is copied from a lower level. The hit rate or hit ratio, is the fraction of memory accesses found in the upper level. The miss rate (1.0 - hit rate) is the fraction not found at the upper level. Hit time: the time to access the upper level including the time to determine if the access is a hit or a miss. Miss penalty the time to replace a block in the upper level. Upper Lower
Cache A safe place for hiding or storing things The level of memory hierarchy between processor and main memory Refer to any storage managed to take advantage pf locality of access Motivation: – high processor cycle speed – low memory cycle speed – fast access to recently used portions of a program's code and data
The Basic Cache Concept 1. The CPU is requesting data item Xn 2. The request results in a miss 3. The word Xn is brought from memory into cache
Direct Mapped Cache Each memory location is mapped to exactly one location in the cache. – address of the block modulo number of blocks in the cache. Answer two crucial questions – How do we know if a data item is in the cache? – If it is, how do we find it?
Cache Contents Tag Identify whether a word in the cache corresponds to the requested word. Valid bit indicates whether an entry contains a valid address Data Tag size = 32 – n – 2 = 32 – 10 - 2 Size = 2 index x ( valid + tag + data) = 2 n x ( 1 + m + 4*8) n m
Direct-Mapped Example A Cache 16 KB of data 4-word blocks 32 bits address ValidTagData How many total bits are required for direct-mapped? 16 KB 4-word 4 x 4 x 8 = 128 bits n + m + 4 = 32 …. (1) 16KB* = 4K words = 2 10 block → n = 10 m = 18 The total bits = 2 10 x (1 +18 + 4*4*8) = 147 Kbits
Mapping an address to a cache block Source: http://www.faculty.uaf.edu/ffdr/EE443/
Handling Cache Misses Stall the entire pipeline & fetch the requested word Steps to handle an instruction cache miss: 1. Send the original PC value (PC-4) to the memory. 2. Instruct main memory to perform a read and wait for the memory to complete its access. 3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. 4. Restart the instruction execution at the first step, which will refresh the instruction, this time finding it in the cache.
Cache Main Memory Write-Through A scheme in which writes always update both the cache and the memory, ensuring that data is always consistent between the two. Write buffer: – A queue that holds data while the data are waiting to be written to memory.
Write-Back A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced. Pro: Improve performance, especially when writes are frequent (and couldn’t be handled by write buffer) Con: More complex to implement Cache Main Memory Write-Back
Cache Performance CPU time = (CPU execution clock cycles + Memory-stall clock cycles) x Clock cycle time Memory-stall clock cycles = Read-stall cycles + Write-stall cycles – Read-stall cycles = (Reads/Program) x Read miss rate x Read miss penalty – Write-stall cycles = ((Writes/Program) x Write miss rate x Write miss penalty) + Write buffer stalls Memory-stall clock cycles = (MemoryAccess/Program) x Miss Rate x Miss Penalty Memory-stall clock cycles = (Instructions/Program) x Misses/Instructions) x Miss Penalty
The Example Source: http://www.faculty.uaf.edu/ffdr/EE443/ (1.38 + 2)
What if …. What if the processor is made faster, but the memory system stays the same? – Speed up the machine by improving the CPI from 2 to 1 without increasing the clock The system with a perfect cache would be 2.38 / 1 = 2.38 times faster The amount of time spent on memory stalls rises from 1.38/3.38 = 41% to 1.38/2.38 = 58%
Our Observations Relative cache penalties increases as a processor becomes faster The lower the CPI, the more pronounced the impact of stall cycles If the main memory system is the same, a higher CPU clock rate leads to a larger miss penalty
Decreasing miss ratio with associative cache direct-mapped cache: A cache structure in which each memory location is mapped to exactly one location in the cache. set-associative cache: A cache that has a fixed number of locations (at least two) where each block can be placed. fully associative cache: A cache structure in which a block can be placed in any location in the cache.
The Example (12 mod 8) = 4(12 mod 4) = 0 Can appear in any of the eight cache block
One More Example – Direct Mapped Block AddressCache Address 0(0 mod 4) = 0 6(6 mod 4) = 2 8(8 mod 4) = 0 Address of memory block accessed Hit or miss Contents of cache block after reference 0123 0missMemory  8missMemory  0missMemory  6missMemory  Memory  8missMemory  Memory  5 Misses
Two-Way Set Associative Cache Block AddressCache Address 0(0 mod 2) = 0 6(6 mod 2) = 0 8(8 mod 2) = 0 Address of memory block accessed Hit or miss Contents of cache block after reference Set 0 Set 1 0MissMemory  8MissMemory Memory  0HitMemory Memory  6MissMemory Memory  8missMemory Memory  4 Misses which block to replace – commonly used is LRU scheme Least recently used (LRU) A replacement scheme in which the block replaced is the one that has been unused for the longest time. Least recently used (LRU) A replacement scheme in which the block replaced is the one that has been unused for the longest time.
The Implementation of 4-Way Set Associative Cache
Fully Associative Cache Address of memory block accessed Hit or miss Contents of cache block after reference Block 0Block 1Block 2Block 3 0missMemory  8missMemory Memory  0HitMemory Memory  6missMemory Memory Memory  8HitMemory Memory Memory  3 Misses Increasing degree of associativity → decrease in miss rate
Performance of Multilevel Cache CPI1 Clock Rate5 GHz Memory Access Time100 ns Miss Rate per instruction at the primary cache 2% Secondary cache Access Time (Hit or Miss) 5 ns Reduce the miss rate of main memory 0.5% 100 /0.2 = 500 clock cycles The miss penalty Total CPI = 1 + Memory-Stall cycle per instruction = 1 + 500 * 2% = 1 + 10 = 11 Original 5 /0.2 = 25 clock cycles The miss penalty Total CPI = 1 + Primary-Stall per instruction + Secondary-Stall per instruction = 1 + (25 * 2%) + (500 * 0.5%) = 1 + 0.5 + 2.5 = 4.0 Multilevel 11/4 = 2.8
Designing the Memory System to Support Caches (I) Consider hypothetical memory system parameters: 1 memory bus clock cycle to send address 15 memory bus clock cycles to initiate DRAM access 1 memory bus clock cycle to transfer a word of data a cache block is a 4-word blocks 1-word-wide bank of DRAMs The miss penalty is: 1 + 4 × 15 + 4 × 1 = 65 clock cycles Number of bytes transferred per clock cycle per miss: (4*4) / 65 = 0.25
Designing the Memory System to Support Caches (II)
Virtual Memory The technique in which main memory acts as a "cache" for the secondary storage – automatically manages main memory and secondary storage Motivation – allow efficient sharing of memory among multiple programs – remove the programming burdens of a small, limited amount of main memory
Basic Concepts of Virtual Memory Virtual memory allows each program to exceed the size of primary memory It automatically manages two levels of memory hierarchy: – Main memory (physical memory) – Secondary storage Same concepts as in caches, different terminology A virtual memory block – a page A virtual memory miss – a page fault CPU produces a virtual address (which is translated to a physical address, used to access main memory). This process (accomplished by a combination o HW and SW) is called memory mapping or address translation. Source: http://www.faculty.uaf.edu/ffdr/EE443/
Mapping from a Virtual to Physical Address 2 32 = 4 GB 2 30 = 1 GB
High Cost of a Miss Page fault takes millions of cycles to process – E.g., main memory is 100,000 times faster than disk – This time is dominated by the time it takes to get the first word for typical page size Key decisions: – Page size large enough to amortize the high access time – Pick organization that reduces page fault rate (e.g., fully associative placement of pages) – Handle page faults in software (overhead is small compared to disk access times) and use clever algorithms for page placement – Use write-back
Page Table Containing the virtual to physical address translations in a virtual memory system. – Resides in memory – Indexed with the page number form the virtual address – Contains corresponding physical page number – Each program has its own page table – Hardware includes a register pointing to the start of the page table (page table register)
Page Table Size For Example: Consider 32-bit virtual addresses, 4-KB page size, 4B per page table entry: Number of page table entries = 2 30 /2 12 = 2 20 Size of page table = 2 20 x 4 = 4 MB
Page Faults Occurs when a valid bit (V) is found to be 0: – Transfer the control to the operating system (using the exception mechanism) – The operating system must find the appropriate page in the next level of hierarchy – Decide where to place it in the main memory Where is the page on this disk? – The information can be found either in the same page table, or in a separate structure The OS creates the space on disk for all the pages of the process at the time it creates the process At the same time, a data structure that records the location of each page is also created.
The Translation-Lookaside Buffer (TLB) Each memory access by a program requires two memory accesses: – Obtain the physical address (reference the page table) – Get the data Because of the spatial and temporal locality within each page, a translation for a virtual page will likely be needed in the near future. To speed this process up include a special cache that keeps track of recently used translations
What block is replaced on a miss? Which block is a candidate for replacement: – In a fully associative cache – all blocks are candidates – In a set-associative cache – all the blocks in the set – In a direct-mapped cache – there is only one candidate In set-associative and fully associative caches, use one of two strategies – 1. Random. (use hardware assistance to make it fast) – 2. LRU (Least recently used). usually two complicated even for fourway associativity.
How Are Write Handled? There are two basic options: – Write-through – The information is written to both the block in the cache and to the block in the lower level of the memory hierarchy – Write-back – The modified block is written to the lower level only when it is replaced ADVANTAGES of WRITE-THROUGH – Misses are cheaper and simpler – Easier to implement (although it usually requires a write buffer) ADVANTAGES of WRITE-BACK – CPU can write at the rate that the cache can accept – Combined writes – Effective use of bandwidth (writing the entire block) Virtual memory is a special case – only a write-back is practical
The Big Picture Where to place a block? – One place (direct-mapped) – A few places (set-associative) – Any place (fully-associative) How to find a block? – Indexing (direct-mapped) – Limited search (set-associative) – Full search (fully associative) – Separate lookup table (page table) 3. Which block should be replaced on a cache miss? – Random – LRU 4. What happens on a write? – Write-through – Write-back
The 3Cs Compulsory misses – caused by the first access to a block that has never been in the cache (cold-start misses) – INCREASE THE BLOCK SIZE (increase in miss penalty) Capacity misses – caused when the cache cannot contain all the blocks needed by the program. Blocks are being replaced and later retrieved again. – INCREASE THE SIZE (access time increases as well) Conflict misses – occur when multiple blocks compete for the same set (collision misses) – INCREASE ASSOCIATIVITY (may slow down access time)