Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
S. Barua – CPSC 440 CHAPTER 7 LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Topics to be covered – Principle.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Patterson.
CMPE 421 Parallel Computer Architecture
Lecture 19: Virtual Memory
July 30, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 8: Exploiting Memory Hierarchy: Virtual Memory * Jeremy R. Johnson Monday.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CDA 3101 Spring 2016 Introduction to Computer Organization Physical Memory, Virtual Memory and Cache 22, 29 March 2016.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
Improving Memory Access 1/3 The Cache and Virtual Memory
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng

Principle of locality programs access a relatively small portion of their address space at a given time. Temporal locality (locality in time): if an item is referenced, it will tend to be referenced again soon. Spatial locality (locality in space): if an item is referenced, items whose addresses are close will tend to be referenced soon.

Basic Structure Memory TechnologyTypical Access Time$ per GB in 2004 SRAM ns$ $10,000 DRAM ns$100 - $200 Magnetic disk5 ms - 20 ms$0.05-$2

The Principal By combining two concepts (locality and hierarchy): – Temporal Locality => Keep most recently accessed data items closer to the processor – Spatial Locality => Move blocks consisting of multiple contiguous words to upper levels of the hierarchy

Memory Hierarchy (I)

Memory hierarchy (II) Data is copied between adjacent levels Minimum unit of information copied is a block If the requested data appears in some block in the upper level, this is called a hit, otherwise a miss and a block containing the requested data is copied from a lower level. The hit rate or hit ratio, is the fraction of memory accesses found in the upper level. The miss rate (1.0 - hit rate) is the fraction not found at the upper level. Hit time: the time to access the upper level including the time to determine if the access is a hit or a miss. Miss penalty the time to replace a block in the upper level. Upper Lower

Memory Hierarchy (II)

The Moore’s Law

Cache A safe place for hiding or storing things The level of memory hierarchy between processor and main memory Refer to any storage managed to take advantage pf locality of access Motivation: – high processor cycle speed – low memory cycle speed – fast access to recently used portions of a program's code and data

The Basic Cache Concept 1. The CPU is requesting data item Xn 2. The request results in a miss 3. The word Xn is brought from memory into cache

Direct Mapped Cache Each memory location is mapped to exactly one location in the cache. – address of the block modulo number of blocks in the cache. Answer two crucial questions – How do we know if a data item is in the cache? – If it is, how do we find it?

The Example of Direct-Mapped Cache

Lower bits Upper bits

Cache Contents Tag Identify whether a word in the cache corresponds to the requested word. Valid bit indicates whether an entry contains a valid address Data Tag size = 32 – n – 2 = 32 – Size = 2 index x ( valid + tag + data) = 2 n x ( 1 + m + 4*8) n m

Direct-Mapped Example A Cache 16 KB of data 4-word blocks 32 bits address ValidTagData How many total bits are required for direct-mapped? 16 KB 4-word 4 x 4 x 8 = 128 bits n + m + 4 = 32 …. (1) 16KB* = 4K words = 2 10 block → n = 10 m = 18 The total bits = 2 10 x ( *4*8) = 147 Kbits

Mapping an address to a cache block Source:

Block Size vs. Miss Rate

Handling Cache Misses Stall the entire pipeline & fetch the requested word Steps to handle an instruction cache miss: 1. Send the original PC value (PC-4) to the memory. 2. Instruct main memory to perform a read and wait for the memory to complete its access. 3. Write the cache entry, putting the data from memory in the data portion of the entry, writing the upper bits of the address (from the ALU) into the tag field, and turning the valid bit on. 4. Restart the instruction execution at the first step, which will refresh the instruction, this time finding it in the cache.

Cache Main Memory Write-Through A scheme in which writes always update both the cache and the memory, ensuring that data is always consistent between the two. Write buffer: – A queue that holds data while the data are waiting to be written to memory.

Write-Back A scheme that handles writes by updating values only to the block in the cache, then writing the modified block to the lower level of the hierarchy when the block is replaced. Pro: Improve performance, especially when writes are frequent (and couldn’t be handled by write buffer) Con: More complex to implement Cache Main Memory Write-Back

Cache Performance CPU time = (CPU execution clock cycles + Memory-stall clock cycles) x Clock cycle time Memory-stall clock cycles = Read-stall cycles + Write-stall cycles – Read-stall cycles = (Reads/Program) x Read miss rate x Read miss penalty – Write-stall cycles = ((Writes/Program) x Write miss rate x Write miss penalty) + Write buffer stalls Memory-stall clock cycles = (MemoryAccess/Program) x Miss Rate x Miss Penalty Memory-stall clock cycles = (Instructions/Program) x Misses/Instructions) x Miss Penalty

The Example Source: ( )

What if …. What if the processor is made faster, but the memory system stays the same? – Speed up the machine by improving the CPI from 2 to 1 without increasing the clock The system with a perfect cache would be 2.38 / 1 = 2.38 times faster The amount of time spent on memory stalls rises from 1.38/3.38 = 41% to 1.38/2.38 = 58%

What if ….

Our Observations Relative cache penalties increases as a processor becomes faster The lower the CPI, the more pronounced the impact of stall cycles If the main memory system is the same, a higher CPU clock rate leads to a larger miss penalty

Decreasing miss ratio with associative cache direct-mapped cache: A cache structure in which each memory location is mapped to exactly one location in the cache. set-associative cache: A cache that has a fixed number of locations (at least two) where each block can be placed. fully associative cache: A cache structure in which a block can be placed in any location in the cache.

The Example (12 mod 8) = 4(12 mod 4) = 0 Can appear in any of the eight cache block

One More Example – Direct Mapped Block AddressCache Address 0(0 mod 4) = 0 6(6 mod 4) = 2 8(8 mod 4) = 0 Address of memory block accessed Hit or miss Contents of cache block after reference missMemory [0] 8missMemory [8] 0missMemory [0] 6missMemory [0] Memory [6] 8missMemory [0] Memory [6] 5 Misses

Two-Way Set Associative Cache Block AddressCache Address 0(0 mod 2) = 0 6(6 mod 2) = 0 8(8 mod 2) = 0 Address of memory block accessed Hit or miss Contents of cache block after reference Set 0 Set 1 0MissMemory [0] 8MissMemory [0]Memory [8] 0HitMemory [0]Memory [8] 6MissMemory [0]Memory [6] 8missMemory [8]Memory [6] 4 Misses which block to replace – commonly used is LRU scheme Least recently used (LRU) A replacement scheme in which the block replaced is the one that has been unused for the longest time. Least recently used (LRU) A replacement scheme in which the block replaced is the one that has been unused for the longest time.

The Implementation of 4-Way Set Associative Cache

Fully Associative Cache Address of memory block accessed Hit or miss Contents of cache block after reference Block 0Block 1Block 2Block 3 0missMemory [0] 8missMemory [0]Memory [8] 0HitMemory [0]Memory [8] 6missMemory [0]Memory [8]Memory [6] 8HitMemory [0]Memory [8]Memory [6] 3 Misses Increasing degree of associativity → decrease in miss rate

Performance of Multilevel Cache CPI1 Clock Rate5 GHz Memory Access Time100 ns Miss Rate per instruction at the primary cache 2% Secondary cache Access Time (Hit or Miss) 5 ns Reduce the miss rate of main memory 0.5% 100 /0.2 = 500 clock cycles The miss penalty Total CPI = 1 + Memory-Stall cycle per instruction = * 2% = = 11 Original 5 /0.2 = 25 clock cycles The miss penalty Total CPI = 1 + Primary-Stall per instruction + Secondary-Stall per instruction = 1 + (25 * 2%) + (500 * 0.5%) = = 4.0 Multilevel 11/4 = 2.8

Designing the Memory System to Support Caches (I) Consider hypothetical memory system parameters: 1 memory bus clock cycle to send address 15 memory bus clock cycles to initiate DRAM access 1 memory bus clock cycle to transfer a word of data a cache block is a 4-word blocks 1-word-wide bank of DRAMs The miss penalty is: × × 1 = 65 clock cycles Number of bytes transferred per clock cycle per miss: (4*4) / 65 = 0.25

Designing the Memory System to Support Caches (II)

Virtual Memory The technique in which main memory acts as a "cache" for the secondary storage – automatically manages main memory and secondary storage Motivation – allow efficient sharing of memory among multiple programs – remove the programming burdens of a small, limited amount of main memory

Basic Concepts of Virtual Memory Virtual memory allows each program to exceed the size of primary memory It automatically manages two levels of memory hierarchy: – Main memory (physical memory) – Secondary storage Same concepts as in caches, different terminology A virtual memory block – a page A virtual memory miss – a page fault CPU produces a virtual address (which is translated to a physical address, used to access main memory). This process (accomplished by a combination o HW and SW) is called memory mapping or address translation. Source:

Mapping from a Virtual to Physical Address 2 32 = 4 GB 2 30 = 1 GB

High Cost of a Miss Page fault takes millions of cycles to process – E.g., main memory is 100,000 times faster than disk – This time is dominated by the time it takes to get the first word for typical page size Key decisions: – Page size large enough to amortize the high access time – Pick organization that reduces page fault rate (e.g., fully associative placement of pages) – Handle page faults in software (overhead is small compared to disk access times) and use clever algorithms for page placement – Use write-back

Page Table Containing the virtual to physical address translations in a virtual memory system. – Resides in memory – Indexed with the page number form the virtual address – Contains corresponding physical page number – Each program has its own page table – Hardware includes a register pointing to the start of the page table (page table register)

Page Table Size For Example: Consider 32-bit virtual addresses, 4-KB page size, 4B per page table entry: Number of page table entries = 2 30 /2 12 = 2 20 Size of page table = 2 20 x 4 = 4 MB

Page Faults Occurs when a valid bit (V) is found to be 0: – Transfer the control to the operating system (using the exception mechanism) – The operating system must find the appropriate page in the next level of hierarchy – Decide where to place it in the main memory Where is the page on this disk? – The information can be found either in the same page table, or in a separate structure The OS creates the space on disk for all the pages of the process at the time it creates the process At the same time, a data structure that records the location of each page is also created.

The Translation-Lookaside Buffer (TLB) Each memory access by a program requires two memory accesses: – Obtain the physical address (reference the page table) – Get the data Because of the spatial and temporal locality within each page, a translation for a virtual page will likely be needed in the near future. To speed this process up include a special cache that keeps track of recently used translations

The Translation-Lookaside Buffer (TLB)

Processing read/write requests

Where Can a Block Be Placed? 1. Increase in the degree of associativity: usually decreases the miss rate. 2. The improvement in miss rate comes from: reduced competition for the same location.

How Is a Block Found?

What block is replaced on a miss? Which block is a candidate for replacement: – In a fully associative cache – all blocks are candidates – In a set-associative cache – all the blocks in the set – In a direct-mapped cache – there is only one candidate In set-associative and fully associative caches, use one of two strategies – 1. Random. (use hardware assistance to make it fast) – 2. LRU (Least recently used). usually two complicated even for fourway associativity.

How Are Write Handled? There are two basic options: – Write-through – The information is written to both the block in the cache and to the block in the lower level of the memory hierarchy – Write-back – The modified block is written to the lower level only when it is replaced ADVANTAGES of WRITE-THROUGH – Misses are cheaper and simpler – Easier to implement (although it usually requires a write buffer) ADVANTAGES of WRITE-BACK – CPU can write at the rate that the cache can accept – Combined writes – Effective use of bandwidth (writing the entire block) Virtual memory is a special case – only a write-back is practical

The Big Picture Where to place a block? – One place (direct-mapped) – A few places (set-associative) – Any place (fully-associative) How to find a block? – Indexing (direct-mapped) – Limited search (set-associative) – Full search (fully associative) – Separate lookup table (page table) 3. Which block should be replaced on a cache miss? – Random – LRU 4. What happens on a write? – Write-through – Write-back

The 3Cs Compulsory misses – caused by the first access to a block that has never been in the cache (cold-start misses) – INCREASE THE BLOCK SIZE (increase in miss penalty) Capacity misses – caused when the cache cannot contain all the blocks needed by the program. Blocks are being replaced and later retrieved again. – INCREASE THE SIZE (access time increases as well) Conflict misses – occur when multiple blocks compete for the same set (collision misses) – INCREASE ASSOCIATIVITY (may slow down access time)

The Design Challenges