Lecture 22: Cache Hierarchies, Memory

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
1 Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1)
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Lecture: Virtual Memory, DRAM Main Memory Topics: virtual memory, TLB/cache access, DRAM intro (Sections 2.2)
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Multilevel Memories (Improving performance using alittle “cash”)
Basic Performance Parameters in Computer Architecture:
Lecture: Cache Hierarchies
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Lecture: Large Caches, Virtual Memory
Lecture: Cache Hierarchies
Lecture 23: Cache, Memory, Security
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Rose Liu Electrical Engineering and Computer Sciences
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture: DRAM Main Memory
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Lecture: DRAM Main Memory
Lecture: DRAM Main Memory
Lecture: Cache Innovations, Virtual Memory
Lecture 22: Cache Hierarchies, Memory
CPE 631 Lecture 05: Cache Design
Lecture 24: Memory, VM, Multiproc
Adapted from slides by Sally McKee Cornell University
ECE232: Hardware Organization and Design
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 20: OOO, Memory Hierarchy
Lecture 15: Memory Design
Lecture 11: Cache Hierarchies
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Lecture 14: Cache Performance
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Lecture 22: Cache Hierarchies, Memory Today’s topics: Cache hierarchies DRAM main memory

Locality Why do caches work? Temporal locality: if you used some data recently, you will likely use it again Spatial locality: if you used some data recently, you will likely access its neighbors No hierarchy: average access time for data = 300 cycles 32KB 1-cycle L1 cache that has a hit rate of 95%: average access time = 0.95 x 1 + 0.05 x (301) = 16 cycles

a unique cache location. Accessing the Cache Byte address 101000 Offset 8-byte words 8 words: 3 index bits Direct-mapped cache: each address maps to a unique cache location. Sets Data array

The Tag Array Byte address 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

Example Access Pattern Byte address Assume that addresses are 8 bits long How many of the following address requests are hits/misses? 4, 7, 10, 13, 16, 68, 73, 78, 83, 88, 4, 7, 10… 101000 Tag 8-byte words Compare Direct-mapped cache: each address maps to a unique address Tag array Data array

Increasing Line Size Byte address A large cache line size  smaller tag array, fewer misses because of spatial locality 10100000 32-byte cache line size or block size Tag Offset Tag array Data array

Associativity Byte address Set associativity  fewer conflicts; wasted power because multiple data and tags are read 10100000 Tag Way-1 Way-2 Tag array Data array Compare

How many offset/index/tag bits if the cache has Associativity How many offset/index/tag bits if the cache has 64 sets, each set has 64 bytes, 4 ways Byte address 10100000 Tag Way-1 Way-2 Tag array Data array Compare

Example 1 32 KB 4-way set-associative data cache array with 32 byte line sizes How many sets? How many index bits, offset bits, tag bits? How large is the tag array?

Example 1 32 KB 4-way set-associative data cache array with 32 byte line sizes cache size = #sets x #ways x block size How many sets? 256 How many index bits, offset bits, tag bits? 8 5 19 How large is the tag array? tag array size = #sets x #ways x tag size = 19 Kb = 2.375 KB

Example 2 A pipeline has CPI 1 if all loads/stores are L1 cache hits 40% of all instructions are loads/stores 85% of all loads/stores hit in 1-cycle L1 50% of all (10-cycle) L2 accesses are misses Memory access takes 100 cycles What is the CPI?

Example 2 A pipeline has CPI 1 if all loads/stores are L1 cache hits 40% of all instructions are loads/stores 85% of all loads/stores hit in 1-cycle L1 50% of all (10-cycle) L2 accesses are misses Memory access takes 100 cycles What is the CPI? Start with 1000 instructions 1000 cycles (includes all 400 L1 accesses) + 400 (l/s) x 15% x 10 cycles (the L2 accesses) + 400 x 15% x 50% x 100 cycles (the mem accesses) = 4,600 cycles CPI = 4.6

Cache Misses On a write miss, you may either choose to bring the block into the cache (write-allocate) or not (write-no-allocate) On a read miss, you always bring the block in (spatial and temporal locality) – but which block do you replace? no choice for a direct-mapped cache randomly pick one of the ways to replace replace the way that was least-recently used (LRU) FIFO replacement (round-robin)

Writes When you write into a block, do you also update the copy in L2? write-through: every write to L1  write to L2 write-back: mark the block as dirty, when the block gets replaced from L1, write it to L2 Writeback coalesces multiple writes to an L1 block into one L2 write Writethrough simplifies coherency protocols in a multiprocessor system as the L2 always has a current copy of data

Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed – the misses for an infinite cache Capacity misses: happens because the program touched many other words before re-touching the same word – the misses for a fully-associative cache Conflict misses: happens because two words map to the same location in the cache – the misses generated while moving from a fully-associative to a direct-mapped cache

Off-Chip DRAM Main Memory Main memory is stored in DRAM cells that have much higher storage density DRAM cells lose their state over time – must be refreshed periodically, hence the name Dynamic A number of DRAM chips are aggregated on a DIMM to provide high capacity – a DIMM is a module that plugs into a bus on the motherboard DRAM access suffers from long access time and high energy overhead

Memory Architecture Processor Bank Memory Controller DIMM Row Buffer Memory Controller Address/Cmd DIMM Data DIMM: a PCB with DRAM chips on the back and front The memory system is itself organized into ranks and banks; each bank can process a transaction in parallel Each bank has a row buffer that retains the last row touched in a bank (it’s like a cache in the memory system that exploits spatial locality) (row buffer hits have a lower latency than a row buffer miss)

Title Bullet