Lecture 20 Last lecture: Today’s lecture: Types of memory

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Memory COE 308 Computer Architecture Prof. Muhamed Mudawar
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.
Memory ICS 233 Computer Architecture and Assembly Language Dr. Aiman El-Maleh College of Computer Sciences and Engineering King Fahd University of Petroleum.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
Lecture 41: Review Session #3 Reminders –Office hours during final week TA as usual (Tuesday & Thursday 12:50pm-2:50pm) Hassan: Wednesday 1pm to 4pm or.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Maninder Kaur CACHE MEMORY 24-Nov
Lecture 19: Virtual Memory
Memory ICS 233 Computer Architecture and Assembly Language Dr. Aiman El-Maleh College of Computer Sciences and Engineering King Fahd University of Petroleum.
Lecture Objectives: 1)Define set associative cache and fully associative cache. 2)Compare and contrast the performance of set associative caches, direct.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Lecture 26 Fasih ur Rehman.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.
Memory COE 301 Computer Organization Dr. Muhamed Mudawar
CSE 241 Computer Engineering (1) هندسة الحاسبات (1) Lecture #3 Ch. 6 Memory System Design Dr. Tamer Samy Gaafar Dept. of Computer & Systems Engineering.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Memory COE 301 Computer Organization Prof. Muhamed Mudawar
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Memory COE 308 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.
Memory COE 301 / ICS 233 Computer Organization Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
Associative Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word Tag uniquely identifies block of.
COSC3330 Computer Architecture
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Lecture 21: Memory Hierarchy
Chapter 5 Memory CSE 820.
Module IV Memory Organization.
CS 704 Advanced Computer Architecture
Cache - Optimization.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Lecture 20 Last lecture: Today’s lecture: Types of memory Memory hierarchy Today’s lecture: Cache mapping schemes Block replacement Write strategy

Four Basic Questions on Caches Q1: Where can a block be placed in a cache? Block placement Direct Mapped, Set Associative, Fully Associative Q2: How is a block found in a cache? Block identification Block address, tag, offset Q3: Which block should be replaced on a miss? Block replacement FIFO, Random, LRU Q4: What happens on a write? Write strategy Write Back or Write Through (with Write Buffer)

Block Placement: Direct Mapped Block: unit of data transfer between cache and memory Direct Mapped Cache: A block can be placed in exactly one location in the cache (Block address) modulo (#Blocks in cache)

Direct-Mapped Cache A memory address is divided into Block address: identifies block in memory Block offset: to access bytes within a block A block address is further divided into Block: used for direct cache access Tag: most-significant bits of block address Index = Block Address mod Cache Blocks Tag must be stored also inside cache For block identification A valid bit is also required to indicate Whether a cache block is valid or not V Tag Block Data = Hit Data Block offset Block Address

Direct Mapped Cache – cont’d Cache hit: block is stored inside cache Index is used to access cache block Address tag is compared against stored tag If equal and cache block is valid then hit Otherwise: cache miss If number of cache blocks is 2n n bits are used for the cache index If number of bytes in a block is 2b b bits are used for the block offset byte-addressable If 32 bits are used for an address 32 – n – b bits are used for the tag Cache data size = 2n+b bytes V Tag Block Data = Hit Data Block offset Block Address

Mapping an Address to a Cache Block Example Consider a direct-mapped cache with 256 blocks Block size = 16 bytes (byte-addressable) Compute tag, index, and byte offset of address: 0x01FFF8AC Physical address: 32 bits Solution 32-bit address is divided into: 4-bit byte offset field, because block size = 24 = 16 bytes 8-bit cache index, because there are 28 = 256 blocks in cache 20-bit tag field Byte offset = 0xC = 12 (least significant 4 bits of address) Cache index = 0x8A = 138 (next lower 8 bits of address) Tag = 0x01FFF (upper 20 bits of address) Tag Index offset 4 8 20 Block Address

Example on Cache Placement & Misses Consider a small direct-mapped cache with 32 blocks Cache is initially empty, Block size = 16 bytes The following memory addresses (in decimal) are referenced: 1000, 1004, 1008, 2548, 2552, 2556. Map addresses to cache blocks and indicate whether hit or miss Solution: 1000 = 0x3E8 cache index = 0x1E Miss (first access) 1004 = 0x3EC cache index = 0x1E Hit 1008 = 0x3F0 cache index = 0x1F Miss (first access) 2548 = 0x9F4 cache index = 0x1F Miss (different tag) 2552 = 0x9F8 cache index = 0x1F Hit 2556 = 0x9FC cache index = 0x1F Hit Tag Index offset 4 5 23

Exercise EXAMPLE 6.2 Assume a byte-addressable memory consists of 214 bytes, cache has 16 blocks, and each block has 8 bytes. The number of memory blocks are: Each main memory address requires14 bits. Of this 14-bit address field, the rightmost 3 bits reflect the offset field We need 4 bits to select a specific block in cache, so the block field consists of the middle 4 bits. The remaining 7 bits make up the tag field.

Exercise EXAMPLE 6.4 Consider 16-bit memory addresses and 64 blocks of cache where each block contains 8 bytes. We have: 3 bits for the offset 6 bits for the block 7 bits for the tag. A memory reference for 0x0404 maps as follows:

Direct Mapped Cache Summary In summary, direct mapped cache maps main memory blocks in a modular fashion to cache blocks. The mapping depends on: The number of bits in the main memory address (how many addresses exist in main memory) The number of blocks are in cache (which determines the size of the block field) How many addresses (either bytes or words) are in a block (which determines the size of the offset field)

Fully Associative Cache A block can be placed anywhere in cache  no indexing If m blocks exist then m comparators are needed to match tag Cache data size = m  2b bytes Address Tag offset Data Hit = V Block Data mux m-way associative

Set-Associative Cache A set is a group of blocks that can be indexed A block is first mapped onto a set Set index = Block address mod Number of sets in cache If there are m blocks in a set (m-way set associative) then m tags are checked in parallel using m comparators If 2n sets exist then set index consists of n bits Cache data size = m  2n+b bytes (with 2b bytes per block) Without counting tags and valid bits A direct-mapped cache has one block per set (m = 1) A fully-associative cache has one set (2n = 1 or n = 0)

Set-Associative Cache Diagram Tag Block Data Address Set offset Data = mux Hit m-way set-associative

Handling Write - Write Policy Write Through: Writes update cache and lower-level memory Cache control bit: only a Valid bit is needed Memory always has latest data, which simplifies data coherency Can always discard cached data when a block is replaced Write Back: Writes update cache only Cache control bits: Valid and Modified bits are required Modified cached data is written back to memory when replaced Multiple writes to a cache block require only one write to memory Uses less memory bandwidth than write-through and less power However, more complex to implement than write through

Write Buffer Decouples the CPU write from the memory bus writing Permits writes to occur without stall cycles until buffer is full Write-through: all stores are sent to lower level memory Write buffer eliminates processor stalls on consecutive writes Write-back: modified blocks are written when replaced Write buffer is used for evicted blocks that must be written back The address and modified data are written in the buffer The write is finished from the CPU perspective CPU continues while the write buffer prepares to write memory If buffer is full, CPU stalls until buffer has an empty entry

Replacement Policy Which block to be replaced on a cache miss? No selection alternatives for direct-mapped caches m blocks per set to choose from for associative caches Random replacement Candidate blocks are randomly selected One counter for all sets (0 to m – 1): incremented on every cycle On a cache miss replace block specified by counter First In First Out (FIFO) replacement Replace oldest block in set One counter per set (0 to m – 1): specifies oldest block to replace Counter is incremented on a cache miss

Replacement Policy – cont’d Least Recently Used (LRU) Replace block that has been unused for the longest time Order blocks within a set from least to most recently used Update ordering of blocks on each cache hit With m blocks per set, there are m! possible permutations Pure LRU is too costly to implement when m > 2 m = 2, there are 2 permutations only (a single bit is needed) m = 4, there are 4! = 24 possible permutations LRU approximation is used in practice For large m > 4, Random replacement can be as effective as LRU

Comparing Random, FIFO, and LRU Data cache misses per 1000 instructions 10 SPEC2000 benchmarks on Alpha processor Block size of 64 bytes LRU and FIFO outperforming Random for a small cache Little difference between LRU and Random for a large cache LRU is expensive for large associativity (# blocks per set) Random is the simplest to implement in hardware 2-way 4-way 8-way Size LRU Rand FIFO LRU Rand FIFO LRU Rand FIFO 16 KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64 KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 256 KB 92.2 92.1 92.5 92.1 92.1 92.5 92.1 92.1 92.5