EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.

Slides:

Advertisements

Similar presentations

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

Advertisements

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 The Memory Hierarchy Ideally one would desire an indefinitely large memory capacity such that any particular word would be immediately available … We.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

The Memory Hierarchy CPSC 321 Andreas Klappenecker.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

Now, Review of Memory Hierarchy

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

Caching I Andreas Klappenecker CPSC321 Computer Architecture.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computing Systems Memory Hierarchy.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

CMPE 421 Parallel Computer Architecture

1 CSCI 2510 Computer Organization Memory System I Organization.

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.

EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff Case Western Reserve University This presentation.

1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

Computer Organization & Programming

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.

1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Improving Memory Access 2/3 The Cache and Virtual Memory

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

CMSC 611: Advanced Computer Architecture

COSC3330 Computer Architecture

Improving Memory Access The Cache and Virtual Memory

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

Improving Memory Access 1/3 The Cache and Virtual Memory

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case Western Reserve University This presentation uses powerpoint animation: please viewshow

The Art of Memory System Design Processor $ MEM Memory reference stream,,,,... op: i-fetch, read, write Optimize the memory system organization to minimize the average memory access time for typical workloads Workload or Benchmark programs

Pipelining and the cache (Designing…,M.J.Quinn, ‘87) Instruction Pipelining is the use of pipelining to allow more than one instruction to be in some stage of execution at the same time. Ferranti ATLAS (1963):  Pipelining reduced the average time per instruction by 375%  Memory could not keep up with the CPU, needed a cache. Cache memory is a small, fast memory unit used as a buffer between a processor and primary memory

Principle of Locality Principle of Locality states that programs access a relatively small portion of their address space at any instance of time Two types of locality Temporal locality (locality in time) If an item is referenced, then the same item will tend to be referenced soon “the tendency to reuse recently accessed data items” Spatial locality (locality in space) If an item is referenced, then nearby items will be referenced soon “the tendency to reference nearby data items”

Memory Hierarchy Registers Pipelining Cache memory Primary real memory Virtual memory (Disk, swapping) Faster Cheaper Cost $$$ More Capacity CPU

Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the user with as much memory as is available in the cheapest technology. –Provide access at the speed offered by the fastest technology. Control Datapath Secondary Storage (Disk) Processor Registers Main Memory (DRAM) Second Level Cache (SRAM) On-Chip Cache 1s10,000,000s (10s ms) Speed (ns):10s100s Gs Size (bytes): KsMs Tertiary Storage (Disk) 10,000,000,000s (10s sec) Ts

Cache Memory Technology: SRAM 1 bit cell layout

Memories Technology and Principle of Locality Faster Memories are more expensive per bit Memory Technology Typical access time $ per Mbyte in 1997 SRAM5-25 ns$100-$250 DRAM ns$5-$10Magnetic Disk10-20 million ns$0.10-$0.20 Slower Memories are usually smaller in area size per bit

Cache Memory Technology: SRAM Why use SRAM (Static Random Access Memory)? see reference: Speed. The primary advantage of an SRAM over DRAM is speed. The fastest DRAMs on the market still require 5 to 10 processor clock cycles to access the first bit of data. SRAMs can operate at processor speeds of 250 MHz and beyond, with access and cycle times equal to the clock cycle used by the microprocessor Density. when 64 Mb DRAMs are rolling off the production lines, the largest SRAMs are expected to be only 16 Mb.

Cache Memory Technology: SRAM(con’t) Volatility. Unlike DRAMs, SRAM cells do not need to be refreshed. SRAMs are available 100% of the time for reading & writing. Cost. If cost is the primary factor in a memory design, then DRAMs win hands down. If, on the other hand, performance is a critical factor, then a well-designed SRAM is an effective cost performance solution.

By taking advantage of the principle of locality: –Present the user with as much memory as is available in the cheapest technology. –Provide access at the speed offered by the fastest technology. Memory Hierarchy of a Modern Computer System DRAM is slow but cheap and dense: –Good choice for presenting the user with a BIG memory system SRAM is fast but expensive and not very dense: –Good choice for providing the user FAST access time.

Cache Terminology A hit if the data requested by the CPU is in the upper level A miss if the data is not found in the upper level Hit rate or Hit ratio is the fraction of accesses found in the upper level Miss rate or (1 – hit rate) is the fraction of accesses not found in the upper level Hit time is the time required to access data in the upper level = + Miss penalty is the time required to access data in the lower level = +

Cache Example Processor Data are transferred Time 1: Hit: in cache Time 1: Miss Time 3: deliver to CPU Time 2: fetch from lower level into cache Hit time = Time 1Miss penalty = Time 2 + Time 3

Basic Cache System

Cache Memory Technology: SRAM Block diagram

Cache Memory Technology: SRAM timing diagram

Direct Mapped Cache Direct Mapped: assign the cache location based on the address of the word in memory cache_address = memory_address modulo cache_size; Observe there is a Many-to-1 memory to cache relationship

Direct Mapped Cache: Data Structure There is a Many-to-1 relationship between memory and cache How do we know whether the data in the cache corresponds to the requested word? tags contain the address information required to identify whether a word in the cache corresponds to the requested word. tags need only to contain the upper portion of the memory address (often referred to as a page address) valid bit indicates whether an entry contains a valid address

Direct Mapped Cache: Temporal Example lw$1,22($0)lw$1, ($0) lw$2,26($0) lw$2, ($0) lw$3,22($0) lw$3, ($0) IndexValidTagData 000N 001N 010N 011N 100N 101N 110N 111N Y10Memory[10110]Y11Memory[11010] Miss: valid Hit!

Direct Mapped Cache: Worst case, always miss! lw$1,22($0)lw$1, ($0) lw$2,30($0) lw$2, ($0) lw$3,6($0) lw$3, ($0) IndexValidTagData 000N 001N 010N 011N 100N 101N 110N 111N Y10Memory[10110]Y11Memory[11110] Miss: valid Miss: tag Y00Memory[00110]

Tag Index Direct Mapped Cache: Mips Architecture Data Compare Tags Hit

Bits in a Direct Mapped Cache How many total bits are required for a direct mapped cache with 64KB (= 2 16 KiloBytes) of data and one word (=32 bit) blocks assuming a 32 bit byte memory address? Cache index width= log 2 words = log /4 = log words = 14 bits Tag size = – = 30 – 14 = 16 bits Block address width = – log 2 word = 32 – 2 = 30 bits Cache block size = + + = 1 bit + 16 bits + 32 bits = 49 bits Total size =  = 2 14 words  49 bits = 784  2 10 = 784 Kbits = 98 KB = 98 KB/64 KB = 1.5 times overhead

Harvard architecture was coined to describe machines with separate memories. Speed efficient: Increased parallelism (split cache). instructions data ALUI/OALUI/O instructions and data Data busAddress bus Von Neuman architecture Area efficient but requires higher bus bandwidth because instructions and data must compete for memory. Split Cache: Exploiting the Harvard Architectures

Modern Systems: Pentium Pro and PowerPC RAM (main memory) : von Neuman Architecture Cache: uses Harvard Architecture separate Instruction/Data caches RAM (main memory) : von Neuman Architecture Cache: uses Harvard Architecture separate Instruction/Data caches

Cache schemes write-through cache Always write the data into both the cache and memory and then wait for memory. write-back cache Write data into the cache block and only write to memory when block is modified but complex to implement in hardware. No amount of buffering can help if writes are being generated faster than the memory system can accept them. write buffer write data into cache and write buffer. If write buffer full processor must stall. Chip AreaSpeed

Read hits –this is what we want! Hits vs. Misses Read misses –stall the CPU, fetch block from memory, deliver to cache, and restart. Write hits –write-through: can replace data in cache and memory. –write-buffer: write data into cache and buffer. –write-back: write the data only into the cache. Write misses –read the entire block into the cache, then write the word.

Example: The DECStation 3100 cache DECStation uses a write-through harvard architecture cache 128 KB total cache size (=32K words) = 64 KB instruction cache (=16K words) + 64 KB data cache (=16K words) 10 processor clock cycles to write to memory

The DECStation 3100 miss rates A split instruction and data cache increases the bandwidth 6.1% 2.1% 5.4% Benchmark Program gcc Instruction miss rate Data miss rate Effective split miss rate Combined miss rate 4.8% spice 1.2% 1.3% 1.2% split cache has slightly worse miss rate Why a lower miss rate? Numerical programs tend to consist of a lot of small program loops 1.2% miss, also means that 98.2% of the time it is in the cache. So using a cache pays off!

Review: Principle of Locality Principle of Locality states that programs access a relatively small portion of their address space at any instance of time Two types of locality Temporal locality (locality in time) If an item is referenced, then the same item will tend to be referenced soon “the tendency to reuse recently accessed data items” Spatial locality (locality in space) If an item is referenced, then nearby items will be referenced soon “the tendency to reference nearby data items”

Spatial Locality Temporal only cache cache block contains only one word (No spatial locality). Spatial locality Cache block contains multiple words. When a miss occurs, then fetch multiple words. Advantage Hit ratio increases because there is a high probability that the adjacent words will be needed shortly. Disadvantage Miss penalty increases with block size

Spatial Locality: 64 KB cache, 4 words 64KB cache using four-word (16-byte word) 16 bit tag, 12 bit index, 2 bit block offset, 2 bit byte offset.

Use split caches because there is more spatial locality in code: Performance 6.1% 2.1% 5.4% Program Block size gcc =1 Instruction miss rate Data miss rate Effective split miss rate Combined miss rate 4.8% gcc =4 2.0% 1.7% 1.9% 4.8% spice =1 1.2% 1.3% 1.2% spice =4 0.3% 0.6% 0.4% Temporal only split cache: has slightly worse miss rate Spatial split cache: has lower miss rate

Increasing the block size tends to decrease miss rate: Cache Block size Performance