Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn.

Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn Sound, WA

Memory Bottleneck State-of-the-art processor f = 3 GHz t clock = 330ps 1-2 instructions per cycle ~25% memory reference Memory response One reference every 4 instructions 4 x 330ps  ~1.2ns needed! Bulk semiconductor RAM 100ns+ for a ‘random’ access on system bus!  Processor will spend most of its time waiting for memory! Faster for DDR RAM streaming date over special bus but random access is still slow!

Cache Small, fast memory Typically ~50kbytes (1998) 2007 Pentium – 16kbytes (Level 1) 2 cycle access time Same die as processor “Off-chip” cache possible Custom cache chip closely coupled to processor Use fast static RAM (SRAM) rather than slower dynamic RAM Several levels possible 2 nd level of the memory hierarchy “Caches” most recently used memory locations “closer” to the processor closer = closer in time

Cache Etymology cacher (French) = “to hide” Transparent to a program Programs simply run slower without it Modern processors rely on it Reduces the cost of main memory access Enables instruction/cycle throughput Typical program ~25% memory accesses Reference: Wikipedia Don’t try Google

Cache Relies upon locality of reference Programs continually use - and re-use - the same locations Instructions loops, common subroutines Data look-up tables “working” data sets

Cache - operation Memory requests checked in cache first If the word sought is in the cache, it’s read from cache (or updated in cache)  Cache hit If not, request is passed to main memory and data is read (written) there  Cache miss CPU MMU Cache Main Mem D or I VA PA D or I

Cache - operation Hit rates of 95% are usual Cache: 16 kbytes Effective Memory Access Time Cache: 2 cycles Main memory: 10 cycles Average access: 0.95*2 + 0.05*10 = 2.4 cycles In general, if there are are n levels of memory, Avg memory access time =  f j t j acc where f j = fraction of accesses ‘hitting’ at level j t j acc = access time for level j CPU L1 Cache L2 Cache Bulk memory Paging Disc Memory Hierarchy Faster Larger Small is fast!

Cache - operation Key Problem Cache: 16 kbytes Memory: 2 Mbytes min What goes in the cache? ‘Common’ instructions and data We’d like Very common data in L1 Less common data in L2 How do we Ensure that the very common data is in L1? Find out whether the data item that we want now is in the cache? CPU L1 Cache L2 Cache Bulk memory Paging Disc Memory Hierarchy Faster Larger Small is fast!

Cache - organisation Three basic types Direct-mapped cache Fully associative cache Set associative cache Note that all these are possible for other caches too Disc cache Web page cache Frequently accessed pages Data base cache Frequently accessed records

Cache - organisation Direct-mapped cache Each word in the cache has a tag Assume cache size - 2 k words machine words - p bits byte-addressed memory m = log 2 ( p/8 ) bits not used to address words m = 2 for 32-bit machines p-k-mmk p bits tagcache address byte address Address format

Cache - organisation Direct-mapped cache p-k-mmk tagcache address byte address tagdata Hit? memory CPU 2 k lines p-k-mp A cache line Memory address

Cache - Conflicts Conflicts Two addresses separated by 2 k+m will hit the same cache location p-k-mmk p bits tag cache address byte address Addresses in which these k bits are the same will map to the same cache line

Cache - Direct Mapped Conflicts Two addresses separated by 2 k+m will hit the same cache location 32-bit machine, 64kbyte (16kword) cache  m = 2, k = 14  Any program or data set larger than 64kb will generate conflicts On a conflict, the ‘old’ word is flushed Unmodified word ( Program, constant data ) overwritten by the new data from memory Modified data needs to be written back to memory before being overwritten

Cache - Conflicts Modified or dirty words When a word is modified in cache  Write-back cache Only writes data back when needed  Misses  Two memory accesses Write modified word back Read new word  Write-through cache Low priority write to main memory is queued Processor is delayed by read only Memory write occurs in parallel with other work Instruction and necessary data fetches take priority

Cache - Write-through or write-back? Write-through Allows an intelligent bus interface unit to make efficient use of a serious bottle-neck Processor - memory interface (Main memory bus) Reads (instruction and data) need priority! They stall the processor Writes can be delayed At least until the location is needed! More on intelligent system interface units later but...

Cache - Write-through or write-back? Write-through Seems a good idea! but... Multiple writes to the same location waste memory bus bandwidth  Typical programs run better with write-back caches however Often you can easily predict which will be best  Some processors ( eg PowerPC) allow you to classify memory regions as write-back or write-through

Cache - more bits Cache lines need some status bits Tag bits +.. Valid All set to false on power up Set to true as words are loaded into cache Dirty Needed by write-back cache Write- through cache always queues the write, so lines are never ‘dirty’ Tag VMData Cache line p-k-mp11

Cache – Improving Performance Conflicts ( addresses 2 k+m bytes apart ) Degrade cache performance Lower hit rate Murphy’s Law operates Addresses are never random! Some locations ‘thrash’ in cache Continually replaced and restored Alternatively Ideal cache performance depends on uniform access to all parts of memory Never happens in real programs!

Cache access pattern Ideally, each cache location is ‘hit’ the same number of times In reality, hot spots occur Some cache lines ‘thrash’ Values ping-pong between cache and next level memory  Cache provides no benefit Could even slow down processor (write-backs generate uneven bus load)

Cache - Fully Associative All tags are compared at the same time Words can use any cache line

Cache - Fully Associative Associative Each tag is compared at the same time Any match  hit Avoids ‘unnecessary’ flushing Replacement ‘Capacity’ conflicts will still occur Cache size << working data set size! Use Least Recently Used (LRU) algorithm Needs extra status bits Cycles since last accessed Hardware cost high Extra comparators Wider tags p-m bits vs p-k-m bits

Cache Organization Direct Mapped Cache  Simple, fast  Conflicts (‘hot spots’) degrade performance Fully Associative  Avoids conflict problem  Any number of hits on address a + k 2 m for any set of values of k possible ?Depends on precise LRU algorithm Extra hardware  Additional comparators (one per line!) make hardware cost very high!  Hybrid organization developed  Set associative caches

Cache - Set Associative Each line - two words two comparators only Example: 2-way set associative

Cache - Set Associative n -way set associative caches n can be small: 2, 4, 8  Best performance Reasonable hardware cost Most high performance processors Replacement policy LRU choice from n Reasonable LRU approximation 1 or 2 bits Set on access Cleared / decremented by timer Choose cleared word for replacement

Cache - Locality of Reference Locality of Reference is the fundamental principle which permits caches to work Almost all programs exhibit it to some degree!  Temporal Locality Same location will be referenced again soon Access same data again Program loops - access same instruction again Caches described so far exploit temporal locality  Spatial Locality Nearby locations will be referenced soon Next element of an array Next instruction of a program

Cache - Line Length Spatial Locality Use very long cache lines Fetch one datum  Neighbours fetched also Common in all types of program Programs – next instruction Branches – about 10% of instructions Data Scientific, Engineering Arrays Commercial, Information processing Character strings

Cache - Line Length Spatial Locality Allows efficient use of bus Blocks of data are ‘burst’ across the bus More efficient, reduced bus overheads Modern RAM (RAMbus, DDR, SDDR, etc) rely on it Example PowerPC 601 (Motorola/Apple/IBM) first of the single chip Power processors 64 sets 8-way set associative 32 bytes per line 32 bytes (8 instructions) fetched into instruction buffer in one cycle or data: 32 byte string or 8 floats or 4 doubles 64 x 8 x 32 = 16k byte total

Cache - Separate I- and D-caches Unified cache Instructions and Data in same cache Two caches - * Instructions * Data  Increases total bandwidth MIPS R10000 32Kbyte Instruction; 32Kbyte Data Instruction cache is pre-decoded! (32  36bits) Data 8-word (64byte) line, 2-way set associative 256 sets Replacement policy?

Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn.

Similar presentations

Presentation on theme: "Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn.

Similar presentations

Presentation on theme: "Computer Architecture Cache John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe at 13 knots on Cockburn."— Presentation transcript:

Similar presentations

About project

Feedback