Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367 Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi.

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367 Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi Mendelson Some of the slides were taken from: (1) Lihu Rapoport (2) Randi Katz and (3) Petterson

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 2 Technology Trends DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1000:1 1 Mb 2:1 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns CapacitySpeed Logic2x in 3 years2x in 3 years DRAM4x in 3 years1.4x in 10 years Disk2x in 3 years1.4x in 10 years

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 3 Processor-DRAM Memory Gap (latency) Processor-Memory Performance Gap: (grows 50% / year) 1 10 100 1000 19801981 198219831984 1985 19861987 19881989199019911992199319941995 19961997 19981999 2000 DRAM CPU Performance Time

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 4 Why can’t we build Memory at the same frequency as Logic? 1.It is too expensive to build large memory with that technology 2.The size of the memory determine its access time, the larger the slower.  We do not aim to achieve the best performance solution. We aim to achieve the best COST EFFECTIVE solution (best performance for a given amount of money).

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 5 Important observation – programs preserve locality (and we can help it)  Temporal Locality (Locality in Time): – If an item is referenced, it will tend to be referenced again soon – Example: code and variables in loops => Keep most recently accessed data items closer to the processor  Spatial Locality (Locality in Space): – If an item is referenced, nearby items tend to be referenced soon – Example: scanning an array => Move blocks of contiguous words closer to the processor Locality + smaller HW is faster + Amdahl’s law => memory hierarchy

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 6 The Goal: illusion of large, fast, and cheap memory  Fact: Large memories are slow, fast memories are small  How do we create a memory that is large, cheap and fast (most of the time)? – Hierarchy: Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Level 1Level 4Level 3Level 2CPU

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 7 Levels of the Memory Hierarchy CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns $.01-.001/bit Main Memory M Bytes 100ns-1us $.01-.001 Disk G Bytes ms 10 - 10 cents -3 -4 Capacity Access Time Cost Infinite storage sec-min 10 -6 Registers Cache Memory Disk Backup Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 8 Simple performance evaluation  Suppose we have a processor that can execute one instruction per cycle when working from the first level of the memory hierarchy (hits the L1).  Example: If the information is not found in the first level, the CPU waits for 10 cycles, and if it is found only in the third level, it costs another 100 cycles. Level 1Level 3Level 2 CPU

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 9 Cache Performance CPU time = (CPU execution cycles + Memory stall cycles) × cycle time Memory stall cycles = Reads × Read miss rate × Read miss penalty + Writes × Write miss rate × Write miss penalty Memory stall cycles = Memory accesses × Miss rate × Miss penalty CPU time = IC × (CPI execution + Mem accesses per instruction × Miss rate × Miss penalty) × Clock cycle time Misses per instruction = Memory accesses per instruction × Miss rate CPU time = IC × (CPI execution + Misses per instruction × Miss penalty) × Clock cycle time

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 10 Example  Consider a program that executes 10x10 6 instructions and with CPI=1.  Each instruction causes (in average) 0.5 accesses to data  95% of the accesses hit L1  50% of the accesses to L2 are misses and so need to be looked up in L3.  What is the slowdown due to memory hierarchy? Solution  Program generates 15x10 6 accesses to memory that could be executed in 10x10 6 cycles if all the information was at level-1  0.05* 15x10 6 = 750000 accesses to L2 and 375000 accesses to L3.  New cycles = 10x10 6 + 10*750,000 + 100*375,000 = 55*10 6  It is 5.5 times slowdown!!!!!

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 11 The first level of the memory hierarchy: Cache memories – Main Idea  At this point we assume only two levels of memory hierarchy: Main memory and cache memory  For the simplicity we also assume that all the program (data and instructions) is placed in the main memory.  The cache memory(ies) is part of the Processor – Same technology – Speed: same order of magnitude as accessing Registers  Relatively small and expensive  Acts like an HASH function: holds parts of the programs’ address spaces.  It needs to achieve: – Fast access time – Fast search mechanism – Fast replacement mechanism – High Hit-Ratio

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 12 Cache - Main Idea (cont)  When processor needs instruction or data it first looks to find it in the cache. If that fails, it brings the data from the main memory to the cache and uses it from there.  Address space (or main memory) is partitioned into blocks – Typical block size is 32, 64 or 128 bytes – Block address is the address of the first byte in the block  block address is aligned (multiple of the block size)  Cache holds lines, each line holds a block – Need to determine which line the block is mapped to (if at all) – A block may not exist in the cache - cache miss  If we miss the Cache – Entire block is fetched into a line fill buffer (may require few bus cycles), and then put into the cache – Before putting the new block in the cache, another block may need to be evicted from the cache (to make room for the new block)

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 13 Memory Hierarchy: Terminology  For each memory level we can define the following: – Hit: data appears in the memory level – Hit Rate: the fraction of memory accesses which are hits – Hit Time: Time to access the memory level (includes also the time to determine hit/miss) – Miss: data needs to be retrieved from the lower level – Miss Rate = 1 - (Hit Rate) – Miss Penalty: Time to replace a block in the current level + Time to deliver the data to the processor  Average memory-access time = t effective = (Hit time  Hit Rate) + (Miss Time  Miss rate) = (Hit time  Hit Rate) + (Miss Time  (1- Hit rate)) – If hit rate is close to 1 t effective is close to Hit time

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 14 Four Questions for Memory Hierarchy Designers In order to increase efficiently, we are moving data in blocks between different levels of memories; e.g., pages in main memory. In order to achieve that we need to answer (at least) 4 questions:  Q1: Where can a block be placed when brought? (Block placement)  Q2: How is a block found when needed? (Block identification)  Q3: Which block should be replaced on a miss? (Block replacement)  Q4: What happens on a write? (Write strategy)

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 15 Q1-2: Where can a block be placed and how can we find it?  Direct Mapped: Each block has only one place that it can appear in the cache.  Fully associative: Each block can be placed anywhere in the cache.  Set associative: Each block can be placed in a restricted set of places in the cache. –If there are n blocks in a set, the cache placement is called n-way set associative  What is the associativity of a direct mapped cache?

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 16 Fully Associative Cache Tag Line Tag Array Tag = Block#Line Offset Address Fields 0431 Data array 031 = = = hitdata  An address is partitioned to – offset within block – block number  Each block may be mapped to each of the cache lines – need to lookup the block in all lines  Each cache line has a tag – tag is compared to the block number – If one of the tags matches the block# we have a hit and the line is accessed according to the line offset – need a comparator per line

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 17 Fully associative - Cont  Advantages + Good utilization of the area, since any block in the main memory can be mapped to any cache line  Disadvantage - A lot of hardware - Complicated hardware that causes a slow down to the access time.

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 18 Direct Mapped Cache  The l.s.bits of the block number determine to which cache line the block is mapped - called the set number – Each block is mapped to a single line in the cache – If a block is mapped to the same line as another, it will replace it.  The rest of the block number bits are used as a tag – Compared to the tag stored in the cache for the appropriate set Line Tag Array Tag Set# Cache storage 031 TagSetLine Offset Address 041331 5 Block number 2 9 =512 sets

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 19 Direct Map Cache (cont)  Memory is conceptually divided into slices whose size is the cache size  Offset from from slice start indicates position in cache (set)  Addresses with the same offset map into the same line  One tag per line is kept  Advantages + Easy hit/miss resolution + Easy replacement algorithm + Lowest power and complexity  Disadvantage - Excessive Line replacement due to “Conflict misses” Cache Size Line 1 Line 2........ Line n x x x Map to the Same set X Cache Size Cache Size

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 20 LineTag Line 2-Way Set Associative Cache  Each set holds two lines (way 0 and way 1)  Each block can be mapped into one of two lines in the appropriate set TagSetLine Offset Address Fields 041231 Cache storage Way 1 Tag Array Set# 031Way 0 Tag Array Set# 031Cache storage WAY #1WAY #0 Example: Line Size: 32 bytes Cache Size 16KB # of lines512 lines #sets256 Offset bits5 bits Set bits8 bits Tag bits19 bits Address 0x12345678 0001 0010 0011 0100 0 0101 0110 0111 1000 Offset: 1 1000 = 0x18 = 24 Set: 1011 0011 = 0x0B3 = 179 Tag: 000 1001 0001 1010 0010 = 0x091A2

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 21 2-Way Cache - Hit Decision TagSetLine Offset 041231 Way 0 Tag Set# Data = Hit/Miss MUX Data Out Data Tag Way 1 =

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 22 2-Way Set Associative Cache (cont)  Memory is conceptually divided into slices whose size is 1/2 the cache size (way size)  Offset from from slice start indicates set# Each set contains now two potential lines!  Addresses with the same offset map into the same set  Two tags per set, one tag per line is needed Way Size Line 1 Line 2........ Line n x x x Map to the Same set X Way Size Way Size

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 23 What happens on a Cache miss?  Read miss – Cache line file - fetch the entire block that contains the missing data from memory – Block is fetched into the cache line fill buffer – May take a few bus cycles to complete the fetch  e.g., 64 bit (8 byte) data bus, 32 byte cache line  4 bus cycles – Once the entire line is fetched it is moved from the fill buffer into the cache  What happens on a write miss ? – The processor does not wait for data  continues its work – 2 options: write allocate and no write allocate – Write allocate: fetch the line into the cache  Assumes that we may read from the line soon  Goes with write back policy (hoping that subsequent writes to the line hit the cache) – Write no allocate: do not fetch line into the cache on a write miss  Goes with write through policy (subsequent writes would update memory anyhow)

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 24 Replacement  Each line contains Valid indication  Direct map: simple, line can be brought to only one place – Old line is evicted (re-written to cache, if needed)  n-Ways: need to choose among ways in set  Options: FIFO, LRU, Random, Pseudo LRU  LRU is the best (in average)  LRU – 2 ways: requires 1 bit per set to mark latest accessed – 4-ways:  Need to save full ordering – Fully associative:  Full ordering cannot be saved (too many bits)  approximate LRU

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 25 Implementing LRU in a k-way set associative cache  For each set hold a k  k matrix – Initialization: 0 0 0 0 …. 0 1 1 0 0 ….. 0 2 1 1 0 ….. 0 N 1 1 1 …. 1 0  When line j (1  j  k) is accessed  Set all bits on row J to 1 (done in parallel by hardware) THEN  Reset all bits on column J to 0 (at the same cycle) i 0 : j1...101...111 : 0 Evict row with ALL “0”

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 26 Pseudo LRU  We will use as an example a 4-way set associative cache.  Full LRU records the full-order of way access in each set (which way was most recently accessed, which was second, and so on).  Pseudo LRU (PLRU) records a partial order, using 3 bits per-set: – Bit 0 specifies whether LRU way is either one of 0 and 1 or one of 2 and 3. – Bit 1 specifies which of ways 0 and 1 was least recently used – Bit 2 specifies which of ways 2 and 3 was least recently used  For example if order in which ways were accessed is 3,0,2,1, then bit 0 =1, bit 1 =1, bit 2 =1 0 1 2 3 0 1 00 11 bit 0 bit 2 bit 1

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 27 Write Buffer for Write Through  A Write Buffer is needed between the Cache and Memory – Write buffer is just a FIFO  Processor: writes data into the cache and the write buffer  Memory controller: write contents of the buffer to memory – Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle  Store frequency (w.r.t. time) > 1 / DRAM write cycle – If exists for a long period of time (CPU cycle time too quick and/or too many store instructions in a row):  Store buffer will overflow no matter how big you make it  The CPU Cycle Time <= DRAM Write Cycle Time  Write combining: combine writes in the write buffer  On cache miss need to lookup write buffer Processor Cache Write Buffer DRAM

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 28 Improving Cache Performance  Separating data cache from instruction cache (will be discussed in future lectures)  Reduce the miss rate – In order to reduce the misses, we need to understand why misses happen  Reduce the miss penalty – Bring the information to the processor as soon as possible.  Reduce the time to hit in the cache – Using Amdahl law, since most of the time we hit the cache it is important to make sure we are accelerating the hit process.

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 29 Classifying Misses: 3 Cs  Compulsory – The first access to a block is not in the cache, so the block must be brought into the cache – Also called cold start misses or first reference misses – Misses in even an Infinite Cache – Solution: for a fixed cache-line size -> prefetching  Capacity – cache cannot contain all blocks needed during program execution (it also termed the working set of the program is too big)  blocks are evicted and later retrieved – Solution: increase cache size, stream buffers, software solution  Conflict – Occurs in set associative or direct mapped caches when too many blocks map to the same set – Also called collision misses or interference misses – Solution: increase associativity, victim cache, linker optimizations

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 30 3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 31 How Can We Reduce Misses?  3 Cs: Compulsory, Capacity, Conflict  In all cases, assume total cache size is not changed:  What happens if: 1) Change Block Size: Which of 3Cs is obviously affected? 2) Change Associativity: Which of 3Cs is obviously affected? 3) Change Compiler: Which of 3Cs is obviously affected?

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 32 Reduce Misses via Larger Block Size

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 33 Reduce Misses via Higher Associativity  We have two conflicting trends here:  Higher associativity  improve the hit ratio BUT  Increase the access time  Slow down the replacement  Increase complexity  Most of the modern cache memory systems are using at least 4-way associative cache memories

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 34 Example: Avg. Memory Access Time vs. Miss Rate  Example: assume Cache Access Time = 1.10 for 2-way, 1.12 for 4-way, 1.14 for 8-way vs. CAT of direct mapped Cache SizeAssociativity (KB)1-way2-way4-way8-way 12.332.152.072.01 21.981.861.761.68 41.721.671.611.53 81.461.481.471.43 161.291.321.321.32 321.201.241.251.27 641.141.201.211.23 1281.101.171.181.20 Effective access time to cache (Red means -> not improved by more associativity) Note this is for a specific example

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 35 Reducing Miss Penalty by Critical Word First  Don’t wait for full block to be loaded before restarting CPU – Early restart  As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution – Critical Word First  Request the missed word first from memory and send it to the CPU as soon as it arrives  Let the CPU continue execution while filling the rest of the words in the block  Also called wrapped fetch and requested word first  Example: – 64 bit = 8 byte bus, 32 byte cache line  4 bus cycles to fill line – Fetch date from 95H 80H-87H88H-8FH90H-97H98H-9FH 1243

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 36 Prefetchers  In order to avoid compulsory misses, we need to bring the information before it was requested by the program  We can use the locality of references behavior – Space -> bring the environment. – Time -> same “patterns” repeats themselves.  Prefetching relies on having extra memory bandwidth that can be used without penalty  There are hardware and software prefetchers.

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 37 Hardware Prefetching  Instruction Prefetching – Alpha 21064 fetches 2 blocks on a miss  Extra block placed in stream buffer in order to avoid possible cache pollution in case the pre-fetched instructions will not be required  On miss check stream buffer – Branch predictor directed prefetching  Let branch predictor run ahead  Data Prefetching – Try to predict future data access  Next sequential  Stride  General pattern

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 38 Software Prefetching  Data Prefetch – Load data into register (HP PA-RISC loads) – Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) – Special prefetching instructions cannot cause faults; a form of speculative execution  How it is done – Special prefetch intrinsic in the language – Automatically by the compiler  Issuing Prefetch Instructions takes time – Is cost of prefetch issues < savings in reduced misses? – Higher superscalar reduces difficulty of issue bandwidth

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 40 Multi-ported cache and Banked Cache  A n-ported cache enables n cache accesses in parallel – Parallelize cache access in different pipeline stages – Parallelize cache access in a super-scalar processors  Effectively doubles the cache die size  Possible solution: banked cache – Each line is divided to n banks – Can fetch data from k  n different banks (in possibly different lines)

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 41 Separate Code / Data Caches  Enables parallelism between data accesses (done in the memory access stage) and instruction fetch (done in fetch stage of the pipelined processors)  Code cache is a read only cache – No need to write back line into memory when evicted – Simpler to manage  What about self modified code? (X86 only) – Whenever executing a memory write need to snoop the code cache – If the code cache contains the written address, the line in which the address is contained is invalidated – Now the code cache is accessed both in the fetch stage and in the memory access stage  Tags need to be dual ported to avoid stalling

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 42 Increasing the size with minimum latency loss - L2 cache  L2 is much larger than L1 (256K-1M in compare to 32K-64K)  Used to be off chip cache (between the cache and the memory bus). Now, most of the implementations are on-chip. (but some architectures have level 3 cache off-chip) – If L2 is on-chip, why not just make L1 larger?  Can be inclusive: – All addresses in L1 are also contained in L2 – Data in L1 may be more updated than in L2 – L2 is unified (code / data)  Most architectures do not require the caches to be inclusive (although, due to the size difference they are)

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 43 Victim Cache  Problem: per set load may non-uniform – some sets may have more conflict misses than others  Solution: allocate ways to sets dynamically, according to the load  When a line is evicted from the cache it placed on the victim cache – If the victim cache is full - LRU line is evicted to L2 to make room for the new victim line from L1  On cache lookup, victim cache lookup is also performed (in parallel)  On victim cache hit, – line is moved back to cache – evicted line moved to the victim cache – Same access time as cache hit  Especially effective for direct mapped cache – Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 44 Stream Buffers  Before inserting a new line into the cache put it in a stream buffer  Line is moved from stream buffer into cache only if we get some indication that the line will be accessed in the future  Example: – Assume that we scan a very large array (much larger than the cache), and we access each item in the array just once – If we inset the array into the cache it will thrash the entire cache – If we detect that this is just a scan-once operation (e.g., using a hint from the software) we can avoid putting the array lines into the cache

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 46 Compiler issues  Data Alignment – Misaligned access might span several cache lines – Prohibited in some architectures (Alpha, SPARC) – Very slow in others (x86)  Solution 1: add padding to data structures  Solution 2: make sure memory allocations are aligned  Code Alignment – Misaligned instruction might span several cache lines – x86 only. VERY slow.  Solution: insert NOPs to make sure instructions are aligned

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 47 Compiler issues 2  Overalignment – Alignment of an array can be a multiple of cache size – Several arrays map to same cache lines – Excessive conflict misses (thrashing) for (int i=0; i<N; i++) a[i] = a[i] + b[i] * c[i]  Solution 1: increase cache associativity  Solution 2: break the alignment

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367 Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi.

Similar presentations

Presentation on theme: "Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367 Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367 Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi.

Similar presentations

Presentation on theme: "Memory Hierarchy & Cache Memory © Avi Mendelson, 3/2005 1 MAMAS – Computer Architecture 234367 Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi."— Presentation transcript:

Similar presentations

About project

Feedback