Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)

Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)
Computer Architecture 計算機結構 Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Ping-Liang Lai (賴秉樑)

Outline C.0 Review From Last Lecture C.1 Introduction
C.2 Cache Performance C.3 Six Basic Cache Optimizations C.4 Virtual Memory C.5 Protection and Examples of Virtual Memory C.6 Fallacies and Pitfalls C.7 Concluding Remarks C.8 Historical Perspective and References

Review From Last Lecture
Quantify and summarize performance Ratios, Geometric Mean, Multiplicative Standard Deviation F&P: Benchmarks age, disks fail,1 point fail danger Control VIA State Machines and Microprogramming Just overlap tasks; easy if tasks are independent Speed Up  Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction Exceptions, Interrupts add complexity

Outline C.0 Review From Last Lecture C.1 Introduction
C.2 Cache Performance C.3 Six Basic Cache Optimizations C.4 Virtual Memory C.5 Protection and Examples of Virtual Memory C.6 Fallacies and Pitfalls C.7 Concluding Remarks C.8 Historical Perspective and References

C.1 Introduction The Principle of Locality
Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on locality for speed The principle of locality states that programs access a relatively small portion of the address space at any instant of time. This is kind of like in real life, we all have a lot of friends. But at any given time most of us can only keep in touch with a small group of them. There are two different types of locality: Temporal and Spatial. Temporal locality is the locality in time which says if an item is referenced, it will tend to be referenced again soon. This is like saying if you just talk to one of your friends, it is likely that you will talk to him or her again soon. This makes sense. For example, if you just have lunch with a friend, you may say, let’s go to the ball game this Sunday. So you will talk to him again soon. Spatial locality is the locality in space. It says if an item is referenced, items whose addresses are close by tend to be referenced soon. Once again, using our analogy. We can usually divide our friends into groups. Like friends from high school, friends from work, friends from home. Let’s say you just talk to one of your friends from high school and she may say something like: “So did you hear so and so just won the lottery.” You probably will say NO, I better give him a call and find out more. So this is an example of spatial locality. You just talked to a friend from your high school days. As a result, you end up talking to another high school friend. Or at least in this case, you hope he still remember you are his friend. +3 = 10 min. (X:50) It is a property of programs which is exploited in machine design. CS252 S05

Levels of the Memory Hierarchy
Capacity Access Time Cost Upper Level Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache K Bytes ns 1-0.1 cents/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 200ns- 500ns $ cents /bit Memory OS 512-4K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) cents/bit Disk user/operator Mbytes Files Tape infinite sec-min 10-8 Larger Tape Lower Level

Since 1980, CPU has outpaced DRAM ...
Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”. Performance (1/latency) CPU 60% per yr 2X in 1.5 yrs 1000 CPU Gap grew 50% per year 100 DRAM 9% per yr 2X in 10 yrs 10 DRAM 1980 1990 2000 Year

Memory Hierarchy: Terminology
Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level. Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss. Miss: data needs to be retrieve from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate). Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor. Hit Time << Miss Penalty (500 instructions on 21264!) Lower Level Memory Upper Level To Processor From Processor Block X Block Y A HIT is when the data the processor wants to access is found in the upper level (Blk X). The fraction of the memory access that are HIT is defined as HIT rate. HIT Time is the time to access the Upper Level where the data is found (X). It consists of: (a) Time to access this level. (b) AND the time to determine if this is a Hit or Miss. If the data the processor wants cannot be found in the Upper level. Then we have a miss and we need to retrieve the data (Blk Y) from the lower level. By definition (definition of Hit: Fraction), the miss rate is just 1 minus the hit rate. This miss penalty also consists of two parts: (a) The time it takes to replace a block (Blk Y to BlkX) in the upper level. (b) And then the time it takes to deliver this new block to the processor. It is very important that your Hit Time to be much much smaller than your miss penalty. Otherwise, there will be no reason to build a memory hierarchy. +2 = 14 min. (X:54) CS252 S05

Cache Measures Review Hit rate: fraction found in that level
So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU Access time: time to lower level = f(latency to lower level) Transfer time: time to transfer block =f(BW between upper & lower levels)

Cache Performance Review (1/3)
Memory Stall Cycles: the number of cycles during which the processor is stalled waiting for a memory access. Rewriting the CPU performance time The number of memory stall cycles depends on both the number of misses and the cost per miss, which is called the miss penalty : The advantage of the last form is the component can be easily measured.

Miss penalty depends on Prior memory requests or memory refresh; Different clocks of the processor, bus, and memory; Thus, using miss penalty be a constant is a simplification. Miss rate: the fraction of cache access that result in a miss (i.e., number of accesses that miss divided by number of accesses). Extract formula for R/W Simplify the complete formula by combining the R/W.

Example (C-5) Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits? Ans: First computer the performance for the computer that always hits: Now for the computer with the real cache, first we compute memory stall cycles: The total performance is thus The performance ratio is the inverse of the execution times:

Usually, measuring miss rate as misses per instruction rather than misses per memory reference. For example, in the previous example into misses per instruction: The latter formula is useful when you know the average number of memory accesses per instruction.

Example (C-6) To show equivalency between the two miss rate equations, let’s redo the example above, this time assuming a miss rate per 1000 instructions of 30. What is memory stall time in terms of instruction count? Answer Recomputing the memory stall cycles:

4 Questions for Memory Hierarchy
Q1: Where can a block be placed in the upper level? Block placement Q2: How is a block found if it is in the upper level? Block identification Q3: Which block should be replaced on a miss? Block replacement Q4: What happens on a write? Write strategy

Q1: Where Can a Block be Placed in The Upper Level?
Block Placement Direct Mapped, Fully Associative, Set Associative Direct mapped: (Block number) mod (Number of blocks in cache) Set associative: (Block number) mod (Number of sets in cache) # of set  # of blocks n-way: n blocks in a set 1-way = direct mapped Fully associative: # of set = 1 Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Fully associative: block 12 can go anywhere Block no. Block no. Block no. Set0 Set1 Set2 Set3 Block-frame address Block no. 1 2 3 4 5 6 7 8 9 12 31

1 KB Direct Mapped Cache, 32B blocks
For a 2N byte cache The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2M) Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select Ex: 0x00 9 5 10

Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache; The two tags in the set are compared to the input in parallel; Data is selected based on the tag result.

Disadvantage of Set Associative Cache
N-way Set Associative Cache versus Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss decision and set selection In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss.

Q2: Block Identification
Tag on each block No need to check index or block offset Increasing associativity shrinks index, expands tag Block Offset Block Address Index Tag Set Select Data Select Cache size = Associativity × 2index_size × 2offest_size

Q3: Which block should be replaced on a miss?
Easy for Direct Mapped Set Associative or Fully Associative Random LRU (Least Recently Used) First in, first out (FIFO) Associativity 2-way 4-way 8-way Size LRU Ran. FIFO 16KB 114.1 117.3 115.5 111.7 115.1 113.3 109.0 111.8 110.4 64KB 103.4 104.3 103.9 102.4 102.3 103.1 99.7 100.5 100.3 256KB 92.2 92.1 92.5

Q4: What Happens on a Write?
Write-Through Write-Back Policy Data written to cache block, also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache Debug Easy Hard Do read misses produce writes? No Yes Do repeated writes make it to lower level? Additional option -- let writes to an un-cached address allocate a new cache line (“write-allocate”).

Write Buffers for Write-Through Caches
Q. Why a write buffer ? A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or send read 1st after check write buffers.

Write - Miss Policy Two options on a write miss
Write allocate – the block is allocated on a write miss, followed by the write hit actions. Write misses act like read misses. No-write allocate – write misses do not affect the cache. The block is modified only in the lower-level memory. Block stay out of the cache in no-write allocate until the program tries to read the blocks, but with write allocate even blocks that are only written will still be in the cache.

Write-Miss Policy Example
Example: Assume a fully associative write-back cache with many cache entries that starts empty. Below is sequence of five memory operations (The address is in square brackets): Write Mem[100]; Read Mem[200]; Write Mem[200]; Write Mem[100]. What are the number of hits and misses (inclusive reads and writes) when using no-write allocate versus write allocate? Answer No-write Allocate: Write allocate: Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write miss Write Mem[100]; 1 write hit Read Mem[200]; 1 read miss Read Mem[200]; 1 read miss Write Mem[200]; 1 write hit Write Mem[200]; 1 write hit Write Mem[100] write miss Write Mem[100]; 1 write hit 4 misses; 1 hit misses; 3 hits

Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)

Similar presentations

Presentation on theme: "Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)

Similar presentations

Presentation on theme: "Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)"— Presentation transcript:

Similar presentations

About project

Feedback