CSC 4250 Computer Architectures

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Performance of Cache Memory
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Chapter 12 Pipelining Strategies Performance Hazards.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CMPE 421 Parallel Computer Architecture
Lecture 19: Virtual Memory
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Hierarchy and Cache Design (3). Reducing Cache Miss Penalty 1. Giving priority to read misses over writes 2. Sub-block placement for reduced miss.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 29 Memory Hierarchy Design Cache Performance Enhancement by: Reducing Cache.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Improving Memory Access 1/3 The Cache and Virtual Memory
How will execution time grow with SIZE?
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Copyright © 2011, Elsevier Inc. All rights Reserved.
Cache Memory Presentation I
Morgan Kaufmann Publishers
Advanced Computer Architectures
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
How can we find data in the cache?
Miss Rate versus Block Size
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Update : about 8~16% are writes
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

CSC 4250 Computer Architectures December 1, 2006 Chapter 5. Memory Hierarchy

Figure 5.7. Data Cache in Alpha 21264

Data Cache Organization of Alpha 21264 The 64K cache is two-way set associative with 64-byte blocks The 9-bit index selects among 512 sets The circled numbers indicate the four steps of a read hit, in the order of occurrence Three bits of the block offset join the index to supply the RAM address to select the proper 8 bytes. Thus, the cache holds two groups of 4096 64-bit words, with each group containing half of the 512 sets. The line from lower-level memory is used on a miss to load the cache The size of the address leaving the CPU is 44 bits because it is a physical address

Figure 5.8. Miss per 1000 instructions for instruction, data, and unified caches of different sizes. The percentage of instruction references is about 74%. The data are for 2-way associative caches with 64-byte blocks. (Is unified cache the worst choice?) Size Instruction cache Data cache Unified cache 8KB 8.16 44.0 63.0 16KB 3.82 40.9 51.0 32KB 1.36 38.4 43.3 64KB 0.61 36.9 39.4 128KB 0.30 35.3 36.2 256KB 0.02 32.6 32.9

Cache Performance Which has the lower miss rate: a 16KB instruction cache with a 16 KB data cache, or a 32 KB unified cache? Use the miss rates in Figure 5.8 to help calculate the correct answer, assuming 36% of the instructions are data transfer instructions Assume a hit takes 1 clock cycle and the miss penalty is 100 clock cycles A load or store hit takes 1 extra clock cycle on a unified cache if there is only one cache port to satisfy two simultaneous requests (a structural hazard) What is the average memory access time in each case? Assume write-through caches with a write buffer, and ignore stalls due to the write buffer

Effective Miss Rates Consider 16K instruction cache: Instruction miss rate = 3.82/1000 = 0.004 Consider 16K data cache (36% of instructions are data transfers): Data miss rate = (40.9/1000)/0.36 = 0.114 74% of memory accesses are instruction references: Overall miss rate for split caches = (74%×0.004)+(26%×0.114) = 0.0324 Consider 32KB unified cache: Miss rate = (43.3/1000)/(1.00+0.36) = 0.0318 Thus, a 32KB unified cache has a slightly lower effective miss rate than two 16KB caches

Average Memory Access Time = % instructions × (hit time + instruction miss rate × miss penalty) + % data × (hit time + data miss rate × miss penalty) Consider two 16KB split caches: = 74%× (1 + 0.0038 × 100) + 26% × (1 + 0.1136 × 100) = 74%× 1.38 + 26% × 12.36 = 1.023 + 3.214 = 4.24 Consider 32KB unified cache: = 74%× (1 + 0.0318 × 100) + 26% × (1 + 1 + 0.0318 × 100) = 74%× 4.18 + 26% × 5.18 = 3.096 + 1.348 = 4.44 The split caches, which offer two memory ports per clock cycle (thus avoiding the structural hazard), have a better average memory access time, despite having a worse effective miss rate

Cache Optimizations Reducing miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches Reducing miss rate: larger block size, larger cache size, higher associativity, way prediction, and compiler optimization Reducing miss penalty or miss rate via parallelism: hardware and compiler prefetching Reducing time to hit in cache: small and simple caches, and pipelined cache access

Reducing Cache Miss Penalty Multilevel Caches Critical Word First and Early Restart Giving Priority to Read Misses over Writes Merging Write Buffer Victim Caches

1. Multilevel Caches Performance gap between processors and memory leads to the question: Should one make the cache fast enough to keep pace with the speed of the CPUs, or make the cache large enough to overcome the widening gap between the CPU and main memory?

Answer Both Add another level of cache between the original cache and memory. The first-level cache is small enough to match the clock cycle time of the fast CPU, and the second-level cache is large enough to capture many accesses that would otherwise go to main memory.

Two-Level Cache System Local miss rate ─ the number of misses in a cache divided by the total number of memory accesses to this cache Global miss rate ─ the number of misses in the cache divided by the total number of memory accesses generated by the CPU. Thus, the global miss rate for the first-level cache is just its local miss rate, but the global miss rate for the second-level cache is the product of the local miss rates of the first and second level caches.

Example Suppose that in 1000 memory references there are 40 misses in the first-level cache and 20 misses in the second-level miss. What are the various miss rates? Assume the miss penalty from L2 cache to memory is 100 clock cycles, the hit time of L2 cache is 10 clock cycles, the hit time of L1 cache is 1 clock cycle, and there are 1.5 memory references per instruction. What is the average memory access time?

Answer The miss rate (either local or global) of the first-level cache is 40/1000 or 4% The local miss rate for the second-level cache is 20/40 or 50% The global miss rate of the second-level cache is 20/1000 or 2% Average memory access time = 1 + 4% × ( 10 + 50%×100 ) = 3.4 clock cycles

Multilevel Inclusion Multilevel inclusion is natural for memory hierarchy: L1 data are always present in L2. Inclusion is desirable because consistency between I/O and caches can be determined by just checking the second-level cache. Many cache designers keep the block size same in all levels of caches What if the designer can afford only an L2 cache that is slightly larger than the L1 cache? An example is the AMD Athlon: two 64KB L1 caches and only one 256KB L2 cache Should a significant portion (50% in Athlon) of the L2 space be used as a redundant copy of the L1 cache?

Multilevel Exclusion Athlon: L1 data is never found in an L2 cache With this policy, a cache miss in L1 (followed by a cache hit in L2) results in a swap of blocks between L1 and L2 instead of a replacement of an L1 block with an L2 block

2. Critical Word First and Early Restart Critical word first ─ Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Early restart ─ Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution

3. Giving Priority to Read Misses over Writes Serve reads before writes are completed What about write buffers? They might hold the updated value of a location needed on a read miss. Example SW R3,512(R0) ;M[512]←R3 (cache index 0) LW R1,1024(R0) ;R1←M[1024] (cache index 0) LW R2,512(R0) ;R2←M[512] (cache index 0) Assume a direct-mapped, write through cache that maps 512 and 1024 to the same block, and a four-word write buffer Will the value in R2 always be equal to the value in R3? What is the data hazard in memory? RAW

RAW Hazard in Memory The data in R3 are placed into the write buffer after the store. The first load uses the same cache index and is therefore a miss. The second load tries to put the value in location 512 into R2; this also results in a miss. If the write buffer has not completed writing to location 512 in memory, the read of location 512 will put the old, wrong value into the cache block, and then into R2. Without proper precaution, R3 would not equal R2!

Solution One way is to require the read miss to wait until the write buffer is empty An alternative is to check the contents of the write buffer on a read miss

Cost of Writes in Write-back Cache How to reduce the cost of writes in a write-back cache? Suppose a read miss replaces a dirty cache block. Instead of writing the dirty block to memory, and then reading memory, we could copy the dirty block to a buffer, then read memory, and finally write memory. Now the CPU read will finish sooner. If a read miss occurs, the processor should either stall until the buffer is empty or check the addresses of the words in the buffer for conflicts (same as above for write-through).

4. Merging Write Buffer Without write merging, the words to the right in the upper part of the figure would only be used for instructions that write multiple words at the same time This optimization reduces stalls due to the write buffer being full

5. Victim Caches Victim cache: small, fully associative AMD Athlon: victim cache with eight entries Although it reduces the miss penalty, the victim cache is aimed at reducing the damage done by conflict misses