Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy Adopted from Professor David Patterson Electrical Engineering and Computer Sciences.

Similar presentations


Presentation on theme: "CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy Adopted from Professor David Patterson Electrical Engineering and Computer Sciences."— Presentation transcript:

1 CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy Adopted from Professor David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley

2 CSCE 430/830, Memory Hierarchy Introduction Since 1980, CPU has outpaced DRAM... CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs 10 DRAM CPU Performance (1/latency) Year Gap grew 50% per year Q. How do architects address this gap? A. Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”.

3 1977: DRAM faster than microprocessors Apple ][ (1977) Steve Wozniak Steve Jobs CPU: 1000 ns DRAM: 400 ns

4 CSCE 430/830, Memory Hierarchy Introduction Levels of the Memory Hierarchy CPU Registers 100s Bytes <10s ns Cache K Bytes (MB?) ns cents/bit Main Memory M Bytes (GB?) 200ns- 500ns $ cents /bit Disk G Bytes (TB?), 10 ms (10,000,000 ns) cents/bit Capacity Access Time Cost Tape infinite sec-min Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

5 CSCE 430/830, Memory Hierarchy Introduction Memory Hierarchy: Apple iMac G5 iMac G5 1.6 GHz 07 RegL1 InstL1 DataL2DRAMDisk Size 1K64K32K512K256M80G Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 10 7, 12 ms Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access Managed by compiler Managed by hardware Managed by OS, hardware, application Goal: Illusion of large, fast, cheap memory

6 iMac’s PowerPC 970: All caches on-chip (1K) R eg ist er s 512K L2 L1 (64K Instruction) L1 (32K Data)

7 CSCE 430/830, Memory Hierarchy Introduction The Principle of Locality The Principle of Locality: –Program access a relatively small portion of the address space at any instant of time. Two Different Types of Locality: –Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) –Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) Last 15 years, HW relied on locality for speed It is a property of programs which is exploited in machine design.

8 Programs with locality cache well... Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): (1971) Time Memory Address (one dot per access) Spatial Locality Temporal Locality Bad locality behavior

9 CSCE 430/830, Memory Hierarchy Introduction Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X) –Hit Rate: the fraction of memory access found in the upper level –Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: data needs to be retrieve from a block in the lower level (Block Y) –Miss Rate = 1 - (Hit Rate) –Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor Hit Time << Miss Penalty (500 instructions on 21264!) Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

10 CSCE 430/830, Memory Hierarchy Introduction Cache Measures Hit rate: fraction found in that level –So high that usually talk about Miss rate –Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory Average memory-access time = Hit time + Miss rate x Miss penalty (ns or clocks) Miss penalty: time to replace a block from lower level, including time to replace in CPU –access time : time to lower level = f(latency to lower level) –transfer time : time to transfer block =f(BW between upper & lower levels)

11 CSCE 430/830, Memory Hierarchy Introduction 4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

12 CSCE 430/830, Memory Hierarchy Introduction Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: –Fully associative, direct mapped, 2-way set associative –S.A. Mapping = (Block Number) Modulo (Number Sets) Cache Memory Full Mapped Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0

13 Direct Mapped Block Placement *4*0*8*C Cache C C C C C Memory address maps to block: location = (block address MOD # blocks in cache)

14 Fully Associative Block Placement C C C C C Cache Memory arbitrary block mapping location = any

15 Set-Associative Block Placement C C C C *4*0*8*C C Cache Memory *0*4*8*C Set 0 Set 1 Set 2 Set 3 address maps to set: location = (block address MOD # sets in cache) (arbitrary location in set)

16 CSCE 430/830, Memory Hierarchy Introduction Q2: How is a block found if it is in the upper level? Tag on each block –No need to check index or block offset Increasing associativity shrinks index, expands tag Block Offset Block Address IndexTag

17 Direct-Mapped Cache Design CACHE SRAM ADDR DATA[31:0] 0x00001C00xff083c2d 0 10x x x x x23F02100x TagVData = 030x DATA[58:32]DATA[59] DATAHIT ADDRESS =1 Tag Cache Index Byte Offset

18 Set Associative Cache Design Key idea: –Divide cache into sets –Allow block anywhere in a set Advantages: –Better hit rate Disadvantage: –More tag bits –More hardware –Higher access time A Four-Way Set-Associative Cache

19 tag data = Fully Associative Cache Design Key idea: set size of one block –1 comparator required for each block –No address decoding –Practical only for small caches due to hardware demands tag data = = = = = tag tag tag tag data data data data tag in data out

20 CSCE 430/830, Memory Hierarchy Introduction In-Class Exercise Given the following requirements for cache design for a 32-bit-address computer: (1) cache contains 16KB of data, and (2) each cache block contains 16 words. (3) Placement policy is 4-way set-associative. –What are the lengths (in bits) of the block offset field and the index field in the address? –What are the lengths (in bits) of the index field and the tag field in the address if the placement is 1- way set-associative?

21 CSCE 430/830, Memory Hierarchy Introduction Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: –Random –LRU (Least Recently Used) Assoc: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB5.2%5.7% 4.7%5.3%4.4%5.0% 64 KB1.9%2.0% 1.5%1.7% 1.4%1.5% 256 KB1.15%1.17% 1.13% 1.13% 1.12% 1.12%

22 CSCE 430/830, Memory Hierarchy Introduction Q3: After a cache read miss, if there are no empty cache blocks, which block should be removed from the cache? A randomly chosen block? Easy to implement, how well does it work? The Least Recently Used (LRU) block? Appealing, but hard to implement for high associativity Miss Rate for 2-way Set Associative Cache Also, try other LRU approx. SizeRandomLRU 16 KB 5.7%5.2% 64 KB 2.0%1.9% 256 KB 1.17%1.15%

23 CSCE 430/830, Memory Hierarchy Introduction Q4: What happens on a write? Write-ThroughWrite-Back Policy Data written to cache block also written to lower- level memory Write data only to the cache Update lower level when a block falls out of the cache DebugEasyHard Do read misses produce writes? NoYes Do repeated writes make it to lower level? YesNo Additional option (on miss)-- let writes to an un-cached address; allocate a new cache line (“write-allocate”).

24 CSCE 430/830, Memory Hierarchy Introduction Write Buffers for Write-Through Caches Q. Why a write buffer ? Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or send read 1 st after check write buffers.

25 CSCE 430/830, Memory Hierarchy Introduction 5 Basic Cache Optimizations Reducing Miss Rate: 3 Cs of misses 1.Larger Block size (compulsory misses) 2.Larger Cache size (capacity misses) 3.Higher Associativity (conflict misses) Reducing Miss Penalty 4.Multilevel Caches Reducing hit time 5.Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer

26 In-Class Exercises In systems with a write-through L1 cache backed by a write-back L2 cache instead of main memory, a merging write buffer can be simplified. Explain how this can be done. Are there situations where having a full write buffer (instead of the simple version you’ve just proposed) could be helpful? –The merging buffer links CPU to the L2 cache. Two CPU writes cannot merge if they are to different sets in L2. So, for each new entry into the buffer a quick check on only those address bits that determine the L2 set number need be performed at first. If there is no match, then the new entry is not merged. Otherwise, all address bits can be checked for a definitive result. –As the associativity of L2 increases, the rate of false positive matches from the simplified check will increase, reducing performance. CSCE 430/830, Memory Hierarchy Introduction

27 In-Class Exercises As caches increase in size, blocks often increase in sizes as well. 1.If a large instruction cache has larger data blocks, is there still a need for prefetching? Explain the interaction between prefetching and increased block size in instruction caches. –Program basic blocks are often short (<10 instr.), and thus program executions does not continue to follow sequential locations for very long. As block gets larger, program is more likely to not execute all instructions in the block but branch out early, making instruction prefetching less attractive. 2.Is there a need for data prefetch instructions when data blocks get larger? –Data structures often comprise lengthy sequences of memory addresses, and program access of data structure often takes the form of sequential sweep. Large data blocks work well with such access patterns, and prefetching is likely still of value to the highly sequential access patterns. (go-to)go-to CSCE 430/830, Memory Hierarchy Introduction

28 Outline Review Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options Conclusion

29 CSCE 430/830, Memory Hierarchy Introduction The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data o All programs share one address space: The physical address space o Machine language programs must be aware of the machine organization o No way to prevent a program from accessing any machine resource

30 CSCE 430/830, Memory Hierarchy Introduction Solution: Add a Layer of Indirection CPU Memory A0-A31 D0-D31 Data User programs run in a standardized virtual address space Address Translation hardware, managed by the operating system (OS), maps virtual address to physical memory “Physical Addresses” Address Translation VirtualPhysical “Virtual Addresses” Hardware supports “modern” OS features: Protection, Translation, Sharing

31 CSCE 430/830, Memory Hierarchy Introduction Three Advantages of Virtual Memory Translation: –Program can be given consistent view of memory, even though physical memory is scrambled –Makes multithreading reasonable (now used a lot!) –Only the most important part of program (“Working Set”) must be in physical memory. –Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection: –Different threads (or processes) protected from each other. –Different pages can be given special behavior » (Read Only, Invisible to user programs, etc). –Kernel data protected from User programs –Very important for protection from malicious programs Sharing: –Can map same physical page to multiple users (“Shared memory”)

32 Page tables encode virtual address spaces A machine usually supports pages of a few sizes (MIPS R4000): Physical Memory Space A valid page table entry codes physical memory “frame” address for the page A virtual address space is divided into blocks of memory called pages frame A page table is indexed by a virtual address Page Table OS manages the page table for each ASID (Addr. Space ID)

33 Physical Memory Space Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry) Virtual memory => treat memory  cache for disk 4 fundamental questions: placement, identification, replacement, and write policy? Details of Page Table Virtual Address Page Table index into page table Page Table Base Reg V Access Rights PA V page no.offset 12 table located in physical memory P page no.offset 12 Physical Address frame virtual address Page Table

34 CSCE 430/830, Memory Hierarchy Introduction Page tables may not fit in memory! A table for 4KB pages for a 32-bit address space has 1M entries Each process needs its own address space! P1 indexP2 indexPage Offset bit virtual address Top-level table wired in main memory Subset of 1024 second-level tables in main memory; rest are on disk or unallocated Two-level Page Tables

35 CSCE 430/830, Memory Hierarchy Introduction VM and Disk: Page replacement policy... Page Table 1 0 useddirty Set of all pages in Memory Tail pointer: Clear the used bit in the page table Head pointer Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk. Freelist Free Pages Dirty bit: page written. Used bit: set to 1 on any reference Architect’s role: support setting dirty and used bits

36 CSCE 430/830, Memory Hierarchy Introduction TLB Design Concepts

37 MIPS Address Translation: How does it work? “Physical Addresses” CPU Memory A0-A31 D0-D31 Data TLB also contains protection bits for virtual address Virtual Physical “Virtual Addresses” Translation Look-Aside Buffer (TLB) Translation Look-Aside Buffer (TLB) A small fully-associative cache of mappings from virtual to physical addresses Fast common case: Virtual address is in TLB, process has permission to read/write it. What is the table of mappings that it caches?

38 V=0 pages either reside on disk or have not yet been allocated. OS handles V=0 “Page fault” Physical and virtual pages must be the same size! The TLB caches page table entries TLB Page Table virtual address page off 2 framepage 2 50 physical address page off TLB caches page table entries. MIPS handles TLB misses in software (random replacement). Other machines use hardware. for ASID Physical frame address

39 CSCE 430/830, Memory Hierarchy Introduction Can TLB and caching be overlapped? Index Byte Select ValidCache TagsCache Data Data out Virtual Page NumberPage Offset Translation Look-Aside Buffer (TLB) Virtual Physical = Hit Cache Tag This works, but... Q. What is the downside? A. Inflexibility. Size of cache limited by page size. Cache Block

40 Problems With Overlapped TLB Access Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache Example: suppose everything the same except that the cache is increased to 8 K bytes instead of 4 K: virt page #disp cache index This bit is changed by VA translation, but is needed for cache lookup Solutions: go to 8K byte page sizes; go to 2 way set associative cache; or SW guarantee VA[13]=PA[13] 1K way set assoc cache

41 CSCE 430/830, Memory Hierarchy Introduction Use virtual addresses for cache? “Physical Addresses” CPU Main Memory A0-A31 D0-D31 Only use TLB on a cache miss ! Translation Look-Aside Buffer (TLB) Virtual Physical “Virtual Addresses” A. Synonym problem. If two address spaces share a physical frame, data may be in cache twice. Maintaining consistency is a nightmare. Cache Virtual D0-D31 Downside: a subtle, fatal problem. What is it?

42 In-Class Exercise Some memory systems handle TLB misses in sw (as an exception), while others use hw. 1.What the trade-offs between these two methods? –Hw faster but less flexible 2.Will sw handling of TLB misses always be slower ? Explain. –Factors other whether miss handling is done in hw or ws can quickly dominate handling time. E.g., Is page tbl itself paged? Can sw implement a more efficient page tbl search algorithm than hw? Hw Prefetching? 3.Are there page table structures that would be difficult to handle in hw but possible in sw? –Page table structures that change dynamically. CSCE 430/830, Memory Hierarchy Introduction

43 Summary #1/3: The Cache Design Space Several interacting dimensions –cache size –block size –associativity –replacement policy –write-through vs write-back –write allocation The optimal choice is a compromise –depends on access characteristics »workload »use (I-cache, D-cache, TLB) –depends on technology / cost Simplicity often wins Associativity Cache Size Block Size Bad Good LessMore Factor AFactor B

44 CSCE 430/830, Memory Hierarchy Introduction Summary #2/3: Caches The Principle of Locality: –Program access a relatively small portion of the address space at any instant of time. »Temporal Locality: Locality in Time »Spatial Locality: Locality in Space Three Major Categories of Cache Misses: –Compulsory Misses: sad facts of life. Example: cold start misses. –Capacity Misses: increase cache size –Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! Write Policy: Write Through vs. Write Back Today CPU time is a function of (ops, cache misses) vs. just f(ops): affects Compilers, Data structures, and Algorithms

45 CSCE 430/830, Memory Hierarchy Introduction Summary #3/3: TLB, Virtual Memory Page tables map virtual address to physical address TLBs are important for fast translation TLB misses are significant in processor performance –funny times, as most systems can’t access all of 2nd level cache without TLB misses! Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is replaced on miss? 4) How are writes handled? Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy benefits, but computers insecure

46 –Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) –Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) –Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) Classifying Misses: 3C CSCE 430/830, Memory Hierarchy Introduction

47 3Cs Absolute Miss Rate (SPEC92) Conflict Compulsory vanishingly small Classifying Misses: 3C CSCE 430/830, Memory Hierarchy Introduction

48 Conflict miss rate 1-way associative cache size X = miss rate 2-way associative cache size X/2 2:1 Cache Rule CSCE 430/830, Memory Hierarchy Introduction

49 Conflict Flaws: for fixed block size Good: insight => invention 3C Relative Miss Rate CSCE 430/830, Memory Hierarchy Introduction

50 improve cache and memory access times: Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Reducing each of these! Simultaneously? Improve Cache Performance Improve performance by: 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. CSCE 430/830, Memory Hierarchy Introduction

51 Reducing Cache Misses: 1. Larger Block Size Size of Cache Using the principle of locality. The larger the block, the greater the chance parts of it will be used again. CSCE 430/830, Memory Hierarchy Introduction

52 Increasing Block Size One way to reduce the miss rate is to increase the block size –Take advantage of spatial locality –Decreases compulsory misses However, larger blocks have disadvantages –May increase the miss penalty (need to get more data) –May increase hit time (need to read more data from cache and larger mux) –May increase miss rate, since conflict misses Increasing the block size can help, but don’t overdo it. CSCE 430/830, Memory Hierarchy Introduction

53 Block Size vs. Cache Measures Increasing Block Size generally increases Miss Penalty and decreases Miss Rate As the block size increases the AMAT starts to decrease, but eventually increases Block Size Miss Rate Miss PenaltyAvg. Memory Access Time X = CSCE 430/830, Memory Hierarchy Introduction Hit Time +

54 Increasing associativity helps reduce conflict misses 2:1 Cache Rule: –The miss rate of a direct mapped cache of size N is about equal to the miss rate of a 2-way set associative cache of size N/2 –For example, the miss rate of a 32 Kbyte direct mapped cache is about equal to the miss rate of a 16 Kbyte 2-way set associative cache Disadvantages of higher associativity –Need to do large number of comparisons –Need n-to-1 multiplexor for n-way set associative –Could increase hit time Reducing Cache Misses: 2. Higher Associativity CSCE 430/830, Memory Hierarchy Introduction

55 AMAT vs. Associativity Cache SizeAssociativity (KB)1-way2-way4-way8-way Red means A.M.A.T. not improved by more associativity Does not take into account effect of slower clock on rest of program CSCE 430/830, Memory Hierarchy Introduction

56 Tag CPU Address In Out Data Cache Write Buffer Victim Cache =? Lower Level Memory =? Address Data discarded from cache is placed in an extra small buffer (victim cache). On a cache miss check victim cache for data before going to main memory Jouppi [1990]: A 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP PA-RISC CPUs. Reducing Cache Misses: 3. Victim Cache CSCE 430/830, Memory Hierarchy Introduction

57  Way prediction helps select one block among those in a set, thus requiring only one tag comparison (if hit).  Preserves advantages of direct-mapping (why?);  In case of a miss, other block(s) are checked.  Pseudoassociative (also called column associative) caches  Operate exactly as direct-mapping caches when hit, thus again preserving advantages of the direct- mapping;  In case of a miss, another block is checked (as if in set-associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”.  real hit time < pseudo-hit time  too many pseudo hits would defeat the purpose Way Prediction and Pseudoassociative Caches Reducing Cache Misses: 4. Way Prediction and Pseudoassociative Caches CSCE 430/830, Memory Hierarchy Introduction

58 Compiler Optimizations Reducing Cache Misses: 5. Compiler Optimizations CSCE 430/830, Memory Hierarchy Introduction

59 Compiler Optimizations Reducing Cache Misses: 5. Compiler Optimizations CSCE 430/830, Memory Hierarchy Introduction

60 Compiler Optimizations Reducing Cache Misses: 5. Compiler Optimizations CSCE 430/830, Memory Hierarchy Introduction

61 Compiler Optimizations Reducing Cache Misses: 5. Compiler Optimizations Blocking: improve temporal and spatial locality  multiple arrays are accessed in both ways (i.e., row-major and column-major), namely, orthogonal accesses that can not be helped by earlier methods  concentrate on submatrices, or blocks c)All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N 3 operations and 2N 3 + N 2 reads! Capacity misses are a function of N and cache size in this case. CSCE 430/830, Memory Hierarchy Introduction

62 Compiler Optimizations (cont’d) Reducing Cache Misses: 5. Compiler Optimizations (cont’d)cont’d Blocking: improve temporal and spatial locality  To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor.  To total number of memory words accessed is 2N 3/ /B + N 2  Blocking exploits a combination of spatial (Y) and temporal (Z) locality. CSCE 430/830, Memory Hierarchy Introduction

63 Multi-level Cache Reducing Cache Miss Penalty: 1. Multi-level Cache  To keep up with the widening gap between CPU and main memory, try to:  make cache faster, and  make cache larger by adding another, larger but slower cache between cache and the main memory. CSCE 430/830, Memory Hierarchy Introduction

64 Adding an L2 Cache If a direct mapped cache has a hit rate of 95%, a hit time of 4 ns, and a miss penalty of 100 ns, what is the AMAT? If an L2 cache is added with a hit time of 20 ns and a hit rate of 50%, what is the new AMAT? CSCE 430/830, Memory Hierarchy Introduction

65 Adding an L2 Cache If a direct mapped cache has a hit rate of 95%, a hit time of 4 ns, and a miss penalty of 100 ns, what is the AMAT? AMAT = Hit time + Miss rate x Miss penalty = x 100 = 9 ns If an L2 cache is added with a hit time of 20 ns and a hit rate of 50%, what is the new AMAT? AMAT = Hit Time L1 + Miss Rate L1 x (Hit Time L2 + Miss Rate L2 x Miss Penalty L2 ) = x ( x100) = 7.5 ns CSCE 430/830, Memory Hierarchy Introduction

66 Handling Misses Judiciously Reducing Cache Miss Penalty: 2. Handling Misses Judiciously  Critical Word First and Early Restart  CPU needs just one word of the block at a time:  critical word first: fetch the required word first, and  early start: as soon as the required word arrives, send it to CPU.  Giving Priority to Read Misses over Write Misses  Serves reads before writes have been completed:  while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory;  instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read.  in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving performance. CSCE 430/830, Memory Hierarchy Introduction

67 Compiler-ControlledPrefetching Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching  Compiler inserts prefetch instructions  An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0]  16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b? a element rows (100 columns) visited; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses b. 101 rows and 3 columns visited; no spatial locality, but there is temporal locality: same element is used in i th and (i + 1) st iterations and the same element is access in each i iteration (outer loop). 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses  Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have CSCE 430/830, Memory Hierarchy Introduction

68 Compiler-Controlled Prefetching Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching  An Example (continued) for(j:=0; j<100; j:=j+1){ prefetch(b[j+7][0]; prefetch(a[0][j+7]; a[0][j] := b[j][0] * b[j+1][0];}; for(i:=1; i<3; i:=i+1) for(j:=0; j<100; j:=j+1){ prefetch(a[i][j+7]; a[i][j] := b[j][0] * b[j+1][0]}  Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7* *50 = cycles (total iteration cycles plus total cache miss cycles); CSCE 430/830, Memory Hierarchy Introduction

69 Compiler-Controlled Prefetching (cont’d) Reducing Cache Miss Penalty: 3. Compiler-Controlled Prefetching (cont’d)cont’d  An Example (continued)  the first loop consumes 9 cycles per iteration (due to the two prefetch instruction)  the second loop consumes 8 cycles per iteration (due to the single prefetch instruction),  during the first 7 iterations of the first loop array a incurs 4 cache misses,  array b incurs 7 cache misses,  during the first 7 iterations of the second loop for i = 1 and i = 2 array a incurs 4 cache misses each  array b does not incur any cache miss in the second split!.  the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450  Prefetching improves performance: 14650/3450=4.25 folds CSCE 430/830, Memory Hierarchy Introduction

70 Reducing Cache Hit Time:  Small and simplecaches  Small and simple caches  smaller is faster:  small index, less address translation time  small cache can fit on the same chip with CPU  low associativity: in addition to a simpler/shorter tag check, 1-way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity!  Avoid address translation during indexing  Make the common case fast:  use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache CSCE 430/830, Memory Hierarchy Introduction

71 Reducing Cache Hit Time:  Make the common case fast (continued):  there are at least three important performance aspects that directly relate to virtual-to-physical translation:  improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time  for a physical cache, the TLB access time must occur before the cache access, extending the cache access time  two-line address (e.g., an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical  problems with virtual cache:  Page-level protection must be enforced no matter what during address translation (solution: copy protection info from TLB on a miss and hold it in a field for future virtual indexing/tagging)  when a process is switched in/out, the entire cache has to be flushed out ‘cause physical address will be different each time, i.e., the problem of context switching (solution: process identifier tag -- PID) CSCE 430/830, Memory Hierarchy Introduction

72 Reducing Cache Hit Time:  Avoid address translation during indexing  Avoid address translation during indexing (continued)continued  problems with virtual cache:  different virtual addresses may refer to the same physical address, i.e., the problem of synonyms/aliases HW solution: guarantee every cache block a unique phy. Address SW solution: force aliases to share some address bits (e.g., page-coloring) Virtually indexed and physically tagged  Pipelined cache writes  the solution is to reduce CCT and increase # of stages – increases instr. throughput  Trace caches  Finds a dynamic sequence of instructions including taken branches to load into a cache block:  Put traces of the executed instructions into cache blocks as determined by the CPU  Branch prediction is folded in to the cache and must be validated along with the addresses to have a valid fetch.  Disadvantage: store the same instructions multiple times CSCE 430/830, Memory Hierarchy Introduction


Download ppt "CSCE 430/830 Computer Architecture Introduction to Memory Hierarchy Adopted from Professor David Patterson Electrical Engineering and Computer Sciences."

Similar presentations


Ads by Google