Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire for unlimited fast memory and economical considerations,

Similar presentations


Presentation on theme: "Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire for unlimited fast memory and economical considerations,"— Presentation transcript:

1 Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire for unlimited fast memory and economical considerations, and based on: –principle of locality, and –cost/performance ratio of memory technologies (fast  small, large  slow), to achieve a memory system with cost almost as low as the cheapest level of memory and speed almost as fast as the fastest level. HierarchyHierarchy: CPU/register file (RF)  cache (C)  main memory (MM)  disk memory/I/O devices (DM) Speed in descending orderSpeed in descending order: RF > C > MM > DM Space in ascending orderSpace in ascending order: RF < C < MM < DM

2 Slide 2 Memory Hierarchy Design Thegaps in speed and space between the different levels are widening increasinglyThe gaps in speed and space between the different levels are widening increasingly: Level/Name1/RF2/C3/MM4/DM Typical size< 1 KB< 16 MB< 16 GB> 100GB Implementation technology Custom memory w. multiple ports, CMOS On-chip or off-chip CMOS SRAM CMOS DRAMMagnetic disk Access time (ns)0.25-0.50.5-2580-2505,000,000 Bandwidth20,000-100,000 (MB/s)5000-10,000 (MB/s)1000-5000 (MB/s)20-150 (MB/s) Managed byCompilerHardwareOperating systemOS/operator Backed byCacheMain memoryDiskCD or tape

3 Slide 3 Memory Hierarchy Design Cache performance reviewCache performance review: Memory stall cycles = Number_of_misses * Miss_penalty = IC * Miss_per_instr * Miss_penalty IC * MAPI * Miss_rate * Miss_penalty = IC * MAPI * Miss_rate * Miss_penalty where MAPI stands for memory accesses per instruction Four Fundamental Memory Hierarchy Design IssuesFour Fundamental Memory Hierarchy Design Issues: 1.Block placement issue: where can a block, the atomic memory unit in cache-memory transactions, be placed in the upper level? 2.Block identification issue: how is a block found if it is in the upper level? 3.Block replacement issue: which block should be replaced on a miss? 4.Write strategy issue: what happens on a write?

4 Slide 4 Memory Hierarchy Design 1.Placement 1.Placement: three approaches : 1)fully associative: any block in the main memory can be placed in any block frame. It is flexible but expensive due to associativity 2)direct mapping: each block in memory is placed in a fixed block frame with the following mapping function: (Block Address) MOD (Number of blocks in cache) 3)set associative: a compromise between fully associative and direct mapping; The cache is divided into sets of block frames, and each block from the memory is first mapped to a fixed set wherein the block can be placed in any block frame. Mapping to a set follows the function, called a bit selection : (Block Address) MOD (Number of sets in cache)

5 Slide 5 Memory Hierarchy Design 2.Identification 2.Identification:  Each block frame in the cache has an address tag indicating the block's address in the memory  All possible tags are searched in parallel  A valid bit is attached to the tag to indicate whether the block contains valid information or not  An address for a datum from CPU, A, is divided into a block address field and a block offset field:  block address = (A) / (block size)  block offset = (A) MOD (block size)  block address is further divided into tag and index:  index indicates the set in which the block may reside  tag is compared to indicate a hit or a miss

6 Slide 6 Memory Hierarchy Design 3.Replacement on a cache miss 3.Replacement on a cache miss:  The more choices for replacement, the more expensive for hardware  direct mapping is the simplest  Random vs. least-recently used (LRU): the former has uniform allocation and is simple to build while the latter can take advantage of temporal locality but can be expensive to implement (why?). First in, first out (FIFO) approximates LRU and is simpler than LRU Data cache misses per 1000 instructions Associativity Two-wayFour-wayEight-way SizeLURRandomFIFOLURRandomFIFOLURRandomFIFO 16K114.1117.3115.5111.7115.1113.3109.0111.8110.4 64K103.4104.3103.9102.4102.3103.199.7100.5100.3 256K92.292.192.592.1 92.592.1 92.5

7 Slide 7 Memory Hierarchy Design 4.Write strategies 4.Write strategies:  Most cache accesses are reads: 10% stores + 37% loads + 100% instructions  only 7% of all memory accesses are writes  Optimize reads to make the common case fast, observing that CPU doesn't have to wait for writes while must wait for reads: fortunately, read is easy in direct-mapping: reading and tag comparison can be done in parallel (what about associative mapping?); but write is hard: a)Cannot overlap tag reading and block writing (destructive) b)CPU specifies write size: only 1 - 8 bytes. Thus write strategies often distinguish cache design; On a write hit: i.write through i.write through (or store through):  ensuring consistency at the cost of memory and bus bandwidth  write stalls may be alleviated by using write buffers  write back  write back (store in):  minimizing memory and bus traffic at the cost of weakened consistency,  use dirty bit to indicate modification  read misses may result in writes (why?)  On a write miss:  write allocate  write allocate (fetch on write)  no-write allocate  no-write allocate (write around)

8 Slide 8 Memory Hierarchy Design  An Example  An Example:The Alpha 21264 Data Cache  Cache size=64KB, block size=64B, two-way set associativity, write- back, write allocate on a write miss.  What is the index size? = 64K/(64*2) = 2 16 /(2 6+1 )=2 9

9 Slide 9 Memory Hierarchy Design  Cache Performance  Cache Performance:  Memory access time is an indirect measure of performance and it is not a substitute for execution time:

10 Slide 10 Memory Hierarchy Design  Example 1  Example 1: How much does cache help in performance?

11 Slide 11 Memory Hierarchy Design  Example 2  Example 2: What’s the relationship between AMAT and CPU Time?

12 Slide 12 Memory Hierarchy Design  Improving Cache Performance  The average memory access time can be improved by reducing any of the three parameters above: 1.R1  reducing miss rate; 2.R2  reducing miss penalty; 3.R3  reducing hit time;  Four categories of cache organizations that help reduce these parameters: 1.Organizations that help reduce miss rate: larger block size, larger cache size, higher associativity, way prediction and pseudoassociativity, and compiler optimization; 2.Organizations that help reduce miss penalty: multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim cache; 3.Organizations that help reduce miss penalty or miss rate via parallelism: non-blocking caches, hardware prefetching, and compiler prefetching; 4.Organizations that help reduce hit time: small and simple caches, avoid address translation, pipelined cache access, and trace cache.

13 Slide 13 Memory Hierarchy Design  Reducing Miss Rate  There are three kinds of cache misses depending on the causes:  Compulsory: the very first access to a block cannot be a hit, since the block must be first brought in from the main memory. Also call cold-start misses;  Capacity: lack of space in cache to hold all blocks needed for the execution. Capacity misses will occur because of blocks being discarded and later retrieved;  Conflict: due to mapping that confines blocks to restricted area of cache (e.g., direct mapping, set-associative), also called collision misses or interference misses  While 3-C characterization gives insights to causes, they are at times too simplistic (and they are inter-dependent). For example, they ignore replacement policies.

14 Slide 14 Memory Hierarchy Design Roles of 3-C

15 Slide 15 Memory Hierarchy Design Roles of 3-C

16 Slide 16 Memory Hierarchy Design  First Miss Rate Reduction Technique: Large Block Size  Takes advantage of spatial locality  reduces compulsory miss  Increases miss penalty (it takes longer to fetch a block)  Increases conflict misses, and/or increases capacity misses  Must strike a delicate balance among MP, MR, and AMAT, in finding an appropriate block size

17 Slide 17 Memory Hierarchy Design  First Miss Rate Reduction Technique: Larger Block Size  Example: Find the optimal block size in terms of AMAT, given that miss penalty is 40 cycles overhead plus 2 cycles/16 bytes and miss rates of the table below.  Solution: AMAT = HT + MR * MP = HT + MR * (40 + block size * 2 / 16)  High latency and bandwidth encourages large block size  Low latency and bandwidth encourages small block size

18 Slide 18 Memory Hierarchy Design  Second Miss Rate Reduction Technique: Larger Caches  An obvious way to reduce capacity misses.  Drawback: high overhead in terms of hit time and higher cost.  Popular in off-chip cache (2 nd and 3 rd level cache).  Third Miss Rate Reduction Technique: Higher Associativity  Miss rate  Rule of Thumb:  8-way associativity is almost equal to full associativity;  Miss rate of (1-way of N-sized cache) is almost equal to Miss rate of (2-way of 0.5N-sized cache)  The higher the associativity, the longer the hit time (why?)  Higher miss rate rewards higher associativity.

19 Slide 19 Memory Hierarchy Design  Fourth Miss Rate Reduction Technique: Way Prediction and Pseudoassociative Caches  Way prediction helps select one block among those in a set, thus requiring only one tag comparison (if hit).  Preserves advantages of direct-mapping (why?);  In case of a miss, other block(s) are checked.  Pseudoassociative (also called column associative) caches  Operate exactly as direct-mapping caches when hit, thus again preserving advantages of the direct-mapping;  In case of a miss, another block is checked (as if in set-associative caches), by simply inverting the most significant bit of the index field to find the other block in the “pseudoset”.  real hit time < pseudo-hit time  too many pseudo hits would defeat the purpose

20 Slide 20 Memory Hierarchy Design  Fifth Miss Rate Reduction Technique: Compiler Optimizations

21 Slide 21 Memory Hierarchy Design  Fifth Miss Rate Reduction Technique: Compiler Optimizations

22 Slide 22 Memory Hierarchy Design  Fifth Miss Rate Reduction Technique: Compiler Optimizations

23 Slide 23 Memory Hierarchy Design  Fifth Miss Rate Reduction Technique: Compiler Optimizations  Blocking: improve temporal and spatial locality  multiple arrays are accessed in both ways (i.e., row-major and column- major), namely, orthogonal accesses that can not be helped by earlier methods  concentrate on submatrices, or blocks  All N*N elements of Y and Z are accessed N times and each element of X is accessed once. Thus, there are N 3 operations and 2N 3 + N 2 reads! Capacity misses are a function of N and cache size in this case.

24 Slide 24 Memory Hierarchy Design  Fifth Miss Rate Reduction Technique: Compiler Optimizations  To ensure that elements being accessed can fit in the cache, the original code is changed to compute a submatrix of size B*B, where B is called the blocking factor.  To total number of memory words accessed is 2N 3/ /B + N 2  Blocking exploits a combination of spatial (Y) and temporal (Z) locality.

25 Slide 25 Memory Hierarchy Design  First Miss Penalty Reduction Technique: Multilevel Caches  To keep up with the widening gap between CPU and main memory, try to:  make cache faster, and  make cache larger by adding another, larger but slower cache between cache and the main memory.

26 Slide 26 Memory Hierarchy Design  First Miss Penalty Reduction Technique: Multilevel Caches : b)Local miss rate vs. global miss rate: i.Local miss rate i.Local miss rate is defined as ii.Global miss rate ii.Global miss rate is defined as

27 Slide 27 Memory Hierarchy Design  Second Miss Penalty Reduction Technique: Critical Word First and Early Restart  CPU needs just one word of the block at a time:  critical word first: fetch the required word first, and  early start: as soon as the required word arrives, send it to CPU.  Third Miss Penalty Reduction Technique: Giving Priority to Read Misses over Write Misses  Serves reads before writes have been completed:  while write buffers improve write-through performance, they complicate memory accesses by potentially delaying updates to memory;  instead of waiting for the write buffer to become empty before processing a read miss, the write buffer is checked for content that might satisfy the missing read.  in a write-back scheme, the dirty copy upon replacing is first written to the write buffer instead of the memory, thus improving performance.

28 Slide 28 Memory Hierarchy Design  Fourth Miss Penalty Reduction Technique: Merging Write Buffer  Improves efficiency of write buffers that are used by both write- through and write back caches:  Multiple single-word writes are combined into a single write buffer entry which is otherwise used for multi-word write.  Reduces stalls due to write buffer being full

29 Slide 29 Memory Hierarchy Design  Fifth Miss Penalty Reduction Technique: Victim Cache  victim caches attempt to avoid miss penalty on a miss by:  Adding a small fully-associative cache that is used to contain discarded blocks (victims)  It is proven to be effective, especially for small 1-way cache. e.g., a 4- entry victim cache removes 20% !

30 Slide 30 Memory Hierarchy Design  Reducing Cache Miss Penalty or Miss Rate via Parallelism  Nonblocking Caches (Lock-free caches):  Hardware Prefetching of Instructions and Data:

31 Slide 31 Memory Hierarchy Design  Reducing Cache Miss Penalty or Miss Rate via Parallelism  Compiler-Controlled Prefetching: compiler inserts prefetch instructions

32 Slide 32 Memory Hierarchy Design  Reducing Cache Miss Penalty or Miss Rate via Parallelism  Compiler-Controlled Prefetching: An Example for(i:=0; i<3; i:=i+1) for(j:=0; j<100; j:=j+1) a[i][j] := b[j][0] * b[j+1][0]  16-byte blocks, 8KB cache, 1-way write back, 8-byte elements; What kind of locality, if any, exists for a and b? a. 3 rows and 100 columns; spatial locality: even-indexed elements miss and odd-indexed elements hit, leading to 3*100/2 = 150 misses b. 101 rows and 3 columns; no spatial locality, but there is temporal locality: same element is used in i th and (i + 1) st iterations and the same element is access in each i iteration. 100 misses for i = 0 and 1 miss for j = 0 for a total of 101 misses  Assuming large penalty (50 cycles and at least 7 iterations must be prefetched). Splitting the loop into two, we have

33 Slide 33 Memory Hierarchy Design  Reducing Cache Miss Penalty or Miss Rate via Parallelism  Compiler-Controlled Prefetching: An Example (continued) for(j:=0; j<100; j:=j+1){ prefetch(b[j+7][0]; prefetch(a[0][j+7]; a[0][j] := b[j][0] * b[j+1][0];}; for(i:=1; i<3; i:=i+1) for(j:=0; j<100; j:=j+1){ prefetch(a[i][j+7]; a[i][j] := b[j][0] * b[j+1][0]}  Assuming that each iteration of the pre-split loop consumes 7 cycles and no conflict and capacity misses, then it consumes a total of 7*300 + 251*50 = 14650 cycles (total iteration cycles plus total cache miss cycles); whereas the split loop consumes a total of (1+1+7)*100+(4+7)*50+(1+7)*200+(4+4)*50 = 3450

34 Slide 34 Memory Hierarchy Design  Reducing Cache Miss Penalty or Miss Rate via Parallelism  Compiler-Controlled Prefetching: An Example (continued)  the first loop consumes 9 cycles per iteration (due to the two prefetch instruction)  the second loop consumes 8 cycles per iteration (due to the single prefetch instruction),  during the first 7 iterations of the first loop array a incurs 4 cache misses,  array b incurs 7 cache misses,  during the first 7 iterations of the second loop for i = 1 and i = 2 array a incurs 4 cache misses each  array b does not incur any cache miss in the second split!.

35 Slide 35 Memory Hierarchy Design  First Hit Time Reduction Technique: Small and simplecaches  First Hit Time Reduction Technique: Small and simple caches  smaller is faster:  small index, less address translation time  small cache can fit on the same chip  low associativity: in addition to a simpler/shorter tag check, 1- way cache allows overlapping tag check with transmission of data which is not possible with any higher associativity!  Second Hit Time Reduction Technique: Avoid address translation during indexing  Make the common case fast:  use virtual address for cache because most memory accesses (more than 90%) take place in cache, resulting in virtual cache

36 Slide 36 Memory Hierarchy Design  Second Hit Time Reduction Technique: Avoid address translation during indexing  Make the common case fast:  there are at least three important performance aspects that directly relate to virtual-to-physical translation:  improperly organized or insufficiently sized TLBs may create excess not-in-TLB faults, adding time to program execution time  for a physical cache, the TLB access time must occur before the cache access, extending the cache access time  two-line address (e.g., an I-line and a D-line address) may be independent of each other in virtual address space yet collide in the real address space, when they draw pages whose lower page address bits (and upper cache address bits) are identical  problems with virtual cache:  Page-level protection must be enforced no matter what during address translation (solution: copy protection info from TLB on a miss and hold it in a field for future virtual indexing/tagging)  when a process is switched in/out, the entire cache has to be flushed out ‘cause physical address will be different each time, i.e., the problem of context switching (solution: process identifier tag -- PID)

37 Slide 37 Memory Hierarchy Design  Second Hit Time Reduction Technique: Avoid address translation during indexing  problems with virtual cache:  different virtual addresses may refer to the same physical address, i.e., the problem of synonyms/aliases HW solution: guarantee every cache block a unique phy. Address SW solution: force aliases to share some address bits (e.g., page-coloring) Virtually indexed and physically tagged  Third Hit Time Reduction Technique: Pipelined cache writes  the solution is to reduce CCT and increase # of stages – increases instr. throughput  Fourth Hit Time Reduction Technique: Trace caches  Finds a dynamic sequence of instructions including taken branches to load into a cache block:  Put traces of the executed instructions into cache blocks as determined by the CPU  Branch prediction is folded in to the cache and must be validated along with the addresses to have a valid fetch.  Disadvantage: store the same instructions multiple times

38 Slide 38 Memory Hierarchy Design  Main Memory and Organizations for Improving Performance

39 Slide 39 Memory Hierarchy Design  Main Memory and Organizations for Improving Performance  Wider main memory bus Wider main memory bus  Cache miss penalty decreases proportionally  Cost: i.wider bus (x n) and multiplexer (x n), ii.expandability (x n), and iii.error correction is more expensive b)Simple interleaved memorySimple interleaved memory Potential parallelism with multiple DRAMs Sending address and accessing multiple bands in parallel but transmitting data sequentially (4+24+4x4=44 cycles  16/44 = 0.4 byte/cycle) c)Independent memory banksIndependent memory banks

40 Slide 40 Memory Hierarchy Design  Main Memory and Organizations for Improving Performance

41 Slide 41 Memory Hierarchy Design  Main Memory and Organizations for Improving Performance

42 Slide 42 Memory Hierarchy Design  Virtual Memory

43 Slide 43 Memory Hierarchy Design  Virtual Memory

44 Slide 44 Memory Hierarchy Design  Virtual Memory

45 Slide 45 Memory Hierarchy Design  Virtual Memory  Fast address translation: an example – Alpha 21264 data TLB »ASN is used as PID for virtual caches; »TLB is not flushed on a context switch but only when ASNs are recycled; »Fully associative placement

46 Slide 46 Memory Hierarchy Design  Virtual Memory  What is the optimal page size? – It depends:  page table size 1/page size  large page size makes virtual cache possible (avoiding the aliases problem), thus reducing cache hit time  transfer of larger pages (over the network) is more efficient: efficiency of transfer  small TLB favors larger pages  main drawback for large page size: internal fragmentation: waste of storage process startup time: large context switching overhead

47 Slide 47 Memory Hierarchy Design  Summarizing Virtual Memory& Caches  Summarizing Virtual Memory & Caches  A hypothetical memory hierarchy going from virtual address to L2 cache access:

48 Slide 48 Memory Hierarchy Design  Protection and Examples of Virtual Memory  The invention of multiprogramming led to the need to share computer resources such as CPU, memory, I/O, etc. by multiple programs whose instantiations are called “processes”;  Time-sharing of computer resources by multiple processes requires that processes take turns using such resources and designers of operating systems and computer must ensure that the switching among different processes, also called “context switching” is done correctly:  The computer designer must ensure that the CPU portion of the process state can be saved and restored;  The operating systems designer must guarantee that processes do not interfere with each others’ computations  Protecting processes:  Base and Bound – each process falls in a pre-defined portion of the address space, that is, an address is valid if Base  Address  Bound, where OS keeps and defines the values of Base and Bound in two registers.

49 Slide 49 Memory Hierarchy Design  Protection and Examples of Virtual Memory  The computer designer’s responsibilities in helping the OS designer protect processes from each other:  Providing two modes to distinguish a user process from a kernel process (or equivalently, supervisor or executive process);  Providing a portion of the CPU state, including the base/bound registers, the user/kernel mode bit(s), and the exception enable/disable bit, that a user can use but cannot write; and  Providing mechanisms by which the CPU can switch between the user mode to the supervisor mode.  While base-and-bound constitutes the minimum protection system, virtual memory offers a more fine-grained alternative to this simple model:  Address translation provides an opportunity to check any possible violations – the read/write and user/kernel signals from CPU vs. the permission flags marked on individual pages by virtual memory (or OS) to detect stray memory accesses;  Depending on the designer’s apprehension, protection can be either relaxed or escalated. In escalated protection, multiple levels of access permissions can be used, much like the military classification system.

50 Slide 50 Memory Hierarchy Design  Protection and Examples of Virtual Memory  A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB (one for Instruction and one for Data)  A combination of segmentation and paging, with 48-bit virtual addresses while the 64- bit address space being divided into three segments: seg0 (bits 63-47 = 0..00), kseg (bits 63-46 = 0…10), and seg1 (bits 63-46 = 1…11)  Advantages: segmentation divides address space and conserves page table space, while paging provides virtual memory, relocation, and protection  Even with segmentation, the size of the page tables for the 64-bit address space is alarming. A three-level hierarchical page table is used in Alpha, with each PT contained in one page:

51 Slide 51 Memory Hierarchy Design  Protection and Examples of Virtual Memory  A Paged Virtual Memory Example – The Alpha Memory Management and the 21264 TLB  PTE is 64 bits long, with the first 32 bits contain the physical page number and the other half includes the following five protection fields:  Valid – whether the page number is valid for address translation  User read enable – allows user programs to read data within this page  Kernel read enable -- allows kernel programs to read data within this page  User write enable – allows user programs to write data within this page  Kernel write enable -- allows kernel programs to write data within this page  Current design of Alpha has 8-KB pages, thus allowing 1024 PTEs in each PT. The three page level fields and page offset account for 10+10+10+13=43 bits of the 64-bit virtual address. The 21 bits to the left of level-1 field are all “0”s for seg0 and all “1”s for seg1.  The maximum virtual address and physical address is tied to the page size and Alpha has provisions for future growth: 16KB, 32KB and 64KB page sizes for the future.  The following table shows memory hierarchy parameters of the Alpha 21264 TLB ParameterDescription Block size1 PTE (8 bytes) Hit time1 clock cycle Miss penalty (average)20 clock cycle TLB size128 PTEs per TLB, each of which can map 1,8,64, or 512 pages Block selectionRound-robin Write strategy(not applicable) Block placementFully associative

52 Slide 52 Memory Hierarchy Design  Protection and Examples of Virtual Memory  A Segmented Virtual Memory Example – Protection in the Intel Pentium  Pentium has four protection levels, with the innermost level (0) corresponding to Alpha’s kernel mode and the outermost level (3) corresponding to Alpha’s user mode’. Separate stacks are used for each level to avoid security breaches between levels.  User can call an OS routine and pass parameters to it while retaining full protection  Allows the OS to maintain the protection level of the called routine for the parameters that are passed to it  The potential loophole in protection is prevented by not allowing the user process to ask the OS to access something indirectly that it would not have been able to access itself (such security loopholes are called Trojan Horse).  Bounds checking and memory mapping in Pentium by the use of a descriptor table (DT, which plays the role of PTs in the Alpha). The equivalent of PTE in DT is a segment descriptor containing the following fields:  Present bit – equivalent to the PTE valid bit, used to indicate this is a valid translation  Base field – equivalent to a page frame address, containing the physical address of the first byte of the segment  Access bit – like the reference bit or use bit in some architectures that is helpful for replacement algorithms  Attributes field – specifies the valid operations and protection levels for the operations that use this segment  Limit field – not found in paged systems, establishes the upper bound of valid offsets for this segment.

53 Slide 53 Memory Hierarchy Design  Crosscutting Issues: The Design of Memory Hierarchies  Superscalar CPU and Number of Ports to the Cache Cache must provide sufficient peak bandwidth to benefit from multiple issues. Some processors increase complexity of instruction fetch by allowing instructions to be issued to be found on any boundary instead of, say, multiples of 4 words.  Speculative Execution and the Memory System Speculative and conditional instructions generate exceptions (by generating invalid addresses) that would otherwise not occur, which in turn can overwhelm the benefits of speculation with the exception handling overhead. Such CPUs must be matched with non-blocking caches and only speculate on L1 misses (due to the unbearable penalty of L2).  Combining Instruction Cache with Instruction Fetch and Decode Mechanisms Increasing demand for ILP and clock rate has led to the merging of the first part of instruction execution with instruction cache, by incorporating trace cache (which combines branch prediction with instruction fetch) and storing the internal RISC operations in the trace cache (e.g., Pentium 4’s NetBurst microarchitecture). A cache hit in the merged cache saves portion of the instruction execution cycles.  Embedded Computer Caches and Real-Time Performance In real-time applications, variation of performance matters much more than average performance. Thus, caches that offer average performance enhancement have to be used carefully. Instruction caches are often used due to the highly predictability of instructions; whereas data caches are “locked down”, forcing them to act as small scratchpad memory under program control.  Embedded Computer Caches and Power It is much more power efficient to access on-chip memory than to access off-chip one (which needs to drive the pins, buses and activate external memory chips, etc). Other techniques, such as way prediction, can be used to save power (by only powering half of the two-way set- associative cache).  I/O and Consistency of Cached Data Cache coherence Cache coherence problem must be addressed when I/O devices also share the same cached data.

54 Slide 54 Memory Hierarchy Design  An Example of the Cache Coherence Problem

55 Slide 55 Memory Hierarchy Design  Putting It All Together: Alpha 21264 Memory Hierarchy »Instruction cache is virtually indexed and virtual tagged; Data cache is virtually indexed but physically tagged; »Operations: 1.Chip loads instr serially from an external PROM and loads configuration info for L2 cache 2.Execute preloaded code in PAL mode to initialize: e.g., update TLB 3.Once OS ready, it sets PC to appropriate addr in seg0 4.9 index + 1 way-predict + 2 4-instr = 12 addr bits are sent to I-$; 48 – 9 – 6 = 33 bits for v. tag 5.Way-prediction and Line-prediction (11-bit for next 16-byte group on a miss and updated by br prediction) are used to reduce I-$ latency 6.The next way and line prediction is loaded to read the next block (step 3) on an intr cache hit. 7.An instr $ miss leads to a check of I-TLB & prefetcher (4 – 7), or access L2 $ (if instr. Addr. Not found) (8) 8.L2 $ is direct mapped, 1-16MB

56 Slide 56 Memory Hierarchy Design  Putting It All Together: Alpha 21264 Memory Hierarchy 9.The instruction prefetcher does not rely on TLB for address translation, it simply increments the physical address of the miss by 64 bytes, checking to make sure that the new address is within the same page. Prefetching is suppressed if new address is out of the page (step 14).step 14 10.If the instruction is not found in L2, the physical address command is sent to the ES40 system chip set via four consecutive transfer cycles on a narrow, 15-bit outbound address bus (step 15). Address and command take 8 CPU cycles. CPU is connected to memory via a crossbar to one of two 256-bit memory buses (16)step 1516 11.Total penalty of the instruction miss is about 130 CPU cycles for critical instructions, while the rest of the block is filled at a rate of 8 bytes per 2 CPU cycles (step 17).step 17 12.A “victim file” is associated with L2, a write-back cache, to store a replaced block (victim block) (step 18); The address of the victim is sent out the system address bus following the address of the new request (step 19), where the system chip set later extracts the victim data and writes to the memory.step 18step 19 13.D-$ is a 64-KB two-way set-associative, write-back cache, which is virtually indexed but physically tagged. While the 9-bit index + 3-bit word selection is sent to index the required data (step 24), virtual page # is being translated at D- TLB (step 23), which is fully associative and has 128 PTEs of which each represents page size from 8KB to 4MB (step 25).step 24step 23step 25

57 Slide 57 Memory Hierarchy Design  Putting It All Together: Alpha 21264 Memory Hierarchy 14.A TLB miss will trap to PAL (privileged architecture library) code to load the valid PTE for this address. In the worst case, a page-fault happens, in which case OS will bring the page from disk while context is switched. 15.The index field of the address is sent to both sets of data cache (step 26). Assuming a TLB hit, the two tags and valid bits are compared to the physical page # (steps 27-28), with a match sending the desired 8 bytes to the CPU (step 29)step 26steps 27-28step 29 16.A miss at D-$ goes to L2 $, which proceeds similary to an instruction miss (step 30), except that it must check the victim buffer to make sure the block is not there (step 31)step 30step 31 17.A write-back victim can be produced on a data cache miss. The victim data are extracted from the data cache simultaneously with the fill of the data cache with the L2 data and sent to the victim buffer (step 32)step 32 18.In case of a L2 miss, the fill data from the system is written directly into the (L1) data cache (step 33). The L2 is written only with L1 victims (step 34).step 33step 34

58 Slide 58 Memory Hierarchy Design  Another View: The Emotion Engine of the Sony Playstation 2  3 Cs captured by cache for SPEC2000 (left) and multimedia applications (right)

59 Slide 59 Memory Hierarchy Design  Another View: The Emotion Engine of the Sony Playstation 2


Download ppt "Slide 1 Memory Hierarchy Design Motivated based onMotivated by a combination of programmer's desire for unlimited fast memory and economical considerations,"

Similar presentations


Ads by Google