1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing.

1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing Miss Rate 5.6 Reducing Cache Miss Penalty of Miss Rate via Parallelism 5.7 Reducing Hit Time 5.8 Main Memory and Organizations for Improving Performance

2 Chapter 5 Memory Hierarchy Design 5.9 Memory Technology 5.10 Virtual Memory 5.11 Protection and Examples of Virtual Memory 5.12 Crosscutting Issues: The design Memory Hierarchy 5.13 Putting it all together:Alpha 21264 memory hierarchies 5.14 Another view: The emotion engine of the Sony Playstation 2 5.15 Another view: The Sun Fire 6800 Server 5.16 Fallacies and Pitfalls 5.17 Concluding remarks 5.18 Historical perspective and references

3 The levels in a typical memory hierarchy in embedded, desktop, and server computers

4 Starting with 1980 performance as a baseline, the gap in performance between memory and CPUs is plotted over time

5 5.2 Review of the ABCs of Caches Cache Virtual memory Memory stall cycles Direct mapped Valid bit Block address Write through Instruction cache Average memory access time Cache hit Page Miss penalty

6 Review of the ABCs of Caches Fully associative Dirty bit Block offset Write back Data cache Hit time Cache miss Page fault Miss rate N-way set associative Least-recently used Tag field

7 Review of the ABCs of Caches Write allocate Unified cache Miss per instruction Block Locality Set Random replacement Index field No-write allocate Write buffer Write stall

8 The typical levels in the hierarchy slow down and get larger as we move away from the CPU for a large workstation or small servers

9 Cache performance review

10 Cache performance review

11 Cache miss rate and penalty

12 Cache miss rate and penalty

13 Measuring miss rate as misses per instruction

14 Memory stall cycles

15 Four memory hierarchy questions Q1:Where can a block be placed in the upper level? (Block placement) Q2:How is a block found if it is in the upper level?(block identification) Q3:Which block should be replaced on a miss?(Block replacement) Q4:What happens on a write?(Write strategy)

16 Q1:Where can a block be placed in the upper level? (Block placement) Direct mapped: Only one place Fully associative:Anywhere in the cache Set associative:A set is a group of blocks in the cache.

17 This example cache has eight block frames and memory has 32 blocks

18 The three portions of an address in a set- associative or direct-mapped cache

19 Q3:Which block should be replaced on a cache miss Random Least-recently used(LRU) First in, first out(FIFO)

20 Data cache misses per 1000 instructions comparing least- recently used, random, and first in, first out replacement for several sizes and associativities

21 Q4:What happens on a write? Two write policies: Write through(page. 401) Write back Two options on a write miss: Write allocate-The block is allocated on a write miss, followed by the write hit actions above. No-write allocate -This apparently unusual alternative is write misses do not affect the cache. Instead, the block is modified only in the lower-level memory(Main Memory)..

22 No-write allocate versus write allocate

23 The organization of the data cache in the Alpha 21264 microprocessor(2-way set)

24 Physical address Physical address=38bit block address+6bit block offset Ex. 38bit block address=9bit index +29 bit tag

25 Miss per 1000 instructions for instruction, data, and unified cache of different sizes

26 Cache performance Average memory access time

27 Unified cache  separated cache

28 Unified cache or separated cache

29 Unified cache or separated cache

30 Average memory access time and processor performance Average memory access time depend on many factors: -Memory stalls -CPU(out-of-order execution),…

31 Performance impact by cache

32 Performance impact by cache

33 Cache misses have a double-barreled impact on a CPU A lower CPI. The lower the CPI execution, the higher the relative impact of a fixed number of cache miss clock cycles. A fast clock. When calculating CPI, the cache miss penalty is measured in CPU clock cycles for a miss. Therefore, even if memory hierarchies for two computers are identical, the CPU with the higher clock rate has a larger number of clock cycles per miss and hence a higher memory portion of CPI.

34 Directed mapped or two-way set associative

35 ex-P410-1

36 Miss penalty and out-of-order execution processors Redefine memory stalls to lead to a new definition of miss penalty as on overlapped latency. Miss Latency Length of memory latency Length of latency overlap

37 Out-of-order execution with a direct-mapped cache

38 Summary of performance equations in this chapter

39 We organize 17 cache optimization into four categories Reducing the miss penalty (Section 5.4):multilevel caches, critical word first, read miss before write miss, merging write buffers, and victim caches. Reducing the miss rate (Section 5.5): Larger block size, larger cache size, higher associativity, way predication and pseudoassociativity, and compiler optimizations. Reducing the miss penalty or miss rate via parallelism (Section 5.6): nonblocking caches, hardware prefetching, and compiler prefetching. Reducing the time to hit in the cache (Sections 5.7): small and simple caches, avoiding address translation, pipelined cache access, and trace caches.

40 5.4 Reducing cache miss penalty First miss penalty reduction technique: multilevel caches Multiple miss rates Local miss rate-Miss rate L1, or Miss rate L2 Global miss rate-Miss rate L1 for level 1 or Miss rate L1 xMiss rate L2 for level 2.

41 The first-level and the second-level caches

42 The first-level and the second-level caches

43 Miss rates versus cache size for multilevel caches

44 Relative execution time by second-level cache size

45 Two major questions for the design of the second-level cache Will it lower the average memory access time portion of the CPI, and how much does it cost? The second-level caches should be much bigger than the first. Multilevel inclusion is the natural policy for memory hierarchies: L1 data are always present in L2. Multilevel exclusion: L1 data is never found in an L2 cache(prevents wasting space in the L2 cache)

46 The impact of second-level cache associativity of its miss penalty

47 Second miss penalty reduction technique:Critical word first and early restart This strategy is impatience: Don’t wait for the full block to be loaded before sending the requested word and restarting the CPU. Critical words first-Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first. Early restart-Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.

48 The average miss penalty

49 Third miss penalty reduction techniques:Giving priority to read misses over writes Fourth miss penalty reduction techniques: Merging write buffer (Fig. 5.12) Fifth miss penalty reduction techniques: Victim Caches (Fig. 5.13) Summary of miss penalty reduction techniques: The principle of locality Retrieving the word Giving priority to read misses over writes A victim caches

50 R2=R3?

51 To illustrate write merging, the write buffer on top does use it while the write buffer on the bottom does

52 Placement of victim cache in the memory hierarchy

53 5.5 Reducing miss rate Three simple categories of cache misses: Compulsory-cold-start misses of first-reference misses Capacity-the cache cannot contain all the blocks Conflict-collision misses or interference misses. Associativity Miss Rate Eight-way Four-way Two-way One-way

54 Total miss rate for each size cache and percentage of each according the three C’s

55 Total miss rate (top) and distribution of miss rate (bottom) for each size cache according to the three C’s for the data in Fig 5.14

56 Miss rate versus block size for five different-sized caches

57 First miss rate reduction technique: Larger block size Larger blocks will reduce compulsory misses-Temporal locality and spatial locality. Larger blocks increase the miss penalty(the size of blocks increase) Larger blocks increase the miss rate(reduce the number of blocks in the cache)

58 Actual miss rate versus block size for five different-sized cached in Fig 5.16

59 Average memory access time

60 Average memory access time versus block size for five different-sized caches in Fig. 5.16

61 Second miss rate reduction technique:larger caches The size of second- or third-level caches in 2001=the size of main memory in the desktop computers in 1991. Third miss rate reduction technique:higher associativity Two general rules of thumb: -- eight-way set associatives fully associative --A direct-mapped cache of size N has about the same miss rate as a two-way set-associative cache of size N/2

62 1-way, 2-way, 4-way and 8-way

63 1-way, 2-way, 4-way and 8-way

64 Average memory access time using miss rate in Fig. 5.14 for parameters in the example

65 Fourth miss rate reduction technique:way predication and pseudoassociative caches Way predication saves pipeline stages in more than 85% of the instruction fetches Pseudoassociative or column associative Next level checking A simple way is to invert the most significant bit of the index field to find the other block in the “pseudoset”

66 Relationship between regular hit time, pseudohit, and miss penalty

67 Fifth miss rate reduction technique:Compiler optimizations Reordering the instruction-reduce the miss rates by 50%? Long cache block Improve the spatial and temporal locality:Loop Interchange, Blocking /* Before */ for (j=0; j<100; j=j+1) for (i=0; i<5000; i=i+1) x[i][j]= 2 * x[i][j]; /* After */ for (j=0; j<5000; j=j+1) for (i=0; i<100; i=i+1) x[i][j]= 2 * x[i][j];

68 Blocking Storing the array row by row (row major order) or column by column(column major order) does solve the problem because both rows and columns are used in every iteration of the loop. Such orthogonal accesses mean that transformations such as loop interchange are not helpful. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks. /*Before */ for (i=0; i<N; i=i+1) for (j=0; j<N; j=j+1) { r=0; for ( k=0; k<N; k=k+1) r=r+y[i][k]*z[k][j]; x[i][j]=r; } In the worst case, there would be 2N 3 + N 2 memory words accessed for N 3 operations

69 Blocking factor The original code is changed to compute on a submatrix of size B by B. /*After */ for (jj=0; jj<N; jj=jj+B) for (kk=0; kk<N; kk=kk+B) for(i=0; i<N; i=i+1) for(j=jj; j<min(jj+B, N); k=k+1) { r=0; for (k=kk; k<min(kk+B, N); k=k+1) r=r + y[i][k]*z[k][j]; x[i][j]=x[i][j] + R; }; In the worst case, there would be 2N 3 /B + N 2 memory words accessed for N 3 operations

70 A snapshot of the three arrays x, y, and z when i=1

71 The age of access to the arrays x, y, and z

72 5.6 Reducing cache miss penalty or miss rate via parallelism First miss penalty/rate reduction technique: Nonblocking caches to reduce stall on cache misses --The CPU could continue fetching instructions from the instruction cache while waiting for the data cache to return the missing data  A nonblocking cache or lockup-free cache --Hit under 1 miss, or hit under multiple misses,..

73 Two-way set

74 Ratio of the average memory stall time for a blocking cache to hit under-miss schemes as the number of understanding misses is varied for 18 SPEC92 programs

75 Second miss penalty/rate reduction technique:Hardware prefetching of instructions and data Instruction prefetch is frequently done in hardware outside of the cache. The processor fetches two blocks on a miss: the requested block and the consecutive block (to the instruction stream buffer). A stream buffer would catch 15% to 25% of the misses from a 4KB direct-mapped instruction cache with 16-byte blocks. Eight stream buffers can capture 50% to 70% of all misses from a processor with two 64 KB four way set-associate caches.

76 Average access time, cache size, miss rate

77 Average memory access time

78 Third miss penalty/rate reduction technique:compiler-controlled prefectching The compiler insert prefetch instructions to request to request the data before they are needed. Two flavors of prefetch: * Register prefetch will load the value into a register. * Cache prefetch loads data only into the cache and not the register. Faulting prefetch:The address does or does not cause an exception for virtual address faults and protection violation. Non-faulting prefetches simply turn into no-ops if they would normally result in an exception.

79 Compiler-controlled prefetching The most effective prefetch is “semantically invisible” to a program. It doesn’t change the contents of registers and memory. It cannot cause virtual memory faults. The goal of prefetching is to overlap execution with the prefetching of data.

80 Data cache misses

81 Data cache misses

82 The time saved

83 5.7 Reducing Hit Time First hit time reduction technique:small and simple caches *Smaller hardware is faster, and a small cache certainly helps the hit time. *Keep the cache simple *

84 Access times as size and associativity vary in a CMOS cache

85 Second hit time reduction technique: Avoiding address translation during indexing of the cache Physical address  Virtual address Two tasks: Indexing the cache and comparing address Process switching  the virtual address refer to different physical address, the cache can be flushed. One solution:use a process-identifier tag(PID). By recycled the PID instead of flushing the cache.

86 Synonyms or aliases Synonyms or aliases:Two different virtual address  the same physical address. * Hardware solutions guarantee every cache block a unique physical address. * Software solutions:page coloring is simply set-associative mapping applied to virtual memory: The 4KB(12 12 ) pages are mapped using 64(2 6 ) set to ensure that the physical and virtual address match in the last 18 bits. This restriction means a direct-mapped cache that is 2 18 (256K) bytes or smaller can never have duplicate physical address for blocks.

87 Miss rate versus virtually addressed cache size of a program measured three ways: without process switches(uniprocess), with process switches using a process-identifier tag (PID), and with process switches but without PIDs(purge)

88 Third hit time reduction technique: pipelined cache access The pipeline for the Pentium takes 1 clock cycle to access the instruction cache. Fourth hit time reduction technique: trace caches A trace cache finds a dynamic sequence of instructions including taken branches to load into a cache block.

89 Summary of cache optimization showing impact on cache performance and complexity for the techniques in Section 5.4-5.7

90 5.8 Main memory and organization for improving performance Cache  Main memory  I/O Memory bandwidth  Memory speed  New organization Assume the performance of the basic memory organization is 4 clock cycles to send the address 56 clock cycles for the access time per word 4 clock cycles to send a word of data Given given a cache block of 4 words, and that a word is 8 bytes, the miss penalty is 4X(4+56+4)=256 clock cycles, with a memory bandwidth of 1/8 byte (32/256) per clock cycle.

91 First technique for higher bandwidth: wider main memory Double or quadrupling the width of the cache  double or quadrupling the memory bandwidth With a main memory width of 2 words, the miss penalty in our example would drop from 4X64 CC to 2X64CC. The width of the cache  the error correction problem

92 Second technique for higher bandwidth: simple interleaved memory Number of memory banks  Parallelism Miss penalty of Fig. 5.27(c): 4 + 56 + ( 4 X 4 )CC. 4 X 2 X 4 Bytes/76CC=0.4 Bytes/CC The mapping of address of banks effects the behavior of the memory system. The address of the four banks are interleaved Bank 0 has all words whose address modulo 4 is 0 Bank 1 has all words whose address modulo 4 is 1

93 Three examples of bus width, memory width, and memory interleaving of achieve higher memory bandwidth

94 Four-way interleaved memory

95 What can interleaving and wide memory buy?

96 What can interleaving and wide memory buy?

97 How many banks should be included Number of banks >= Number of clock cycles to access word in bank Capacity per memory chip increases, there are fewer chips in the same-sized memory system, making multiple banks much more expensive. A second disadvantage of memory banks is again the difficulty of main memory expansion.

98 Third technique for higher bandwidth:independent memory banks Interleaving sequential accesses  multiple independent accesses Cache read in bank i, cache write in bank j Multiple cache accesses Nonblocking caches allow the CPU to proceed beyond a cache miss. Multiprocessors share a common memory provide further motivation for memory banks.

99 5.9 Memory technology Memory latency: Access time and cycle time Access time is the time between when a read is requested and when the desired word arrives, while cycle time is the minimum time between requests to memory. One reason that cycle time is greater than access time is that the memory needs the address lines to be stable between accesses.

100 DRAM Technology Capacity of DRAM Number of address lines  Multiplex the address lines : Row access strobe and column access strobe. DRAM need refresh it content  occasionally unavailable. Refresh time<=5% of the total time Fourfold improvement in capacity every three years. 5% performance improvement in row access time

101 Internal organization of a 64M bit DRAM

102 Times of fast and slow DRAMs with each generation

103 SRAM technology SRAM need not refresh. Six transistors per bit SRAM design are concerned with speed and capacity. The capacity of DRAMs is roughly 4 to 8 times that of SRAMs. The cycle time of SRAM is 8 to 16 times faster than DRAMs, but they are also 8 to 16 times as expensive.

104 Embedded processor memory technology: ROM and flash ROM for the embedded program and the constant Flash memory allows the embedded device to alter nonvolatile memory after the system is manufactured. Flash memory allows reading at almost DRAM speed, but writing flash is 10 to 100 times slower.

105 Improving memory performance in a standard DRAM chip Improving bandwidth of DRAM 1. Fast page mode: allow repeated accesses to the row buffer without another row access time. 2. Synchronous DRAM 3. The third major DRAM innovation to increase bandwidth is to transfer data on both the rising edge and falling edge of the DRAM clock cycle.. Adding little cost to the system while achieving a significant improvement in bandwidth.

106 Improving memory performance via new DRAM interface:RAMBUS DRAM  A high speed interface  CPU The first-generation RAMBUS interface dropped RAS/CAS. Replacing it with “a” bus that allows other accesses over the bus between the sending of the address and return of the data.  RDRAM, Such a bus is called a packet-switched bus or split- transaction bus.

107 The second-generation RAMBUS interface Direct RDRAM (DRDRAM)  1.6GB/Sec. This interface include separate row- and column- command buses instead of the conventional multiplexing; an 18-bit data bus; expanding from 4 to 16 internal banks per RDRAM to reduce bank conflicts;increasing the number of row buffers from 4 to 8; increasing the clock to 400 MHz; and a much more sophisticated controller on chip Because of the separation of data, row, column buses, three transactions can be performed simultaneously.

108 5.10 Virtual Memory Virtual memory divides physical memory into blocks and allocates them to different processes. A protection scheme restricts a process to the blocks belonging only to that process. Virtual memory automatically manages the two levels of the memory hierarchy represented by main memory and secondary storage. Virtual memory was invented to relieve programmers of program block allocation; it automatically manages the two levels of the memory hierarchy represented by main memory and secondary storage.

109 5.10 Virtual memory Relocation allows the same program to run in any location in physical memory. Page or segment is used for block, and page fault or address fault is used for miss Virtual address  virtual memory  physical address hardware/software Memory mapping or address translation.

110 The logical program in its contiguous virtual address space is shown on the left

111 There are further differences between caches and virtual memory beyond those quantitative ones mentioned in Fig. 5.32 Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by the operating system. The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size. Main memory  Secondary storage. Virtual memory system: Page, a fixed-size blocks (4096~65536) Segment, a variable-size blocks(1~[2 16 ~2 32 ]) Paged addressing has a single fixed-size address divided into page number and offset within a page, analogous to cache addressing Segmented address: the variable size of segments require for segment and 1 word for an offset within a segment.

112 Typical ranges of parameter for caches and virtual memory

113 Example of how paging and segmentation divide a program

114 Paging and segmentation

115 Four memory hierarchy questions revisited Q1: Where can a block be placed in main memory? Operating systems allow blocks to be placed anywhere in main memory. Q2: How is a block found if it in main memory?(Fig. 5.35) Both paging and segmentation rely on a data structure that is indexed by the page or segment number. Q3: Which block should be replaced on a virtual memory miss? Almost all operating systems try to replace the least-recently used (LRU) block because if the past predicts the future Q4: What happens on a write? The write strategy is always write back.

116 The mapping of a virtual address to a physical address via a page table

117 Techniques for fast address translation Paging  1. One memory access to obtain the physical address. 2. A second access to get the data. A general solution is to again rely on the principle of locality. (Address translation is in a special cache) This special address translation lookaside buffer (TLB) A TLB entry is like a cache entry where the tag holds portions of the virtual address and the data portion holds a physical page frame number, protection field, valid bit, and usually a use bit and dirty bit. OS manage the TLB

118 Fig. 5.36 The TLB uses fully associative placement; thus, the translation begins (step 1 and 2) by sending the virtual address to all tags, the type of memory access is checked for a violation (also in step 2) against protection information in the TLB. Address space number (SPN) plays the same role as the PID. Step 3: 128:1 multiplexor, Step 4: Page frame(31 or 28 bit) and page offset(13-bit).

119 Operation of the Alpha 21264 data TLB during address translation

120 Selecting a page size The following favor a larger page size: 1.The size of the page table is inversely proportional to the page size; memory can therefore be saved by making the page bigger. 2. A larger page size can allow larger caches will fast cache hit times. 3. Transferring larger pages to or from secondary storage, possibly over a network, is more efficient than transferring. 4. The number of TLB entries is restricted, so a larger page size means that more memory can be mapped efficiently, thereby reducing the number of TLB misses. The main motivation for a smaller page size is conserving storage. A small page size will result in less wasted storage when a continuous region of virtual memory is not equal in size to a multiple of the page size.

121 The overall picture of a hypothetical memory hierarchy going from virtual address to L2 cache access

122 5.11 Protection and examples of virtual memory Multiprogramming, a computer would be shared by several programs running concurrently, led to new demands for protection and sharing among program. Process, a running program plus any state needed to continue running it. Process exchange is called a process switch or context switch. The operating system designer must guarantee that processes do not interfere with each others’ computation. The operating system partitions main memory so that several different processes have their state in memory at the same time. The operating system allow communication between processes or save memory by reducing the number of copies of identical information.

123 Protecting processes The simplest protection mechanism is pair of register that checks every address to be sure that it falls between the two limits, base and bound. An address is valid if Base =< Address =< Bound or (Base + Address) =< Bound The computer designer has three more responsibilities in helping the operating system designer protect process from each other. 1. Provide at least two modes; a user process and an operating process. 2. Provide a portion of the CPU state that a user process can use but not write. 3.Provide mechanism whereby the CPU can go from user mode to supervisor mode and vice versa.

124 Protection mechanism The simplest way of doing protection is to add permission flags to each page and segment. Since few programs today intentionally modify their own code, an operating system can detect accidental writes to code by offering read- only protection to pages. Processes are thus protected from one another by having their page tables, each pointing to distinct pages of memory.

125 Protection mechanism “Ring” added to the protection structure expand memory access protection from two levels ( user and kernel) to many more. A program can’t unlock access to the data unless it has the key.

126 A paged virtual memory example:The Alpha memory management and the 21264 TLB The Alpha architecture uses a combination of segmentation and paging, providing protection minimizing page table size. This combination provides many advantages: Segmentation divides the address space and conserves page table, while paging provides virtual memory relocation, and protection. 64-bit address space(virtual)  41-bit physical address space.

127 The organization of seg0 and seg1 in the Alpha

128 The mapping of an Alpha virtual address

129 Five protection fields of the each entry of the page table Valid –says that the page frame number is valid for hardware translation User read enable- Allows user programs to read data within this page Kernel read enable-Allows the kernel to read data within this page User write enable-Allows user programs to write data within this page Kernel write enable-Allows the kernel to write data within this page

130 The maximum virtual address and physical address The maximum virtual address and physical address is tied to the page size. Page size [8 KB ~ 64 KB]  virtual address: 3 x13 + 16=55 bits; The max physical address: 32 + 18=48 bits. Memory management in the Alpha 21264 rely on page-level address translation and correct operation of the operating system to provide safety to multiple processes sharing the computer.

131 Memory hierarchy parameters of the Alpha 21264TLB(Translation lookaside buffer), PTE(Page table entry)

132 A segmented virtual memory example: Protection in the Intel Pentium The original 8086 used segment for address, no virtual memory or protection mechanism. The “successors” to the 8086 (IA-32): Lager address space, protection scheme to avoid security loopholes. Double the traditional two-level address space  the Pentium has four levels of protection. The IA-32 divides the address space, allowing both the operating system and the user access to the full space. The IA-32 allows the operating system to maintain the protection level of the called routine for the parameters that are passed to it.

133 IA-32 has a page table entry (PTE), or a segment descriptor The fields found in PTE 1. Present bit-Equivalent to the PTE valid bit, used to indicate this is a valid translation 2. Base field-Equivalent to a page frame address, containing the physical address of the first byte of the segment. 3. Access bit-Like the reference bit or use bit in some architecture that is helpful for replacement algorithms. 4.Attributes field-Specifies the valid operations and protection levels for operations that use this segment.

134 Reference: Walter A Triebel and Avtar Singh, “The 8088 and 8086 Microprocessors: Programming,Interfacing, Software, Hardware, and Applications”, Prentice-Hall, Fourth Edition, 2003, DPL:Descriptor privilege level Privilege check by CPU Conforming:When a service is initiates, the current privilege level may change. This depends on whether the software that was interrupted in a code segment that was configured as conforming or nonconforming. If the interrupted code is in a conforming code segment, CPL does not change when the service routine is initiated.

135 The IA-32 segment description are distinguished by bits in the attributes field

136 Adding sharing and protection Half of the address space is shared by all all process and half is unique to each process, called global address space and local address space, respectively. Writing data  A descriptor  Action??? Privilege level comparing between the caller and the called Adding safe calls from user to OS gates and inheriting protection level for parameters. To restrict entry into others’ code, the IA-32 provides a special segment descriptor, or call gate, identified by a bit in the attributes field. The call gate is to prevent the user from randomly jumping anywhere into a protected or more privileged code segment.

137 5.12 Crosscutting Issues:The design of memory hierarchies Superscalar CPU and number of ports to cache Parallelism: multiple instructions can be issued within a single clock cycle. Speculative execution and the memory system To fulfill the demanding for instruction-level parallelism and clock rate, increasingly the instruction cache and first part of instruction execution are merging. Embedded computers have bigger instruction cache Embedded computers often are placed in real-time environments where a set of tasks must be completed every time period.  a portion of cache acts like a small scratch-pad memory under program control.

138 I/O and consistency of cached data There is little danger in the CPU seeing the old or stale copy. I/O devices give the opportunity for other devices to cause copies to be inconsistent or for other devices to read the stale copies.  Cache coherency problem. I/O competing with the CPU for cache access will cause the CPU to stall for I/O. If a write-through cache were used, then memory would have an up-to-date copy of the information, and there would be no stale-data issue for output.

139 I/O and consistency of cached data Input requires some extra work. The software solution is to guarantee that no blocks of the I/O buffer designed for input are in the cache. One approach: A buffer page is marked as noncachable; the operating system always inputs to such a page. Another approach: the operating system flushes the buffer addresses from the cache before the input occurs. A hardware solution: Check the I/O addresses on input to see if they are in cache.(Checking of I/O addresses in parallel with processor cache accesses) Cache coherency protocols:Maintain coherency for multiple processors

140 The cache coherency problem

141 5.13 Putting all together: Alpha 21264 memory The 21264 is an out-of-order execution processor that fetches up to four instructions per clock cycle and executes up to six instructions per clock cycle. Virtual address  Physical address 48-bit 44-bit 43-bit 41-bit Alpha on  The chip loads the instruction cache serially from an external PROM.(16K instructions) into the cache.

142 5.13 Putting it all together: Alpha 21264 memory hierarchy The preloaded instructions execute in privileged architecture library (PAL) mode. The software executed in PAL mode simply machine language routines with some implementation-specific extensions to allow access to low-level hardware, such as TLB. One of the first step is to update the instruction TLB with valid page table entries(PTEs) for this process. Kernel code updates the appropriate page table entry(in memory) for each page to be mapped. A miss in the TLB is handled by PAL code that relies on the TLB cannot change the TLB.

143 5.13 Putting it all together: Alpha 21264 memory hierarchy Once the operating system is ready to begin executing a user process, it sets the PC to the appropriate address in segment seg0 First, a 12-bit address is sent to the 64 KB instruction cache, along with a 35-bit page number. An 8-bit address space number (SPN) is also sent, for the same purpose as using ASNs in the TLB ( step 1). The instruction cache is virtually indexed and virtually tagged, so instruction TLB translations are only required on cache misses.

144 5.13 Putting it all together: Alpha 21264 memory hierarchy To reduce latency, the instruction cache includes two mechanisms to begin early access of the next block: Way predication and line predication. Way predication: The way-predicting cache relies on a 1- bit field for every 16 bytes to predict which of two sets will be used next. Line predication: Predict the next sequential group on a cache miss. Step 2: The tag field of the PC is compared to the address from the tag portion of the cache, and the 8-bit process ASN to the tag ASN field.

145 5.13 Putting it all together: Alpha 21264 memory hierarchy Step 3: The valid bit is also checked. If any field has the wrong value, it is a miss. On a hit in the instruction cache, the proper fetch block is supplied, and the next way and line prediction is loaded to read the next block. Step 4: An instruction cache miss causes a simultaneous check of the instruction TLB and the instruction prefetcher. Step 5: The fully associative TLB simultaneously searches all 128 entries to find a match between the address and a valid PTE. Step 6: If the desired instruction address is found in the instruction prefetcher

146 5.13 Putting it all together: Alpha 21264 memory hierarchy Step 7: The instruction address eventually supplied directly by the prefetcher. Step 8: Otherwise, if there is no TLB exception, an access to the second-level cache is started. Step 9: The 35-bit block address (41-bit physical address – 6-bit block offset) is divided into an 18-bit tag and a 17-bit index. Step 10: The cache controller reads the tag from that index, and if it matches and is valid. Step 11: It returns the critical 16 bytes Step 12: At the same time: a “request” is made for the next sequential 64-block

147 5.13 Putting it all together: Alpha 21264 memory hierarchy Step 13: The next sequential 64-byte block is loaded into the instruction prefetcher in the next 6 clock cycles Step 14: To save time, the prefetched instructions are passed around the CPU and then written to the instruction cache while the instructions execute in the CPU. Step 15: If the instruction is not found in the secondary cache, the physical address command is sent to the ES40 system chip set via four consecutive transfer cycles on a narrow, 15-bit outbound address bus. Step 16:The address and command use the address bus for 8 CPU cycles. The ES40 connects the microprocessor to memory via a crossbar to one of two 256-bit memory buses to service the request

148 The overall picture of the Alpha 21264 memory hierarchy

149 Alpha 21264/21164 performance speedup versus miss rate for SPECint2000

150 Performance of the 21264 memory hierarchy 21264: Out-of-order execution; A memory stall for one instruction may be completely hidden by successful completion of later instruction. The higher the speedup of the 21264 over the 21164(In-order execution, higher miss rate) The 21264’s ability to continue to execute during cache misses that stall the 21164 but hit in the L2 cache of the 21264. The peak rate of the CPI of 21264 is 0.25, or 4 instruction per clock cycle. SPEC95: the 21264 completes almost 2 instruction per clock cycle. Database ap. Higher miss rate + Higher branch mispreication The server may heavier demands on the memory than do the microprocessor for.

151 CPI and misses per 1000 instructions for running a TPC-C-like database workload and the SPEC95 benchmarks on the Alpha 21264 in the Compaq ES40

152 5.14 Another view: The emotion engine of the Sony Playstation 2 The operation of Playstation 2: The data are often a continuous stream. The steady stream of graphics and audio demanded by electronic games leads to a different approach to memory design. The style is high bandwidth via many dedicated independent memories. Fig. 5.46 show that much smaller caches capture the misses for multimedia applications

153 Three C’s for MPEG3 decode

154 Figure 5.47: The block diagram of the Sony Playstation 2 Playstation 2: A game machine, there are interfaces for video, sound, and a DVD player Two standard computer I/O buses, USB and IEEE 1394, a PCMCIA slot, and a Modem. It includes a 34-MHz MIPS processor that also acts the emulation computer to run games for earlier Sony Playstations. It also connects to a standard PC audio card to provide the sound for the games.

155 Figure 5.47: The block diagram of the Sony Playstation 2 The 21264 microprocessor in 0.25-micron technology is about 160 mm 2 and uses 150M transistors. The Emotion Engine in 0.25-micron technology is about 225mm 2 and uses 13.5M transistors. And the Graphics Synthesizer is 279mm 2. Two-instruction issue per one clock cycle. 128-bit SIMD instructions for multimedia applications. Vector Unit 0 is primarily a DSP-like coprocessor for the CPU. Vector Unit 1 has similar functions to VPU0, but it normally operates independently of the CPU.

156 Block diagram of the Sony Playstation 2

157 Figure 5.47: The block diagram of the Sony Playstation 2 The PS2 uses two PC800 (400 MHz) DRDRAM chips using two channels, offering 32 MB of storage and a peak memory bandwidth of 3.2 MB/Sec. The graphic Synthesizer take rendering ( 移交 ) command from Emotion Engine in what are commonly called display lists. These are lists of 32-bit commands tell the renderer what shape to use and where to place them, plus what colors and textures to fill them. The Graphics Synthesizer contains the full video buffer and has a 2048-bit-wide interface so that pixel filling is not a bottleneck. The “separate” memories dedicated to individual functions to inexpensively achieve greater memory bandwidth for the entire system.

158 Figure 5.47: The block diagram of the Sony Playstation 2 A major insight shaped the design of the Emotion Engine: Generally, in a racing car game there are foreground objects that are constantly changing and background objects that change less in reaction to the event, although the background can be most of the screen. This observation led to a split of responsibilities. CPU-VPU0 handles the foreground action. CPU-VPU1 handles the background. (1) The traditional 64-bit MIPS architecture including a floating-point unit (2) The MIPS architecture extended with multimedia instructions (VPU0) (3) Independent vector processor (VPU1) (4) Image processing unit to accelerate MPEG decoding. Split function + Dedicated memory (four memories)

159 Figure 5.47: The block diagram of the Sony Playstation 2 The programmer organizes all memories as two double buffers, one pair for incoming DMA data and one pair for the outgoing DMA data. The programmer next set up the “10” DMA channels, take are to meet the real-time deadline for realistic animation of 15 frames per second. Fig. 5.48: Serial, where CPU/VPU0 acts as a preprocessor on what to give VPU1 for it to create for the Graphics Interface using the scratchpad memory as the buffer. Parallel: Where both the CPU/VPU0 and VPU1 create display lists. The display lists and Graphics Synthesizer have multiple context identifiers to distinguish the parallel display lists to product a coherent final image. Two dedicated buses: a 128-bit path between the CPU and VPU0 and a 128-bit path between VPU1 and the Graphics Interface. The programmer also chooses which bus to use when setting up the DMA channels.

160 Two modes of using emotion engine organization

161 5.15 Another view: The Sun Fire 6800 Server The Sun Fire 6800 is a midrange multiprocessor server with particular attention paid to the memory system. The emphasis of this server of this sever is cost- performance for both commercial computing and running database applications such as data warehousing and data mining This sever also includes special features to improve availability and maintainability. Miss rate is 0.5% for 1 MB data cache. Multiprocessor  Coherency misses (a Fourth C)

162 Clock cycles per instruction for memory accesses versus off-chip cache size for a four-processor server

163 Technical summary of Sun Fire 6800 server Ultra SPARC III microprocessor

164 Sun Fire 6800 server block diagram

165 Sun Fire 6800 Server The data switch yields a peak bandwidth to off-chip memory of 11 GB/Sec. Error correction codes enable buses and memories to both detect and correct errors. The idea is to calculate and store parity over different subsets of the bits in protected word. When parity does not match, it indicates an error. The Sun Fire ECC was also designed to detect any pair of bit errors, and also to detect if a whole DRAM chip failed, turning all the bits of an 8-bit-wide chip to 0  Single error correcting/double error detecting.

166 Sun Fire 6800 Server Memory is connected directly to the processor to lower latency.(DRAM controller on the chip can save 80ns) UltraSPARC include the tags for L2 cache on chip  Save 10 clock cycles off a miss. The on-chip caches are both four-way set associative ( 32KB instruction, 64KB data cache) To reduce the latency to the data cache, it combines and address adder with the word line decoder.  summing address memory reduced latency from 3 to 2 clock cycles. The L1 data cache uses write through (no-write allocate) and L2 cache uses write back (Write allocate). Both caches provide parity and detect error.

167 Instruction and data misses per 1000 instruction as cache size varies from 4 KB to 4096 KB

168 5.16 Fallacies and pitfalls Fallacy:Predicting cache performance of one program from another. Pitfall: Simulating enough instructions to get accurate performance measures of the memory hierarchy Pitfall: Too small an address Pitfall: Emphasizing memory bandwidth in DRAMs versus memory latency. Pitfall: Delivering high memory bandwidth in a cache- based system. Pitfall: Ignoring the impact of the operating system on the performance of the memory hierarchy. Pitfall: Relying on the operating systems to change the page size over time

169 Instruction misses per 1000 references for five inputs to perl benchmark from SPEC2000

170 Comparison of price SDRAM versus DRDRAM in memory modules and in systems in 2000

171 Top 10 in memory bandwidth as measured by the copy portion of the steam benchmark [McCalpin2001]

172 Misses and time spent in misses for applications and operating system

173 5.17 Concluding Remarks Memory speed  Balancing  CPU speed Do 8KB page makes sense with terabyte main memory? The design decisions at all these levels interact, and the architect must take the whole system wise to make wise decisions. New inventions: prefetching, cache-aware compiler, increasing page size. Design time  Balancing  debug time

174 Desktop, embedded, and server microprocessors in 2001.

175 58

176 5.18 Historical Perspective and references A two levels of memory hierarchy—1962 Translation lookaside buffer—1978 IBM 360 – 32 bit of address(1964) Measure program behavior as memory traffic as well as miss rate—1968 Spatial locality and temporal locality –1982 Three C’s model—1987 The victim cache– 1993 Nonblocking cache—1981,1998 Multilevel inclusion property—1988 Multilevel exclusion property—1994

177 5.18 Historical Perspective and references Prefetching via streaming buffer—1990 The streaming buffer that work well with nonblocking loads and speculative execution for in-order processors—1995 Execution out-of order –1997 The measurement of SPEC2000 benchmarks collected by Cantin and Hill-- 2001

1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing.

Similar presentations

Presentation on theme: "1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing.

Similar presentations

Presentation on theme: "1 Chapter 5 Memory Hierarchy Design 5.1 Introduction 5.2 Review of the ABCs of Caches 5.3 Cache Performance 5.4 Reducing Cache Miss Penalty 5.5 Reducing."— Presentation transcript:

Similar presentations

About project

Feedback