Main Memory Assume the following performance: 4 clock cycles to send address 56 clock cycles for access time per word 4 clock cycles to send a word of data Choose a cache block of 4 words. Then miss penalty is 4×(4+56+4) or 256 clock cycles bandwidth is 32/256 or 1/8 byte per clock cycle
Fig. 5.27. Three examples of bus width, memory width, & memory interleaving to achieve higher memory bandwidth
Techniques for Higher Bandwidth 1. Wider main memory ─ Quadrupling the width of the cache and the memory will quadruple the memory bandwidth. With a main memory width of 4 words, the miss penalty would drop from 256 cycles to 64 cycles. 2. Simple interleaved memory ─ Sending an address to four banks permits them all to read simultaneously. The miss penalty is now 4+56+(4×4) or 76 clock cycles.
Example What can interleaving and wide memory buy? Consider the following machine: Block size = 1 word Memory bus width = 1 word Miss rate = 3% Memory accesses per instruction = 1.2 Cache miss penalty = 64 cycles Average CPI (ignoring cache misses) = 2 If we change the block size to 2 words, the miss rate falls to 2%, and a 4-word block has a miss rate of 1.2%. What is the improvement in performance of interleaving two ways and four ways versus doubling the width of memory and the bus?
Solution (1) CPI for computer using 1-word blocks = 2 + (1.2×3%×64) = 4.30 Since the clock cycle time and instruction time won’t change in this example, we calculate performance improvement by just comparing CPI. Increasing the block size to 2 words gives these options: 64-bit bus and memory, no interleaving = 2 + (1.2×2%×2×64) = 5.07 64-bit bus and memory, interleaving = 2 + (1.2×2%×(4+56+8)) = 3.63 128-bit bus and memory, no interleaving = 2 + (1.2×2%×1×64) = 3.54 Thus, doubling the block size slows down the straightforward implementation (5.07 versus 4.30), while interleaving or wider memory is 1.19 or 1.22 times faster, respectively.
Solution (2) Increasing the block size to 4 words gives these options: 64-bit bus and memory, no interleaving = 2 + (1.2×1.2%×4×64) = 5.69 64-bit bus and memory, interleaving = 2 + (1.2×1.2%×(4+56+16)) = 3.09 128-bit bus and memory, no interleaving = 2 + (1.2×1.2%×2×64) = 3.84 Again, the larger block hurts performance for the simple case (5.69 versus 4.30), although the interleaved 64-bit memory is now fastest ─ 1.39 times faster versus 1.12 for the wider memory and bus.
Interleaved Memory Interleaved memory is logically a wide memory, except that accesses to banks are staged over time to share internal resources. How many banks should be included? One metric, used in vector computers, is Number of banks ≥ Number of clock cycles to access word in bank
Virtual Memory At any instant in time computers are running multiple processes, each with its own address space. It is too expensive to dedicate a full address space worth of memory for each process, especially since many processes use only a small part of their address space. We need a way to share a smaller amount of physical memory among many processes. One way, virtual memory, divides physical memory into blocks and allocate them to different processes. There must be a protection scheme that restricts a process to the blocks belonging only to that process.
Fig. 5.31. A program in its contiguous virtual address space
Comparison with Caches Page or segment is used for block. Page fault or address fault is used for miss. The CPU produces virtual addresses that are translated by a combination of hardware and software to physical addresses, which access main memory. This process is called memory mapping or address translation. Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by the operating system. The size of the processor address determines the size of virtual memory, but the cache size is independent of the processor address size.
Figure 5.32. Typical ranges of parameters for caches and virtual memory ParameterL1 cacheVirtual memory Block (page) size16 – 128B4,096 – 65,536B Hit time1 – 3 clock cycles50 – 150 clock cycles Miss penalty8 – 150 cycles10 6 – 10 7 clock cycles (access time) (6 – 130 cycles) (8×10 5 − 8×10 6 cycles) (transfer time) (2 – 20 cycles) (2×10 5 − 2×10 6 cycles) Miss rate0.1 – 10%10 −5 − 10 −3 % Address mapping25 – 45 bit physical address to 14 – 20 bit cache address 32 – 64 virtual address to 25 – 45 bit physical address
Figure 5.33. How paging and segmentation divide a program
Fig. 5.34. Paging versus segmentation Why two words per address for segment? PageSegment Words per addressOneTwo (segment and offset) Programmer visible? InvisibleMay be visible Replacing a blockTrivial (all blocks are same size) Hard ( must find contiguous variable- sized, unused portion of main memory) Memory use inefficiency Internal fragmentation (unused portion of page) External fragmentatiion (unused pieces of main memory) Efficient disk trafficYes (adjusting page size to balance access time and transfer time) Not always (small segments may transfer just a few bytes)
Four Questions 1. Where can a block be placed in main memory? Miss penalty is high. So, choose direct-mapped, fully associative, or set associative? 2. How is a block found if it is in main memory? Paging and segmentation (tag, index, offset fields). 3. Which block should be replaced on a virtual memory miss? Random, LRU, or FIFO. 4. What is the write policy? Write through, write back, write allocate, or no-write allocate?
Paging Paging uses a data structure that is indexed by the page number. This structure contains the physical address of the block. The offset is concatenated to the physical page address. The structure takes the form of a page table. Indexed by the virtual page number, the size of the table equals the number of pages in the virtual address space. Given a 32-bit virtual address, 4 KB pages, and four bytes per page table entry, the size of the page table would be (2 32 /2 12 )×2 2 = 2 22 or 4 MB.
Figure 5.35. Mapping of virtual address to physical address via page table How can we reduce address translation time?
Figure 5.35. Again Use these values: 64-bit virtual address, 8KB page size. What is the number of entries in page table? What is the size of page table?
Alpha 21264 Memory Management 1 64-bit address space 43-bit virtual address Three segments: 1. seg0: bits 63-43 = 00…0 2. seg1: bits 63-43 = 11…1 3. kseg Segment kseg is reserved for operating system User processes use seg0 Page tables reside in seg1
Alpha 21264 Memory Management 2 PTE (page table entry) is 64 bit (8 bytes) Each page table has 1,024 PTE’s Page size is thus 8KB Virtual address is 43 bits (why?) Physical page number is 28 bits Physical address is thus 41 bits (why?) Possible to increase page size to 16, 32, or 64KB If page size = 64KB, then virtual and physical addresses become 55 and 44 bits, resp. (why?)
Fig. 5.43. Overview of Alpha 21264 memory hierarchy