1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition.

2 Copyright © 2012, Elsevier Inc. All rights reserved. Introduction Programmers want very large memory with low latency Fast memory technology is more expensive per bit than slower memory Solution: organize memory system into a hierarchy Entire addressable memory space available in largest, slowest memory Incrementally smaller and faster memories, each containing a subset of the memory below it, proceed in steps up toward the processor Temporal and spatial locality insures that nearly all references can be found in smaller memories Gives the allusion of a large, fast memory being presented to the processor Introduction

3 Memory Hierarchy Processor Latency L1 Cache L2 Cache L3 Cache Main Memory Hard Drive or Flash Capacity (KB, MB, GB, TB)

4 PROCESSOR I-Cache D-Cache L1: I-Cache D-Cache L2: U-Cache L2: U-Cache L3: Main Memory Main: Intel Corei7: 2 data memory accesses per clock per core. 3.2 GHz clock + 4 cores = 25.6 e9 64bit data access 12.8 e9 128bit instruction access 409.6 GB/s 25 GB/s - Multi-port, pipelined caches - Independent caches - Shared third-level cache

7 Copyright © 2012, Elsevier Inc. All rights reserved. Performance and Power High-end microprocessors have >10 MB on-chip cache Consumes large amount of area and power budget Static power: leakage Dynamic power: read or write 25% - 50% of power consumption in PMDs Introduction

8 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics When a word is not found in the cache, a miss occurs: Fetch word from lower level in hierarchy, requiring a higher latency reference Lower level may be another cache or the main memory Also fetch the other words contained within the block Takes advantage of spatial locality Place block into cache in any location within its set, determined by address block address MOD number of sets Introduction

9 Placement Problem Main Memory Cache Memory

10 Placement Policies WHERE to put a block in cache Mapping between main and cache memories. Main memory has a much larger capacity than cache memory.

11 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Memory Block number 0123456701234567 Fully Associative Cache Block can be placed in any location in cache.

12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Memory Block number 0123456701234567 Direct Mapped Cache (Block address) MOD (Number of blocks in cache) 12 MOD 8 = 4 Block can be placed ONLY in a single location in cache.

13 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Memory Block number 0123456701234567 Set Associative Cache (Block address) MOD (Number of sets in cache) 12 MOD 4 = 0 01230123 Set no. Block number Block can be placed in one of n locations in n-way set associative cache.

14 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics n blocks per set => n-way set associative Direct-mapped cache => one block per set Fully associative => one set Writing to cache: two strategies Write-through Immediately update lower levels of hierarchy Write-back Only update lower levels of hierarchy when an updated block is replaced Both strategies use write buffer to make writes asynchronous Introduction

15 Dirty bit(s) Indicates if the block has been written to. No need in I-caches. No need in write through D-cache. Write back D-cache needs it.

16 Write back C P U Main memory cache D

17 Write through C P U Main memory cache

18 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics Miss rate Fraction of cache access that result in a miss Causes of misses (Three Cs) Compulsory First reference to a block. Does not relate to cache size Capacity Blocks discarded and later retrieved Conflict Occurs in non-fully-associative caches. Program makes repeated references to multiple addresses from different blocks that map to the same location in the cache Introduction

20 Speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses The cache must service requests while handling a miss Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics Introduction

21 Cache organization : : Valid Tag Data CPU address Data = MUX TagIndexblk

22 64 KB Two-way cache (AMD Opteron) 1024 64 Byte blocks Write buffer

23 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Hierarchy Basics Six basic cache optimizations: Larger block size Reduces compulsory misses Increases capacity and conflict misses, increases miss penalty Larger total cache capacity to reduce miss rate Increases hit time, increases power consumption Higher associativity Reduces conflict misses Increases hit time, increases power consumption Higher number of cache levels Reduces overall memory access time to A design compromise between cache and block size, hit time, and power Giving priority to read misses over writes Checking the write buffer on a read miss Reduces miss penalty Avoiding address translation in cache indexing Using part of the page offset in virtual address to index the cache Reduces hit time Introduction

25 Copyright © 2012, Elsevier Inc. All rights reserved. Ten Advanced Optimizations Optimization categories: Reducing the hit time Increase cache bandwidth Reducing miss penalty Reducing miss rate Reducing miss penalty or miss rate via parallelism Advanced Optimizations

26 1) Small and simple L1 caches Limited size and lower associativity results lower hit time. Critical timing path: addressing tag memory using the index comparing read tags with address selecting correct set by multiplexing the offset Direct-mapped caches can overlap tag compare and transmission of data Lower associativity reduces power because fewer cache lines are accessed Copyright © 2012, Elsevier Inc. All rights reserved.

27 Copyright © 2012, Elsevier Inc. All rights reserved. L1 Size and Associativity Access time vs. size and associativity Advanced Optimizations 2-way is 1.2 times faster than 4-way 4-way is 1.4 times faster than 8-way

30 3 reason for higher associativity in L1 Taking at least 2 clock cycles for cache access that makes the hit time less important Virtual indexing needs the cache size to be the page size times the associativity Multi-threading increase conflict misses in lower associativities. Copyright © 2012, Elsevier Inc. All rights reserved.

31 Copyright © 2012, Elsevier Inc. All rights reserved. 2) Way Prediction To improve hit time, predict the way to pre-set mux Mis-prediction gives longer hit time Prediction accuracy > 90% for two-way > 80% for four-way I-cache has better accuracy than D-cache First used on MIPS R10000 in mid-90s Used on 4-way caches in ARM Cortex-A8 Extend to predict block as well “Way selection”, decreases energy to 0.28 in I-cache and 0.35 in D-cache Increases mis-prediction penalty, increases average access time to 1.04 in I-cache and 1.13 in D-cache. Advanced Optimizations

33 Copyright © 2012, Elsevier Inc. All rights reserved. 3) Pipelining Cache Pipeline cache access to improve bandwidth Examples: Pentium: 1 cycle Pentium Pro – Pentium III: 2 cycles Pentium 4 – Core i7: 4 cycles Increases branch misprediction penalty Makes it easier to increase associativity Advanced Optimizations

34 4) Nonblocking Caches Allow hits before previous misses complete “Hit under miss” “Hit under multiple miss” Core-i7 supports both but ARM-A8 supports first 9% reduction in SPECINT2006 & 12.5% reduction in SPECFP2006 The miss penalty reduction is hard to analyze because of time overlap between hits and misses. Advanced Optimizations

36 Copyright © 2012, Elsevier Inc. All rights reserved. 5) Multibanked Caches Organize cache as independent banks to support simultaneous access ARM Cortex-A8 supports 1-4 banks for L2 Intel i7 supports 4 banks for L1 and 8 banks for L2 Interleave banks according to block address Advanced Optimizations

37 6) Critical Word First, Early Restart Critical word first Request missed word from memory first Send it to the processor as soon as it arrives Early restart Request words in normal order Send missed word to the processor as soon as it arrives Effectiveness of these strategies depends on block size and likelihood of another access to the portion of the block that has not yet been fetched Advanced Optimizations

38 Copyright © 2012, Elsevier Inc. All rights reserved. 7) Merging Write Buffer When storing a word in a block that is already pending in the write buffer, update write buffer Reduces stalls due to full write buffer Do not apply to I/O addresses Advanced Optimizations No write buffering Write buffering

39 Copyright © 2012, Elsevier Inc. All rights reserved. 8) Compiler Optimizations Loop Interchange Swap nested loops to access memory in sequential order Improves spatial locality X= [5000,100], row major stored Advanced Optimizations

40 Blocking Instead of accessing entire rows or columns, subdivide matrices into blocks Requires less memory accesses by improving spatial & temporal locality Useful in orthogonal matrix accesses. 2N³+N² memory accesses 2N³/B+N² memory accesses

41 Copyright © 2012, Elsevier Inc. All rights reserved. 9) Hardware Prefetching Fetch two blocks on miss (include next sequential block) Put the prefetched block in stream buffer 8 stream buffers can capture 60% of cache misses. Intel Core-i7 supports next block prefetching in L1 & L2 Advanced Optimizations Pentium 4 Pre-fetching

42 Copyright © 2012, Elsevier Inc. All rights reserved. 10) Compiler Prefetching Insert prefetch instructions before data is needed Non-faulting: prefetch doesn’t cause exceptions Register prefetch Loads data into register Cache prefetch Loads data into cache Loops are a good prefetching option. Advanced Optimizations

44 Revised version

45 Original loop executes 3*100 = 300 times that takes 300*7=2100 cycles 251 misses takes 25100 cycles Total execution time = 27200 cycles First prefetch loop : 100 * 9 cycles = 900 cycles + 11 misses * 100 = 1100 cycles = 2000 cycles Second loop : 2*100 = 200 * 8 = 1600 cycles + 8 misses *100 = 800 cycles = 2400 cycles Revised loop : 4400 cycles = 27200/4400 = 6.2 times faster

46 Cache Optimization Summary Advanced Optimizations

47 Copyright © 2012, Elsevier Inc. All rights reserved. Memory Technology Performance metrics Latency is concern of cache Bandwidth is concern of multiprocessors and I/O With increased cache blocks, bandwidth is also important for cache Access time Time between read request and when desired word arrives Cycle time Minimum time between unrelated requests to memory DRAM used for main memory, SRAM used for cache Flash memory for PMDs. Memory Technology

48 Memory Technology SRAM Access time is close to cycle time Requires low power to retain bit Requires 6 transistors/bit Largest L3 SRAM = 12 MB 4 times faster than DRAM DRAM Must be re-written after being read Cycle time is longer than the access time Must also be periodically refreshed Every ~ 8 ms : takes 5% of memory time Each row can be refreshed simultaneously One transistor/bit Address lines are multiplexed due to large size: Upper half of address: row access strobe (RAS) Lower half of address: column access strobe (CAS) Memory Technology

49 Internal organization of DRAM dual inline memory modules (DIMM) contain 4-16 DRAM chips and are 8 bytes wide.

51 Improving DRAM performance Multiple accesses to same row (using the row buffer as a cache) Synchronous DRAM (SDRAM) Added clock to DRAM interface Burst mode with critical word first Wider busses (16 bit bus in DDR2 and DDR3) Double data rate (DDR) transfer data on both edges of clock Multiple banks on each DRAM device (up to 8 banks in DDR3) Memory Technology

52 Copyright © 2012, Elsevier Inc. All rights reserved. DDR evolution DDR2 Lower power (2.5 V -> 1.8 V) Higher clock rates (266 MHz, 333 MHz, 400 MHz) DDR3 1.5 V 800 MHz DDR4 1-1.2 V 1600 MHz GDRAM (Graphical DRAM) GDDR5 (based on DDR3) Wider bus: 32 bits Higher clock rates by soldered (direct) connection to GPU Memory Technology

53 Reducing power in SDRAMs Lower voltage Low power mode (ignores clock, continues to refresh) Takes 200 clocks to recover from low power mode Memory Technology

54 Copyright © 2012, Elsevier Inc. All rights reserved. Flash Memory Type of EEPROM Backup storage in PMDs Must be erased (in blocks) before being overwritten Non volatile Limited number of write cycles typically 100000 Cheaper than SDRAM, more expensive than disk $2/GB in flash, $30/GB in SDRAM, $0.09/GB in hard disk Slower than SDRAM (x 1/4), faster than disk (x1000) May be a replacement for disks. Memory Technology

55 Memory Dependability Memory is susceptible to cosmic rays Soft errors: dynamic errors Detected by parity (1 bit per byte) Detected and fixed by error correcting codes (ECC) (1 byte per 8 bytes) Hard errors: permanent errors Use spare rows to replace defective rows Chipkill: a RAID-like error recovery technique Distribution of data in ECC between memory chips Used in very large cluster computers like Google. IBM analysis on a10000 processor server with 4GB memory per each one in 3 years: Parity: 90000 unrecovered failures – 1 every 17 min. ECC : 3500 unrecovered failures – 1 every 7.5 hr. Chipkill : 6 unrecovered failures - 1 every 2 months Memory Technology

56 Protection via Virtual Memory Protection via virtual memory Keeps processes in their own memory space Role of architecture: Provide user mode and supervisor mode Protect certain aspects of CPU state User/supervisor mode bit, exception enable/disable, memory protection information Provide mechanisms for switching between user mode and supervisor mode (system calls) Provide mechanisms to limit memory access between processes Provide TLB to translate addresses and implement protection Protection scheme: R/W & code execution from memory page Virtual Memory and Virtual Machines

57 Protection via Virtual Machines Supports isolation and security Low security and reliability of standard OSes. Sharing a computer among many unrelated users Enabled by raw speed of processors, making the overhead more acceptable Allows different ISAs and operating systems to be presented to user programs “System Virtual Machines”: IBM VM/370, VMware ESX, Xen SVM software is called “virtual machine monitor” or “hypervisor” Individual virtual machines run under the monitor are called “guest VMs” VMM is much smaller than a OS  better security Os runs as a user process IBM 370 architecture is virtualizable but 80x86, RISC are not Virtual Memory and Virtual Machines

58 Copyright © 2012, Elsevier Inc. All rights reserved. Impact of VMs on Virtual Memory Each guest OS maintains its own set of page tables VMM adds a level of memory between physical and virtual memory called “real memory” VMM maintains shadow page table that maps guest real addresses to physical addresses The guest page tables are only refreshed by VMM Requires VMM to detect guest’s changes to its own page table Occurs naturally if accessing the page table pointer is a privileged operation Virtual Memory and Virtual Machines

59 Crosscutting issues Protection & ISA ISA should be changed to reduce the cost of virtualization Coherency of cached data I/O cache coherency Hardware or software checking of I/O addresses Invalidating the cache data if I/O is performed Multi-processor cache coherency Will be covered soon Copyright © 2012, Elsevier Inc. All rights reserved.

60 The ARM Cortex-A8 ISA: ARMv7 Core technology: IP (Intellectual Property) Dominant core in embeddeds and PMDs. Can incorporate application specific processors (e.g. video) + I/O + memory IP cores 2 instructions per clock (1GHz max.) Hard core: not configurable, better performance Soft core: reconfigurable, bigger die size

61 ARM Cortex-A8 memory organization First level cache: 16 KB or 32 KB, I & D caches in first level 4-way set associative with way prediction. Virtually indexed and physically tagged Second level cache (optional): 128 KB – 1 MB 8-way set associative 1-4 banks Physically indexed and tagged 64, 128 bit memory bus The primary input pin A64n128 of the processor determines the width 64 byte block size Copyright © 2012, Elsevier Inc. All rights reserved.

63 ARM Cortex-A8 memory organization 2 TLBs (for I & D) Fully associative 32 entries Variable page size (4, 16, 64 KB, 1, 16 MB) 32 bit virtual address translation (32 KB L1, 1 MB L2, 16 KB page)

65 Intel Core-i7 64 bit extension of 80x86 ISA: x86-64 4 cores 4 instructions per clock per core. 16 stage pipeline 2 simultaneous threads per core. 3.3 GHz: up to 50 billion instruction per sec. 3 memory channels: 25GB/sec. 48 bit virtual address, 36 bit physical address: 36 GB Copyright © 2012, Elsevier Inc. All rights reserved.

67 3rd Generation Intel Core i7 Core 0 L1 L2 L3 I D Core 1 I D Core 2 I D Core 3 I D I: 32KB 4-way D: 32KB 4-way 256KB 8MB

68 Core-i7 memory hierarchy

69 Core-i7 memory hierarchy Miss penalty for critical Instructions: 135 clocks Miss penalty for entire block: 135 + 45 = 180 clocks Write-back mechanism 10 entry merging write buffer between L3, L2 and L2, L1 Prefetching next block in L1 and L2 is supported

70 Performance of i7 L1 cache

71 Performance of i7 L2, L3 caches

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition.

Similar presentations

Presentation on theme: "1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition.

Similar presentations

Presentation on theme: "1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 M EMORY H IERARCHY D ESIGN Computer Architecture A Quantitative Approach, Fifth Edition."— Presentation transcript:

Similar presentations

About project

Feedback