Memory Hierarchy Instructor: Adam C. Champion, Ph.D.

Memory Hierarchy Instructor: Adam C. Champion, Ph.D.
CSE 2431: Introduction to Operating Systems Reading: Chap. 6, [CSAPP] OSU CSE 2431

Motivation Up to this point we have relied on a simple model of a computer system: a CPU with a simple memory that holds instructions and data for the CPU. In reality, a computer system contains a hierarchy of storage devices with different costs, capacities, and access times. With a memory hierarchy, a faster storage device at one level of the hierarchy acts as a staging area for a slower storage device at the next lower level. Software that is well-written takes advantage of the hierarchy accessing the faster storage device at a particular level more frequently than the storage at the next level. As a programmer, understanding the memory hierarchy will result in better application performance. OSU CSE 2431

Outline Storage Technologies Locality Memory Hierarchy Cache Memories
Writing Cache-friendly Code Impact of Caches on Program Performance OSU CSE 2431

Storage Technologies Random-Access Memory Disk Storage
Solid State Disks Storage Technology Trends OSU CSE 2431

Random-Access Memory (RAM)
Features Basic storage unit is usually a cell (one bit per cell) RAM is traditionally packaged as a chip Multiple chips form memory Static RAM (SRAM) Each cell implemented with a six-transistor circuit Holds value as long as power is maintained: volatile Insensitive to disturbances such as electrical noise, radiation, etc. Faster and more expensive than DRAM Dynamic RAM (DRAM) Each bit stored as charge on a capacitor Value must be refreshed every 10–100 msec: volatile Sensitive to disturbances Slower and cheaper than SRAM OSU CSE 2431

SRAM vs DRAM RAM Type Trans. / Bit Access Time Needs Refresh?
Sensitive? Cost Applications SRAM 4 or 6 1× No 100× Cache memories DRAM 1 10× Yes Main memories, cache buffers OSU CSE 2431

Conventional DRAM Organization
d × w DRAM: dw total bits organized as d supercells of size w bits 16 × 8 DRAM chip cols 1 2 3 Memory controller 2 bits / addr 1 rows 2 supercell (2,1) (to/from CPU) 3 8 bits / data Internal row buffer OSU CSE 2431

Reading DRAM Supercell (2,1) (1)
Step 1(a): Row access strobe (RAS) selects row 2. Step 1(b): Row 2 copied from DRAM array to row buffer. 16 × 8 DRAM chip Cols Memory controller 1 2 3 RAS = 2 2 / addr 1 Rows 2 3 8 / data Internal row buffer OSU CSE 2431

Reading DRAM Supercell (2,1) (2)
Step 2(a): Column access strobe (CAS) selects column 1. Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually back to the CPU. 16 x 8 DRAM chip Cols Memory controller 1 2 3 CAS = 1 2 / addr Supercell (2,1) To CPU 1 Rows 2 3 8 / data Supercell (2,1) Internal row buffer OSU CSE 2431

64-bit doubleword at main memory address A
Memory Modules addr (row = i, col = j) : Supercell (i, j) DRAM 0 64 MB memory module consisting of eight 8M×8 DRAMs 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit doubleword at main memory address A bits 0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 DRAM 7 DRAM chips are packaged in memory modules that plug into expansion slots on the main system board (motherboard). This memory module stores a total of 64MB using eight 64-Mbit 8M x 8 DRAM chips numbered 0 to 7. Each supercell stores 1 byte of main memory and each 64-bit double word at byte address A in main memory is represented by eight supercells whose corresponding supercell address is (i,j). DRAM 0 stores the first (lower-order) byte, DRAM stores the next,… To retrieve a 64-bit doubleword at memory address A, the memory controller converts A to a supercell address (i,j) and sends it to the memory module which then broadcasts I and j to each DRAM. Each DRAM outputs the 8-bit contents of its (i,j) supercell. Circuitry then collects these outputs and forms them into a 64-bit doubleword, which it returns to the memory controller. 64-bit doubleword 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 Memory controller OSU CSE 2431

Enhanced DRAMs Enhanced DRAMs have optimizations that improve the speed with which basic DRAM cells are accessed. Examples: Fast page mode DRAM (FPM DRAM) Extended data out DRAM (EDO DRAM) Synchronous DRAM (SDRAM) Double Data-Rate Synchronous DRAM (DDR SDRAM) Rambus DRAM (RDRAM) Video RAM (VRAM) OSU CSE 2431

Nonvolatile Memory (1) Features
Information retained if supply voltage is turned off Collectively referred to as read-only memories (ROM) although some may be written to as well as read Distinguishable by the number of times they can be reprogrammed (written to) and by the mechanism for reprogramming them Used for firmware programs (BIOS, controllers for disks, network cards, graphics accelerators, security subsystems…), solid state disks, disk caches OSU CSE 2431

Nonvolatile Memory (2) Read-only memory (ROM) Programmed during production Programmable ROM (PROM) Fuse associated with cell that is blown once by zapping with current Can be programmed once Eraseable PROM (EPROM) Cells cleared by shining ultraviolet light, special device used to write 1’s Can be erased and reprogrammed about 1000 times Electrically eraseable PROM (EEPROM) Similar to EPROM but does not require a physically separate programming device, can be re-programmed in place on printed circuit cards Can be reprogrammed about 100,000 times Flash Memory Based on EEPROM technology Wears out after about 100,000 repeated writes OSU CSE 2431

Traditional Bus Structure Connecting Bus, Memory
A bus is a collection of parallel wires that carry address, data, and control signals. Buses are typically shared by multiple devices. CPU chip Register file ALU System bus Memory bus Configuration of an example computer system made up of CPU chip, a chipset called the I/O Bridge (includes the memory controller), and the DRAM memory modules that make up the main memory. I/O bridge translates electrical signals of the system bus into the electrical signals of the memory bus. Main memory Bus interface I/O bridge OSU CSE 2431

Memory Read Transaction (1)
CPU places address A on the memory bus. Register file Load operation: movl A, %eax ALU %eax Main memory I/O bridge A Bus interface A x Load operation - movl A, %eax – contents of address A are loaded into register %eax. Bus Interface initiates the read transaction on the bus. OSU CSE 2431

Main memory reads A from the memory bus, retrieves word x, and places it on the bus. Register file ALU Load operation: movl A, %eax %eax Main memory I/O bridge x Bus interface A x OSU CSE 2431

CPU read word x from the bus and copies it into register %eax. Register file Load operation: movl A, %eax ALU %eax x Main memory I/O bridge Bus interface A x OSU CSE 2431

Memory Write Transaction (1)
CPU places address A on bus. Main memory reads it and waits for the corresponding data word to arrive. Register file Store operation: movl %eax, A ALU %eax y Main memory I/O bridge A Bus interface A Store operation – movl %eax, A – contents of register %eax are written to address A. OSU CSE 2431

CPU places data word y on the bus. Register file Store operation: movl %eax, A ALU %eax y Main memory I/O bridge y Bus interface A OSU CSE 2431

Main memory reads data word y from the bus and stores it at address A. Register file ALU Store operation: movl %eax, A %eax y Main memory I/O bridge Bus interface A y OSU CSE 2431

Disk Storage Disks hold enormous amount of data – on the order of hundreds to thousands of gigabytes compared to hundreds to thousands of megabytes in memory. Disks are slower than RAM-based memory – on the order of milliseconds to read information on a disk, a hundred thousand times longer than from DRAM and a million times longer than SRAM. OSU CSE 2431

Anatomy of A Disk Drive Spindle Arm Platters Actuator Electronics
(including a processor and memory!) Disk drive refers to entire assembly containing disk platters and other components. Sometimes referred to as rotating disks to distinguish them from solid state disks. SCSI connector Image courtesy of Seagate Technology OSU CSE 2431

Disk Geometry Disks consist of platters, each with two surfaces.
Each surface consists of concentric rings called tracks. Each track consists of sectors separated by gaps. Spindle Surface Tracks Track k Sectors Gaps OSU CSE 2431

Disk Geometry (Multiple-Platter View)
Aligned tracks form a cylinder. Cylinder k Surface 0 Platter 0 Surface 1 Surface 2 Platter 1 Surface 3 Surface 4 Platter 2 Surface 5 Spindle OSU CSE 2431

Disk Capacity (1) Capacity defined to be the maximum number of bits that can be recorded on a disk. Determined by the following factors: Recording density (bits/in): The number of bits on a 1-inch segment of a track. Track density (tracks/in): The number of tracks on a 1-inch segment of radius extending from the center of the platter. Areal density (bits/in2): product of recording density and track density OSU CSE 2431

Disk Capacity (2) Determination of areal density:
Original disks partitioned every track into the same number of sectors, which was determined by the innermost track. Resulted in sectors being spaced further apart on outer tracks. Modern disks partition into disjoint subsets called recording zones. Each track within zone same number of sectors, determined by the innermost track. Each zone has a different number of sectors/track. OSU CSE 2431

Computing Disk Capacity
Capacity = (#bytes/sector) × (avg #sectors/track) × (#tracks/surface) × (#surfaces/platter) × (#platters/disk) Example: 512 bytes/sector Average of 300 sectors/track 20,000 tracks/surface 2 surfaces/platter 5 platters/disk Capacity = 512 × 300 × 20,000 × 2 × 5 = 30,720,000,000 = GB. OSU CSE 2431

Disk Operation (Single-Platter View)
The disk surface spins at a fixed rotational rate By moving radially, the arm can position the read/write head over any track. The read/write head is attached to the end of the arm and flies over the disk surface on a thin cushion of air. spindle spindle spindle spindle spindle OSU CSE 2431

Disk Operation (Multi-Platter View)
Read/write heads move in unison from cylinder to cylinder Arm Spindle OSU CSE 2431

Disk Structure: Top View of Single Platter
Surface organized into tracks Tracks divided into sectors OSU CSE 2431

Disk Access (1) Head in position above a track OSU CSE 2431

Disk Access (2) Rotation is counter-clockwise OSU CSE 2431

Disk Access: Read (1.1) About to read blue sector OSU CSE 2431

Disk Access: Read (1.2) After reading blue sector After BLUE read
OSU CSE 2431

Disk Access: Read (1.3) Red request scheduled next After BLUE read
OSU CSE 2431

Disk Access: Seek Seek to red’s track After BLUE read Seek for RED
OSU CSE 2431

Disk Access: Rotational Latency
After BLUE read Seek for RED Rotational latency Wait for red sector to rotate around OSU CSE 2431

Disk Access: Read (2.1) Complete read of red After BLUE read
Seek for RED Rotational latency After RED read Complete read of red OSU CSE 2431

Disk Access: Service Time Components
After BLUE read Seek for RED Rotational latency After RED read Time to access a sector is equal to the the sum of the seek time, rotation latency, and the time for data transfer. Data transfer Seek Rotational latency Data transfer OSU CSE 2431

Calculating Access Time (1)
Average access time for a sector: Taccess = Tavg_seek + Tavg_rotation + Tavg_transfer Seek time (Tavg_seek): Time to position heads over cylinder Typical Tavg_seek is 3–9 msec (ms), max can be as high as 20 ms Rotational latency (Tavg_rotation): Once head is positioned over track, the time it takes for the first bit of the sector to pass under the head. In the worst case, the head just misses the sector and waits for the disk to make a full rotation. Tmax_rotation = (1/RPM) × (60 secs/1 min) Average case is ½ of worst case: Tavg_rotation = (1/2) × (1/RPM) × (60 secs/1 min) Typical Tavg_rotation = 7200 RPMs. OSU CSE 2431

Calculating Access Time (2)
Transfer time (Tavg_transfer): Time to read bits in the sector Time depends on the rotational speed and the number of sectors per track. Estimate of the average transfer time; Tavg_transfer = (1/RPM) x (1/(avg #sectors/tracks)) × (60 secs/1 min) Example: Rotational rate = 7200 RPM Average seek time = 9 ms Avg #sectors/track = 400 Tavg_rotation = 1/2 × (60 secs/7200 RPM) × (1000 ms/sec) = 4 ms Tavg_transfer = (60/7200 RPM) × (1/400 secs/track) × (1000 ms/sec) = 0.02 ms Taccess = 9 ms + 4 ms ms OSU CSE 2431

Access Time SRAM: 256 ns DRAM: 5000 ns Disk: 10 ms
Time to access the 512 bytes in a disk sector is dominated by the seek time (9 ms) and rotational latency (4 ms). Accessing the sector takes a long time but transferring bits are basically free. Since seek time and rotational latency are roughly the same, at least same order of magnitude, doubling the seek time is a reasonable estimate for access time. Comparison of access times of various storage devices when reading a comparable 512-byte sector sized block: SRAM: 256 ns DRAM: 5000 ns Disk: 10 ms Disk is about 40,000 times slower than SRAM, 2,500 times slower than DRAM. OSU CSE 2431

Formatted Disk Capacity
Logical Disk Blocks Although modern disks have complex geometries they present a simpler abstract view as a sequence of B sector- sized logical blocks, numbered 0, 1, 2, … B – 1. Disk controller maintains the mapping between the logical and actual (physical) disk sectors and converts requests for block into a surface, track and sector by doing a fast table lookup. Formatted Disk Capacity Before disks can be used for the first time they must be formatted by the disk controller. Gaps between sectors filled in with info to identify sectors. Finds surface defects and sets aside cylinders to be used for spares. Formatted capacity is less than the maximum capacity. OSU CSE 2431

Connecting I/O Devices
I/O devices such as disks, graphics cards, monitors, mice, and keyboards are connected to the CPU and main memory using an I/O bus. Unlike the system bus and memory bus which are CPU specific, the I/O bus is independent of the underlying CPU. The I/O bus is slower than the system and memory buses but can accommodate a wide variety of third-party I/O devices. For instance, USB, graphics card or adapter, host bus adapter (SCSI/SATA). Network adapters can be connected to the I/O bus by plugging the adapter into an empty expansion slot on the motherboard. OSU CSE 2431

I/O Bus CPU chip Register file ALU System bus Memory bus Main memory
Bus interface I/O bridge I/O bus Expansion slots for other devices such as network adapters. USB controller Graphics adapter Disk controller Mouse Keyboard Monitor Disk OSU CSE 2431

Reading a Disk Sector (1)
CPU chip CPU initiates a disk read by writing a command, logical block number, and destination memory address to a port (address) associated with disk controller. Register file ALU Main memory Bus interface I/O bus Graphics adapter Disk controller USB controller Mouse Keyboard Monitor Disk OSU CSE 2431

CPU chip Register file Disk controller reads the sector and performs a direct memory access (DMA) transfer into main memory. ALU Main memory Bus interface I/O bus USB controller Graphics adapter Disk controller Mouse Keyboard Monitor Disk OSU CSE 2431

CPU chip When the DMA transfer completes, the disk controller notifies the CPU with an interrupt (i.e., asserts a special “interrupt” pin on the CPU) Register file ALU Main memory Bus interface I/O bus USB controller Graphics adapter Disk controller Mouse Keyboard Monitor Disk OSU CSE 2431

Solid State Disks (SSDs)
I/O bus Requests to read and write logical disk blocks Solid State Disk (SSD) Flash translation layer Flash memory Block 0 Block B-1 … … … Page 0 Page 1 Page P-1 Page 0 Page 1 Page P-1 SSD package plugs into a standard disk slot on the I/O bus (typically USB or SATA) and behaves like a disk reading and writing logical blocks. Consists of one or more flash memory chips and a flash translation layer (hardware/firmware device) that plays the same role as a disk controller. Pages: 512 KB to 4 KB, Blocks: 32 to 128 pages Data read/written in units of pages. Page can be written only after its block has been erased A block wears out after 100,000 repeated writes. OSU CSE 2431

SSD Performance Characteristics
Sequential read throughput 550 MB/s Sequential write throughput 470 MB/s Random read throughput 365 MB/s Random write throughput 303 MB/s Random read access 50 μs Random write access 60 μs Source: Intel SSD 730 product specification Why are random writes so slow? Erasing a block is slow (around 1 ms) Write to a page triggers a copy of all useful pages in the block Find an used block (new block) and erase it Write the page into the new block Copy other pages from old block to the new block OSU CSE 2431

Advantages of SSDs over Rotating Disks
No moving parts (semiconductor memory); more rugged Much faster random access times Use less power Disadvantages of SSDs over Rotating Disks SSDs wear out with usage More expensive than disks OSU CSE 2431

Storage Technology Trends
SRAM Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/MB 2,900 320 256 100 75 60 116 Access (ns) 150 35 15 3 2 1.5 200 115 DRAM Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/MB 880 100 30 1 0.1 0.06 0.02 44,000 Access (ns) 200 70 60 50 40 20 10 Typical Size (MB) 0.256 4 16 64 2,000 8,000 16,000 62,500 Disk Different storage technologies have different price and performance trade-offs. SRAM is somewhat faster than DRAM, and DRAM is much faster than disk. On the other hand, fast storage is always more expensive than slower storage. The price and performance properties of different storage technologies are changing at dramatically different rates. Since 1980, the cost and performance of SRAM have improved at roughly the same rate. Access times have decreased by a factor of about 200 and cost per megabyte by a factor of 300. The trends for DRAM are much more dramatic and divergent. Cost per megabyte has decreased by a factor of 130,000 (more than 5 orders of magnitude). DRAM access times have decreased only by a factor of about 10. Cost of megabyte of disk storage has decreased by a factor of more than 1,000,000 (more than 6 orders of magnitude) access times have improved much more slowly – only by a factor of about 30. Conclusion: Much easier to increase density (reducing cost) than to decrease access time. Metric 1985 1990 1995 2000 2005 2010 2015 2015:1985 $/GB 100,000 8,000 300 10 5 0.3 0.03 3,333,333 Access (ms) 75 28 8 3 25 Typical Size (GB) 0.01 0.16 1 20 160 1,500 3,000 300,000 OSU CSE 2431

CPU Trends Inflection point in computer history
when designers hit the “Power Wall” 1985 1990 1995 2000 2003 2005 2010 2015 2015:1985 CPU 80286 80386 Pentium P-III P-4 Core 2 Core i7 (N) Core i7 (H) — Clock rate (MHz) 6 20 150 600 3,300 2,000 2,500 3,000 500 Cycle time (ns) 166 50 1.6 0.3 0.50 0.4 0.33 Cores 1 2 4 Effective cycle time (ns) 0.25 0.1 0.08 2,075 DRAM and disk performance are lagging behind CPU performance. Looking at effective cycle time (cycle time of an individual processor divided by the number of its processor cores) improvement is a factor of 10,000. 2003: Introduction of multicore processors. “Power wall” : computer manufacturers discovered they could no longer increase CPU clock frequencies as quickly because the chips would then consume too much power. Solution was to replace a single large processor with multiple smaller processor cores, each a complete processor capable of executing programs independently and in parallel with the other cores. From this point computers will get faster, not because clock frequency increases, but because the number of cores in each processor increases. * (N) indicates Intel’s Nehalem architecture; (H) indicates Intel’s Haswell architecture. Around 2003, system designers reached a limit regarding the exploitation of instruction-level parallelism (ILP) in sequential programs. Since 2000, processor speed has not greatly increased; instead, multicore CPUs. OSU CSE 2431

The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds.
SSD DRAM Gap between DRAM and CPU performance widening. Modern computers make heavy use of SRAM-based caches to try to bridge the processor-memory gap. This approach works because of a fundamental property of application programs known as locality. CPU OSU CSE 2431

Locality Principle of Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time Principle of locality has an enormous impact on the design and performance of hardware and software systems. In general, programs with good locality run faster than programs with poor locality. This principle is used by all levels of a modern computer system: hardware, OSes, and application programs. OSU CSE 2431

Locality Example Data references Instruction references
sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum; Data references Reference array elements in succession (stride-1 reference pattern). Reference variable sum each iteration. Instruction references Reference instructions in sequence. Cycle through loop repeatedly. Spatial locality Temporal locality Sum is scalar so now spatial locality. a has good spatial locality but not good temporal since each element only accessed once. Since function has good spatial or temporal locality we can conclude that the function enjoys good locality. Spatial locality Temporal locality OSU CSE 2431

Locality of Reference to Program Data (1)
Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer. Question: Does this function have good locality with respect to array a? int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } The doubly nested loop reads the elements of the array in row-major order which is the same order that the array is stored. Stride-1 reference pattern with excellent spatial locality. Sum has temporal locality. OSU CSE 2431

Locality of Reference to Program Data (2)
Question: Does this function have good locality with respect to array a? int sum_array_cols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } Computes same result as previous example but i and j loops have been interchanged. Poor spatial locality, array is referenced in column order instead of row order. Stride-N reference pattern. OSU CSE 2431

Locality of Instruction Fetches
Question: Does this function have good locality with respect to instructions? int sum_array_rows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } Yes, the instructions in the body of the for loop are ???????????????????????????????????? OSU CSE 2431

Summary of Locality Simple rules for evaluating the locality in a program: Programs that repeatedly reference the same variables enjoy good temporal locality. For programs with stride-k reference patterns, smaller strides yield better spatial locality. Programs with stride-1 reference patterns have good spatial locality. Programs that hop around memory with large strides have poor spatial locality. Loops have good temporal and spatial locality with respect to instruction fetches. The smaller the loop body and the greater the number of loop iterations, the better the locality. OSU CSE 2431

The Memory Hierarchy Fundamental properties of storage technology and computer software: Storage technology: Different storage technologies have widely different access times. Faster technologies cost more per byte than slower ones and have less capacity. The gap between CPU and main memory speed is widening. Computer software: Well-written programs tend to exhibit good locality. The complementary nature of these properties suggest an approach for organizing memory systems, knows as a memory hierarchy. OSU CSE 2431

An Example Memory Hierarchy
CPU registers hold words retrieved from L1 cache Registers L1: L1 cache (SRAM) L1 cache holds cache lines retrieved from L2 cache Smaller, faster, costlier per byte L2: L2 cache (SRAM) L2 cache holds cache lines retrieved from main memory L3: Main memory (DRAM) Larger, slower, cheaper per byte Main memory holds disk blocks retrieved from local disks L4: Local secondary storage (local disks) Local disks hold files retrieved from disks on remote network servers Remote secondary storage (tapes, distributed file systems, Web servers) L5: OSU CSE 2431

Caching in the Memory Hierarchy
A cache is a small, fast storage device that acts as a staging area for the data objects stored in a larger, slower device. The central idea is that for each level k in the memory hierarchy, the faster and larger storage device serves as a cache for the larger and slower storage devices at level k + 1. If a program finds a needed data object from level k + 1 in level k then we have a cache hit. Otherwise we have a cache miss and the data must be brought to level k from level k + 1. OSU CSE 2431

General Cache Concepts
Smaller, faster, more expensive memory caches a subset of the blocks Cache 8 4 9 14 10 3 Data is copied in block-sized transfer units 4 10 Larger, slower, cheaper memory viewed as partitioned into “blocks” Memory 1 2 3 4 4 5 6 7 8 9 10 10 11 12 13 14 15 OSU CSE 2431

General Cache Concepts: Hit
Request: 14 Data in block b is needed Block b is in cache: Hit! Cache 8 9 14 14 3 Memory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 OSU CSE 2431

General Cache Concepts: Miss
Request: 12 Data in block b is needed Block b is not in cache: Miss! Cache 8 9 12 14 3 Block b is fetched from memory 12 Request: 12 Block b is stored in cache Placement policy: determines where b goes Replacement policy: determines which block gets evicted (victim) Memory 1 2 3 4 5 6 7 8 9 10 11 12 12 13 14 15 OSU CSE 2431

Kinds of Cache Misses Cold miss: Conflict miss: Capacity miss:
Cache at level k is empty. Temporary situation that resolves itself when repeated accesses cause the cache to ‘warm up’ Conflict miss: Most caches limit the blocks at k+1 to a small subset (possibly only one) position at level k, for instance, block i restricted to (i mod 4) Cache at level k is large enough but needed blocks map to the same position, for instance, blocks 0, 4, 8, 12, 16, … mapping to 0 using (i mod 4) Capacity miss: Set of active blocks at k+1 larger than cache. OSU CSE 2431

Examples of Caching in the Hierarchy
Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By Registers 4–8 byte words CPU core Compiler TLB Address translations On-Chip TLB Hardware L1 cache 64-byte blocks On-Chip L1 1 L2 cache On/Off-Chip L2 10 Virtual memory 4 KB page Main memory 100 Hardware + OS Buffer cache Parts of files OS Disk cache Disk sectors Disk controller 100,000 Disk firmware Network buffer Local disk 10,000,000 AFS/NFS client Browser cache Web pages Web browser Web cache Remote server disks 1,000,000,000 Web proxy server Acronyms: TLB: translation lookaside buffer, MMU: memory management unit, AFS: Andrew File System, NFS: Network File System OSU CSE 2431

Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory. Typical system structure: CPU chip Register file ALU Cache memories System bus Memory bus Main memory Bus interface I/O bridge OSU CSE 2431

General Cache Organization (S, E, B)
E = 2e lines per set set line S = 2s sets S = number of cache sets, E = cache lines, B = data block (in bytes). Cache size, C, is stated in terms of the aggregate size of all the blocks. The tag and valid bit are not included. Valid bit indicates whether the line contains meaningful information. Tag bits (a subset of the bits from the current block’s memory address) uniquely identify the block stored in the cache line. Cache size: C = S × E × B data bytes v tag 1 2 B – 1 valid bit B = 2b bytes per cache block (the data) OSU CSE 2431

Steps of a Cache Request
Given a request for the word w, the address for w is used to determine the following; Set Selection: Determine set within cache Line Matching: Determine line within specific set Word Extraction: Extract word from cache and return it to CPU OSU CSE 2431

An ‘Aside’ : Some Terminology
Block: fixed size packet of info that moves back and forth between a cache and main memory (or a lower-level cache) Line: a container in a cache that stores a block as well as other info such as the valid bit and the tag bits Set: collection of one of more lines. Sets in direct-mapped caches consist of a single line. Sets in set associative and fully associative caches consist of multiple lines. OSU CSE 2431

Cache Read Locate set Check if any line in set has matching tag
‘Yes’ + line valid: hit Locate data starting at offset Cache Read E = 2e lines per set Address of word: t bits s bits b bits m–1 S = 2s sets tag set index block offset The parameters S and B induce a partitioning of the m address bits into the three fields shown. The s set index bits in the address form an index into the array of S sets. The first set is 0, second set is 1, etc.. The set index tells us which set the word must be stored in. Once we know which set the word must be contained in, the t tag bits in A tell us which line (if any) in the set contains the word. A line in the set contains the word iff the valid bit is set and the tag bits in the line match the tab bits in the address A. Once we have located the line identified by the tag in the set identified by the set index, then the b block offset bits give us the offset of the word in the B-byte data block. data begins at this offset v tag 1 2 B – 1 valid bit B = 2b bytes per cache block (the data) OSU CSE 2431

Example: Direct Mapped Cache (E = 1) (1)
Direct mapped: One line per set Assume: cache block size 8 bytes Address of int: v tag 1 2 3 4 5 6 7 t bits 0…01 100 v tag 1 2 3 4 5 6 7 find set S = 2s sets v tag 1 2 3 4 5 6 7 Caches are grouped into different classes based on E, the number of cache lines per set. A cache with exactly one line per set (E=1) is known as a direct-mapped cache. Direct-mapped caches are the simplest both to implement and to understand. In this example the index selects set 1. There is only one line so the line is checked to see if the the valid bit is set and the tag matches the tag in the address for w. Once we have a hit we know that w is somewhere in the block. The last step determines where the desired word starts in the block by using the block offset bits. In this case 100 implies that the copy of 2 starts at byte 4 in the block. (We’re assuming that words are 4 bytes long.) v tag 1 2 3 4 5 6 7 OSU CSE 2431

Direct mapped: One line per set Assume: cache block size 8 bytes Address of int: valid? + match: assume yes = hit t bits 0…01 100 v tag tag 1 2 3 4 5 6 7 block offset See notes on previous slide. OSU CSE 2431

Direct mapped: One line per set Assume: cache block size 8 bytes Address of int: valid? + match: assume yes = hit t bits 0…01 100 v tag 1 2 3 4 5 6 7 block offset If the cache misses, then it needs to retrieve the requested block from the next level of memory and store the new block in one of the cache lines of the set indicated by the index bits. For direct-mapped cache the replacement policy is trivial, the current line is replaced by the newly fetched line. int (4 bytes) is here No match: old line is evicted and replaced OSU CSE 2431

Direct-Mapped Cache Simulation
b = 1 M = 16 byte addresses, B = 2 bytes/block, S = 4 sets, E = 1 Blocks/set Address trace (reads, one byte per read): 0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002] x xx x miss hit miss miss miss v Tag Block Set 0 1 M[0-1] 1 M[8-9] ? 1 M[0-1] Set 1 Set 2 Set 3 1 M[6-7] OSU CSE 2431

Higher-Level Example Ignore the variables sum, i, j
Assume: cold (empty) cache, a[0][0] goes here int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum; } 32 B = 4 doubles blackboard OSU CSE 2431

E-way Set Associative Cache (E = 2) (1)
E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 find set v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 Set Associative Cache allows multiple lines, in the example each set has two lines. Set selection the same with the set index bits identifying the set. v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 OSU CSE 2431

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit v tag tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 block offset Line matching is more involved because the cache must check the tags and valid bits of multiple lines in order to determine if the requested word is in the set. Any line in the set can contain any of the memory blocks that map to that set. The cache must search each line in the set, searching for a valid line whose tag matches the tag in the address. If it finds one then we have a hit and the block offset selects a word from the block as before. OSU CSE 2431

E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit v tag 1 2 3 4 5 6 7 v tag 1 2 3 4 5 6 7 block offset Miss: Question is, once the cache has retrieved the block, which line should it replace? Different replacement policies. short int (2 Bytes) is here No match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), … OSU CSE 2431

2-Way Set Associative Cache Simulation
b = 1 M = 16 byte addresses, B = 2 bytes/block, S = 2 sets, E = 2 blocks/set Address trace (reads, one byte per read): 0 [00002], 1 [00012], 7 [01112], 8 [10002], 0 [00002] xx x x miss hit miss miss hit v Tag Block ? 1 00 M[0-1] Set 0 1 10 M[8-9] 1 01 M[6-7] Set 1 OSU CSE 2431

A Higher Level Example Ignore the variables sum, i, j
assume: cold (empty) cache, a[0][0] goes here int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } 32 B = 4 doubles int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum; } blackboard OSU CSE 2431

How do we handle writes? Hit
Write through: write immediately to memory Write-back: wait and write to memory when line is replaced (need a dirty bit to see if line is different from memory or not) Nested loop structure Miss Write-allocate: load into cache, update line in cache (good if more writes to the location follow) No-write-allocate: writes immediately to memory Typical Write-through + No-write-allocate Write-back + Write-allocate OSU CSE 2431

Intel Core i7 Cache Hierarchy
Processor package Core 0 Core 3 L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L2 unified cache: 256 KB, 8-way, Access: 11 cycles L3 unified cache: 8 MB, 16-way, Access: cycles Block size: 64 bytes for all caches. Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache … L2 unified cache L2 unified cache L3 unified cache (shared by all cores) Main memory OSU CSE 2431

Performance Metrics Cache performance is evaluated with a number of metrics; Miss Rate: Fraction of memory references during execution of program, or part thereof, that miss (#misses/#references) = 1 – hit rate. Usually 3–10% for L1, < 1% for L2. Hit Rate: The fraction of memory references that hit. = 1 – miss rate Hit Time: The time to deliver a word in the cache to the CPU, including time for set selection, line identification, and word selection. Several clock cycles for L1, 5–20 cycles for L2. Miss Penalty: Any additional time required because of a miss. Penalty for L1 served from L2 is ~10 cycles; from L3, ~40 cycles; and from main memory, 100 cycles. OSU CSE 2431

Some Insights… (1) 99% Hits is Twice as Good as 97%:
Consider: each hit time (1 cycle), miss penalty (100 cycles) Average access time: 97% hits: 1 cycle +0.03*100 cycles = 4 cycles 99% hits: 1 cycle+0.01* 100 cycles = 2 cycles. Impact of Cache Size: On one hand, a larger cache will tend to increase the hit rate. On the other hand, it’s harder to make larger memories run faster. As a result, larger caches tend to increase hit time. Impact of Block Size: Larger blocks can help increase hit rate by exploiting spatial locality; however, larger blocks imply a smaller number of cache lines, which hurt hit rate with more temporal locality than spatial. Usually blocks are 32–64 bytes. OSU CSE 2431

Some Insights… (2) Impact of Associativity:
Higher associativity (larger values of E) decrease the vulnerability of the cache to thrashing due to conflict misses. Higher associativity expensive to implement and hard to make fast. It requires more tag bits per line, additional LRU state bits per line, and additional control logic. It can increase hit time and increase miss penalty because of increased complexity. Trade-off between hit time and miss penalty. Intel Core i7 systems: L1 and L2 are 8-way, L3 is 16-way. OSU CSE 2431

Writing Cache-Friendly Code
Programs with better locality tend to have lower miss rates and programs with lower miss rates will tend to run faster than programs with higher miss rates. Good programmers should always try to write code that is cache friendly, in the sense that it has good locality. OSU CSE 2431

Approach to Cache Friendly Code
Make the common case go fast. Programs often spend most of their time in a few core functions. These functions often spend most of their time in a few loops. So focus on the inner loops of the core function and ignore the rest. Minimize the number of cache misses for each inner loop. Good programmers should always try to write code that is cache friendly, in the sense that it has good locality. OSU CSE 2431

Writing Cache-friendly Code Putting it Together: Impact of Caches on Program Performance OSU CSE 2431

Putting it Together: Impact of Caches on Program Performance
The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality OSU CSE 2431

The Memory Mountain Every computer has a unique memory mountain that characterizes the capabilities of its memory. Read throughput (read bandwidth): Number of bytes read from memory per second (MB/s) Memory Mountain: Measured read throughput as a function of spatial and temporal locality. Compact way to characterize memory system performance. OSU CSE 2431

Memory Mountain Test Function
/* The test function */ void test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ } /* Run test(elems, stride) and return read throughput (MB/s) */ double run(int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ The Memory Mountain functions measure the read throughput for a particular read sequence. The test function generates a read sequence by scanning the first elemns elements of an array with a stride of stride. The run function is a wrapper that calls the test function and returns the measured read throughput. Call to test function in line 29 warms the cache. Fcyc2 function in line 30 calls the test function with arguments elems and estimates the running time of the test function in CPU cycles. (Note: size argument is in units of bytes, while the corresponding elems argument is in units of array elements.) MB/s is computed in line 31 as 106 bytes/s as opposed to 220 bytes/s. Size and stride arguments allow us to control the degree of temporal and spatial locality in the resulting read sequence. Smaller values of size result in a smaller working set size, and thus better temporal locality. Smaller values of stride result in better spatial locality. If we call the run function repeatedly with different values of size and stride then we can recover a fascinating two-dimensional function of read throughput versus temporal and spatial locality. This function is called a memory mountain. OSU CSE 2431

The Memory Mountain Core i7 Haswell 2.1 GHz 32 KB L1 d-cache
256 KB L2 cache 8 MB L3 cache 64 B block size The Memory Mountain Aggressive prefetching Ridges of temporal locality L1 Mem L2 L3 Slopes of spatial locality OSU CSE 2431

Memory Mountain Summary
The performance of the memory mountain is not characterized by a single number. Instead, it is a mountain of temporal and spatial locality whose elevations can vary by over 10×. Wise programmers try to structure their programs so that they run in the peaks instead in the valleys. The aim is to exploit temporal locality so that heavily used words are fetched from the L1 cache, and to exploit spatial locality so that as many words as possible are accessed from a single L1 cache line. OSU CSE 2431

Programming Example: Matrix Multiplication
Consider the problem of multiplying a pair of N × N matrices: C = AB. A matrix multiplying function is usually implemented using three nested loops, which are identified with indexes i, j, and k. If we permute the loops and make some minor code changes, we can create six functionally equivalent versions. Each version is uniquely identified by the ordering of its loops. OSU CSE 2431

Miss Rate Analysis (Matrix Multiplication)
Assume: Line size = 32 bytes (big enough for four 64-bit words) Matrix dimension (N) is very large: approximate 1/N as 0.0 Cache is not even big enough to hold multiple rows Analysis Method: Look at access pattern of inner loop A k i B k j C i j OSU CSE 2431

Matrix Multiplication Example
Variable sum held in register Description: Multiply N × N elements O(N3) total operations N reads per source element N values summed per destination; may be able to hold in register /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } OSU CSE 2431

Layout of C Arrays in Memory (review)
C arrays allocated in row-major order (each row in contiguous memory locations) Stepping through columns in one row: for (i = 0; i < N; i++) sum += a[0][i]; Accesses successive elements If block size (B) > 4 bytes, exploit spatial locality Compulsory miss rate = 4 bytes / B Stepping through rows in one column: for (i = 0; i < n; i++) sum += a[i][0]; Accesses distant elements No spatial locality! Compulsory miss rate = 1 (i.e. 100%) OSU CSE 2431

Matrix Multiplication (ijk)
for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Inner loop: (*, j) (i, j) (i, *) A B C Row-wise Column- wise Fixed Misses per inner loop iteration: A B C OSU CSE 2431

Matrix Multiplication (jik)
for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } Inner loop: (*, j) (i, j) (i, *) A B C Row-wise Column- wise Fixed Misses per inner loop iteration: A B C OSU CSE 2431

Matrix Multiplication (kij)
for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } Inner loop: (i, k) (k, *) (i, *) A B C Fixed Row-wise Row-wise Misses per inner loop iteration: A B C OSU CSE 2431

Matrix Multiplication (ikj)
for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } Inner loop: (i, k) (k, *) (i, *) A B C Fixed Row-wise Row-wise Misses per inner loop iteration: A B C OSU CSE 2431

Matrix Multiplication (jki)
Inner loop: /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } (*, k) (*, j) (k, j) A B C Column- wise Fixed Column- wise Misses per inner loop iteration: A B C OSU CSE 2431

Matrix Multiplication (kji)
for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } Inner loop: (*, k) (*, j) (k, j) A B C Column- wise Fixed Column- wise Misses per inner loop iteration: A B C OSU CSE 2431

Summary of Matrix Multiplication
for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (and jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } kij (and ikj): 2 loads, 1 store misses/iter = 0.5 for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } jki (and kji): 2 loads, 1 store misses/iter = 2.0 OSU CSE 2431

Core i7 Matrix Multiply Performance
jki / kji ijk / jik kij / ikj OSU CSE 2431

Example: Matrix Multiplication
c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n+j] += a[i*n + k]*b[k*n + j]; } j c a b = * i OSU CSE 2431

Cache Miss Analysis (1) = * = * Assume: First iteration:
Matrix elements are doubles Cache block = 8 doubles Cache size C ≪ n (much smaller than n) First iteration: n/8 + n = 9n/8 misses Afterwards in cache: (schematic) n = * = * 8 wide OSU CSE 2431

Cache Miss Analysis (2) = * Assume: Second iteration: Total misses:
Matrix elements are doubles Cache block = 8 doubles Cache size C ≪ n (much smaller than n) Second iteration: Again: n/8 + n = 9n/8 misses Total misses: 9n/8 * n2 = (9/8) * n3 n = * 8 wide OSU CSE 2431

Blocked Matrix Multiplication
c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i++) for (j1 = j; j1 < j+B; j++) for (k1 = k; k1 < k+B; k++) c[i1*n+j1] += a[i1*n + k1]*b[k1*n + j1]; } j1 c a b c = + * i1 Block size B × B OSU CSE 2431

Cache Miss Analysis (1) = * = * Assume: Cache block = 8 doubles
Cache size C ≪ n (much smaller than n) Three blocks fit into cache: 3B2 < C First (block) iteration: B2/8 misses for each block 2n/B * B2/8 = nB/4 (omitting matrix C) Afterwards in cache (schematic) n/B blocks = * Block size B × B = * OSU CSE 2431

Cache Miss Analysis (2) = * Assume: Second (block) iteration:
Cache block = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C Second (block) iteration: Same as first iteration 2n/B * B2/8 = nB/4 Total misses: nB/4 * (n/B)2 = n3/(4B) n/B blocks = * Block size B × B OSU CSE 2431

Summary No blocking: (9/8) * n3 Blocking: 1/(4B) * n3
Suggest largest possible block size B, but limit 3B2 < C! Reason for dramatic difference: Matrix multiplication has inherent temporal locality: Input data: 3n2, computation 2n3 Every array elements used O(n) times! But program has to be written properly OSU CSE 2431

Concluding Observations
Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache friendly code” Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) OSU CSE 2431

Memory Hierarchy Instructor: Adam C. Champion, Ph.D.

Similar presentations

Presentation on theme: "Memory Hierarchy Instructor: Adam C. Champion, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Hierarchy Instructor: Adam C. Champion, Ph.D.

Similar presentations

Presentation on theme: "Memory Hierarchy Instructor: Adam C. Champion, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback