1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:

Slides:

Advertisements

Similar presentations

CS 105 Tour of the Black Holes of Computing

Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.

Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

Cache Memories February 24, 2004 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance class13.ppt.

CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.

Caches Oct. 22, 1998 Topics Memory Hierarchy

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.

Systems I Locality and Caching

ECE Dept., University of Toronto

– 1 – , F’02 Conventional DRAM Organization d x w DRAM: dw total bits organized as d supercells of size w bits cols rows internal row.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.

Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.

Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.

The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

Memory hierarchy. – 2 – Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

1 Memory Hierarchy ( Ⅲ ). 2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:

Systems I Cache Organization

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.

Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 2007/01/08 with slides by CMU

Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

Cache Memories February 26, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

Cache Memories.

COSC3330 Computer Architecture

CSE 351 Section 9 3/1/12.

Cache Memory and Performance

The Goal: illusion of large, fast, cheap memory

Cache Memories CSE 238/2038/2138: Systems Programming

Cache Memory Presentation I

CS 105 Tour of the Black Holes of Computing

The Memory Hierarchy : Memory Hierarchy - Cache

Memory hierarchy.

Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron

ReCap Random-Access Memory (RAM) Nonvolatile Memory

Cache Memories September 30, 2008

Cache Memories Topics Cache memory organization Direct mapped caches

“The course that gives CMU its Zip!”

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Cache Performance October 3, 2007

Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05

Cache Memories.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

1 Cache Memory

2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading: 6.4, 6.5, 6.6

3 6.4 Cache Memories

4 Cache Memory History –At very beginning, 3 levels Registers, main memory, disk storage –10 years later, 4 levels Register, SRAM cache, main DRAM memory, disk storage –Modern processor, 4~5 levels Registers, SRAM L1, L2(,L3) cache, main DRAM memory, disk storage –Cache memories are small, fast SRAM-based memories are managed by hardware automatically can be on-chip, on-die, off-chip

5 Cache Memory Figure 6.24 P488 main memory I/O bridge bus interfaceL2 cache ALU register file CPU chip cache bussystem busmemory bus L1 cache

6 Cache Memory L1 cache is on-chip L2 cache is off-chip several years ago L3 cache can be off-chip or on-chip CPU looks first for data in L1, then in L2, then in main memory –Hold frequently accessed blocks of main memory are in caches

7 Inserting an L1 cache between the CPU and main memory a b c d block 10 p q r s block w x y z block The big slow main memory has room for many 4-word blocks. The small fast L1 cache has room for two 4-word blocks. The tiny, very fast CPU register file has room for four 4-byte words. The transfer unit between the cache and main memory is a 4-word block (16 bytes). The transfer unit between the CPU register file and the cache is a 4-byte block. line 0 line 1

8 Generic Cache Memory Organization Figure 6.25 P488 B–110 B–110 valid tag set 0: B = 2 b bytes per cache block E lines per set S = 2 s sets t tag bits per line 1 valid bit per line B–110 B–110 valid tag set 1: B–110 B–110 valid tag set S-1: Cache is an array of sets. Each set contains one or more lines. Each line holds a block of data. pp.488

9 Addressing caches Figure 6.25 P488 t bitss bits b bits 0m-1 Address A: B–110 B–110 v v tag set 0: B–110 B–110 v v tag set 1: B–110 B–110 v v tag set S-1: The word at address A is in the cache if the tag bits in one of the lines in set match. The word contents begin at offset bytes from the beginning of the block.

10 Cache Memory Fundamental parameters ParametersDescriptions S = 2 s E B=2 b m=log 2 (M) Number of sets Number of lines per set Block size(bytes) Number of physical(main memory) address bits

11 Cache Memory Derived quantities ParametersDescriptions M=2 m s=log2(S) b=log2(B) t=m-(s+b) C=B  E  S Maximum number of unique memory address Number of set index bits Number of block offset bits Number of tag bits Cache size (bytes) not including overhead such as the valid and tag bits

12 Direct-mapped cache Figure 6.27 P490 Simplest kind of cache Characterized by exactly one line per set. valid tag set 0: set 1: set S-1: E=1 lines per set cache block

13 Accessing direct-mapped caches Figure 6.28 P491 Set selection –Use the set index bits to determine the set of interest valid tag set 0: set 1: set S-1: t bitss bits m-1 b bits tagset indexblock offset selected set cache block

14 Accessing direct-mapped caches Line matching and word extraction –find a valid line in the selected set with a matching tag (line matching) –then extract the word (word selection)

15 Accessing direct-mapped caches Figure 6.29 P491 1 t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? = ? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (1) The valid bit must be set (2) The tag bits in the cache line must match the tag bits in the address 0110 w3w3 w0w0 w1w1 w2w

16 Line Replacement on Misses in Directed Caches If cache misses –Retrieve the requested block from the next level in the memory hierarchy –Store the new block in one of the cache lines of the set indicated by the set index bits

17 Line Replacement on Misses in Directed Caches If the set is full of valid cache lines –One of the existing lines must be evicted For a direct-mapped caches –Each set contains only one line –Current line is replaced by the newly fetched line

18 Direct-mapped cache simulation P492 M=16 byte addresses B=2 bytes/block, S=4 sets, E=1 entry/set

19 Direct-mapped cache simulation P493 10m[1] m[0] vtagdata 11m[13] m[12] 0 [0000] (miss) (4) 11m[9] m[8] vtagdata 11m[13] m[12] 8 [1000] (miss) (3) 10m[1] m[0] vtagdata 11m[13] m[12] 13 [1101] (miss) (2) 10m[1] m[0] vtagdata 0 [0000] (miss) (1) M=16 byte addresses, B=2 bytes/block, S=4 sets, E=1 entry/set Address trace (reads): 0 [0000] 1 [0001] 13 [1101] 8 [1000] 0 [0000] x t=1s=2b=1 xxx

20 Direct-mapped cache simulation Figure 6.30 P493 Address bits Address (decimal) Tag bits (t=1) Index bits (s=2) Offset bits (b=1) Block number (decimal)

21 Why use middle bits as index? High-Order Bit Indexing –Adjacent memory lines would map to same cache entry –Poor use of spatial locality Middle-Order Bit Indexing –Consecutive memory lines map to different cache lines –Can hold C-byte region of address space in cache at one time 4-line Cache High-Order Bit Indexing Middle-Order Bit Indexing Figure 6.31 P497

22 Set associative caches Characterized by more than one line per set validtag set 0: E=2 lines per set set 1: set S-1: cache block validtagcache block validtagcache block validtagcache block validtagcache block validtagcache block Figure 6.32 P498

23 Accessing set associative caches Set selection –identical to direct-mapped cache valid tag set 0: valid tag set 1: valid tag set S-1: t bitss bits m-1 b bits tagset indexblock offset Selected set cache block Figure 6.33 P498

24 Accessing set associative caches Line matching and word selection –must compare the tag in each valid line in the selected set. (3) If (1) and (2), then cache hit, and block offset selects starting byte w3w3 w0w0 w1w1 w2w t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (1) The valid bit must be set Figure 6.34 P499

25 Fully associative caches Characterized by all of the lines in the only one set No set index bits in the address set 0: valid tag cache block validtagcache block … E=C/B lines in the one and only set t bits b bits tagblock offset Figure 6.36 P500 Figure 6.35 P500

26 Accessing set associative caches Word selection –must compare the tag in each valid line w3w3 w0w0 w1w1 w2w t bits m-1 b bits tag block offset =1? = ? (3) If (1) and (2), then cache hit, and block offset selects starting byte. (2) The tag bits in one of the cache lines must match the tag bits in the address (1) The valid bit must be set Figure 6.37 P500

27 Issues with Writes Write hits –Write through Cache updates its copy Immediately writes the corresponding cache block to memory –Write back Defers the memory update as long as possible Writing the updated block to memory only when it is evicted from the cache Maintains a dirty bit for each cache line

28 Issues with Writes Write misses –Write-allocate Loads the corresponding memory block into the cache Then updates the cache block –No-write-allocate Bypasses the cache Writes the word directly to memory Combination –Write through, no-write-allocate –Write back, write-allocate

29 Multi-level caches size: speed: $/Mbyte: line size: 8-64 KB 3 ns 32 B 128 MB DRAM 60 ns $1.50/MB 8 KB 30 GB 8 ms $0.05/MB larger, slower, cheaper Memory disk TLB L1 I-cache L1 D-cache regs L2 Cache Processor 1-4MB SRAM 6 ns $100/MB 32 B larger line size, higher associativity, more likely to write back Options: separate data and instruction caches, or a unified cache Figure 6.38 P504

30 Cache performance metrics P505 Miss Rate –fraction of memory references not found in cache (misses/references) –Typical numbers: 3-10% for L1 Hit Rate –fraction of memory references found in cache (1 - miss rate)

31 Cache performance metrics Hit Time –time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) –Typical numbers: 1-2 clock cycle for L clock cycles for L2 Miss Penalty –additional time required because of a miss Typically cycles for main memory

32 Cache performance metrics P505 1> Cache size –Hit rate vs. hit time 2> Block size –Spatial locality vs. temporal locality 3> Associativity –Thrashing –Cost –Speed –Miss penalty 4> Write strategy –Simple, read misses, fewer transfer

Writing Cache-Friendly Code

34 Writing Cache-Friendly Code Principles –Programs with better locality will tend to have lower miss rates –Programs with lower miss rates will tend to run faster than programs with higher miss rates

35 Writing Cache-Friendly Code Basic approach –Make the common case go fast Programs often spend most of their time in a few core functions. These functions often spend most of their time in a few loops –Minimize the number of cache misses in each inner loop All things being equal

36 Writing Cache-Friendly Code P507 8[h]7[h]6[h]5[m]4[h]3[h]2[h]1[m]Access order, [h]it or [m]iss i= 7i= 6i= 5i= 4i= 3i= 2i= 1i=0v[i] Temporal locality, These variables are usually put in registers int sumvec(int v[N]) { int i, sum = 0 ; for (i = 0 ; i < N ; i++) sum += v[i] ; return sum ; }

37 Writing cache-friendly code Temporal locality –Repeated references to local variables are good because the compiler can cache them in the register file

38 Writing cache-friendly code Spatial locality –Stride-1 references patterns are good because caches at all levels of the memory hierarchy store data as contiguous blocks Spatial locality is especially important in programs that operate on multidimensional arrays

39 Writing cache-friendly code P508 Example (M=4, N=8, 10cycles/iter) int sumvec(int a[M][N]) { int i, j, sum = 0 ; for (i = 0 ; i < M ; i++) for ( j = 0 ; j < N ; j++ ) sum += a[i][j] ; return sum ; }

40 Writing cache-friendly code a[i][j]j=0j= 1j= 2j= 3j= 4j= 5j= 6j= 7 i=0 i=1 i=2 i=3 1[m] 9[m] 17[m] 25[m] 2[h] 10[h] 18[h] 26[h] 3[h] 11[h] 19[h] 27[h] 4[h] 12[h] 20[h] 28[h] 5[m] 13[m] 21[m] 29[m] 6[h] 14[h] 22[h] 30[h] 7[h] 15[h] 23[h] 31[h] 8[h] 16[h] 24[h] 32[h]

41 Writing cache-friendly code P508 Example (M=4, N=8, 20cycles/iter) int sumvec(int v[M][N]) { int i, j, sum = 0 ; for ( j = 0 ; j < N ; j++ ) for ( i = 0 ; i < M ; i++ ) sum += v[i][j] ; return sum ; }

42 Writing cache-friendly code a[i][j]j=0j= 1j= 2j= 3j= 4j= 5j= 6j= 7 i=0 i=1 i=2 i=3 1[m] 2[m] 3[m] 4[m] 5[m] 6[m] 7[m] 8[m] 9[m] 10[m] 11[m] 12[m] 13[m] 14[m] 15[m] 16[m] 17[m] 18[m] 19[m] 20[m] 21[m] 22[m] 23[m] 24[m] 25[m] 26[m] 27[m] 28[m] 29[m] 30[m] 31[m] 32[m]

Putting it Together: The Impact of Caches on Program Performance The Memory Mountain

44 The Memory Mountain P512 Read throughput (read bandwidth) –The rate that a program reads data from the memory system Memory mountain –A two-dimensional function of read bandwidth versus temporal and spatial locality –Characterizes the capabilities of the memory system for each computer

45 Memory mountain main routine Figure 6.41 P513 /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /*... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */

46 Memory mountain main routine int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */

47 Memory mountain main routine for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); } exit(0); }

48 Memory mountain test function Figure 6.40 P512 /* The test function */ void test (int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ }

49 Memory mountain test function /* Run test (elems, stride) and return read throughput (MB/s) */ double run (int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test (elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }

50 The Memory Mountain Data –Size MAXBYTES(8M) bytes or MAXELEMS(2M) words –Partially accessed Working set: from 8MB to 1KB Stride: from 1 to 16

51 The Memory Mountain Figure 6.42 P514

52 Ridges of temporal locality Slice through the memory mountain with stride=1 –illuminates read throughputs of different caches and memory Ridges: 山脊

53 Ridges of temporal locality Figure 6.43 P515

54 A slope of spatial locality Slice through memory mountain with size=256KB –shows cache block size.

55 A slope of spatial locality Figure 6.44 P516