University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Slides:



Advertisements
Similar presentations
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Advertisements

SE-292 High Performance Computing
CS 105 Tour of the Black Holes of Computing
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.
CSE 502: Computer Architecture
Computer Systems – the impact of caches
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
The Memory Hierarchy CPSC 321 Andreas Klappenecker.
Cache Memories September 30, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Systems I Locality and Caching
Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)
1 Overview 2 Cache entry structure 3 mapping function 4 Cache hierarchy in a modern processor 5 Advantages and Disadvantages of Larger Caches 6 Implementation.
CMPE 421 Parallel Computer Architecture
CS 3410, Spring 2014 Computer Science Cornell University See P&H Chapter: , 5.8, 5.15.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
University of Amsterdam Computer Systems – virtual memory Arnoud Visser 1 Computer Systems Virtual Memory.
Memory hierarchy. – 2 – Memory Operating system and CPU memory management unit gives each process the “illusion” of a uniform, dedicated memory space.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 Memory Hierarchy ( Ⅲ ). 2 Outline The memory hierarchy Cache memories Suggested Reading: 6.3, 6.4.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Systems I Cache Organization
1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
COSC3330 Computer Architecture
Improving Memory Access The Cache and Virtual Memory
CSE 351 Section 9 3/1/12.
Memory COMPUTER ARCHITECTURE
The Goal: illusion of large, fast, cheap memory
Cache Memories CSE 238/2038/2138: Systems Programming
Section 9: Virtual Memory (VM)
Cache Memory Presentation I
Memory hierarchy.
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Caches 2 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science
Lecture 21: Memory Hierarchy
Cache Memories September 30, 2008
Chapter 8 Digital Design and Computer Architecture: ARM® Edition
Lecture 22: Cache Hierarchies, Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
10/16: Lecture Topics Memory problem Memory Solution: Caches Locality
CMSC 611: Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Memory and Performance
Presentation transcript:

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 2 How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 3 CPUID-instruction asm("push %ebx; movl $2,%eax; CPUID; movl %eax,cache_eax; movl %ebx,cache_ebx; movl %edx,cache_edx; movl %ecx,cache_ecx; pop %ebx");

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 4 A limited number of answers

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 5 Direct-Mapped Cache Simplest kind of cache Characterized by exactly one line per set. E=1 lines per set valid tag set 0: set 1: set S-1: cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 6 Fully Associative Cache Difficult and expensive to build one that is both large and fast Valid Tag Set 0: E = C/B lines in the one and only set ValidTag Cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 7 E – way Set Associative Caches Characterized by more than one line per set validtag set 0: E=2 lines per set set 1: set S-1: cache block validtagcache block validtagcache block validtagcache block validtagcache block validtagcache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 8 Accessing Set Associative Caches Set selection –Use the set index bits to determine the set of interest. m-1 t bitss bits b bits tagset indexblock offset selected set s = log 2 (S) valid tag set 0: valid tag set 1: valid tag set S-1: cache block S = C / (B x E)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 9 Accessing Set Associative Caches Line matching and word selection –must compare the tag in each valid line in the selected set w3w3 w0w0 w1w1 w2w t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? (1) The valid bit must be set. = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (3) If (1) and (2), then cache hit, and block offset selects starting byte b = log 2 (B)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 10 Observations Contiguously region of addresses form a block Multiple address blocks share the same set-index Blocks that map to the same set can be uniquely identified by the tag

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 11 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Data is copied between levels in block-sized transfer units Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: Level k+1: Same set-index

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 12 Request 14 Request 12 General Caching Concepts Program needs object d, which is stored in some block b. Cache hit –Program finds b in the cache at level k. E.g., block 14. Cache miss –b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. –If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? Level k: Level k+1: * Request 12 4* Conflict miss

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 13 Intel Processors Cache SRAM L1L2 InstructionData Pentium II19974-way, 32B, 128 sets 4-way,32B, 4096 sets Celeron A1998,, 4-way,32B, 1028 sets Pentium III Coppermine 2000,, 4-way,32B, 2048 sets Pentium 4 Willamette way4-way,64B, 64 sets 8-way,64B, 512 sets Pentium 4 Northwood 2002,, 8-way,64B, 1028 sets

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 14 Impact Associativity more lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) Block size larger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty ( cycles for L1).

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 15 Design of DRAM cache –Line size? Large, since disk better at transferring large blocks –Associativity? High, to mimimize miss rate –Write through or write back? Write back, since can’t afford to perform small writes to disk What would the impact of these choices be on: –miss rate Extremely low. << 1% –hit time Must match cache/DRAM performance –miss latency Very high. ~20ms –tag storage overhead Low, relative to block size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 16 Locating an Object in a “Cache” SRAM Cache –Tag stored with cache line –Maps from cache block to memory blocks From cached to uncached form Save a few bits by only storing tag –No tag for block not in cache –Hardware retrieves information can quickly match against multiple tags X Object Name TagData D243 X 17 J105 0: 1: N-1: = X? “Cache”

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 17 Locating an Object in “Cache” Data : 1: N-1: X Object Name Location D: J: X: 1 0 On Disk “Cache”Page Table DRAM Cache –Each allocated page of virtual memory has entry in page table –Mapping from virtual pages to physical pages From uncached form to cached form –Page table entry even if page not in memory Specifies disk address

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 18 A System with Virtual Memory Address Translation: Hardware converts virtual addresses to physical addresses via lookup table (page table) CPU 0: 1: N-1: Memory 0: 1: P-1: Page Table Disk Virtual Addresses Physical Addresses

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 19 Page Faults (like “Cache Misses”) What if an object is on disk rather than in memory? –Page table entry indicates virtual address not in memory –OS exception handler invoked to move data from disk into memory current process suspends, others can resume OS has full control over placement, etc. Memory Page Table Disk Virtual Addresses Physical Addresses CPU Memory Page Table Disk Virtual Addresses Physical Addresses Before fault After fault CPU

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 20 VM Address Translation: Hardware vs Software Processor Hardware Addr Trans Mechanism fault handler Main Memory Secondary memory a a'  page fault physical address OS performs this transfer (only if miss) virtual addresspart of the on-chip memory mgmt unit (MMU)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 21 virtual page numberpage offset virtual address physical page numberpage offset physical address 0p–1 address translation pm–1 n–10p–1p Page offset bits don’t change as a result of translation VM Address Translation Parameters –P = 2 p = page size (bytes). –N = 2 n = Virtual address limit –M = 2 m = Physical address limit

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 22 Address Translation via Page Table virtual page number (VPN)page offset virtual address physical page number (PPN)page offset physical address 0p–1pm–1 n–10p–1p page table base register if valid=0 then page not in memory valid physical page number (PPN) access VPN acts as table index PTEA = PTE

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 23 CPU Trans- lation Cache Main Memory VAPA miss hit data Integrating VM and Cache Most Caches “Physically Addressed” –Allows multiple processes to have blocks in cache at same time –Cache doesn’t need to be concerned with protection issues (Access rights checked as part of address translation) Perform Address Translation Before Cache –But this involves a memory access itself (of the PTE) –Of course, page table entries can also become cached

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 24 CPU TLB Lookup Cache Main Memory VAPA miss hit data Trans- lation hit miss Speeding up Translation with a dedicated cache “Translation Lookaside Buffer” (TLB) –Small hardware cache in MMU –Maps virtual page to physical page numbers –Contains complete PTEs for small number of pages

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 25 What do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 26 Matrix Multiplication Example Major Cache Effects to Consider –Total cache size Exploit temporal locality and keep the working set small (e.g., by using blocking) –Block size Exploit spatial locality Description: –Multiply N x N matrices –O(N3) total operations –Accesses N reads per source element N values summed per destination Assumption –No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Variable sum held in register

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 27 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ABC (i,*) (*,j) (i,j) Inner loop: Column- wise Row-wise Fixed Misses per Inner Loop Iteration: ABC = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 28 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } ABC (i,*) (*,j) (i,j) Inner loop: Row-wiseColumn- wise Fixed Misses per Inner Loop Iteration: ABC = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 29 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 30 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 31 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) Column - wise Column- wise Fixed Misses per Inner Loop Iteration: ABC = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 32 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) FixedColumn- wise Column- wise Misses per Inner Loop Iteration: ABC = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 33 Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 34 Matrix Multiply Performance Miss rates are helpful but not perfect predictors. Code scheduling matters, too.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 35 Improving Temporal Locality by Blocking Example: Blocked matrix multiplication –do not mean “cache block”. –Instead, it mean a sub-block within the matrix. –Example: N = 8; sub-block size = 4 C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 X = C 11 C 12 C 21 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 36 Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 37 Blocked Matrix Multiply Analysis Innermost loop pair multiplies –a 1 X bsize sliver of A by –a bsize X bsize block of B and accumulates into 1 X bsize sliver of C –Loop over i steps through n row slivers of A & C, using same B ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 38 Matrix Multiply Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions –relatively insensitive to array size.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 39 Concluding Observations Programmer can optimize for cache performance –How data structures are organized –How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache friendly code” –Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. –Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 40 Assignment Optimize your ‘image-processing’ code further with blocking Make your code a function of the L1-size