University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 2 How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 3 CPUID-instruction asm("push %ebx; movl $2,%eax; CPUID; movl %eax,cache_eax; movl %ebx,cache_ebx; movl %edx,cache_edx; movl %ecx,cache_ecx; pop %ebx");

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 4 A limited number of answers http://www.sandpile.org/ia32/cpuid.htm

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 5 Direct-Mapped Cache Simplest kind of cache Characterized by exactly one line per set. E=1 lines per set valid tag set 0: set 1: set S-1: cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 6 Fully Associative Cache Difficult and expensive to build one that is both large and fast Valid Tag Set 0: E = C/B lines in the one and only set ValidTag Cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 7 E – way Set Associative Caches Characterized by more than one line per set validtag set 0: E=2 lines per set set 1: set S-1: cache block validtagcache block validtagcache block validtagcache block validtagcache block validtagcache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 8 Accessing Set Associative Caches Set selection –Use the set index bits to determine the set of interest. m-1 t bitss bits 0 0 0 0 1 0 b bits tagset indexblock offset selected set s = log 2 (S) valid tag set 0: valid tag set 1: valid tag set S-1: cache block S = C / (B x E)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 9 Accessing Set Associative Caches Line matching and word selection –must compare the tag in each valid line in the selected set. 10110 w3w3 w0w0 w1w1 w2w2 11001 t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? (1) The valid bit must be set. = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (3) If (1) and (2), then cache hit, and block offset selects starting byte. 30127456 b = log 2 (B)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 10 Observations Contiguously region of addresses form a block Multiple address blocks share the same set-index Blocks that map to the same set can be uniquely identified by the tag

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 11 Caching in a Memory Hierarchy 0123 4567 891011 12131415 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Data is copied between levels in block-sized transfer units 8 9143 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: Level k+1: 4 4 4 10 Same set-index

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 12 Request 14 Request 12 General Caching Concepts Program needs object d, which is stored in some block b. Cache hit –Program finds b in the cache at level k. E.g., block 14. Cache miss –b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. –If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? 93 0123 4567 891011 12131415 Level k: Level k+1: 14 12 14 4* 12 0123 Request 12 4* 12 14 12 Conflict miss

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 13 Intel Processors Cache SRAM L1L2 InstructionData Pentium II19974-way, 32B, 128 sets 4-way,32B, 4096 sets Celeron A1998,, 4-way,32B, 1028 sets Pentium III Coppermine 2000,, 4-way,32B, 2048 sets Pentium 4 Willamette 20008-way4-way,64B, 64 sets 8-way,64B, 512 sets Pentium 4 Northwood 2002,, 8-way,64B, 1028 sets http://www11.brinkster.com/bayup/dodownload.asp?f=37

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 14 Impact Associativity more lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) Block size larger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1).

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 15 Design of DRAM cache –Line size? Large, since disk better at transferring large blocks –Associativity? High, to mimimize miss rate –Write through or write back? Write back, since can’t afford to perform small writes to disk What would the impact of these choices be on: –miss rate Extremely low. << 1% –hit time Must match cache/DRAM performance –miss latency Very high. ~20ms –tag storage overhead Low, relative to block size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 16 Locating an Object in a “Cache” SRAM Cache –Tag stored with cache line –Maps from cache block to memory blocks From cached to uncached form Save a few bits by only storing tag –No tag for block not in cache –Hardware retrieves information can quickly match against multiple tags X Object Name TagData D243 X 17 J105 0: 1: N-1: = X? “Cache”

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 17 Locating an Object in “Cache” Data 243 17 105 0: 1: N-1: X Object Name Location D: J: X: 1 0 On Disk “Cache”Page Table DRAM Cache –Each allocated page of virtual memory has entry in page table –Mapping from virtual pages to physical pages From uncached form to cached form –Page table entry even if page not in memory Specifies disk address

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 18 A System with Virtual Memory Address Translation: Hardware converts virtual addresses to physical addresses via lookup table (page table) CPU 0: 1: N-1: Memory 0: 1: P-1: Page Table Disk Virtual Addresses Physical Addresses

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 19 Page Faults (like “Cache Misses”) What if an object is on disk rather than in memory? –Page table entry indicates virtual address not in memory –OS exception handler invoked to move data from disk into memory current process suspends, others can resume OS has full control over placement, etc. Memory Page Table Disk Virtual Addresses Physical Addresses CPU Memory Page Table Disk Virtual Addresses Physical Addresses Before fault After fault CPU

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 20 VM Address Translation: Hardware vs Software Processor Hardware Addr Trans Mechanism fault handler Main Memory Secondary memory a a'  page fault physical address OS performs this transfer (only if miss) virtual addresspart of the on-chip memory mgmt unit (MMU)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 21 virtual page numberpage offset virtual address physical page numberpage offset physical address 0p–1 address translation pm–1 n–10p–1p Page offset bits don’t change as a result of translation VM Address Translation Parameters –P = 2 p = page size (bytes). –N = 2 n = Virtual address limit –M = 2 m = Physical address limit

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 22 Address Translation via Page Table virtual page number (VPN)page offset virtual address physical page number (PPN)page offset physical address 0p–1pm–1 n–10p–1p page table base register if valid=0 then page not in memory valid physical page number (PPN) access VPN acts as table index PTEA = PTE

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 23 CPU Trans- lation Cache Main Memory VAPA miss hit data Integrating VM and Cache Most Caches “Physically Addressed” –Allows multiple processes to have blocks in cache at same time –Cache doesn’t need to be concerned with protection issues (Access rights checked as part of address translation) Perform Address Translation Before Cache –But this involves a memory access itself (of the PTE) –Of course, page table entries can also become cached

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 24 CPU TLB Lookup Cache Main Memory VAPA miss hit data Trans- lation hit miss Speeding up Translation with a dedicated cache “Translation Lookaside Buffer” (TLB) –Small hardware cache in MMU –Maps virtual page to physical page numbers –Contains complete PTEs for small number of pages

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 25 What do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 26 Matrix Multiplication Example Major Cache Effects to Consider –Total cache size Exploit temporal locality and keep the working set small (e.g., by using blocking) –Block size Exploit spatial locality Description: –Multiply N x N matrices –O(N3) total operations –Accesses N reads per source element N values summed per destination Assumption –No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Variable sum held in register

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 27 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ABC (i,*) (*,j) (i,j) Inner loop: Column- wise Row-wise Fixed Misses per Inner Loop Iteration: ABC 0.251.00.0 = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 28 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } ABC (i,*) (*,j) (i,j) Inner loop: Row-wiseColumn- wise Fixed Misses per Inner Loop Iteration: ABC 0.251.00.0 = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 29 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC 0.00.250.25 = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 30 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC 0.00.250.25 = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 31 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) Column - wise Column- wise Fixed Misses per Inner Loop Iteration: ABC 1.00.01.0 = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 32 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) FixedColumn- wise Column- wise Misses per Inner Loop Iteration: ABC 1.00.01.0 = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 33 Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 34 Matrix Multiply Performance Miss rates are helpful but not perfect predictors. Code scheduling matters, too.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 35 Improving Temporal Locality by Blocking Example: Blocked matrix multiplication –do not mean “cache block”. –Instead, it mean a sub-block within the matrix. –Example: N = 8; sub-block size = 4 C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 X = C 11 C 12 C 21 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 36 Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 37 Blocked Matrix Multiply Analysis Innermost loop pair multiplies –a 1 X bsize sliver of A by –a bsize X bsize block of B and accumulates into 1 X bsize sliver of C –Loop over i steps through n row slivers of A & C, using same B ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 38 Matrix Multiply Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions –relatively insensitive to array size.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 39 Concluding Observations Programmer can optimize for cache performance –How data structures are organized –How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache friendly code” –Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. –Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 40 Assignment Optimize your ‘image-processing’ code further with blocking Make your code a function of the L1-size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Similar presentations

Presentation on theme: "University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Similar presentations

Presentation on theme: "University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics."— Presentation transcript:

Similar presentations

About project

Feedback