University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 2 How do you know? (ow133)cpuid This system has a Genuine Intel(R) Pentium(R) 4 processor Processor Family: F, Extended Family: 0, Model: 2, Stepping: 7 Pentium 4 core C1 (0.13 micron): core-speed 2 Ghz - 3.06 GHz (bus-speed 400/533 MHz) Instruction TLB: 4K, 2M or 4M pages, fully associative, 128 entries Data TLB: 4K or 4M pages, fully associative, 64 entries 1st-level data cache: 8K-bytes, 4-way set associative, sectored cache, 64-byte line size No 2nd-level cache or, if processor contains a valid 2nd-level cache, no3rd-level cache Trace cache: 12K-uops, 8-way set associative 2nd-level cache: 512K-bytes, 8-way set associative, sectored cache, 64-byte line size

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 3 CPUID-instruction asm("push %ebx; movl $2,%eax; CPUID; movl %eax,cache_eax; movl %ebx,cache_ebx; movl %edx,cache_edx; movl %ecx,cache_ecx; pop %ebx");

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 4 A limited number of answers http://www.sandpile.org/ia32/cpuid.htm

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 5 Direct-Mapped Cache Simplest kind of cache Characterized by exactly one line per set. E=1 lines per set valid tag set 0: set 1: set S-1: cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 6 Fully Associative Cache Difficult and expensive to build one that is both large and fast Valid Tag Set 0: E = C/B lines in the one and only set ValidTag Cache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 7 E – way Set Associative Caches Characterized by more than one line per set validtag set 0: E=2 lines per set set 1: set S-1: cache block validtagcache block validtagcache block validtagcache block validtagcache block validtagcache block

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 8 Accessing Set Associative Caches Set selection –Use the set index bits to determine the set of interest. m-1 t bitss bits 0 0 0 0 1 0 b bits tagset indexblock offset selected set s = log 2 (S) valid tag set 0: valid tag set 1: valid tag set S-1: cache block S = C / (B x E)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 9 Accessing Set Associative Caches Line matching and word selection –must compare the tag in each valid line in the selected set. 10110 w3w3 w0w0 w1w1 w2w2 11001 t bitss bits 100i0110 0m-1 b bits tagset indexblock offset selected set (i): =1? (1) The valid bit must be set. = ? (2) The tag bits in one of the cache lines must match the tag bits in the address (3) If (1) and (2), then cache hit, and block offset selects starting byte. 30127456 b = log 2 (B)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 10 Observations Contiguously region of addresses form a block Multiple address blocks share the same set-index Blocks that map to the same set can be uniquely identified by the tag

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 11 Caching in a Memory Hierarchy 0123 4567 891011 12131415 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Data is copied between levels in block-sized transfer units 8 9143 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: Level k+1: 4 4 4 10 Same set-index

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 12 Request 14 Request 12 General Caching Concepts Program needs object d, which is stored in some block b. Cache hit –Program finds b in the cache at level k. E.g., block 14. Cache miss –b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. –If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? 93 0123 4567 891011 12131415 Level k: Level k+1: 14 12 14 4* 12 0123 Request 12 4* 12 14 12 Conflict miss

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 13 Intel Processors Cache SRAM L1L2 InstructionData Pentium II19974-way, 32B,4-way,32B Celeron A1998,, 4-way,32B Pentium III Coppermine 2000,, 4-way,32B Pentium 4 Willamette 20008-way4-way,64B8-way,64B Pentium 4 Northwood 2002,, 8-way,64B http://www11.brinkster.com/bayup/dodownload.asp?f=37

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 14 Intel Processors Cache SRAM ftp://download.intel.com/design/Pentium4/manuals/25366514.pdf (section 2.4)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 15 Impact Associativity more lines decrease the vulnerability on thrashing. It requires more tag-bits and control logic, which increases the hit-time (1-2 cycles for L1) Block size larger blocks exploit spatial locality (not temporal locality) to increase the hit-rate. Larger blocks increase the miss-penalty (5- 100 cycles for L1).

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 16 Optimum Block size http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture15.pdf

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 17 Absolute Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 18 Relative Miss Rate http://www.cs.berkeley.edu/~randy/Courses/CS252.S96/Lecture16.pdf

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 19 Matrix Multiplication Example Major Cache Effects to Consider –Total cache size Exploit temporal locality and keep the working set small (e.g., by using blocking) –Block size Exploit spatial locality Description: –Multiply N x N matrices –O(N3) total operations –Accesses N reads per source element N values summed per destination Assumption –No so large that cache not big enough to hold multiple rows /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Variable sum held in register

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 20 Matrix Multiplication (ijk) /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ABC (i,*) (*,j) (i,j) Inner loop: Column- wise Row-wise Fixed Misses per Inner Loop Iteration: ABC 0.251.00.0 = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 21 Matrix Multiplication (jik) /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } ABC (i,*) (*,j) (i,j) Inner loop: Row-wiseColumn- wise Fixed Misses per Inner Loop Iteration: ABC 0.251.00.0 = 1.25

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 22 Matrix Multiplication (kij) /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC 0.00.250.25 = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 23 Matrix Multiplication (ikj) /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC 0.00.250.25 = 0.5

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 24 Matrix Multiplication (jki) /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) Column - wise Column- wise Fixed Misses per Inner Loop Iteration: ABC 1.00.01.0 = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 25 Matrix Multiplication (kji) /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) FixedColumn- wise Column- wise Misses per Inner Loop Iteration: ABC 1.00.01.0 = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 26 Summary of Matrix Multiplication for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 27 Matrix Multiply Performance Miss rates are helpful but not perfect predictors. Code scheduling matters, too.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 28 Improving Temporal Locality by Blocking Example: Blocked matrix multiplication –do not mean “cache block”. –Instead, it mean a sub-block within the matrix. –Example: N = 8; sub-block size = 4 C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A 11 A 12 A 21 A 22 B 11 B 12 B 21 B 22 X = C 11 C 12 C 21 C 22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 29 Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 30 Blocked Matrix Multiply Analysis Innermost loop pair multiplies –a 1 X bsize sliver of A by –a bsize X bsize block of B and accumulates into 1 X bsize sliver of C –Loop over i steps through n row slivers of A & C, using same B ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 31 Matrix Multiply Performance Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions –relatively insensitive to array size.

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 32 Concluding Observations Programmer can optimize for cache performance –How data structures are organized –How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache friendly code” –Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. –Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality)

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 33 Assignment Homework problem 6.28: What percentage of writes will miss in the cache for the blank screen function?

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Similar presentations

Presentation on theme: "University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics.

Similar presentations

Presentation on theme: "University of Amsterdam Computer Systems – cache characteristics Arnoud Visser 1 Computer Systems Cache characteristics."— Presentation transcript:

Similar presentations

About project

Feedback