John Kubiatowicz Electrical Engineering and Computer Sciences

CS252 Graduate Computer Architecture Lecture 14 3+1 Cs of Caching and many ways Cache Optimizations
John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley

Review: VLIW: Very Large Instruction Word
Each “instruction” has explicit coding for multiple operations In IA-64, grouping called a “packet” In Transmeta, grouping called a “molecule” (with “atoms” as ops) Tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need compiling technique that schedules across several branches 3/11/2009 cs252-S09, Lecture 14

Problems with 1st Generation VLIW
Increase in code size generating enough operations in a straight-line code fragment requires ambitiously unrolling loops whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding Operated in lock-step; no hazard detection HW a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized Compiler might prediction function units, but caches hard to predict Binary code compatibility Pure VLIW => different numbers of functional units and unit latencies require different versions of the code 3/11/2009 cs252-S09, Lecture 14

Discussion of two papers for today
“Abstract DAISY: Dynamic Compilation for 100 % Architectural Compatibility,” Erik R. Altman. Appeared in International Symposium on Computer Architecture (ISCA), 1997 “The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges,” James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, Jim Mattson. Appeared in the Proceedings of the First Annual IEEE/ACM International Symposium on Code Generation and Optimization, March 2003 3/11/2009 cs252-S09, Lecture 14

Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)”
IA-64: instruction set architecture bit integer regs bit floating point regs Not separate register files per functional unit as in old VLIW Hardware checks dependencies (interlocks  binary compatibility over time) 3 Instructions in 128 bit “bundles”; field determines if instructions dependent or independent Smaller code size than old VLIW, larger than x86/RISC Groups can be linked to show independence > 3 instr Predicated execution (select 1 out of 64 1-bit flags)  40% fewer mispredictions? Speculation Support: deferred exception handling with “poison bits” Speculative movement of loads above stores + check to see if incorect Itanium™ was first implementation (2001) Highly parallel and deeply pipelined hardware at 800Mhz 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process Itanium 2™ is name of 2nd implementation (2005) 6-wide, 8-stage pipeline at 1666Mhz on 0.13 µ process Caches: 32 KB I, 32 KB D, 128 KB L2I, 128 KB L2D, 9216 KB L3 3/11/2009 cs252-S09, Lecture 14

Itanium™ EPIC Design Maximizes SW-HW Synergy
Micro-architecture Features in hardware: Itanium™ EPIC Design Maximizes SW-HW Synergy (Copyright: Intel at Hotchips ’00) Architecture Features programmed by compiler: Predication Data & Control Speculation Bypasses & Dependencies Parallel Resources 4 Integer + 4 MMX Units 2 FMACs (4 for SSE) 2 LD/ST units 32 entry ALAT Speculation Deferral Management Register Stack & Rotation Explicit Parallelism 128 GR & 128 FR, Register Remap & Stack Engine Handling Fast, Simple 6-Issue Issue Branch Hints Memory Hints Instruction Cache & Branch Predictors Fetch Memory Subsystem Three levels of cache: L1, L2, L3 Control 3/11/2009 cs252-S09, Lecture 14

10 Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00)
Front End Pre-fetch/Fetch of up to 6 instructions/cycle Hierarchy of branch predictors Decoupling buffer Execution 4 single cycle ALUs, 2 ld/str Advanced load control Predicate delivery & branch Nat/Exception//Retirement REGISTER READ WORD-LINE DECODE RENAME EXPAND IPG FET ROT EXP REN REG EXE DET WRB WLD INST POINTER GENERATION FETCH ROTATE EXCEPTION DETECT EXECUTE WRITE-BACK Instruction Delivery Dispersal of up to 6 instructions on 9 ports Reg. remapping Reg. stack engine Operand Delivery Reg read + Bypasses Register scoreboard Predicated dependencies 3/11/2009 cs252-S09, Lecture 14

Why More on Memory Hierarchy?
Processor-Memory Performance Gap Growing Y-axis is performance X-axis is time Latency Cliché: Not e that x86 didn’t have cache on chip until 1989 3/11/2009 cs252-S09, Lecture 14

A Typical Memory Hierarchy c.2007
Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Implementation close to the CPU looks like a Harvard machine…(jse) Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) 3/11/2009 cs252-S09, Lecture 14

Itanium-2 On-Chip Caches (Intel/HP, 2002)
Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2, 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3, 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency If two is good, then three must be better (jse) 3/11/2009 cs252-S09, Lecture 14

Review: Cache performance
Miss-oriented Approach to Memory Access: Separating out Memory component entirely AMAT = Average Memory Access Time 3/11/2009 cs252-S09, Lecture 14

What is Cache Impact on Performance?
Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle), Ideal (no misses) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Miss Behavior: 10% of memory operations get 50 cycle miss penalty 1% of instructions get same miss penalty CPI = ideal CPI + average stalls per instruction 1.1(cycles/ins) + [ 0.30 (DataMops/ins) x 0.10 (miss/DataMop) x 50 (cycle/miss)] + [ 1 (InstMop/ins) x 0.01 (miss/InstMop) x 50 (cycle/miss)] = ( ) cycle/ins = 3.1 58% of the time the proc is stalled waiting for memory! AMAT=(1/1.3)x[1+0.01x50]+(0.3/1.3)x[1+0.1x50]=2.54 3/11/2009 cs252-S09, Lecture 14

What is impact of Harvard Architecture?
Unified vs Separate I&D (Harvard) Statistics (given in H&P): 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99% Which is better (ignore L2 cache)? Assume 33% data ops  75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port) AMATHarvard=75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMATUnified=75%x(1+1.99%x50)+25%x( %x50)= 2.24 Proc I-Cache-1 Unified Cache-1 Cache-2 D-Cache-1 3/11/2009 cs252-S09, Lecture 14

Recall: Reducing Misses
Classifying Misses: 3 Cs Compulsory—The first access to a block is not in the cache, so the block must be brought into the cache. Also called cold start misses or first reference misses. (Misses in even an Infinite Cache) Capacity—If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved. (Misses in Fully Associative Size X Cache) Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory & capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. Also called collision misses or interference misses. (Misses in N-way Associative, Size X Cache) More recent, 4th “C”: Coherence - Misses caused by cache coherence. Intuitive Model by Mark Hill 3/11/2009 cs252-S09, Lecture 14

Review: 6 Basic Cache Optimizations
Reducing hit time Avoiding Address Translation during Cache Indexing E.g., Overlap TLB and cache access, Virtual Addressed Caches Reducing Miss Penalty 2. Giving Reads Priority over Writes E.g., Read complete before earlier writes in write buffer 3. Multilevel Caches Reducing Miss Rate 4. Larger Block size (Compulsory misses) 5. Larger Cache size (Capacity misses) 6. Higher Associativity (Conflict misses) 3/11/2009 cs252-S09, Lecture 14

1. Two options for avoiding translation:
CPU CPU $ TB MEM VA PA Tags Overlap $ access with VA translation: requires $ index to remain invariant across translation L2 $ CPU $ TB MEM VA PA Virtually Addressed Cache Translate only on miss Synonym Problem Tags VA TB PA $ PA MEM Conventional Organization Option A Option B 3/11/2009 cs252-S09, Lecture 14

Virtually Addressed Caches (Details)
Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache vs. Physical Cache Every time process is switched logically must flush the cache; otherwise get false hits Cost is time to flush + “compulsory” misses from empty cache Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O must interact with cache, so need virtual address Solution to aliases HW guaranteess covers index field & direct mapped, they must be unique; called page coloring Solution to cache flush Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process 3/11/2009 cs252-S09, Lecture 14

2. Read Priority over Write on Miss
Processor Cache Write Buffer DRAM Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if:Store frequency (w.r.t. time) << 1 / DRAM write cycle Must handle burst behavior as well! You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40) 3/11/2009 cs252-S09, Lecture 14

RAW Hazards from Write Buffer!
Write-Buffer Issues: Could introduce RAW Hazard with memory! Write buffer may contain only copy of valid data  Reads to memory may get wrong result if we ignore write buffer Solutions: Simply wait for write buffer to empty before servicing reads: Might increase read miss penalty (old MIPS 1000 by 50% ) Check write buffer contents before read (“fully associative”); If no conflicts, let the memory access continue Else grab data from buffer Can Write Buffer help with Write Back? Read miss replacing dirty block Copy dirty block to write buffer while starting read to memory RAS/ CAS Write DATA Read 3 8 Processor + DRAM RAS/ CAS Read DATA Write 8 3 DRAM Proc 3/11/2009 cs252-S09, Lecture 14

12 Advanced Cache Optimizations
Reducing hit time Small and simple caches Way prediction Trace caches Increasing cache bandwidth Pipelined caches Multibanked caches Nonblocking caches Reducing Miss Penalty Critical word first Merging write buffers Reducing Miss Rate Victim Cache Hardware prefetching Compiler prefetching Compiler Optimizations 3/11/2009 cs252-S09, Lecture 14

1. Fast Hit times via Small and Simple Caches
Index tag memory and then compare takes time  Small cache can help hit time since smaller memory takes less time to index E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip Simple  direct mapping Can overlap tag check with data transmission since no choice Access time estimate for 90 nm using CACTI model 4.0 Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches 3/11/2009 cs252-S09, Lecture 14

Recall: Set Associative Cache
N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Disadvantage: Time to set mux Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit This is called a 2-way set associative cache because there are two cache entries for each cache index. Essentially, you have two direct mapped cache works in parallel. This is how it works: the cache index selects a set from the cache. The two tags in the set are compared in parallel with the upper bits of the memory address. If neither tag matches the incoming address tag, we have a cache miss. Otherwise, we have a cache hit and we will select the data on the side where the tag matches occur. This is simple enough. What is its disadvantages? +1 = 36 min. (Y:16) 3/11/2009 cs252-S09, Lecture 14

2. Fast Hit times via Way Prediction
How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? Way prediction: keep extra bits in cache to predict the “way,” or block within the set, of next cache access. Multiplexor is set early to select desired block, only 1 tag comparison performed that clock cycle in parallel with reading the cache data Miss  1st check other blocks for matches in next clock cycle Accuracy  85% Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Used for instruction caches vs. data caches Also used on MIPS R10K for off-chip L2 unified cache, way-prediction table on-chip Hit Time Way-Miss Hit Time Miss Penalty 3/11/2009 cs252-S09, Lecture 14

Way Predicting Caches (MIPS R10000 L2 cache)
Use processor address to index into way prediction table Look in predicted way at given index, then: HIT MISS Return copy of data from cache Look in other way Read block of data from next level of cache Prediction for data is hard, but what about instruction cache (jse) MISS SLOW HIT (change entry in prediction table) 3/11/2009 cs252-S09, Lecture 14

Way Predicting Instruction Cache (Alpha 21264-like)
Jump target 0x4 Jump control Add PC addr inst Primary Instruction Cache way New slide (jse) If way prediction is wrong – need to retrain and reaccess. Sequential Way Branch Target Way 3/11/2009 cs252-S09, Lecture 14

3. Fast (Instruction Cache) Hit times via Trace Cache
Key Idea: Pack multiple non-contiguous basic blocks into one contiguous trace cache line BR BR BR BR Single fetch brings in multiple basic blocks Trace cache indexed by start address and next n branch predictions 3/11/2009 cs252-S09, Lecture 14

3. Fast Hit times via Trace Cache (Pentium 4 only; and last time?)
Find more instruction level parallelism? How avoid translation from x86 to microops? Trace cache in Pentium 4 Dynamic traces of the executed instructions vs. static sequences of instructions as determined by layout in memory Built-in branch predictor Cache the micro-ops vs. x86 instructions Decode/translate from x86 to micro-ops on trace cache miss + 1.  better utilize long blocks (don’t exit in middle of block, don’t enter at label in middle of block) 1.  complicated address mapping since addresses no longer aligned to power-of-2 multiples of word size - 1.  instructions may appear multiple times in multiple dynamic traces due to different branch outcomes 3/11/2009 cs252-S09, Lecture 14

4: Increasing Cache Bandwidth by Pipelining
Pipeline cache access to maintain bandwidth, but higher latency Instruction cache access pipeline stages: 1: Pentium 2: Pentium Pro through Pentium III 4: Pentium 4  greater penalty on mispredicted branches  more clock cycles between the issue of the load and the use of the data Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase 3/11/2009 cs252-S09, Lecture 14

5. Increasing Cache Bandwidth: Non-Blocking Caches
Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires F/E bits on registers or out-of-order execution requires multi-bank memories “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Requires muliple memory banks (otherwise cannot support) Penium Pro allows 4 outstanding memory misses 3/11/2009 cs252-S09, Lecture 14

Value of Hit Under Miss for SPEC (old data)
0->1 1->2 2->64 Base “Hit under n Misses” Integer Floating Point FP programs on average: Miss Penalty = > > > 0.26 Int programs on average: Miss Penalty = > > > 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 3/11/2009 cs252-S09, Lecture 14

6: Increasing Cache Bandwidth via Multiple Banks
Rather than treat the cache as a single monolithic block, divide into independent banks that can support simultaneous accesses E.g.,T1 (“Niagara”) L2 has 4 banks Banking works best when accesses naturally spread themselves across banks  mapping of addresses to banks affects behavior of memory system Simple mapping that works well is “sequential interleaving” Spread block addresses sequentially across banks E,g, if there 4 banks, Bank 0 has all blocks whose address modulo 4 is 0; bank 1 has all blocks whose address modulo 4 is 1; … Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase 3/11/2009 cs252-S09, Lecture 14

7. Reduce Miss Penalty: Early Restart and Critical Word First
Don’t wait for full block before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Spatial locality  tend to want next sequential word, so not clear size of benefit of just early restart Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Long blocks more popular today  Critical Word 1st Widely used block 3/11/2009 cs252-S09, Lecture 14

8. Merging Write Buffer to Reduce Miss Penalty
Write buffer to allow processor to continue while waiting to write to memory If buffer contains modified blocks, the addresses can be checked to see if address of new data matches the address of a valid write buffer entry If so, new data are combined with that entry Increases block size of write for write-through cache of writes to sequential words, bytes since multiword writes more efficient to memory The Sun T1 (Niagara) processor, among many others, uses write merging 3/11/2009 cs252-S09, Lecture 14

9. Reducing Misses: a “Victim Cache”
How to combine fast hit time of direct mapped yet still avoid conflict misses? Add buffer to place data discarded from cache Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache Used in Alpha, HP machines TAGS DATA Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data Tag and Comparator One Cache line of Data To Next Lower Level In Hierarchy 3/11/2009 cs252-S09, Lecture 14

10. Reducing Misses by Hardware Prefetching of Instructions & Data
Prefetching relies on having extra memory bandwidth that can be used without penalty Instruction Prefetching Typically, CPU fetches 2 blocks on a miss: the requested block and the next consecutive block. Requested block is placed in instruction cache when it returns, and prefetched block is placed into instruction stream buffer Data Prefetching Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8 different 4 KB pages Prefetching invoked if 2 successive L2 cache misses to a page, if distance between those cache blocks is < 256 bytes 3/11/2009 cs252-S09, Lecture 14

Issues in Prefetching Usefulness – should produce hits
Timeliness – not late and not too early Cache and bandwidth pollution L1 Instruction Unified L2 Cache CPU L1 Data RF Prefetched data 3/11/2009 cs252-S09, Lecture 14

Hardware Instruction Prefetching
Instruction prefetch in Alpha AXP 21064 Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1) Requested block placed in cache, and next block in instruction stream buffer If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) Prefetched instruction block Req block Stream Buffer Unified L2 Cache CPU Need to check the stream buffer if the requested block is in there. Never more than one 32-byte block in the stream buffer. L1 Instruction Req block RF 3/11/2009 cs252-S09, Lecture 14

Hardware Data Prefetching
Prefetch-on-miss: Prefetch b + 1 upon miss on b One Block Lookahead (OBL) scheme Initiate prefetch for block b + 1 when block b is accessed Why is this different from doubling block size? Can extend to N block lookahead Strided prefetch If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access HP PA 7200 uses OBL prefetching Tag prefetching is twice as effective as prefetch-on-miss in reducing miss rates. 3/11/2009 cs252-S09, Lecture 14

11. Reducing Misses by Software Prefetching Data
Data Prefetch Load data into register (HP PA-RISC loads) Cache Prefetch: load into cache (MIPS IV, PowerPC, SPARC v. 9) Special prefetching instructions cannot cause faults; a form of speculative execution Issuing Prefetch Instructions takes time Is cost of prefetch issues < savings in reduced misses? Higher superscalar reduces difficulty of issue bandwidth 3/11/2009 cs252-S09, Lecture 14

12. Reducing Misses by Compiler Optimizations
McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software Instructions Reorder procedures in memory so as to reduce conflict misses Profiling to look at conflicts(using tools they developed) Data Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange: change nesting of loops to access data in order stored in memory Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows 3/11/2009 cs252-S09, Lecture 14

Merging Arrays Example
/* Before: 2 sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: 1 array of stuctures */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; Reducing conflicts between val & key; improve spatial locality 3/11/2009 cs252-S09, Lecture 14

Loop Interchange Example
/* Before */ for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1) for (i = 0; i < 5000; i = i+1) x[i][j] = 2 * x[i][j]; /* After */ Sequential accesses instead of striding through memory every 100 words; improved spatial locality 3/11/2009 cs252-S09, Lecture 14

Loop Fusion Example /* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j]; /* After */ { a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} 2 misses per access to a & c vs. one miss per access; improve spatial locality 3/11/2009 cs252-S09, Lecture 14

Blocking Example Two Inner Loops:
/* Before */ for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {r = 0; for (k = 0; k < N; k = k+1){ r = r + y[i][k]*z[k][j];}; x[i][j] = r; }; Two Inner Loops: Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Capacity Misses a function of N & Cache Size: 2N3 + N2 => (assuming no conflict; otherwise …) Idea: compute on BxB submatrix that fits 3/11/2009 cs252-S09, Lecture 14

Blocking Example B called Blocking Factor
/* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) { r = r + y[i][k]*z[k][j];}; x[i][j] = x[i][j] + r; }; B called Blocking Factor Capacity Misses from 2N3 + N2 to 2N3/B +N2 Conflict Misses Too? 3/11/2009 cs252-S09, Lecture 14

Reducing Conflict Misses by Blocking
Conflict misses in caches not FA vs. Blocking size Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fit in cache 3/11/2009 cs252-S09, Lecture 14

Summary of Compiler Optimizations to Reduce Cache Misses (by hand)
3/11/2009 cs252-S09, Lecture 14

Impact of Hierarchy on Algorithms
Today CPU time is a function of (ops, cache misses) What does this mean to Compilers, Data structures, Algorithms? Quicksort: fastest comparison based sorting algorithm when keys fit in memory Radix sort: also called “linear time” sort For keys of fixed length and fixed radix a constant number of passes over the data is sufficient independent of the number of keys “The Influence of Caches on the Performance of Sorting” by A. LaMarca and R.E. Ladner. Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, January, 1997, For Alphastation 250, 32 byte blocks, direct mapped L2 2MB cache, 8 byte keys, from 4000 to Let’s do a short review of what you learned last time. Virtual memory was originally invented as another level of memory hierarchy such that programers, faced with main memory much smaller than their programs, do not have to manage the loading and unloading portions of their program in and out of memory. It was a controversial proposal at that time because very few programers believed software can manage the limited amount of memory resource as well as human. This all changed as DRAM size grows exponentially in the last few decades. Nowadays, the main function of virtual memory is to allow multiple processes to share the same main memory so we don’t have to swap all the non-active processes to disk. Consequently, the most important function of virtual memory these days is to provide memory protection. The most common technique, but we like to emphasis not the only technique, to translate virtual memory address to physical memory address is to use a page table. TLB, or translation lookaside buffer, is one of the most popular hardware techniques to reduce address translation time. Since TLB is so effective in reducing the address translation time, what this means is that TLB misses will have a significant negative impact on processor performance. +3 = 3 min. (X:43) 3/11/2009 cs252-S09, Lecture 14

Quicksort vs. Radix: Instructions
Job size in keys 3/11/2009 cs252-S09, Lecture 14

Quicksort vs. Radix Inst & Time
Insts Job size in keys 3/11/2009 cs252-S09, Lecture 14

Quicksort vs. Radix: Cache misses
Job size in keys 3/11/2009 cs252-S09, Lecture 14

Experimental Study (Membench)
Microbenchmark for memory system performance for array A of length L from 4KB to 8MB by 2x for stride s from 4 Bytes (1 word) to L/2 by 2x time the following loop (repeat many times and average) for i from 0 to L by s load A[i] from memory (4 Bytes) s 1 experiment 3/11/2009 cs252-S09, Lecture 14

Membench: What to Expect
average cost per access memory time size > L1 cache hit time total size < L1 s = stride Consider the average cost per load Plot one line for each array length, time vs. stride Small stride is best: if cache line holds 4 words, at most ¼ miss If array is smaller than a given cache, all those accesses will hit (after the first run, which is negligible for large enough runs) Picture assumes only one level of cache Values have gotten more difficult to measure on modern procs 3/11/2009 cs252-S09, Lecture 14

Memory Hierarchy on a Sun Ultra-2i
Sun Ultra-2i, 333 MHz L1: 16 B line L2: 64 byte line 8 K pages, TLB entries Array length Mem: 396 ns (132 cycles) L2: 2 MB, 12 cycles (36 ns) Each symbol is an experiment, vertical axis is time/load Experiments with same length array share symbol, color, joined by line Such experiments differ only in stride in bytes (horizontal axis) note that minimum stride = 4 bytes = 1 word 4) Main Observation 1: complicated, depending on array length and stride 5) Main Observation 2: time/load ranges from 6 ns to 396 ns, 66x difference – important! 6) Detail 1: bottom 3 lines (4KB to 16KB) take 6 ns = 2 cycles, but 32KB and larger take longer deduce that L1 cache is 16 KB, takes 2 cycles/word (yes, according to hardware manual) 7) Detail 2: next 6 lines (32 KB to 1MB) take 36ns = 12 cycles for small stride (2MB a little more) but 4MB takes much longer: deduce that L2 cache is 2MB, takes 12 cycles/word (yes) 8) Detail 3: remaining lines take 396 ns = 132 cycles for medium stride, deduce that main memory takes 132 cycles/word 9) Detail 4: look at 32KB to 1MB lines, speed halves from stride 4 to 8 to 16, then constant deduce that L1 line size is 16 bytes 10) Detail 5: look at 4MB to 64MB lines, speed halves from stride 4 to 8 to … to 64, then constant deduce that L2 line size is 64 bytes 11) Detail 6: look at lines 32KB-256KB, vs 512 KB-2MB, increase up to 8KB stride of latter deduce that page size is 8 KB, and TLB has 256KB/8KB = 32 entries L1: 16 KB 2 cycles (6ns) See for details 3/11/2009 cs252-S09, Lecture 14

Memory Hierarchy on a Power3
Power3, 375 MHz Array size L1: 32 KB 128B line .5-2 cycles L2: 8 MB 128 B line 9 cycles Mem: 396 ns (132 cycles) 3/11/2009 cs252-S09, Lecture 14

Compiler Optimization vs. Memory Hierarchy Search
Compiler tries to figure out memory hierarchy optimizations New approach: “Auto-tuners” 1st run variations of program on computer to find best combinations of optimizations (blocking, padding, …) and algorithms, then produce C code to be compiled for that computer “Auto-tuner” targeted to numerical method E.g., PHiPAC (BLAS), Atlas (BLAS), Sparsity (Sparse linear algebra), Spiral (DSP), FFT-W Autotuners is fine. I had a slightly longer list I was working on before seeing Krste's mail. PHiPAC: Dense linear algebra (Krste, Jim, Jeff Bilmes and others at UCB) PHiPAC = Portable High Performance Ansi C FFTW: Fastest Fourier Transforms in the West (from Matteo Frigo and Steve Johnson at MIT; Matteo is now at IBM) Atlas: Dense linear algebra now the "standard" for many BLAS implementations; used in Matlab, for example. (Jack Dongarra, Clint Whaley et al) Sparsity: Sparse linear algebra (Eun-Jin Im and Kathy at UCB) Spiral: DSP algorithms including FFTs and other transforms (Markus Pueschel, José M. F. Moura et al) OSKI: Sparse linear algebra (Rich Vuduc, Jim and Kathy, From the Bebop project at UCB) In addition there are groups at Rice, USC, UIUC, Cornell, UT Austin, UCB (Titanium), LLNL and others working on compilers that include an auto-tuning (Search-based) optimization phase. Both the Bebop group and the Atlas group have done work on automatic tuning of collective communication routines for supercomputers/clusters, but this is ongoing. I'll send a slide with an autotuning example later. Kathy 3/11/2009 cs252-S09, Lecture 14

Sparse Matrix – Search for Blocking
for finite element problem [Im, Yelick, Vuduc, 2005] Mflop/s Best: 4x2 [NOTE: This slide has some animation in it.] Consider the following experiment in which we implement SpMV using BCSR format for the matrix shown in the previous slide at all block sizes that divide 8x8—16 implementations in all. These implementations fully unroll the innermost loop and use scalar replacement for the source and destination vectors. You might reasonably expect performance to increase relatively smoothly as r and c increase, but this is clearly not the case! Platform: 900 MHz Itanium-2, 3.6 Gflop/s peak speed. Intel v8.0 compiler. Good speedups (4x) but at an unexpected block size (4x2). Figure taken from Im, Yelick, Vuduc, IJHPCA 2005 paper. Reference Mflop/s 3/11/2009 cs252-S09, Lecture 14

Best Sparse Blocking for 8 Computers
Intel Pentium M Sun Ultra 2, Sun Ultra 3, AMD Opteron IBM Power 4, Intel/HP Itanium Intel/HP Itanium 2 IBM Power 3 8 4 row block size (r) 2 1 1 2 4 8 column block size (c) All possible column block sizes selected for 8 computers; How could compiler know? 3/11/2009 cs252-S09, Lecture 14

Comment Technique + – 1 3 2 Hit Time Bandwidth Miss penalty Miss rate
HW cost/ complexity Comment Small and simple caches + – Trivial; widely used Way-predicting caches 1 Used in Pentium 4 Trace caches 3 Pipelined cache access Widely used Nonblocking caches Banked caches Used in L2 of Opteron and Niagara Critical word first and early restart 2 Merging write buffer Widely used with write through Victim Caches Fairly Simple and common Compiler techniques to reduce cache misses Software is a challenge; some computers have compiler option Hardware prefetching of instructions and data 2 instr., 3 data Many prefetch instructions; AMD Opteron prefetches data Compiler-controlled prefetching Needs nonblocking cache; in many CPUs 3/11/2009 cs252-S09, Lecture 14

Conclusion Memory wall inspires optimizations since so much performance lost there Reducing hit time: Small and simple caches, Way prediction, Trace caches Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking caches Reducing Miss Penalty: Critical word first, Merging write buffers Reducing Miss Rate: Compiler optimizations Reducing miss penalty or miss rate via parallelism: Hardware prefetching, Compiler prefetching Actual performance of a simple program can be a complicated function of the architecture To write fast programs, need to consider architecture True on sequential or parallel processor We would like simple models to help us design efficient algorithms “Auto-tuners” search replacing static compilation to explore optimization space? 3/11/2009 cs252-S09, Lecture 14

John Kubiatowicz Electrical Engineering and Computer Sciences

Similar presentations

Presentation on theme: "John Kubiatowicz Electrical Engineering and Computer Sciences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John Kubiatowicz Electrical Engineering and Computer Sciences

Similar presentations

Presentation on theme: "John Kubiatowicz Electrical Engineering and Computer Sciences"— Presentation transcript:

Similar presentations

About project

Feedback