3Motivation Want memory to appear: Processor Memory As fast as CPU As large as required by all of the running applications
4Storage Hierarchy What is S(tatic)RAM vs D(dynamic)RAM? Make common case fast:Common: temporal & spatial localityFast: smaller more expensive memoryRegistersCaches (SRAM)Memory (DRAM)[SSD? (Flash)]Disk (Magnetic Media)Controlled by HardwareBigger TransfersLargerCheaperMore BandwidthFasterControlled by Software (OS)What is S(tatic)RAM vs D(dynamic)RAM?
5Caches An automatically managed hierarchy Break memory into blocks (several bytes) and transfer data to/from cache in blocksspatial localityKeep recently accessed blockstemporal localityCore$Memory
6Cache Terminology block (cache line): minimum unit that may be cached frame: cache storage location to hold one blockhit: block is found in the cachemiss: block is not found in the cachemiss ratio: fraction of references that misshit time: time to access the cachemiss penalty: time to replace block on a miss
7Cache Example Final miss ratio is 50% Address sequence from core: (assume 8-byte lines)CoreMiss0x100000x10000 (…data…)Hit0x100040x10008 (…data…)Miss0x101200x10120 (…data…)Miss0x10008Hit0x10124HitMemory0x10004Final miss ratio is 50%
8AMAT (1/2) Very powerful tool to estimate performance If … cache hit is 10 cycles (core to L1 and back) memory access is 100 cycles (core to mem and back)Then … at 50% miss ratio, avg. access: 0.5×10+0.5×100 = 55 at 10% miss ratio, avg. access: 0.9×10+0.1×100 = 19 at 1% miss ratio, avg. access: 0.99× ×100 ≈ 11
9AMAT (2/2) Generalizes nicely to any-depth hierarchy If … L1 cache hit is 5 cycles (core to L1 and back) L2 cache hit is 20 cycles (core to L2 and back) memory access is 100 cycles (core to mem and back)Then … at 20% miss ratio in L1 and 40% miss ratio in L2 … avg. access: 0.8×5+0.2×(0.6×20+0.4×100) ≈ 14
13SRAM Overview Chained inverters maintain a stable state “6T SRAM” cell2 access gates2T per inverter111bShould definitely be review for the ECE crowd.Chained inverters maintain a stable stateAccess gates provide access to the cellWriting to cell involves over-powering storage inverters
15Fully-Associative Cache Keep blocks in cache framesdatastate (e.g., valid)address tag63addresstag[63:6]block offset[5:0]statetag=datastatetag=datastatetag=datastatetag=datamultiplexorhit?What happens when the cache runs out of space?
16The 3 C’s of Cache Misses Compulsory: Never accessed before Capacity: Accessed long ago and already replacedConflict: Neither compulsory nor capacity (later today)Coherence: (To appear in multi-core lecture)
17Cache Size Cache size is data capacity (don’t count tag and state) Bigger can exploit temporal locality betterNot always betterToo large a cacheSmaller is faster bigger is slowerAccess time may hurt critical pathToo small a cacheLimited temporal localityUseful data constantly replacedworking setsizehit ratecapacity
18Block Size Block size is the data that is Too small a block Associated with an address tagNot necessarily the unit of transfer between hierarchiesToo small a blockDon’t exploit spatial locality wellExcessive tag overheadToo large a blockUseless data transferredToo few total blocksUseful data frequently replacedhit rateblock size
21Why take index bits out of the middle? Direct-Mapped CacheUse middle bits as indexOnly one tag comparisontag[63:16]index[15:6]block offset[5:0]datastatetagdatastatetagdatastatetagdecoderdatastatetagmultiplexortag match (hit?)=Why take index bits out of the middle?
22Cache Conflicts What if two blocks alias on a frame? Same index, but different tags Address sequence: 0xDEADBEEF xFEEDBEEF xDEADBEEF0xDEADBEEF experiences a Conflict missNot Compulsory (seen it before)Not Capacity (lots of other indexes available in cache)tagindexblock offset
23Associativity (1/2) Where does block index 12 (b’1100) go? 1234567Block1Set/Block231234567SetFully-associativeblock goes in any frame(all frames in 1 set)Set-associativeblock goes in any frame in one set(frames grouped in sets)Direct-mappedblock goes in exactly one frame(1 frame per set)
24Associativity (2/2) Larger associativity Smaller associativity lower miss rate (fewer conflicts)higher power consumptionSmaller associativitylower costfaster hit timehit rate~5for L1-Dassociativity
25N-Way Set-Associative Cache tag[63:15]index[14:6]block offset[5:0]waydatastatetagdatastatetagsetdatastatetagdatastatetagdatastatetagdatastatetagdecoderdecoderdatastatetagdatastatetagmultiplexormultiplexor==multiplexorhit?Note the additional bit(s) moved from index to tag
26Associative Block Replacement Which block in a set to replace on a miss?Ideal replacement (Belady’s Algorithm)Replace block accessed farthest in the futureTrick question: How do you implement it?Least Recently Used (LRU)Optimized for temporal locality (expensive for >2-way)Not Most Recently Used (NMRU)Track MRU, random select among the restRandomNearly as good as LRU, sometimes better (when?)
27Victim Cache (1/2) Associativity is expensive Conflicts are expensive Performance from extra muxesPower from reading and checking more tags and dataConflicts are expensivePerformance from extra misesObservation: Conflicts don’t occur in all sets
28Victim Cache (2/2) Provide “extra” associativity, but not for all sets AccessSequence:4-way Set-Associative L1 CacheFully-AssociativeVictim Cache4-way Set-Associative L1 Cache+ACBCDEEDAABCBCDEABABCCDAMLLKJBXYZXYZNJNJMJKKLMLNJJKKLLMCPQRPQRKLDEvery access is a miss!ABCDE and JKLMNdo not “fit” in a 4-wayset associative cacheVictim cache providesa “fifth way” so long asonly four sets overflowinto it at the same timeMTypically need some sort of “cast out” buffer for evictees anyway… a dirty line in a writeback cache needs to be written back to the next-level cache, and this might not be able to happen right away (e.g., bus busy, next-level cache busy, etc.). Victim cache and writeback buffer can be combined into one structure.Can even provide 6thor 7th … waysProvide “extra” associativity, but not for all sets
29Parallel vs Serial Caches Tag and Data usually separate (tag is smaller & faster)State bits stored along with tagsValid bit, “LRU” bit(s), …Parallel access to Tag and Data reduces latency (good for L1)Serial access to Tag and Data reduces power (good for L2+)enable========valid?valid?hit?datahit?data
30Way-Prediction and Partial Tagging Sometimes even parallel tag and data is too slowUsually happens in high-associativity L1 cachesIf you can’t wait… guessWay Prediction Guess based on history, PC, etc…Partial TaggingUse part of tag to pre-select muxprediction============valid?valid?==0hit?correct?datahit?correct?dataGuess and verify, extra cycle(s) if not correct
31Physically-Indexed Caches Virtual AddressCore requests are VAsCache index is PA[15:6]VA passes through TLBD-TLB on critical pathCache tag is PA[63:16]If index size < page sizeCan use VA for indextag[63:14]index[13:6]block offset[5:0]virtual page[63:13]page offset[12:0]D-TLB/ index[6:0]physical index[7:0] //physical index[0:0]/physical tag[51:1]====Simple, but slow. Can we do better?
32Virtually-Indexed Caches Virtual AddressCore requests are VAsCache index is VA[15:6]Cache tag is PA[63:16]Why not tag with VA?Cache flush on ctx switchVirtual aliasesEnsure they don’t exist… or check all on misstag[63:14]index[13:6]block offset[5:0]virtual page[63:13]page offset[12:0]/ virtual index[7:0]One bit overlapsD-TLB/physical tag[51:0]====Fast, but complicated
33Compute Infrastructure “rocks” cluster and several VMsUse your “Unix” login and passwordallv21.all.cs.sunysb.edusbrocks.cewit.stonybrook.eduOnce on rocks head node, run “condor_status”Pick a machine and use ssh to log into it (e.g., compute-4-4)Don’t use VM for HomeworkI already sent out details regarding softwareIncluding MARSS sources and path to boot imageGo throughSend to list in case you encounter problems
34Note: You must measure it, not read it from MSRs Homework #1 (of 1) part 2Already have accurate & precise time measurementExtend your program to…Measure how many cache levels there are in the CPUMeasure the hit time of each cache levelOutput should resemble this:Loop: nsCache Latencies: 2ns 6ns 20ns 150nsNote: You must measure it, not read it from MSRs
35Homework #1 Hints Instructions take time – use them wisely Use high optimization level in compiler (-O3)Check assembly to see instructions (use -S flag in compiler)Independent instructions run in parallel (don’t count time)Dependent instructions run serially (precisely count time)Use random walks in memory (sequential will trick you)Smallest linked-list walk:void* list;for(y=0;y<100;++y) list[y]=&list[(y+1)%100];void* x = &list;x = *x; // mov %rax,(%rax)
36Warm-Up Project (1/3) Build an L1-D cache array Determine groups member list to Connor and CC me (by Feb 20)Must use SRAM<> module for storageGrading reminder:Warm-up ProjectPoints1 Port,Direct-Mapped,16K102 Port,Direct-Mapped,16K121 Port,Set-associative,16K142 Port,DM,64K,Pipelined162 Port,SA,64K,Pipelined182 Port,SA,64K,Pipel.,Way-pred.20
39Multiple Accesses per Cycle Need high-bandwidth access to cachesCore can make multiple access requests per cycleMultiple cores can access LLC at the same timeMust either delay some requests, or…Design SRAM with multiple portsBig and power-hungrySplit SRAM into multiple banksCan result in delays, but usually not
40Multi-Ported SRAMs Wordlines = 1 per port Area = O(ports2) b2Wordline2Wordline1b1b1Lots of other techniques *not* discussed here such as sub-banking, hierarchical wordlines/bitlines, etc.Wordlines = 1 per portBitlines = 2 per portArea = O(ports2)
41Multi-Porting vs Banking DecoderDecoderDecoderDecoderSRAMArrayDecoderSRAMArrayDecoderSRAMArraySSDecoderSRAMArrayDecoderSRAMArraySenseSenseSenseSenseSSColumnMuxing4 portsBig (and slow)Guarantees concurrent access4 banks, 1 port eachEach bank small (and fast)Conflicts (delays) possibleHow to decide which bank to go to?
42Bank Conflicts Banks are address interleaved For block size b cache with N banks…Bank = (Address / b) % NMore complicated than it looks: just low-order bits of indexModern processors perform hashed cache indexingMay randomize bank and indexXOR some low-order tag bits with bank/index bits (why XOR?)Banking can provide high bandwidthBut only if all accesses are to different banksFor 4 banks, 2 accesses, chance of conflict is 25%
43On-chip Interconnects (1/5) Today, (Core+L1+L2) = “core”(L3+I/O+Memory) = “uncore”How to interconnect multiple “core”s to “uncore”?Possible topologiesBusCrossbarRingMeshTorusCore$LLC $MemoryController
44On-chip Interconnects (2/5) Possible topologiesBusCrossbarRingMeshTorusCore$$Bank 0$Bank 1$Bank 2$Bank 3MemoryControllerOracle UltraSPARC T5 (3.6GHz, 16 cores, 8 threads per core)
45On-chip Interconnects (3/5) Possible topologiesBusCrossbarRingMeshTorusCore$MemoryController$Bank 0$Bank 1$Bank 2$Bank 33 ports per switchSimple and cheapCan be bi-directional to reduce latencyIntel Sandy Bridge (3.5GHz, 6 cores, 2 threads per core)
46On-chip Interconnects (4/5) Possible topologiesBusCrossbarRingMeshTorusCore$Bank 1Bank 0Bank 4Bank 3MemoryControllerBank 2Bank 7Bank 6Bank 5Tilera Tile64 (866MHz, 64 cores)Up to 5 ports per switchTiled organization combines core and cache
47On-chip Interconnects (5/5) Possible topologiesBusCrossbarRingMeshTorus5 ports per switchCan be “folded” to avoid long linksCore$Core$MemoryController$Bank 0$Bank 1Core$Bank 2Core$Bank 3Core$Bank 4Core$Core$Bank 6Core$Bank 7$Bank 5
48Write Policies Writes are more interesting Choices of Write Policies On reads, tag and data can be accessed in parallelOn writes, needs two stepsIs access time important for writes?Choices of Write PoliciesOn write hits, update memory?Yes: write-through (higher bandwidth)No: write-back (uses Dirty bits to identify blocks to write back)On write misses, allocate a cache block frame?Yes: write-allocateNo: no-write-allocate
49Inclusion Core often accesses blocks not present on chip Should block be allocated in L3, L2, and L1?Called Inclusive cachesWaste of spaceRequires forced evict (e.g., force evict from L1 on evict from L2+)Only allocate blocks in L1Called Non-inclusive caches (who not “exclusive”?)Must write back clean linesSome processors combine bothL3 is inclusive of L1 and L2L2 is non-inclusive of L1 (like a large victim cache)
50Parity & ECC Cosmic radiation can strike at any time What can be done? Especially at high altitudeOr during solar flaresWhat can be done?Parity1 bit to indicate if sum is odd/even (detects single-bit errors)Error Correcting Codes (ECC)8 bit code per 64-bit wordGenerally SECDED (Single-Error-Correct, Double-Error-Detect)Detecting errors on clean cache lines is harmlessPretend it’s a cache miss
51Homework #1 (of 1) part 3 Extend your program to… Measure and report the block size of each cacheMeasure the number of sets in each cacheMeasure the associativity of each cacheCompute the total capacity of each cacheOutput should resemble this:Loop: nsCache Latencies: 2ns 6ns 20ns 150nsCaches: 64/256x4(64KB) 64/512x8(256KB) 128/8192x24(12MB)