2 Bus Structure for a computer system CPU chipRegister fileALUMainmemoryBus interfaceI/O busUSBcontrollerGraphicsadapterDiskcontrollerMouseKeyboardMonitorDisk
3 The CPU-Memory Gap The gap widens between DRAM, disk, and CPU speeds. SSDDRAMCPU
4 Locality to the Rescue!The key to bridging this CPU-Memory gap is a fundamental property of computer programs known as locality
5 Locality of ReferencePrinciple of Locality:Programs tend to use data and instructions with addresses near or equal to those they have used recentlyTemporal locality:Recently referenced items are likely to be referenced again in the near futureSpatial locality:Items with nearby addresses tend to be referenced close together in time
6 Locality Example Data references Instruction references sum = 0;for (i = 0; i<n; i++)sum += a[i];return sum;Data referencesReference array elements in succession (stride-1 reference pattern).Reference variable sum each iteration.Instruction referencesReference instructions in sequence.Cycle through loop repeatedly.Spatial localityTemporal localitySpatial localityTemporal locality
7 Memory HierarchiesSome fundamental and enduring properties of hardware and software:Fast storage technologies cost more per byte, have less capacity, and require more power (heat!).The gap between CPU and main memory speed is widening.Well-written programs tend to exhibit good locality.They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
8 An Example Memory Hierarchy CPU registers hold words retrieved from L1 cacheRegistersL1:L1 cache(SRAM)Smaller,faster,costlierper byteL1 cache holds cache lines retrieved from L2 cacheL2:L2 cache(SRAM)L2 cache holds cache lines retrieved from main memoryL3:Main memory(DRAM)Larger,slower,cheaperper byteMain memory holds disk blocks retrieved from local disksL4:Local secondary storage(local disks)Local disks hold files retrieved from disks on remote network serversRemote secondary storage(tapes, distributed file systems, Web servers)L5:
9 CachesCache:A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.Fundamental idea of a memory hierarchy:For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1.Why do memory hierarchies work?Because of locality, programs tend to access the data at level k more often than they access the data at level k+1.Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
10 General Cache Concepts Smaller, faster, more expensivememory caches a subset ofthe blocksCache84914103Data is copied in block-sized transfer units410Larger, slower, cheaper memoryviewed as partitioned into “blocks”Memory123445678910101112131415
11 General Cache Concepts: Hit Request: 14Data in block b is neededBlock b is in cache:Hit!Cache8914143Memory123456789101112131415
12 General Cache Concepts: Miss Request: 12Data in block b is neededBlock b is not in cache:Miss!Cache8912143Block b is fetched frommemory12Request: 12Block b is stored in cachePlacement policy: determines where b goesReplacement policy: determines which block gets evicted (victim)Memory12345678910111212131415
13 General Caching Concepts: Types of Cache Misses Cold (compulsory) missCold misses occur because the cache is empty.Conflict missMost caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at level k.E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block.E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.Capacity missOccurs when the set of active cache blocks (working set) is larger than the cache.
14 Examples of Caching in the Hierarchy Cache TypeWhat is Cached?Where is it Cached?Latency (cycles)Managed ByRegisters4-8 bytes wordsCPU coreCompilerTLBAddress translationsOn-Chip TLBHardwareL1 cache64-bytes blockOn-Chip L11HardwareL2 cache64-bytes blockOn/Off-Chip L210HardwareVirtual Memory4-KB pageMain memory100Hardware + OSBuffer cacheParts of filesMain memory100OSDisk cacheDisk sectorsDisk controller100,000Disk firmwareNetwork buffer cacheParts of filesLocal disk10,000,000AFS/NFS clientBrowser cacheWeb pagesLocal disk10,000,000Web browserWeb cacheWeb pagesRemote server disks1,000,000,000Web proxy server
15 Put it togetherThe speed gap between CPU, memory and mass storage continues to widen.Well-written programs exhibit a property called locality.Memory hierarchies based on caching close the gap by exploiting locality.
16 Cache/Main Memory Structure CPU requests data from memory locationCheck cache for this dataIf present, get from cache (fast)If not present, read required block from main memory to cacheThen deliver from cache to CPUCache includes tags to identify which block of main memory is in each cache slot
17 Cache Design Issues Size does matter Size Mapping Function Replacement AlgorithmWrite PolicyBlock SizeNumber of CachesSize does matterCostMore cache is expensiveSpeedMore cache is faster (up to a point)Checking cache for data takes time
18 Mapping Function i.e. cache is 16k (214) lines of 4 bytes each Cache of 64kByteCache block of 4 bytesi.e. cache is 16k (214) lines of 4 bytes each16MBytes main memory24 bit address(224=16M)Direct MappingEach block of main memory maps to only one cache line i.e. if a block is in cache, it must be in one specific placeAddress is in two partsLeast Significant w bits identify unique word. Most Significant s bits specify one memory blockThe MSBs are split into a cache line field r and a tag of s-r (most significant)
19 Direct Mapping: Address Structure Tag s-rLine or Slot rWord w814224 bit address2 bit word identifier (4 byte block)22 bit block identifier8 bit tag (=22-14)14 bit slot or lineNo two blocks in the same line have the same Tag fieldCheck contents of cache by finding line and checking Tag
20 Direct Mapping Cache Organization Cache lineMain Memory blocks held0, m, 2m, 3m…2s-m11,m+1, 2m+1…2s-m+1m-1m-1, 2m-1,3m-1…2s-1
21 Direct Mapped Cache Simple Inexpensive Fixed location for given block If a program accesses 2 blocks that map to the same line repeatedly,cache misses are very highAddress length = (s + w) bitsNumber of addressable units = 2s+w words or bytesBlock size = line size = 2w words or bytesNumber of blocks in main memory2s+w/2w = 2sNumber of lines in cache = m = 2rSize of tag = (s – r) bits
22 Fully Associative Mapping A main memory block can load into any line of cacheMemory address is interpreted as tag and wordTag uniquely identifies block of memoryEvery line’s tag is examined for a matchCache searching gets expensive
24 Fully Associative Mapping: Address Structure Word2 bitTag 22 bit22 bit tag stored with each 32 bit block of dataCompare tag field with tag entry in cache to check for hitLeast significant 2 bits of address identify which 16 bit word is required from 32 bit data block
25 Set Associative Mapping Cache is divided into a number of setsEach set contains a number of linesA given block maps to any line in a given sete.g. Block B can be in any line of set ie.g. 2 lines per set2 way associative mappingA given block can be in one of 2 lines in only one set
27 Two-Way Set Associative Mapping Address Structure Tag 9 bitSet 13 bitWord2 bitUse set field to determine cache set to look inCompare tag field to see if we have a hit
28 Set Associative Mapping Summary Address length = (s + w) bitsNumber of addressable units = 2s+w words or bytesBlock size = line size = 2w words or bytesNumber of blocks in main memory = 2sNumber of lines in set = kNumber of sets = v = 2dNumber of lines in cache = kv = k * 2dSize of tag = (s – d) bits
29 Replacement Algorithms No choice in Direct mapping because each block only maps to one line!!!So it is applicable for other mapping functionsHardware implemented algorithm (speed)Least Recently used (LRU)e.g. in 2 way set associativeWhich of the 2 block is LRU?First in first out (FIFO)Replace block that has been in cache longestLeast frequently usedReplace block which has had fewest hitsRandom
30 Write Policy What is the need of Write Policy? Write through Must not overwrite a cache block unless main memory is up to dateMultiple CPUs may have individual cachesI/O may address main memory directlyWrite throughAll writes go to main memory as well as cacheMultiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to dateLots of traffic and Slows down writesWrite backUpdates initially made in cache onlyUpdate bit for cache slot is set when update occursIf block is to be replaced, write to main memory only if update bit is setOther caches get out of syncI/O must access main memory through cache
31 What about writes? Multiple copies of data exist: L1, L2, Main Memory, DiskWhat to do on a write-hit?Write-through (write immediately to memory)Write-back (defer write to memory until replacement of line)Need a dirty bit (line different from memory or not)What to do on a write-miss?Write-allocate (load into cache, update line in cache)Good if more writes to the location followNo-write-allocate (writes immediately to memory)TypicalWrite-through + No-write-allocateWrite-back + Write-allocate
32 Intel Core i7 Cache Hierarchy Processor packageCore 0Core 3L1 i-cache and d-cache:32 KB, 8-way,Access: 4 cyclesL2 unified cache:256 KB, 8-way,Access: 11 cyclesL3 unified cache:8 MB, 16-way,Access: cyclesBlock size: 64 bytes for all caches.RegsRegsL1d-cacheL1i-cacheL1d-cacheL1i-cache…L2 unified cacheL2 unified cacheL3 unified cache(shared by all cores)Main memory
33 Cache Performance Metrics Miss RateFraction of memory references not found in cache (misses / accesses) = 1 – hit rateTypical numbers (in percentages):3-10% for L1can be quite small (e.g., < 1%) for L2, depending on size, etc.Hit TimeTime to deliver a line in the cache to the processorincludes time to determine whether the line is in the cacheTypical numbers:1-2 clock cycle for L15-20 clock cycles for L2Miss PenaltyAdditional time required because of a misstypically cycles for main memory (Trend: increasing!)
34 Writing Cache Friendly Code Make the common case go fastFocus on the inner loops of the core functionsMinimize the misses in the inner loopsRepeated references to variables are good (temporal locality)Stride-1 reference patterns are good (spatial locality)Key idea: Our qualitative notion of locality is quantified through our understanding of cache memories.
35 Lecture on Friday (Extra Class) 10 to 11:am in room no 1220
36 Memory ManagementIn multi-tasking environment, CPU is shared among multiple processes and CPU scheduling takes care of assigning CPU to each ready process.In-order to realize multi-tasking multiple processes should be kept in memory or memory should be shared between multiple processes.This sharing of memory is taken care by memory management techniques.These techniques are usually implemented in hardware.
37 Main Memory Main memory is divided into two parts One for OSOther for multiple processesUser processes should be restricted from accessing OS memory area and also other user processesThis protection is provided by the hardware in the form of base and limit register. (8.1,pno.316)Base register holds starting physical address and limit register holds the size of the process.Both registers are loaded by OS by using privileged instructions.
40 Address BindingAddress in user program is different from physical address.Needs address binding.Compiler usually bind symbolic address in an executable to relocatable address.Loader in turn bind relocatable address to absolute physical address.Binding can be done atCompile time – generates absolute codeLoad time – compiler generates relocatable code. Binding is delayed until load timeExecution time – binding is delayed until run time. That is process is brought into the memory at execution time.
41 Logical versus Physical Address Space CPU generates logical addressAddress seen by memory is Physical addressLogical Address Space : it is set of all logical addresses generated by a programPhysical Address Space: It is a set of physical addresses corresponding to logical addresses.Memory Management Unit: Runtime mapping from logical to physical address is done by MMU (hardware)
42 Simple MMU scheme Using base register (a.k.a Relocation Register) User program deals with logical addressMMU converts logical address to physical addressPhysical address is used to access main or physical memory
43 Dynamic LoadingDo not load entire program into memory before executionIt limits the size of program to the size of physical memorySince programs divided into modules or routinesDynamically load a routine when it called.Advantage:Unused routine is never loaded
44 Static Linking and Dynamic Linking Shared libraries are linked statically and are combined into the binary program image before executionDynamic LinkingLinking shared libraries is postponed until execution time
45 Memory Allocation Memory can be allocated in two ways Contiguous Memory AllocationA process resides in a single contiguous section of memoryFixed sized partitionMulti-programming is bounded to number of partitionsInternal FragmentationVariable sized partitionOS maintains a table to keep track of which memory is available and which is occupiedExternal FragmentationNon- Contiguous Memory AllocationProcess is not stored in contiguous memory segmentsPaging
46 PagingIt’s a memory management scheme that allows memory allocation to be non-contiguousPaging avoids external fragmentation and need of compactionTraditional systems provide hardware support for pagingModern systems implement paging by closely integrating hardware and OS (in 64-microprocessors)
47 Basic method of PagingLogical memory is divided into fixed-sized blocks called pagesPhysical memory is divided into fixed-sized blocks called framesBacking store (disk) is also divide into fixed sized blocks called memory framesCPU generates logical address which is divided into two parts: page number and page offsetPage size is defined by hardware and usually is power of 2 varying between512 bytes to 16MB page size.Page size is growing over time as processes, data sets and main memory have become largeInternal fragmentation may exist
48 Page Table It is used to map a virtual page to physical frame Page table contains information for this mappingFor each page number there is a corresponding frame number stored in page table.Page table is implemented in hardware.
51 Page table – hardware implementation Method 1Page table as a set of dedicated set of registersThese registers are made up of high speed logic to make paging address translation efficientCPU dispatcher reloads these registers just as it reloads other registersThis method is okay if page table is small but for large processes this method is not feasible
52 Page table – hardware implementation Method 2Store page table in main memoryPage Table Base Register (PTBR) is used to point to page table in memoryBut with this method, accessing memory require two memory access. One for page table and another for actual accessVery slowNeed another technique
53 Translation Lookaside Buffer (TLB) Since the page tables vary in size.Require Larger page table for larger size processesIts not possible to store it in registers.Need to store it in main memoryEvery virtual memory reference causes two physical memory accessFetch page table entryFetch dataSlows down the systemTo overcome this problem special cache memory can be used to store page table entriesTLB
59 Virtual Memory Demand paging Page fault Do not require all pages of a process in memoryBring in pages as requiredPage faultRequired page is not in memoryOperating System must swap in required pageMay need to swap out a page to make spaceSelect page to throw out based on recent history
61 Thrashing Too many processes in too little memory Operating System spends all its time swappingLittle or no real work is doneDisk light is on all the timeSolutionsGood page replacement algorithmsReduce number of processes runningFit more memory
62 Advantage We do not need all of a process in memory for it to run We can swap in pages as requiredSo - we can now run processes that are bigger than total memory available!Main memory is called real memoryUser/programmer sees much bigger memory - virtual memory