1 TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 7: CPU and Memory (3)
2 ContentsThis lecture will discuss:Cache.Error Correcting Codes.
3 The Memory Hierarchy Trade-off: cost, capacity and access time. Faster access time, greater cost per bit.Greater capacity, smaller cost per bit.Greater capacity, slower access time.Access time - the time it takes to perform a read or write operation.Memory Cycle time -Time may be required for the memory to “recover” before next access, i.e. access + recovery.Transfer Rate - rate at which data can be moved.
4 A five-level memory hierarchy. Memory HierarchiesA five-level memory hierarchy.
5 Hierarchy List Registers L1 Cache L2 Cache Main memory Disk cache Disk OpticalTapedecreasing cost/bit, increasing capacity, and slower access timeInternal memoryexternal memory
6 Hierarchy ListIt would be nice to use only the fastest memory, but because that is the most expensive memory,we trade off access time for cost by using more of the slower memory.The design challenge is to organise the data and programs in memory so that the accessed memory words are usually in the faster memory.In general, it is likely that most future accesses to main memory by the processor will be to locations recently accessed.So the cache automatically retains a copy of some of the recently used words from the DRAM.If the cache is designed properly, then most of the time the processor will request memory words that are already in the cache.
7 Hierarchy ListNo one technology is optimal in satisfying the memory requirements for a computer system.As a consequence, the typical computer system is equipped with a hierarchy of memory subsystems;some internal to the system (directly accessible by the processor) andsome external (accessible by the processor via an I/O module).
8 Cache Small amount of fast memory Sits between normal main memory and CPUMay be located on CPU chip or moduleor cache line.
9 Cache The cache contains a copy of portions of main memory. When the processor attempts to read a word of memory, a check is made to determine if the word is in the cache.If so (hit), the word is delivered to the processor.If not (miss), a block of main memory, consisting of some fixed number of words, is read into the cache and then the word is delivered to the processor.Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that same memory location or to other words in the block.The ratio of hits to the total number of requests is known as the hit ratio.
11 Cache operation – overview CPU requests contents of memory locationCheck cache for this dataIf present, get from cache (fast)If not present, read required block from main memory to cacheThen deliver from cache to CPUCache includes tags to identify which block of main memory is in each cache slot
13 Cache Design Size Mapping Function Replacement Algorithm Write Policy Block SizeNumber of Caches – L1, L2, L3 etc.
14 Size does matter Cost More cache is expensive Speed More cache is faster (up to a point)Checking cache for data takes timeWe would like the size of the cache to be small enough so that the overall average cost per bit is close to that of main memory alone and large enough so that the overall average access time is close to that of the cache alone.The larger the cache, the larger the number of gates involved in addressing the cache. The result is that large caches tend to be slightly slower than small ones.
15 Comparison of Cache Sizes Comparison of Cache SizesProcessorTypeYear of IntroductionL1 cacheaL2 cacheL3 cacheIBM 360/85Mainframe196816 to 32 KB—PDP-11/70Minicomputer19751 KBVAX 11/780197816 KBIBM 303364 KBIBM 30901985128 to 256 KBIntel 80486PC19898 KBPentium19938 KB/8 KB256 to 512 KBPowerPC 60132 KBPowerPC 620199632 KB/32 KBPowerPC G4PC/server1999256 KB to 1 MB2 MBIBM S/390 G41997256 KBIBM S/390 G68 MBPentium 42000IBM SPHigh-end server/ supercomputer64 KB/32 KBCRAY MTAbSupercomputerItanium200116 KB/16 KB96 KB4 MBSGI Origin 2001High-end serverItanium 220026 MBIBM POWER520031.9 MB36 MBCRAY XD-1200464 KB/64 KB1MBa Two values seperated by a slash refer to instruction and data cachesb Both caches are instruction only; no data caches
16 Cache: Mapping Function Cache lines < main memory blocks:An algorithm is needed for mapping main memory blocks into cache lines.Three techniques:DirectAssociativeset associative
17 Direct Mapping Each block of main memory maps to only one cache line i.e. if a block is in cache, it must be in one specific place.pros & consSimpleInexpensiveFixed location for given blockIf a program accesses 2 blocks that map to the same line repeatedly, cache misses are very high
18 Associative MappingA main memory block can load into any line of cacheMemory address is interpreted as tag and wordTag uniquely identifies block of memoryEvery line’s tag is examined for a matchDisadvantage:Cache searching gets expensiveThe complex circuitry is required to examine the tags of all cache lines in parallel.
19 Set Associative Mapping A compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages.Cache is divided into a number of sets.Each set contains a number of lines.A given block maps to any line in a given sete.g. Block B can be in any line of set i.With fully associative mapping, the tag in a memory address is quite large and must be compared to the tag of every line in the cache.With k-way set associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single set.
20 Replacement Algorithms When cache memory is full, some block in cache memory must be selected for replacement.Direct mapping :No choiceEach block only maps to one lineReplace that line
21 Replacement Algorithms (2) Associative & Set Associative Hardware implemented algorithm (speed)Least Recently used (LRU)An LRU algorithm, keeps track of the usage of each block and replaces the block that was last used the longest time ago.First in first out (FIFO)replace block that has been in cache longestLeast frequently used (LFU)replace block which has had fewest hitsRandom
22 Write PolicyIssues:Must not overwrite a cache block unless main memory is up to dateMultiple CPUs may have individual cachesI/O may address main memory directly
23 Write through All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to dateDisadvantage:Lots of trafficSlows down writesCreate a bottleneck.
24 Cache: Line SizeAs the block size increases from very small to larger sizes, the hit ratio will at first increase because of the principle of locality.Two issues:Larger blocks reduce the number of blocks that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched.As a block becomes larger, each additional word is farther from the requested word, therefore less likely to be needed in the near future.
25 Number of Caches external cache: Is it still desirable? Multilevel Caches:On-chip cache:A cache on the same chip as the processor.Reduces the processor’s external bus activity and therefore speeds up execution times and increases overall system performance.external cache: Is it still desirable?Yes - most contemporary designs include both on-chip and external caches.E.g. two-level cache, with the internal cache (L1) and the external cache (L2). Why?If there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus – poor performance.
26 Number of CachesMore recently, it has become common to split the cache into two:one dedicated to instructions and one dedicated to data.There are two potential advantages of a unified cache:For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data fetches automatically.Only one cache needs to be designed and implemented.The trend is toward split caches, such as the Pentium and PowerPC, which emphasize parallel instruction execution and the prefetching of predicted future instructions. Advantage:It eliminates contention for the cache between the instruction fetch/decode unit and the execution unit.
27 Processor on which feature first appears Intel Cache EvolutionProblemSolutionProcessor on which feature first appearsExternal memory slower than the system bus.Add external cache using faster memory technology.386Increased processor speed results in external bus becoming a bottleneck for cache access.Move external cache on-chip, operating at the same speed as the processor.486Internal cache is rather small, due to limited space on chipAdd external L2 cache using faster technology than main memoryContention occurs when both the Instruction Prefetcher and the Execution Unit simultaneously require access to the cache. In that case, the Prefetcher is stalled while the Execution Unit’s data access takes place.Create separate data and instruction caches.PentiumIncreased processor speed results in external bus becoming a bottleneck for L2 cache access.Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.Pentium ProMove L2 cache on to the processor chip.Pentium IISome applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.Add external L3 cache.Pentium IIIMove L3 cache on-chip.Pentium 4
30 Memory Packaging and Types A group of chips, typically 8 or 16, is mounted on a tiny PCB and sold as a unit.SIMM - single inline memory module, has a row of connectors on one side.DIMM – Dual inline memory module, has a row of connectors on both side.A SIMM holding 256 MB. Two of the chips control the SIMM.
31 Error Correction Hard Failure Soft Error Permanent defectCaused by harsh environmental abuse, manufacturing defects, and wear.Soft ErrorRandom, non-destructiveNo permanent damage to memoryCaused by power supply problems.Detected using Hamming error correcting code.
32 Error CorrectionWhen reading out the stored word, a new set of K code bits is generated from M data bits and compared with fetch code bits. Results:No errors – the fetch data bits are sent out.An error is detected, and it is possible to correct the error.Data bits + error correction bits corrector sent out the corrected set of M bits.An error is detected, but it is not possible to correct the error. This condition is reported.
33 Error Correcting Code Function A function to produce codeStored codeword: M+K bits