Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch7a- 2 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Big is Slow The more information stored, the slower the access 7.1 Amdahl’s.

Similar presentations


Presentation on theme: "Ch7a- 2 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Big is Slow The more information stored, the slower the access 7.1 Amdahl’s."— Presentation transcript:

1

2 Ch7a- 2 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Big is Slow The more information stored, the slower the access 7.1 Amdahl’s law? Spatial Locality – You’re likely to have questions on similar topics Temporal Locality – If you need a particular formula, you’re likely to need it again soon Spatial Locality – You’re likely to have questions on similar topics Temporal Locality – If you need a particular formula, you’re likely to need it again soon Consider taking an open- book exam. You might find the answer: In your memory In a sheet of notes In course handouts In the textbook

3 Ch7a- 3 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University And so it is with Computers Main memory Big Slow (15ns) “Far” from CPU 7.1 Main Memory Registers CPU Load or I-FetchStore Assembly language programmers and compilers manage all transitions between registers and main memory Our system has two kinds of memory Registers Close to CPU Small number of them Fast

4 Ch7a- 4 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University The problem... 7.1 IFRFM LW WB EX... Instruction Fetch Memory Access Since every instruction has to be fetched from memory, we lose big time We lose double big time when executing a load or store DRAM Memory access takes around 15ns At 100 MHz, that’s 1.5 cycles At 1GHz, that’s 15 cycles Don’t even get started talking about 3-4GHz Note: Access time is faster in some memory modes, but basic access is around 10-20ns

5 Ch7a- 5 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University A hopeful thought 7.1 Static RAMs are much faster than DRAMs 3-4 ns possible (instead of 15ns) So, build memory out of SRAMs SRAMs cost about 20 times as much as DRAM Technology limitations cause the price difference Access time gets worse if larger SRAM systems are needed (small is fast...) Nice try.

6 Ch7a- 6 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University A more hopeful thought 7.1 Remember the telephone directory? Do the same thing with computer memory Registers CPU Load or I-FetchStore Main Memory (DRAM) SRAM Cache The big question: What goes in the cache? Build a hierarchy of memories between the registers and main memory Closer to CPU: Small and fast (frequently used) Closer to Main Memory: Big and slow (more rarely used)

7 Ch7a- 7 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Locality 7.1 i = i+1; if (i<20) { z = i*i + 3*i -2; } q = A[i]; Temporal locality p = A[i]; q = A[i+1] r = A[i] * A[i+3] - A[i+2] name = employee.name; rank = employee.rank; salary = employee.salary; Spatial Locality The program is very likely to access the same data again and again over time The program is very likely to access data that is close together

8 Ch7a- 8 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University The Cache 7.2 5600 1000 0 1016 2447 1048 43 1028 4 Most recently accessed Memory locations (exploits temporal locality) Issues: How do we know what’s in the cache? What if the cache is full? Issues: How do we know what’s in the cache? What if the cache is full? Cache 5600 1000 3223 1004 23 1008 1122 1012 0 1016 32324 1020 845 1024 43 1028 976 1032 77554 1036 433 1040 7785 1044 2447 1048 775 1052 433 1056 Main Memory Fragment

9 Ch7a- 9 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Goals for Cache Organization Complete Data may come from anywhere in main memory Fast lookup We have to look up data in the cache on every memory access Exploits temporal locality Stores only the most recently accessed data Exploits spatial locality Stores related data

10 Ch7a- 10 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Direct Mapping 7.2 Index Tag Always zero (words) Valid TagDataIndex Cache 5600 3223 23 1122 0 32324 845 43 976 77554 433 7785 2447 775 433 3649 Main Memory 00 00 00 00 01 00 00 10 00 00 11 00 01 00 00 01 01 00 01 10 00 01 11 00 10 00 00 10 01 00 10 10 00 10 11 00 11 00 00 11 01 00 11 10 00 11 11 00 6-bit Address 560000Y77511Y84501Y3323400N In a direct-mapped cache: -Each memory address corresponds to one location in the cache -There are many different memory locations for each cache entry (four in this case) In a direct-mapped cache: -Each memory address corresponds to one location in the cache -There are many different memory locations for each cache entry (four in this case) 00 01 10 11

11 Ch7a- 11 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Hits and Misses 7.2 The hit rate and miss rate are the fraction of memory accesses that are hits and misses Typically, hit rates are around 95% Many times instructions and data are considered separately when calculating hit/miss rates When the CPU reads from memory: Calculate the index and tag Is the data in the cache? Yes – a hit, you’re done! Data not in cache? This is a miss. Read the word from memory, give it to the CPU. Update the cache so we won’t miss again. Write the data and tag for this memory location to the cache. (Exploits temporal locality)

12 Ch7a- 12 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University A 1024-entry Direct-mapped Cache 7.2 Tag Data IndexV 0 1 2 1023 1022... 012 11 1231 Hit! Data Tag Index 10 20 32 20 Memory Address Byte offset One Block

13 Ch7a- 13 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Example - 1024-entry Direct Mapped Cache 11153 4323 212 14 1 1 0 1 Tag Data IndexV 0 1 2 2332232 323 998 34238829 197689411 1023 3... Assume the cache has been used for a while, so it’s not empty... 01 Index- 10 bits 2 11 Tag- 20 bits 1231 LW $t3, 0x0000E00C($0) address = 0000 0000 0000 0000 1110 0000 0000 1100 tag = 14 index = 3 byte offset=0 Hit: Data is 34238829 LB $t3, 0x00003005($0) (let’s assume the word at mem[0x0003004] = 8764) address = 0000 0000 0000 0000 0011 0000 0000 0101 tag = 3 index = 1 byte offset=1 Miss: load word from mem[0x0003004] and write into cache at index 1 3 8764 7.2 byte address

14 Ch7a- 14 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Separate I- and D-Caches It is common to use two separate caches for Instructions and for Data All Instruction fetches use the I-cache All data accesses (loads and stores) use the D-cache This allows the CPU to access the I-cache at the same time it is accessing the D-cache Still have to share a single memory IFRFMWB EX 7.2 Instruction Cache Data Cache Main Memory miss

15 Ch7a- 15 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University So, how’d we do? 7.2 Miss rates for DEC 3100 (MIPS machine) Note: This isn’t just the average BenchmarkInstructionData missCombined miss rateratemiss rate spice1.2%1.3%1.2% gcc6.1%2.1%5.4% Separate 64KB Instruction/Data Caches (16K 1-word blocks)

16 Ch7a- 16 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University The issue of writes 7.2 What to do on a store (hit or miss) Won’t do to just write it to the cache The cache would have a different (newer) value than main memory Simple Write-Through Write both the cache and memory Works correctly, but slowly Buffered Write-Through Write the cache Buffer a write request to main memory 1 to 10 buffer slots are typical

17 Ch7a- 17 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University What about Spatial Locality? 7.2 Spatial locality says that physically close data is likely to be accessed close together On a cache miss, don’t just grab the word needed, but also the words nearby Organize memory in multi-word blocks Memory transfers between cache and memory are always one full block 5600 3223 23 1122 0 32324 845 43 976 77554 433 7785 2447 775 433 3649 Main Memory 00 00 00 00 01 00 00 10 00 00 11 00 01 00 00 01 01 00 01 10 00 01 11 00 10 00 00 10 01 00 10 10 00 10 11 00 11 00 00 11 01 00 11 10 00 11 11 00 Address Example of 4-word blocks. Each block is 16 bytes. On a miss, the cache copies the entire block that contains the desired word

18 Ch7a- 18 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Working with Blocks Word 2Word 1Word 0 012 131431 Index 10 18 Address Tag 34 Block offset 2 Byte offset 2 One 4-word Block All words in the same block have the same index and tag Data TagV Word Cache Entry 3 The requested word may be at any position within a block. The block size may be any power of 2: 1,2,4,8,16,…

19 Ch7a- 19 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Tag Data (4-word Blocks) IndexV 0 1 2 2047 2046... 32KByte/4-Word Block D.M. Cache 7.2 014 14 1531 Hit! Tag Index 11 17 Byte offset 23 32 KB / 4 Words/Block / 4 Bytes/Word --> 2K blocks Block offset Data 32 Mux 3210 2 11 =2K

20 Ch7a- 20 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University How Much Change? 7.2 Miss rates for DEC 3100 (MIPS machine) spice11.2%1.3%1.2% gcc16.1%2.1%5.4% spice40.3%0.6%0.4% gcc42.0%1.7%1.9% BenchmarkBlock SizeInstructionData missCombined (words)miss ratemiss rate Separate 64KB Instruction/Data Caches (16K 1-word blocks or 4K 4-word blocks)

21 Ch7a- 21 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Choosing a block size 7.2 Large block sizes help with spatial locality, but... It takes time to read the memory in Larger block sizes increase the time for misses It reduces the number of blocks in the cache Number of blocks = cache size/block size Need to find a middle ground 16-64 bytes works nicely

22 Ch7a- 22 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Other Cache organizations 7.3 Direct Mapped 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: VTagData Index Address = Tag | Index | Block offset Fully Associative No Index Address = Tag | Block offset Each address has only one possible location TagDataV

23 Ch7a- 23 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Fully Associative vs. Direct Mapped 7.3 Fully associative caches provide much greater flexibility Nothing gets “thrown out” of the cache until it is completely full Direct-mapped caches are more rigid Any cached data goes directly where the index says to, even if the rest of the cache is empty A problem, though... Fully associative caches require a complete search through all the tags to see if there’s a hit Direct-mapped caches only need to look one place

24 Ch7a- 24 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University A Compromise 7.3 2-Way set associative Address = Tag | Index | Block offset 4-Way set associative Address = Tag | Index | Block offset 0: 1: 2: 3: 4: 5: 6: 7: VTagData Each address has two possible locations with the same index Each address has two possible locations with the same index One fewer index bit: 1/2 the indexes 0: 1: 2: 3: VTagData Each address has four possible locations with the same index Each address has four possible locations with the same index Two fewer index bits: 1/4 the indexes

25 Ch7a- 25 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Set Associative Example VTagData Index 0 0 0 0 0 0 0 0 000: 001: 010: 011: 100: 101: 110: 111: 0100111000 1100110100 0100111100 0110110000 1100111000 Miss 7.3 Index VTagData 0 0 0 0 0 0 0 0 00: 01: 10: 11: VTagData Index 0 0 0 0 0 0 0 0 0: 1: Direct-Mapped2-Way Set Assoc.4-Way Set Assoc. 0100111000 1100110100 0100111100 0110110000 1100111000 Miss Hit Miss 0100111000 1100110100 0100111100 0110110000 1100111000 Miss Hit Miss Hit Byte offset (2 bits) Block offset (2 bits) Index (1-3 bits) Tag (3-5 bits) 010 - 1 110 010 0100 - 1 1100 - 1 011 110 0110 1100 1 01001 1 11001 1 01101 - - - 128-byte cache, 4-word blocks, 10 bit addresses, 1-4 way assocativity

26 Ch7a- 26 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University New Performance Numbers 7.3 Miss rates for DEC 3100 (MIPS machine) spiceDirect0.3%0.6%0.4% gccDirect2.0%1.7%1.9% spice2-way0.3%0.6%0.4% gcc4-way1.6%1.4%1.5% BenchmarkAssociativityInstructionData missCombined ratemiss rate Separate 64KB Instruction/Data Caches (4K 4-word blocks) gcc2-way1.6%1.4%1.5% spice4-way0.3%0.6%0.4%

27 Ch7a- 27 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Block Replacement Strategies 7.5 We have to replace a block when there is a collision Collisions occur whenever the selected set is full Strategy 1: Ideal (Oracle) Replace the block that won’t be used again for the longest time Drawback - Requires knowledge of the future Strategy 2: Least Recently Used (LRU) Replace the block that was last used (hit) the longest time ago Drawback - Requires difficult bookkeeping Strategy 3: Approximate LRU Set a use bit for each block every time it is hit, clear all periodically Replace a block without its use bit set Strategy 4: Random Pick a block at random (works almost as well as approx. LRU)

28 Ch7a- 28 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way to decrease miss rate is to increase the block size Capacity Misses Occur when a program is using more data than can fit in the cache Some misses will result because the cache isn’t big enough Increasing the size of the cache solves this problem Conflict Misses Occur when a block forces out another block with the same index Increasing Associativity reduces conflict misses Worst in Direct-Mapped, non-existent in Fully Associative

29 Ch7a- 29 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Cache Sizing Registers CPU Load or I-FetchStore Main Memory (DRAM) Cache How big should the cache be? As big as possible! Hold as much data in the cache as you can. But… Smaller is faster… The cache must provide the data within 1 CPU cycle to avoid stalling Cache must be on the same chip as the CPU Make the cache as large as possible until either: Access time is > 1 CPU cycle Run out of room on CPU chip

30 Ch7a- 30 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Multi-level Caches Registers CPU Main Memory (DRAM) L1 Cache The difference between a cache hit (1 cycle) and miss (30-50 cycles) is huge Introduce a series of larger, but slower caches to smooth out the difference L1 Cache: As big as can be in 1 cycle L2 Cache: As big as can be in 3-5 cycles L3 Cache: As big as can be in 5-10 cycles L2 Cache L3 Cache L2/L3 Cache may be on/off chip depending on CPU speeds and constraints


Download ppt "Ch7a- 2 EE/CS/CPE 3760 - Computer Organization  Seattle Pacific University Big is Slow The more information stored, the slower the access 7.1 Amdahl’s."

Similar presentations


Ads by Google