1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.

1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

2 Performance matters  Consider basic 5 stage pipeline design CPU runs at 1ns cycle time (1GHz) Main memory runs at 100ns access time  How is performance?  Fetch A (100 cycles)  Decode A (1 cycle) + Fetch B (100 cycles)  Ex A (1 cy) + Dec B (1 cy) + Fetch C (100 cycles)  Effective 1 instr per 100 cycles --> 10MHz CPU Latency killing system  Problem only getting worse CPU speeds grow much faster than DRAM speeds DRAM bandwidth improving well

3 Typical Solution  With the large gap in CPU vs DRAM speeds, must have solution Cache (“slave memories”) Ultra-fast, small local memory Cache runs at 1ns access time...  Fetch A (1 cycle)  Decode A (1 cycle) + Fetch B (1 cycle)  Ex A (1 cy) + Dec B (1 cy) + Fetch C (1 cy)  Effective 1 instr per 1 cyc ----> 1GHz CPU  Special terms hit miss rates (hit/miss)

4  SRAM: value is stored on a pair of inverting gates very fast but takes up more space than DRAM (4 to 6 transistors)  DRAM: value is stored as a charge on capacitor (must be refreshed) very small but slower than SRAM (factor of 5 to 10)  * ignoring new technologies ‘on the horizon’ Memories: Two basic types* Word line Pass transistor Bit line Bit line bar Word line Pass transistor Capacitor Bit line

5 Memory Photos Intel Paxville (dual Core) 90nm 8-way 2MB L2 for each core Intel Itanium2.13µm 24-way 6MB L3 240-pin DDR2 DRAM

6  Users want large and fast memories! For example: SRAM access times are 700ps-1ns (1-3 cycles) DRAM access times are 60-100ns (100-250 cycles) Disk access times are ~1 million ns (~3M cycles)  Try and give it to them anyway build a memory hierarchy Exploiting Memory Hierarchy

7 Model of Memory Hierarchy RegFileL1 Data cache L1 Inst cache L2 Cache Cache Main Memory Memory DISKSRAMDRAM

8 P4: Prescott w/ 2MB L2 (90nm)  Prescott runs very fast (3.4+ GHz)  2MB L2 Unified Cache  12K* trace cache (think “I$”)  16KB data cache  Where is the cache?  What about the similar blocks?  Why the visual differences?  Why is it square?  What’s with the colors?  Check this out: www.chip-architect.com TC L1D L2

9 Interfacing Processors and Peripherals  I/O Design affected by many factors (expandability, resilience)  Performance: — access latency — throughput — connection between devices and the system — the memory hierarchy — the operating system  A variety of different users (e.g., banks, supercomputers, engineers)

10 I/O Devices  Very diverse devices — behavior (i.e., input vs. output) — partner (who is at the other end?) — data rate

11 I/O Example: Disk Drives  To access data: — seek: position head over the proper track (3.5-10 ms. avg.) — rotational latency: wait for desired sector (.5 / RPM) — transfer: grab the data (one or more sectors) 2 to 15 MB/sec  not considering disk buffer hits (100-320 MB/s)

12 Locality  A principle that makes having a memory hierarchy a good idea Temporal locality  If an item is referenced, Temporal locality: it will tend to be referenced again soon Spatial locality Spatial locality: nearby items will tend to be referenced soon. Why does code have locality? What about data locality?  Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level

13  Two issues: How do we know if a data item is in the cache? If it is, how do we find it?  Our first example: block size is one word of data "direct mapped" For each item of data at the lower level, there is exactly one location in the cache where it might be. e.g., lots of items at the lower level share locations in the upper level Cache

14  Mapping: address is modulo the number of blocks in the cache Direct Mapped Cache

15  For MIPS: What kind of locality are we taking advantage of? Direct Mapped Cache

16 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] (4 bytes)

17 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes)

18 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is 0001 1000 (4 bytes)

19 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is 0001 1000 (4 bytes) 4-byte block, drop low 2 bits for byte offset! Only matters for byte- addressable systems

20 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 #24 is 0001 1000 (4 bytes) Next log 2 (8) bits mod 8 = Index

21 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 Next log 2 (8) bits mod 8 = Index

22 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000

23 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 [a] copy

24 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 0 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 [a] copy 000

25 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) #24 is 0001 1000 [a] copy 000

26 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000

27 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000

28 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #28 is 0001 1100

29 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #28 is 0001 1100 [b] copy000

30 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000

31 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000

32 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000 #60 is 0011 1100 It’s valid! How to tell it’s the wrong address?

33 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [b] copy000 #60 is 0011 1100 The tags don’t match! It’s not what we want to access!

34 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #60 is 0011 1100

35 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 #60 is 0011 1100 [c] copy001

36 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001

37 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable?

38 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable?

39 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable? #60 is 0011 1100

40 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What if the machine is only word-addressable? Tag is 2 bits larger; otherwise same (note index+data change!) #60 is 0011 1100

41 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] copy001 Q: What about writing back to memory?

42 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? Do we update memory now? Or later?

43 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? Assume later...

44 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory?

45 Example: DM$, 8-Entry, 4B Valid Tag DataIndex 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? #188 1011 1100 Now What? How do we know to write back?

46 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] OLD [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? #188 1011 1100 Need extra state! The “dirty” bit! a Dirty 0 0 0 0 0 0 0 1

47 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 [c] NEW001 Q: What about writing back to memory? #188 1011 1100 a Dirty 0 0 0 0 0 0 0 1

48 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? #188 1011 1100 a Dirty 0 0 0 0 0 0 0 0

49 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 0 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? #188 1011 1100 a Dirty 0 0 0 0 0 0 0 0 [d] copy101

50 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] NEW [d] #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? a Dirty 0 0 0 0 0 0 0 0 [d] copy101

51 Example: DM$, 8-Entry, 4B Valid Tag Dat Index 0 1 2 6 7 5 4 3 lw $0, #24 [a] lw $1, #28 [b] sw $2, #60 [c] sw $3, #188 [d] 0 0 0 0 0 0 1 1 Main Memory (System) [a] [b] [c] NEW [d] OLD #24 #28 #60 #188 (4 bytes) [a] copy 000 Q: What about writing back to memory? a Dirty 0 0 0 0 0 0 0 1 [d] NEW101

52  Trade-Offs: Write-back or Write-Through? Write-Alloc or No-Write-Alloc? How does Tag change with # of Entries? How does minimum machine word size impact tag? What kind of locality are we taking advantage of? DM$: Thoughts

53  Taking advantage of spatial locality: Direct Mapped Cache Address (showing bit positions) 1612Byte offset V Tag Data HitData 16 32 4K entries 16 bits128 bits Mux 323232 2 32 Block offsetIndex Tag 31 1615 43210

54  Read hits this is what we want!  Read misses stall the CPU, fetch block from memory, deliver to cache, restart  Write hits: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later)  Write misses: read the entire block into the cache, then write the word… ? Hits vs. Misses

55  Make reading multiple words easier by using banks of memory  It can get a lot more complicated... Hardware Issues

56  Increasing the block size tends to decrease miss rate :  Use split caches because there is more spatial locality in code: Performance 256 40% 35% 30% 25% 20% 15% 10% 5% 0% M i s s r a t e 64164 Block size (bytes) 1 KB 8 KB 16 KB 64 KB 256 KB

57 Performance  Simplified model: execution time = (execution cycles + stall cycles)  cycle time stall cycles = # of instructions  miss ratio  miss penalty  Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?

58 Compared to direct mapped, give a series of references that: results in a lower miss ratio using a 2-way set associative cache results in a higher miss ratio using a 2-way set associative cache (assuming “least recently used” replacement strategy) Decreasing miss ratio with associativity

59 An implementation

60 Set-Associative Cache  Multiple cache blocks (lines) can be allocated into the same set  When full, needs to evict some block out of the cache  Need to consider the locality  Replacement policy Last-In First-Out (LIFO), like a stack Random First-In First-Out (FIFO) Least Recently Used (LRU)

61 Least Recently Used (LRU) ABCD ABCD MRU LRULRU+1MRU-1 Access C CABD Access D DCAB Access E EDCA Access C CEDA Access G GCED

62 Performance

63 Decreasing miss penalty with multilevel caches  Add a second level cache: often primary cache is on the same chip as the processor use SRAMs to add another cache above primary memory (DRAM) miss penalty goes down if data is in 2nd level cache  Example: CPI of 1.0 on a 5GHz machine for no cache miss CPI The same machine with 1 st level cache (L1) with a 2% miss rate per instruction, and a 100ns DRAM access (what is the CPI ?) speedup Adding 2nd level cache with 5ns access time (including L1 access time) decreases miss rate per instruction to 0.5%, what is the speedup over the machine with only L1  Using multilevel caches: try and optimize the hit time on the 1st level cache try and optimize the miss rate on the 2nd level cache

64 Virtual Memory  Main memory can act as a cache for the secondary storage (disk)  Advantages: illusion of having more physical memory program relocation protection

65 Pages: virtual memory blocks  Page faults: the data is not in memory, retrieve it from disk huge miss penalty, thus pages should be fairly large (e.g., 4KB) reducing page faults is important (LRU is worth the price) can handle the faults in software instead of hardware using write-through is too expensive so we use writeback 3 2 1 011 10 9 815 14 13 1231 30 29 28 27 Page offsetVirtual page number Virtual address 3 2 1 011 10 9 815 14 13 1229 28 27 Page offsetPhysical page number Physical address Translation

66 Page Tables Physical memory Disk storage Valid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Virtual page number Physical page or disk address

67 Page Tables

68 Making Address Translation Fast  A cache for address translations: translation lookaside buffer

69 TLBs and caches

70 Modern Systems  Very complicated memory systems:

71  Processor speeds continue to increase very fast — much faster than either DRAM or disk access times  Design challenge: dealing with this growing disparity  Trends: synchronous SRAMs (provide a burst of data) redesign DRAM chips to provide higher bandwidth or processing restructure code to increase locality use prefetching (make cache visible to ISA) Some Issues

1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.

Similar presentations

Presentation on theme: "1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering."— Presentation transcript:

Similar presentations

About project

Feedback