Presentation is loading. Please wait.

Presentation is loading. Please wait.

1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories.

Similar presentations

Presentation on theme: "1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories."— Presentation transcript:

1 1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories are expensive, slow memories are cheap. A fast and large memory results in an expensive system. Memory is used to store programs and data.

2 2  1998 Morgan Kaufmann Publishers Main Memory Types CPU uses the main memory on the instruction level. RAM (Random Access Memory): we can read and write. –Static –Dynamic ROM (Read-Only Memory): we can only read. –Changing of the contents is possible for most types. Characteristics –Access time –Price –Volatility

3 3  1998 Morgan Kaufmann Publishers SRAM: –bit value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –bit value is stored as a charge on capacitor (must be refreshed) –very small but slower than SRAM (factor of 2 to 5) Random Access Memories Word line Bit line Capacitor Access transistor 1 1 Word line Bit

4 4  1998 Morgan Kaufmann Publishers Users want large and fast memories! SRAM access times are 5 - 12 ns at cost of 25 $ per MB. DRAM access times are 5 - 20 ns at cost of.15 $ per MB. Disk access times are 7 - 10 ms at cost of.001 $ per MB. Give it to them anyway. –build a memory hierarchy Exploiting Memory Hierarchy 2003 Levels in the memory hierarchy Increasing distance from the CPU in access time Size of the memory at each level CPU Level 1 Level 2 Level n

5 5  1998 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon. Why does code have locality? –loops –instructions accessed sequentially –arrays, records

6 6  1998 Morgan Kaufmann Publishers Memory Hierarchy CPU Cache Main memory Secondary memory Levels L1, L2, … (hardware implementation, SRAMs) Virtual memoryRegisters (software implementation) The computer uses the main memory (DRAM) on the instruction level.

7 7  1998 Morgan Kaufmann Publishers Memory Hierarchy A pair of levels in the memory hierarchy: –two levels: upper and lower –block: minimum unit of data transferred between upper and lower level –hit: data requested is in the upper level hit rate –miss: data requested is not in the upper level miss rate –miss penalty depends mainly on lower level access time

8 8  1998 Morgan Kaufmann Publishers What information goes into the cache in addition to the referenced one? How do we know if a data item is in the cache? If it is, how do we find it? Cache Basics

9 9  1998 Morgan Kaufmann Publishers Simple approach: Direct mapped –block size is one word –every main memory location can be mapped to exactly one cache location –lots of words in the main memory share a single location in the cache Address in the cache = (address in the main memory) modulo (number of words in the cache) –cache address is identical with lower bits in the main memory address –tag (higher address bits) differentiates between competing main memory words We are taking advantage of temporal locality. Direct Mapped Cache

10 10  1998 Morgan Kaufmann Publishers Direct Mapped Cache: Simple Example

11 11  1998 Morgan Kaufmann Publishers A More Realistic Example 32 bit word length 32 bit address 1 kW cache block size 1 word 10 bit cache index 20 bit tag size 2 bit byte offset (word alignment assumed) valid bit

12 12  1998 Morgan Kaufmann Publishers Cache Access

13 13  1998 Morgan Kaufmann Publishers Cache Size Cache memory size 1024  32 b = 32 kb Tag memory size 1024  20 b = 20 kb Valid information 1024  1 b = 1 kb Efficiency 32/53 = 60.4 %

14 14  1998 Morgan Kaufmann Publishers Cache Hits and Misses Cache hit - continue –access the cache Cache miss –stall the CPU –get information from the main memory –write information in the cache data, tag, set valid bit –resume execution

15 15  1998 Morgan Kaufmann Publishers Read hits –this is what we want! Read misses –stall the CPU, fetch block from memory, deliver to cache, restart Write hits: –can replace the data in the cache and memory (write-through) –write the data only in the cache, write in the main memory later (write-back) Write misses: –read the entire block into the cache, then write the word Hits and Misses in More Detail

16 16  1998 Morgan Kaufmann Publishers Write Through and Write Back Write through –update cache and main memory at the same time –may result in extra main memory writes –requires a write buffer; it stores data while it is waiting to be written to memory Write back –update main memory only during replacement –replacement may be slower

17 17  1998 Morgan Kaufmann Publishers Combined vs. Split Cache Combined cache –size equal to the sum of the split caches –no rigid division between locations used by instructions and data –usually a slightly better hit rate –possibly stalls due to simultaneous access to instructions and data –lower bandwidth because of sharing of resources Split instruction and data cache –increased bandwidth from the cache –slightly lower hit rate –no conflict when accessing instruction and data simultaneously

18 18  1998 Morgan Kaufmann Publishers The cache described so far: –simple –block size one word –only temporal locality exploited Spatial locality –block size longer than one word –when a miss occurs, multiple adjacent words are fetched Taking Advantage of Spatial Locality

19 19  1998 Morgan Kaufmann Publishers Direct Mapped Cache with Four-Word Blocks Address (showing bit positions) 1612Byte offset V Tag Data HitData 16 32 4K entries 16 bits128 bits Mux 323232 2 32 Block offsetIndex Tag 31 1615 43210

20 20  1998 Morgan Kaufmann Publishers Increasing block size tends to decrease miss rate: There is more spatial locality in code: Miss Rate vs. Block Size

21 21  1998 Morgan Kaufmann Publishers Achieving Higher Memory Bandwidth CPU Cache Bus Memory a. One-word-wide memory organization CPU Bus b. Wide memory organization Memory Multiplexor Cache CPU Cache Bus Memory bank 1 Memory bank 2 Memory bank 3 Memory bank 0 c. Interleaved memory organization M iss penalties for a four-word block: a. four memory latencies and four bus cycles b. one memory latency and one bus cycle c. one memory latency and four bus cycles. The memory latency is much longer than the bus cycle.

22 22  1998 Morgan Kaufmann Publishers Performance Simplified model: execution time = (execution cycles + stall cycles)  cycle time stall cycles = # of instructions  miss rate  miss penalty The model is more complicated for writes than reads (write-through vs. write-back, write buffers) Two ways of improving performance: –decreasing the miss rate –decreasing the miss penalty

23 23  1998 Morgan Kaufmann Publishers Flexible Placement of Cache blocks Direct mapped cache –a memory block can go exactly in one place in the cache –use the tag to identify the referenced word –easy to implement Fully associative cache –a memory block can be placed in any location in the cache –search all entries in the cache in parallel –expensive to implement (a comparator associated with each cache entry) Set-associative cache –a memory block can be placed in a fixed number of locations –n locations: n-way set-associative cache –a block is mapped to a set; it can be placed in any element in that set –search the elements of the set –implementation simpler than in fully associative cache

24 24  1998 Morgan Kaufmann Publishers Cache Types 1 2 Tag Data Block #01234567 Search Direct mapped 1 2 Tag Data Set #0123 Search Set associative 1 2 Tag Data Search Fully associative We are looking at block 12 in an 8-block cache; 12 mod 8 = 4, 12 mod 4 = 0

25 25  1998 Morgan Kaufmann Publishers Mapping of an Eight-Block Cache

26 26  1998 Morgan Kaufmann Publishers Performance Improvement Associativity reduces high miss rates. ProgramAssociativityInstruction Data Combined miss rate miss rate miss rate gcc 1 2.00% 1.70% 1.90% gcc 2 1.60% 1.40% 1.50% gcc 4 1.60% 1.40% 1.50% spice 1 0.30% 0.60% 0.40% spice 2 0.30% 0.60% 0.40% spice 4 0.30% 0.60% 0.40%

27 27  1998 Morgan Kaufmann Publishers Locating a Block Address portions –Index selects the set. –Tag chooses the the block by comparison. –Block offset is the address of the data within the block. The costs of an associative cache –comparators and multiplexers –time for comparison and selection block offsetindextag

28 28  1998 Morgan Kaufmann Publishers Four-Way Set-Associative Cache Address 22 8 VTagIndex 0 1 2 253 254 255 DataVTagDataVTagDataVTagData 3222 4-to-1 multiplexor HitData 1238910111230310

29 29  1998 Morgan Kaufmann Publishers Replacement Strategy Replacement is needed in associative caches. Random First-in-first-out –oldest block is replaced First-in-not-used-first-out –oldest of the blocks having not been accessed after the previous replacement is replaced LRU (Least Recently Used) –the block having been unused for the longest time is replaced

30 30  1998 Morgan Kaufmann Publishers Random vs. LRU Random –simple to implement –almost as good as other algorithms LRU (Least Recently Used) –2-way set-associative: implementation simple (one bit) –4-way set-associative: approximated to make implementation reasonably simple

31 31  1998 Morgan Kaufmann Publishers Pseudo LRU for Four-Way S-A Cache Approximation of LRU, implemented with 3 bits per set Replacement needed: check B1 check B2check B3 replace replace replace replace block 0 block 1 block 2 block 3 At every cache access two of the bits are updated to point away from the MRU block Always chooses the best or second best choice

32 32  1998 Morgan Kaufmann Publishers Performance (SPEC92)

33 33  1998 Morgan Kaufmann Publishers Multilevel Caches Usually two levels: –L1 cache is often on the same chip as the processor –L2 cache is usually off-chip –miss penalty goes down if data is in L2 cache Example: –CPI of 1.0 on a 500Mhz machine with a 200ns main memory access time: miss penalty 100 clock cycles –Add a 2nd level cache with 20ns access time: miss penalty 10 clock cycles Using multilevel caches: –minimise the hit time on L1 –minimise the miss rate on L2

34 34  1998 Morgan Kaufmann Publishers Virtual Memory Main memory can act as a “cache” for the secondary storage –large virtual address space used in each program –smaller main memory Motivations –efficient and safe sharing of memory among multiple programs –remove programming burdens of a small main memory Advantages: –illusion of having more physical memory –program relocation –protection

35 35  1998 Morgan Kaufmann Publishers Virtual Memory

36 36  1998 Morgan Kaufmann Publishers Pages: virtual memory blocks CPU produces a virtual address –translated by a combination of hardware and software to a physical address Virtual address: virtual page number and page offset Physical address: physical page number and page offset Page fault: data is not in memory, retrieve it from disk

37 37  1998 Morgan Kaufmann Publishers Address Translation 3 2 1 011 10 9 815 14 13 1231 30 29 28 27 Page offsetVirtual page number Virtual address 3 2 1 011 10 9 815 14 13 1229 28 27 Page offsetPhysical page number Physical address Translation

38 38  1998 Morgan Kaufmann Publishers Virtual Memory Design Huge miss penalty, thus pages should be fairly large (4 - 64 kB). Reducing page faults is important (LRU is worth the price). Page faults can be handled in software instead of hardware –overhead small compared to the access time to disk Virtual memory systems use write-back. –using write-through is too expensive Dirty bit –indicates whether a page needs to be copied back when it is replaced –initially cleared, set when the page is first written

39 39  1998 Morgan Kaufmann Publishers Page Tables Physical memory Disk storage Valid 1 1 1 1 0 1 1 0 1 1 0 1 Page table Virtual page number Physical page or disk address

40 40  1998 Morgan Kaufmann Publishers Page Table Details

41 41  1998 Morgan Kaufmann Publishers Making Address Translation Fast A cache for address translations: translation lookaside buffer (TLB)

42 42  1998 Morgan Kaufmann Publishers MIPS R2000 TLB and Cache

43 43  1998 Morgan Kaufmann Publishers TLBs and caches

44 44  1998 Morgan Kaufmann Publishers Protection and Virtual Memory Multiple processes and the operating system –share a single main memory –memory protection is provided A user process can not access other processes’ data The operating system takes care of system administration –page tables, TLBs

45 45  1998 Morgan Kaufmann Publishers Hardware Requirements for Protection At least two operating modes –user process –operating system process (also called kernel, supervisor or executive process) Portion of the CPU state a user process can read but not write –user/supervisor mode bit –page table pointer –TLB Mechanisms for going from user mode to supervisor mode, and vice versa –system call exception –return from exception

46 46  1998 Morgan Kaufmann Publishers Handling Page Faults and TLB Misses TLB miss –page present in the memory  create missing TLB entry –page not present in the memory  page fault  transfer control to the operating system Look at the matching page table entry –valid bit on  copy the page table entry from memory into the TLB –valid bit off  page fault exception Page fault –EPC contains the virtual address of the faulting page –find the page and move it into the memory after choosing a page to be replaced

Download ppt "1  1998 Morgan Kaufmann Publishers Memory CPUMemory Memory should be large and fast. large amounts of code and data time-critical applications Fast memories."

Similar presentations

Ads by Google