Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS1104 Help Session I Memory Semester II 2001/02 Colin Tan, S15-04-05,

Similar presentations


Presentation on theme: "CS1104 Help Session I Memory Semester II 2001/02 Colin Tan, S15-04-05,"— Presentation transcript:

1 CS1104 Help Session I Memory Semester II 2001/02 Colin Tan, S15-04-05, Ctank@comp.nus.edu.sg

2 Memory Memory can be visualized as a stack of pigeon holes. Current computers have about 128,000,000 pigeon holes. Each pigeon hole is given a number, starting from 0. This number is called an “address”. Each pigeon hole will contain either data (e.g. numbers you want to add together) or instruction (e.g. add two numbers)

3 Memory Memory locations 0 to 3 contain instructions, locations 4 to 6 contain data. Note: In reality, instructions are also encoded into numbers!

4 Addresses As mentioned, each pigeon hole has a number identifying it called an “address”. When the CPU requires an instruction, it will send the instruction’s “address” to memory, and the memory will return the instruction at that address. –E.g. At IF CPU will send “0” to memory, and the memory returns li t1, 5 –At MEM CPU will send “6” to memory, and memory returns “10”. –At WB, CPU writes “10” back to t1.

5 Addressing Bits Computers work only in binary –Hence addresses generated in the previous example are also in binary! In general, to address a maximum of n memory locations, you will need m = log 2 n bits in your address. Conversely, if you had m bits in your address, you can access a maximum of 2 m memory locations.

6 Memory Hierarchy Motivation –Not all memory is created equal Cheap Memory => Slow Fast Memory => Expensive –DRAM, 70 ns access time, $1/MByte –SRAM, 8 ns access time, $50/Mbyte –So, you can choose either: Have fast but very small memory, OR Large but very slow memory.

7 Memory Hierarchy Memory hierarchy gives you a third option: –Large, but very fast memory Though slower than the expensive memory mentioned earlier.

8 Locality “Locality” is a particular type of behavior exhibited by running programs: –Spatial locality: If a memory location has been accessed, it is very likely its neighbor will also be accessed. –Temporal locality: If a memory location has been accessed, it is very likely that it will be accessed again sometime soon.

9 Locality - Example Consider the following program: for(i=0; i<10; i++) a[i] = b[i] + c[i];

10 Locality - Example In memory it will look like this:

11 Locality - Example Tracing the execution of the program:

12 Locality - Example Focusing only on the addresses of the fetched instructions, we see that the addresses the instructions are fetched from are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 3, 4, 5, … Here we see both: –Spatial locality (e.g. after location 0 is accessed, location 1 is accessed, then 2, etc.) –Temporal locality (e.g. location 2 is accessed 10 times!)

13 Effect of Locality Locality means that in the short run out of all the memory you have (perhaps up to 128,000,000 pigeon holes!), only a very small number of locations are actually being accessed! –In our example for ten iterations only memory locations 2 to 10 are being accessed out of 128,000,000 possible locations! –What if we had a tiny amount of very fast (but expensive!) memory and kept these locations in that fast memory? We can speed up access times dramatically!! –This is the idea behind caches.

14 How Do Caches Help? The average time to access memory (AMAT) is given by: AMAT = hit_rate * Tcache + miss_rate x (Tmemory + Tcache) Tcache = Time to read the cache (8ns for SRAM cache) Tmemory = Time to read main memory (70ns for DRAM) miss_rate = Probability of not finding what we want in the cache. Because of locality, miss_rate is very small –Typically about 3% to 5%. Here, our AMAT = 0.95 * 8ns + 0.05 x (70 + 8) ns = 11.5 ns Our AMAT is about 43% slower than pure SRAM cache memory (11.5 ns vs. 8 ns)

15 How Do Caches Help? What about cost? Let’s consider: A system with 32 MB of DRAM memory, 512KB of SRAM cache. Cost is $1/MB for DRAM, and $50/MB for SRAM. If we had 32MB of SRAM, access time is 8 ns, but cost will be $1,600 With 32MB of DRAM, cost is $32, but access time is 70 ns! But with 32MB of DRAM and 512 (1/2 MB) of SRAM, cost will be: $32 + (512/1024) * 50 = $57!

16 How do Caches Help? So with pure SRAM, we can have 8 ms average access time at $1,600. With pure DRAM, our memory will cost $32, but all accesses will take 70 ns! With DRAM memory and SRAM cache, we can have 11.5 ms access time at $57. So for a performance drop of 43%, we have a cost savings of >2700%! Hence caches give us large memory size (32 MB), at close to the cost of the DRAM technology ($57 vs. $32), but at close to the speed of expensive SRAM technology (11.5 ms vs. 8 ms)

17 Cache Architecture Caches consist of blocks (or lines). Each block stores data from memory: Block Block allocation problem: –Given data from an address A, how do we decide which block of cache its data should go to?

18 The Block Allocation Problem 3 possible solutions: –Data from each address A will go to to a fixed block. Direct Mapped Cache –Data from each address A will go to any block. Fully associative cache –Data from address A will go to a fix set of blocks. Data may be put into any block within a set. Set associative cache.

19 Direct Mapped Caches The value of a portion of memory address is used to decide which block to send the data to: TagBlock IndexBlock OffsetByte Offset The Block Index portion is used to decide which block data from this address should go to. Address A

20 Example The number of bits in the block index is log 2 N, where N is the total number of blocks. For a 4-block cache, the block index portion of the address will be 2 bits, and these 2 bits can take on the value of 00, 01, 10 or 11. The exact value of these 2 bits will determine which block the data for that address will go to.

21 Direct Mapped Addressing E.g. Show how an addresses generated by the MIPS CPU will be divided into byte offset, block offset, block index and tag portions for the following cases: i) Block size: 1 word, 128 blocks ii) Block size: 4 words, 64 blocks All MIPS addresses are 32 bit byte addresses (i.e. they address individual bytes in a word).

22 Case I

23 Case II

24 Example The value of the two block index bits will determine which block the data will go to, following the scheme shown below: 00 01 10 11 Cache

25 Solving Direct-Mapped Cache Problems Question 7.7 Basic formula: Blk_Addr = floor(word_address/words_per_block) mod N –N here is the total number of blocks in the cache –This is the mathematical version of taking the value of the Block Index bits from the address.

26 A Complication: Multiple Word Blocks Single word blocks do not support spatial locality –Spatial locality: Likelihood of accessing neighbor of a piece of data that was just accessed is high. –But with single word blocks, none of the neighbors are in cache! All accesses to neighbors that were not accessed before will miss!

27 An Example Question 7.8

28 Accessing Individual Words In our example, each block has 4 words. But we always access memory 1 word at a time! (e.g. lw) Use the Block Offset to specify which of the 4 words in a block we want to read: Address A TagBlock IndexBlock OffsetByte Offset

29 The Block Offset Number of block offset bits = log 2 M, where M is the number of words per block. For our example, M=4. So number of block offset bits is 2. These two bits can take on the values of 00, 01, 10 and 11. Note that for single word blocks, the number of block offset bits is log 2 1, which is 0. I.e. There are no block offset bits for single-word blocks. These values determine exactly which word within a block address A is referring to:

30 Who am I? Purpose of the Tag Many different addresses may map to the same block: e.g. (Block Index portions shown highlighted) 01000000100100000000000 01010000100100000000000 11011000100100000000000 All 3 addresses are different, but all map to block 00010010

31 Disambiguation We need a way to disambiguate the situation –Otherwise how do we know that the data in block x actually comes from address A and not from another address A’ that has the same block index bit value? The portion of the address A to the left of the Block Index can be used for disambiguation. This portion is called the tag, and the tag for address A is stored in the cache together with address A data.

32 The Tag When we access the cache, the Tag portion and Block Index portions of address A are extracted. The Block Index portion will tell the cache controller which block of cache to look at. The Tag portion is compared against the tag stored in the block. If the tags match, we have a cache hit. The data is read from the cache. Word 00Word 01Word 10Word 11 00 01 10 11 Tag

33 Accessing Individual Bytes MIPS addresses are byte addresses, and actually index individual bytes rather than words. Each MIPS word consists of 4 bytes. The byte offset tells us exactly which byte within a word we are referring to. Address A TagBlock IndexBlock OffsetByte Offset

34 Advantages & Disadvantages of Direct Mapped Caches Advantages: –Simple to implement –Fast performance Less time to detect a cache hit => less time to get data from the cache => faster performance Disadvantages –Poor temporal locality. Many addresses may map to the same block. The next time address A is accessed, it may have been replaced by the contents of address A’.

35 Improving Temporal Locality The Fully Associative Cache In the fully associative cache, data from an address A can go to any block in cache. –In practice, data will go into the first available cache block. –When the cache is full, a replacement policy is invoked to choose which block of cache to throw out.

36 Advantages and Disadvantages Fully Associative Cache Good temporal locality properties –Flexible block placement allows smart replacement policies such that blocks that are likely to be referenced again will not be replaced. E.g. LRU, LFU. Disadvantages –Complex and too expensive for large caches Each block needs a comparator to check the tag. With 8192 blocks, we need 8192 comparators!

37 A Compromise Set Associative Caches Represents a compromise between direct- mapped and fully associative caches. Cache is divided into sets of blocks. An address A is mapped directly to a set using a similar scheme as for direct mapped caches. Once the set has been determined, the data from A may be stored in any block within a set - Fully associative within a set!

38 Set Associative Cache An n-way set associative cache will have n blocks per set. For example, for a 16-block cache that is implemented as a 2-way set associative cache, each set has 2 blocks, and we have a total of 8 sets.

39 Advantages and Disadvantages Set Associative Cache Advantages –Almost as simple to build as a direct-mapped cache. –Only n comparators are needed for an n-way set associative cache. For 2-way set-associative, only 2 comparators are needed to compare tags. –Supports temporal locality by having full associativity within a set.

40 Advantages and Disadvantages Set Associative Cache Disadvantages –Not as good as fully-associative cache in supporting temporal locality. –For LRU schemes, because of small associativity, actually possible to have 0% hit rate for temporally local data. –E.g. If our accesses are A 1 A 2 A 3 A 1 A 2 A 3, and if A 1, A 2 and A 3 map to the same 2-way set, then hit rate is 0% as subsequent accesses replace previous accesses in the LRU scheme.

41 Multi-level Cache Let the first level of cache (closest to CPU) be called “L1”, and the next level “L2”. Let P hit_l1 be the hit rate of L1, T cache_L1 be the cache access time of L1, T miss_L1 be the miss penalty of L1. AMAT of L1 = P hit_l1 * T cache_L1 + (1-P hit_l1 ) * T miss_L1 What is T miss_L1 ? –If L1 misses, then we will attempt to get data from L2. Hence T miss_l1 is actually just the AMAT of L2! Let P hit_l2 be the hit rate of L2, T cache_l2 be the cache access time of L2, T miss_l2 be the miss penalty of L2.

42 Multilevel Cache T miss_l1 = AMAT l2 = P hit_l2 * T cache_L2 + (1-P hit_l2 ) * T miss_L2 Substitute this back and we get: AMAT of L1 = P hit_l1 * T cache_L1 + (1-P hit_l1 ) * (P hit_l2 * T cache_L2 + (1- P hit_l2 ) * T miss_L2 ) T miss_l2 is of course the time taken to access the slow DRAM memory. What if we had an L3 cache?

43 Other Problems Question 7.9

44 Virtual Memory Motivation Drive space is very very cheap –Typically about 2cents per megabyte. –It would be ideal if we could set aside a portion of drive space to be used as memory. –Unfortunately disk drives are very slow Fastest access time is about 10ms, or about 1,000 times slower than SRAM and several hundred times slower than DRAM. Idea: Use drive space as memory, and main memory to cache the drive space! –This is the idea behind virtual memory.

45 Main Idea Virtual memory (residing on disk) is cached by main memory Main memory is cached by system cache All memory transfers are only between consecutive levels (e.g. VM to main memory, main memory to cache). Virtual Memory Main Memory System Cache Is cached by

46 Cache vs. VM Concept behind VM is almost identical to concept behind cache. But different terminology! –Cache: Block VM: Page –Cache: Cache MissVM: Page Fault Caches implemented completely in hardware. VM implemented in software, with hardware support from CPU. Cache speeds up main memory access, while main memory speeds up VM access.

47 Technical Issues of VM Relatively cheap to remedy cache misses –Miss penalty is essentially the time taken to access the main memory (around 60-80ns). –Pipeline freezes for about 60-80 cycles. Page Faults are EXPENSIVE! –Page fault penalty is the time taken to access the disk. –May take up to 50 or more ms, depending on the speed of the disk and I/O bus. –Wastes millions of processor cycles!

48 Virtual Memory Design Because page-miss penalties are so heavy, not practical to implement direct-mapped or set- associative architectures –These have poorer hit rates. Main memory caching of VM is always fully associative. –1% or 2% improvement in hit rate over other fully associative or set associative designs. –But with heavy page-miss penalties, 1% improvement is A LOT! Also relatively cheap to implement full associativity in software

49 Summary Memory can be thought of as pigeon holes where CPU stores instructions and data. Each pigeon hole (memory location) is given a number called its address. Memory technology can be cheap and slow (DRAM) or fast and expensive (SRAM) Locality allows us to use a small amount of fast expensive memory to store parts of the cheap and slow memory to improve performance. Caches are organized into blocks.

50 Summary Mapping between memory addresses and blocks can be accomplished by: –Directly mapping a memory location to a cache block (direct map) –Slotting a memory location to any block (fully associative) –Mapping a memory location to a set of blocks, then slotting it into any block within the set (set associative) Virtual memory attempts to use disk space as “main memory”, DRAM main memory as cache to the disk memory, and SRAM as cache to the DRAM.


Download ppt "CS1104 Help Session I Memory Semester II 2001/02 Colin Tan, S15-04-05,"

Similar presentations


Ads by Google