Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cache Why it’s needed: Cost-performance optimizationWhy it’s needed: Cost-performance optimization Why it works: The principle of localityWhy it works:

Similar presentations


Presentation on theme: "Cache Why it’s needed: Cost-performance optimizationWhy it’s needed: Cost-performance optimization Why it works: The principle of localityWhy it works:"— Presentation transcript:

1 Cache Why it’s needed: Cost-performance optimizationWhy it’s needed: Cost-performance optimization Why it works: The principle of localityWhy it works: The principle of locality How it works: The architectural detailsHow it works: The architectural details This presentation includes quite a bit of animation and is hence intended to be viewed in slideshow mode. If you reading this text, you’re not in slideshow mode. Hit F5 to start the slideshow.

2 MSJ-2 Von Neumann Architecture Cost/performance analysis is a constant theme in computer engineering ‒ which is why the proper choice of performance metric is important Cost/performance trade-offs help explain why many architectures wind up more complicated than the theoretical simplicity of general purpose computers requires them to be, the extra complexity coming from clever and cost-effective performance enhancements to a simple (i.e., cheap) lower performance architecture Cache is such an architectural tweak: It can significantly boost performance without an equivalent boost in cost CPU main memory cache (small, fast) The Basic Idea of a Cache: A Cost-Effective Performance Boost program #1 program #2program #3 (large, slow, cheap) Managing the movement of data and instructions from main memory to cache and vice versa is going to take some work The addition of a small, high speed memory cache is a cost-effective performance enhancement ‒ i.e., compared to a simpler Von Neumann baseline, the relative boost in performance is greater than the relative increase in cost Here’s the overall address space for some program currently in execution But much of this address space is often not going to be used during any given run of the program ‒ when was the last time that your desktop or laptop computer executed the Y2K code included in its operating system? Here’s the overall address space for some program currently in execution But much of this address space is often not going to be used during any given run of the program ‒ when was the last time that your desktop or laptop computer executed the Y2K code included in its operating system? The principle of locality, which we’ll examine shortly, tells us that much/most of the address space actually used by that program during any given execution could fit into a much smaller memory ‒ i.e., the cache The cache, being relatively small, can be made fast enough to keep up with the CPU without blowing the budget completely out of sight The principle of locality, which we’ll examine shortly, tells us that much/most of the address space actually used by that program during any given execution could fit into a much smaller memory ‒ i.e., the cache The cache, being relatively small, can be made fast enough to keep up with the CPU without blowing the budget completely out of sight The classic (or Von Neumann) architecture is simple and works just fine, but … Main memory for most general purpose computers usually needs to be very large, since many different programs may need to be memory resident concurrently But high speed memory ‒ fast enough to avoid making the CPU stall for a memory access ‒ is relatively expensive, so many commercial designs can’t afford to make the main memory fast enough to keep up with the CPU and still have a marketable (i.e., inexpensive) product Main memory for most general purpose computers usually needs to be very large, since many different programs may need to be memory resident concurrently But high speed memory ‒ fast enough to avoid making the CPU stall for a memory access ‒ is relatively expensive, so many commercial designs can’t afford to make the main memory fast enough to keep up with the CPU and still have a marketable (i.e., inexpensive) product

3 MSJ-3Cache Why it’s needed: Cost-performance trade-offsWhy it’s needed: Cost-performance trade-offs Why it works: The principle of localityWhy it works: The principle of locality How it works ‒ the detailsHow it works ‒ the details

4 MSJ-4 The Principle of Locality Cache is based on the principle of locality, which states that memory references by the CPU are not randomly distributed throughout the entire memory but tend to be localized in both space and time

5 MSJ-5 Spatial Locality (And a Special Case: Sequential Locality) Data items referenced by a program tend to be close to one another in memory, since variables local to a function are allocated storage close to each other E.g., after the declaration int x,y; the variables x and y are likely to be next to each other in memory Sequential locality is a special case of spatial locality: Except when a jump or a branch is executed, instruction execution is sequential: Successive instructions are from sequential locations in memory An array is allocated a contiguous block of memory so as a program loops through an array, references to successive array elements should normally refer to sequential cells in memory

6 MSJ-6 Temporal Locality Memory locations that have been referenced in the past are more likely than a random location to be referenced again in the future Programs often contain loops, so once an instruction has been executed, it is more likely to be executed again in the near future than some other, randomly chosen instruction Data that has been used in the recent past is often used again in the near future ‒ e.g.: Variables referenced inside a loop are likely to be referenced each pass through the loop Even it it’s not in a loop, evaluation of an expression like y = a∙x 3 + b∙x 2 + c∙x probably requires multiple references to the data item named x

7 MSJ-7 main memory CPU cache parking lot Let’s consider the United States as our main memory, lots of storage, but time-consuming to retrieve from, and the parking lot behind King is our cache, fast but small (not to scale ;-) An Analogy We’ll Return to Repeatedly When we want to check the pressure in a tire, we’ll bring the entire car, not just the tire, into the parking lot. Why? Suppose that the only tire pressure gauge in the US is located next to the King parking lot Spatial locality says that once we check the pressure in one tire on a car, we’ll often want to check the other tires too, so we’ll be glad that we had already brought the whole car into cache; otherwise we’d have to go all the way back to New York to fetch the next tire

8 MSJ-8 The Significance of Locality The significance of locality is that the set of memory locations actually being used by some program over any given (small) period of time is likely to be much smaller than the overall size of the program ‒ i.e., small enough to fit into a high speed cache

9 Cache Why it’s needed: Cost-performance trade-offsWhy it’s needed: Cost-performance trade-offs Why it works: The principle of localityWhy it works: The principle of locality How it works ‒ the detailsHow it works ‒ the details

10 MSJ-10 cache parking lot memory hierarchy CPU main memory The CPU and the Cache (And Some Jargon) A car is a block in memory A block of memory is the unit of transfer between main memory and cache ‒ we can’t drive half a car from New York to the King parking lot A car is a block in memory A block of memory is the unit of transfer between main memory and cache ‒ we can’t drive half a car from New York to the King parking lot So even if the CPU wants only a single byte, main memory delivers an entire block to the cache in the expectation, based on the principle of spatial locality, that other nearby locations within the same block will be referenced in the near future The cache and the main memory together are often collectively known as the memory hierarchy Technically, they’re only part of the memory hierarchy, but they’re all we’re interested in CEC470 The cache and the main memory together are often collectively known as the memory hierarchy Technically, they’re only part of the memory hierarchy, but they’re all we’re interested in CEC470 Note that although we often speak about moving a block from main memory into cache, in reality we copy it, we don’t move it ‒ which would imply its removal from main memory, which we don’t do) The parking spaces in the cache are block frames Each block frame can hold exactly one block Main memory transfers blocks, but when discussing cache, the unit of transfer is also commonly referred to as a cache grain or a cache line ‒ different names, same concept The parking spaces in the cache are block frames Each block frame can hold exactly one block Main memory transfers blocks, but when discussing cache, the unit of transfer is also commonly referred to as a cache grain or a cache line ‒ different names, same concept alignment network It’s the job of the alignment network to extract and align the bytes that the CPU is currently requesting from the full block in the cache Despite the physical connectivity, the CPU refers to memory for loads, stores, and instruction fetches by emitting a main memory address ‒ i.e., in normal operations, the cache is invisible to the CPU If the location addressed by the CPU is not already in the cache, it is the job of the cache to figure that out and tell main memory what block to send up, meanwhile telling the CPU to wait Despite the physical connectivity, the CPU refers to memory for loads, stores, and instruction fetches by emitting a main memory address ‒ i.e., in normal operations, the cache is invisible to the CPU If the location addressed by the CPU is not already in the cache, it is the job of the cache to figure that out and tell main memory what block to send up, meanwhile telling the CPU to wait

11 MSJ-11 Cache Design Issues main memory parking lot cache The cache design therefore has to address several questions, including: 1. When we copy a block from memory into cache, where do we put it? 2. How do we quickly determine if a given car is in the parking lot? 3. What happens when someone else wants to legally park in an already occupied space? The answers to these and other questions are a function of the architecture of the cache ? ?

12 MSJ-12 Cache Architectures Today, there are three commonly used cache architectures: Direct mapped ‒ simplest Fully associative ‒ best performance, but most expensive Set associative ‒ common compromise

13 MSJ-13 0 1 2 3 4 5 6 7 8 9 A B C D E F parking lot main memory Direct Mapped Cache cache Now let’s further suppose that the cost of checking the digits on the license plate increases non-linearly with the number of digits – the analogy is getting a bit strained, but it will have to do Well, you don’t need to check all 7 digits (ABC2345); all you have to do is check the first 6 digits (ABC234) You don’t need to check the last digit, since the car couldn’t be there at all (slot #5) unless the last digit were a 5 Now let’s further suppose that the cost of checking the digits on the license plate increases non-linearly with the number of digits – the analogy is getting a bit strained, but it will have to do Well, you don’t need to check all 7 digits (ABC2345); all you have to do is check the first 6 digits (ABC234) You don’t need to check the last digit, since the car couldn’t be there at all (slot #5) unless the last digit were a 5 Now suppose ERAU’s parking regulations stated that a car could only use the parking slot whose number matched the last digit on its license plate So this block can only be placed in block frame #5 Now suppose ERAU’s parking regulations stated that a car could only use the parking slot whose number matched the last digit on its license plate So this block can only be placed in block frame #5 You don’t have to search the entire parking lot; you can proceed directly (direct mapped!) to slot number 5 But just because there’s a car there doesn’t mean it’s mine; lots of cars have license plate numbers that end in 5 So you’ll have to check the license number for the car in slot #5 to see if it’s mine You don’t have to search the entire parking lot; you can proceed directly (direct mapped!) to slot number 5 But just because there’s a car there doesn’t mean it’s mine; lots of cars have license plate numbers that end in 5 So you’ll have to check the license number for the car in slot #5 to see if it’s mine In a cache, we’ll call these digits of a memory address the block’s tag ‒ i.e., the tag is a subset of the (binary) digits of the memory address Suppose I ask you to go see if my car is in the parking lot and you don’t know the make, model, or color of my car; all you know is my license plate number Suppose we numbered each parking spot in the cache For this example, let’s suppose that the King lot had exactly 16 parking places, labeled in hex with 0 through F Suppose we numbered each parking spot in the cache For this example, let’s suppose that the King lot had exactly 16 parking places, labeled in hex with 0 through F A block’s license plate is its memory address, which controls the memory hierarchy “behind the scenes” as far as the CPU is concerned ‒ i.e., the CPU just sends out a memory address and neither knows nor cares what the memory hierarchy does with it so long as the CPU winds up with the data it requested For this computer architecture analogy, let’s assume that the digits of a license plate are always hex digits A block’s license plate is its memory address, which controls the memory hierarchy “behind the scenes” as far as the CPU is concerned ‒ i.e., the CPU just sends out a memory address and neither knows nor cares what the memory hierarchy does with it so long as the CPU winds up with the data it requested For this computer architecture analogy, let’s assume that the digits of a license plate are always hex digits

14 MSJ-14 byte offset tag block frame # Interpreting the Memory Address 00010010001101000101011001111000 0x2468a0x33 The block frame number ( 0x33, in this example) is our parking slot number Here’s how a cache might interpret these bits Note that these byte offset bits, although not used by the cache, still tell us something about both the cache and main memory Their purpose is to tell the alignment network the offset from the start of the block frame of the bytes to be extracted, shifted as necessary for alignment, and send up to the CPU But they therefore indicate the size of a block frame, also known as the memory width, cache grain, or cache line ‒ e.g., the 5 bits of offset shown here means that there are 2 5 = 32 possible starting points or offsets (values 0 through 31) within a block, so (for a byte-addressable memory) the block size must be 32 bytes Note that these byte offset bits, although not used by the cache, still tell us something about both the cache and main memory Their purpose is to tell the alignment network the offset from the start of the block frame of the bytes to be extracted, shifted as necessary for alignment, and send up to the CPU But they therefore indicate the size of a block frame, also known as the memory width, cache grain, or cache line ‒ e.g., the 5 bits of offset shown here means that there are 2 5 = 32 possible starting points or offsets (values 0 through 31) within a block, so (for a byte-addressable memory) the block size must be 32 bytes Some bits are ignored by the cache, just as we simplistically ignored the “New York” and “Empire State” characters on the license plate in our parking lot example But in the real cache, there really are some bits we’ll ignore (at least for now ;-) Some bits are ignored by the cache, just as we simplistically ignored the “New York” and “Empire State” characters on the license plate in our parking lot example But in the real cache, there really are some bits we’ll ignore (at least for now ;-) The tag is what you checked when you went to the designated slot in the parking lot and wanted to see if it was my car that was parked there block # 2 6 =64 block frames2 5 =32 bytes So in this example, we see that there are 2 6 =64 block frames of 2 5 =32 bytes each, so the cache size is 2 11 =2048 bytes The byte offset bits are still ignored Here’s some 32-bit memory address emitted by the CPU Here’s how main memory would interpret these same bits All the other bits are the block #, which identifies the block in main memory that contains the address the CPU actually wants to access

15 MSJ-15 memory address block # offset block frame # tag tags block frames 0123456701234567 block frame # cache  memory width = block size main memory block # 0 1 2 3 4 5 6 7 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19  Direct Mapped Cache The cache consists of a set of block frames; each of which can contain a single block of memory Regardless of the type of cache organization, main memory is organized as a set of sequential blocks Attached to each block frame is a set of bits to hold the tag of the block currently in the block frame Main memory uses the block number to find the block in memory For a direct mapped cache, the cache breaks up the bits of the block number into two fields: the tag and the block frame # In the simplest memory hierarchies, neither the main memory nor the cache cares about the byte offset of the requested data within the block ‒ only the alignment network uses it The CPU refers to an item in memory by emitting its memory address, but that address is interpreted differently by main memory and cache The block size is also referred to as the memory width, since a block of memory, is the unit of transfer between main memory and cache ‒ i.e., given an address for retrieval, main memory always emits an entire block, the one containing the requested address somewhere within it

16 MSJ-16 tags block frames 0123456701234567 block frame # cache memory width = block size main memory block # 0 1 2 3 4 5 6 7 8 9 a b c d e f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d memory address block # offset block frame # tag Direct Mapped Cache (cont’d) Presented with a memory address from the CPU, the cache determines if the requested block is in cache by going to the designated block frame and comparing the tag of the requested block with the tag of the resident one ;-) :- ( Cache hit Cache hit : If the tags match, the resident block is sent to the alignment network which uses the offset to extract the requested bytes from the block and align them properly for the destination CPU register So just as many cars (all with 5 as the last digit on their license plate) could legally park in slot #5 of the King parking lot, many different blocks of memory map to the same block frame in the cache In this example of a cache of 8 block frames, main memory block 0x5 maps to block frame 5, as do memory blocks 0xd, 0x15, 0x1d, 0x25, 0x2d, and so on So just as many cars (all with 5 as the last digit on their license plate) could legally park in slot #5 of the King parking lot, many different blocks of memory map to the same block frame in the cache In this example of a cache of 8 block frames, main memory block 0x5 maps to block frame 5, as do memory blocks 0xd, 0x15, 0x1d, 0x25, 0x2d, and so on Cache miss Cache miss : If the tags don’t match, cache tells main memory to send up the requested block, then both the cache and the CPU wait a long time (called the miss penalty), then the cache places the block in its block frame with its tag alongside it, both over-writing any previous contents, and the requested block is then (finally!) sent to the alignment network requested block sent to the alignment network To determine where a block of memory goes in the cache, the mapping used is b%m, where b is the block number and m is the number of block frames in the cache m is always 2 n for some n so that b%m can be evaluated quickly just by looking at the low order n bits of the block number To determine where a block of memory goes in the cache, the mapping used is b%m, where b is the block number and m is the number of block frames in the cache m is always 2 n for some n so that b%m can be evaluated quickly just by looking at the low order n bits of the block number

17 MSJ-17 main memory 0 1 2 3 4 5 6 7 8 9 A B C D E F parking lot Direct mapped cache is the simplest, and hence cheapest to implement and fastest in raw speed, but the lack of flexibility in block placement can cause other performance problems Reminder: Choosing the correct performance metric(s) is always an important (and not always easy) engineering task and the correct metric here is not the raw speed of the cache (hit time), for which a direct mapped cache is best, but a more complicated metric, the average memory access time, which we’ll obviously analyze at in some detail later instructions Downside to Direct Mapped Cache (And Some More Jargon Along the Way ;-) data And all our data comes from a different block than our instructions, but one whose block number also ends in a 5 We are getting absolutely no benefit from our cache whatsoever! Despite the fact the 15/16 of the cache is not being used, every memory reference generates a cache miss, which results in a long delay while we go out to main memory to bring the requested block up into cache ‒ our miss rate, the percentage of all memory references that we fail to find in the cache, is 100% The problem, of course, is that in this example we can’t use the 15 empty block frames because our cache organization is direct mapped and there is no choice as to where a requested block must be placed We are getting absolutely no benefit from our cache whatsoever! Despite the fact the 15/16 of the cache is not being used, every memory reference generates a cache miss, which results in a long delay while we go out to main memory to bring the requested block up into cache ‒ our miss rate, the percentage of all memory references that we fail to find in the cache, is 100% The problem, of course, is that in this example we can’t use the 15 empty block frames because our cache organization is direct mapped and there is no choice as to where a requested block must be placed And so on When the CPU is ready to fetch the next sequential instruction, we have to go all the way back to main memory to bring the same instruction block back into cache again, bumping the data block out of cache This is a conflict miss ‒ we got a cache miss on a block (our instruction block, in this example) that, in fact, used to be present in the cache but got replaced because of a conflicting demand for its slot When the CPU is ready to fetch the next sequential instruction, we have to go all the way back to main memory to bring the same instruction block back into cache again, bumping the data block out of cache This is a conflict miss ‒ we got a cache miss on a block (our instruction block, in this example) that, in fact, used to be present in the cache but got replaced because of a conflicting demand for its slot Now if this instruction needs to load some data from block #AA00005, we get another conflict miss (the data block used to be in cache but isn’t anymore) and then when we fetch the data block, it bumps out the instruction block again Here’s our first instruction fetch, from block # ABC2345 The very first time the CPU requests data or an instruction from a given block, that block can’t be in the cache already ‒ so we get what is called a compulsory (or mandatory or cold start) miss and the requested block must be fetched from main memory Here’s our first instruction fetch, from block # ABC2345 The very first time the CPU requests data or an instruction from a given block, that block can’t be in the cache already ‒ so we get what is called a compulsory (or mandatory or cold start) miss and the requested block must be fetched from main memory Suppose in a tight loop, all our instructions come from the same block in memory and its block number ends in 5 Now if that instruction needs to load some data from block #AA00005 and this is the first time the CPU needed something from that block, we get another mandatory miss and then the instruction block in cache is replaced by the data block The instruction requesting the data fetch will continue to execute properly; it’s already in the CPU’s IR − it was delivered there by the alignment network shortly after the instruction block arrived in the cache, but … Now if that instruction needs to load some data from block #AA00005 and this is the first time the CPU needed something from that block, we get another mandatory miss and then the instruction block in cache is replaced by the data block The instruction requesting the data fetch will continue to execute properly; it’s already in the CPU’s IR − it was delivered there by the alignment network shortly after the instruction block arrived in the cache, but …

18 MSJ-18 Fully Associative Cache Instead of direct-mapping, suppose ERAU’s parking regulations stated that a faculty member could use any empty slot in the parking lot parking lot 0 1 2 3 4 5 6 7 8 9 A B C D E F But that could be slow, since the student would have to check each spot sequentially (and cache is supposed to be very fast, right?) … and if the tag were found, the successful student could do a victory dance or something to indicate where the car was For a cache, that form of parallel search is done by what is called associative (or content-addressable) logic It would be a lot faster to send a separate student to each slot … The slots wouldn’t need to be numbered … Now to find out if some specific car is in the parking lot … … I could just send a student to check every slot for the tag I wanted … … since any car could go in any empty slot

19 MSJ-19 Associative Search in General hit indicator ●●● bit pattern being searched for Legend XOR gate 1 bit cell AND gate OR gate inverter (negation) bit pattern Here’s the bit pattern we want to search for … data items being searched … and here are the items we want to search to see if one of them matches the item being searched for So the inverter here outputs a 1 if the inputs to the XOR were the same, a 0 otherwise This AND gate, then, outputs a 1 if and only if each bit of its item matches the corresponding bit of the pattern being searched for An XOR gate (A or B but not both) outputs a 0 if its two inputs match, a 1 otherwise: ABA^B 000 011 101 110 This OR gate outputs a 1 iff one of the items being searched matches the pattern being searched for

20 MSJ-20 ●●● block select line Associative Cache ●●● tags tag being searched for cache miss indicator ●●● block select line block frames selected block goes to the alignment network Legend XOR gate 1 bit cell AND gate OR gate inverter (negation) In an associative cache, the items being searched are the tags of the blocks in the cache … … and the pattern being searched for is the tag of the block the cache is searching for If one of the tags matches, the output of its AND gate selects its block to be gated up to the alignment network Otherwise, if there’s a cache miss, the block number is gated out to main memory for retrieval

21 MSJ-21 Fully Associative Cache (cont’d) Fully associative cache provides a better average memory access time (the correct metric for memory hierarchy performance) than direct mapped, since there are no conflict misses (any car can go into any open parking slot), but it’s also more expensive ‒ think about the student wages for the 16 students I sent out to search the 16 slots in the parking lot in parallel as opposed to the one student I used for direct mapped The obvious question then is, is there something in between ‒ providing better performance than direct mapped but cheaper than fully associative?

22 MSJ-22 main memory Set Associative Cache Since we have 4 sets, we’ll need 2 bits from the block number to tell us into which set a given block can go set 0 set 1 set 2 set 3 The cache will still be the same size (16 block frames, in this example), but now it will be organized differently Now it’s organized into 4 sets of 4 block frames each The Texas block also has to go into set 1, but there are 3 empty slots there, any one of which is fine, so unlike the direct mapped case, the New York block does not get bumped out of the cache The rightmost 2 bits of a hex 5 are 01, so the New York block must be placed in set 1 As for direct mapped cache, the bits in the block number not used to designate a destination are the tag for the block frame containing this block So in this example, tags will be 26 bits, since there are 7 hex digits (28 bits) in a block number and we’re using 2 of those bits to designate the set number As for direct mapped cache, the bits in the block number not used to designate a destination are the tag for the block frame containing this block So in this example, tags will be 26 bits, since there are 7 hex digits (28 bits) in a block number and we’re using 2 of those bits to designate the set number The sets need to be numbered, but the block frames (parking slots) in a set do not: We’re going to search associatively within a set − but only within a single set, not the entire cache as we did for a fully associative cache (those extra circuits for associative search are expensive) Since a set could eventually fill up, there must be a block replacement policy to decide which block frame a new block will be placed in, replacing some older block Direct mapped and set associative cache can also become completely full and then a subsequent replacement followed by a miss on the replaced block is a capacity miss, but unlike a fully associative cache, they can (and do) get conflict misses before they are full; whereas a fully associative cache never gets conflict misses Note that a fully associative cache has no conflict misses at all ‒ no block is ever bumped out (replaced) until the entire cache is full If the cache is completely full and a block must be replaced, a subsequent miss on the replaced block is called a capacity miss, not a conflict miss, because the miss on the previously replaced block means that the cache is too small to contain the entire working space for the executing program ‒ but even a fully associative cache still needs a block replacement policy for handling capacity misses Note that a fully associative cache has no conflict misses at all ‒ no block is ever bumped out (replaced) until the entire cache is full If the cache is completely full and a block must be replaced, a subsequent miss on the replaced block is called a capacity miss, not a conflict miss, because the miss on the previously replaced block means that the cache is too small to contain the entire working space for the executing program ‒ but even a fully associative cache still needs a block replacement policy for handling capacity misses If a block that was replaced by the block replacement policy is then called for again, it’s a conflict miss; but because of the greater flexibility in block placement and the possibility for the block replacement policy to be fairly smart (we’ll look at some of the alternatives later) while direct mapped block replacement has no smarts at all, conflict misses for set associative cache occur less frequently than for a direct mapped cache, so its performance is better ‒ lower average memory access time due to a lower miss rate

23 MSJ-23 A Set Associative Cache is Cheaper than Fully Associative set 0 set 1 set 2 set 3 … and, as before, the other bits in the block number tell them what tag to look for … since a portion of the memory address tells us what set they need to go to Now to find a car in the cache, we only need 4 students, not 16 …

24 MSJ-24 The Associativity Factor set 0 set 1 set 2 set 3 set 4 set 5 set 6 set 7 If the cache size were 32 block frames and the associativity were 4, the cache would have 8 sets The size of a set associative cache in block frames is the product of the # of sets and the associativity factor The size in bytes is the total number of block frames times the number of bytes in a block (a.k.a. the cache grain or cache line) The size of a set associative cache in block frames is the product of the # of sets and the associativity factor The size in bytes is the total number of block frames times the number of bytes in a block (a.k.a. the cache grain or cache line) Note that a direct mapped cache can be considered a degenerate form of set associative cache ‒ merely having an associativity factor of 1 (only 1 block frame per set) 0 1 2 3 4 5 6 7 8 9 A B C D E F The number of block frames in a set is called the associativity factor ‒ so this cache is 4-way set associative Note: It is coincidental that this cache also has 4 sets ‒ the previous examples had 16 block frames, so I wanted this one to do so too, only this time, since the associativity factor is 4, the 16 block frames are organized into 4 sets The number of block frames in a set is called the associativity factor ‒ so this cache is 4-way set associative Note: It is coincidental that this cache also has 4 sets ‒ the previous examples had 16 block frames, so I wanted this one to do so too, only this time, since the associativity factor is 4, the 16 block frames are organized into 4 sets

25 MSJ-25 byte offset tag set # 00010010001101000101011001111000 Now turn it around, if the number of sets is 2 n, it takes log 2 2 n, or n, bits to specify the correct destination 64 sets would take the 6 bits shown here Now turn it around, if the number of sets is 2 n, it takes log 2 2 n, or n, bits to specify the correct destination 64 sets would take the 6 bits shown here 8 sets would take 3 bits (log 2 8 = 3) 2 sets would take only1 bit (log 2 2 = 1) Interpreting the Memory Address Again set # For a set associative cache, the meaning of the tag and the offset fields are the same as for a direct mapped cache, but now we need a set # within the cache, not a block frame # index # Regardless of the name (block frame #, index #, or set #), the bits here select the destination and the number of bits here tells us something about the size of the cache ‒ although the units are dependent on the cache organization In this example, there are 6 bits here, so the size of the cache is 2 6 If the cache is direct mapped, that’s 2 6 block frames If it’s set associative, that’s 2 6 sets ‒ and to know the cache size in block frames, we’d need to be told the associativity factor, which is not discernable from the memory address Regardless of the name (block frame #, index #, or set #), the bits here select the destination and the number of bits here tells us something about the size of the cache ‒ although the units are dependent on the cache organization In this example, there are 6 bits here, so the size of the cache is 2 6 If the cache is direct mapped, that’s 2 6 block frames If it’s set associative, that’s 2 6 sets ‒ and to know the cache size in block frames, we’d need to be told the associativity factor, which is not discernable from the memory address And a fully associative cache (which by definition has only 1 set) would need 0 destination bits (log 2 1 = 0) A fully associative cache can thus be considered special case of a set associative cache, one with only one set ‒ so if it has n block frames its associativity factor would be n Since there’s only 1 set, there’s no choice as to which set a block goes to And if there’s no choice, we don’t need any bits to name (select) the destination set, so there’s no index # for a fully associative cache And a fully associative cache (which by definition has only 1 set) would need 0 destination bits (log 2 1 = 0) A fully associative cache can thus be considered special case of a set associative cache, one with only one set ‒ so if it has n block frames its associativity factor would be n Since there’s only 1 set, there’s no choice as to which set a block goes to And if there’s no choice, we don’t need any bits to name (select) the destination set, so there’s no index # for a fully associative cache For reasons I have long since forgotten (if I ever knew), the set # is also commonly referred to as the index # (same concept, as far as I know, just a different name) In any case, just as its block frame number specifies where a block can be placed in a direct mapped cache, its set or index # specifies where (in which set) a block can be placed in a set associative cache For reasons I have long since forgotten (if I ever knew), the set # is also commonly referred to as the index # (same concept, as far as I know, just a different name) In any case, just as its block frame number specifies where a block can be placed in a direct mapped cache, its set or index # specifies where (in which set) a block can be placed in a set associative cache

26 Cache Why it’s needed: Cost-performance optimization Why it works: The principle of locality How it works: The architectural details C’est tout (fini) C’est tout (fini)


Download ppt "Cache Why it’s needed: Cost-performance optimizationWhy it’s needed: Cost-performance optimization Why it works: The principle of localityWhy it works:"

Similar presentations


Ads by Google