Platform-based Design

Platform-based Design
Data Management Part d: Data Layout for Caches 5KK70 TU/e Henk Corporaal Bart Mesman

Data layout for caches Caches are hardware controled
Therefore: no explicit reuse copy code needed ! What can we still do to improve performance? Topics: Cache principles The 3 C's: Compulsory, Capacity and Conflict misses Data layout examples reducing misses H.C. Platform-based Design 5KK70

Cache operation (direct mapped cache)
Cache / Higher level Memory / Lower level block or line tags data H.C. Platform-based Design 5KK70

Why does a cache work? Principle of Locality
Temporal locality an accessed item has a high probability being accessed in the near future Spatial locality items close in space to a recently accessed item have a high probability of being accessed next Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses Regular programs have high instruction and data locality H.C. Platform-based Design 5KK70

Address (bit positions)
Direct mapped cache Address (bit positions) 3 1 3 1 3 1 2 1 1 2 1 B y t e o f f s e t 2 1 H i t D a t a T a g I n d e x I n d e x V a l i d T a g D a t a 1 2 1 2 1 1 2 2 1 2 3 2 3 2 H.C. Platform-based Design 5KK70

Direct mapped cache: larger blocks
Taking advantage of spatial locality: Address (bit positions) H.C. Platform-based Design 5KK70

Performance Increasing the block size tends to decrease miss rate:
H.C. Platform-based Design 5KK70

Cache principles Cache Line or Block data tag 2k lines Hit? main
memory CPU 2m bytes p-k-m p-k-m k m tag index address byte address H.C. Platform-based Design 5KK70

Cache Architecture Fundamentals
Block placement Where in the cache will a new block be placed? Block identification How is a block found in the cache? Block replacement policy Which block is evicted from the cache? Updating policy How is a block written from cache to memory? H.C. Platform-based Design 5KK70

Block placement policies
Here only! 1 2 3 4 5 6 7 Direct mapped (one-to-one) Memory Cache 1 7 2 3 4 5 6 ... 1 2 3 4 Mapping? 5 6 7 8 1 2 3 4 5 6 7 Fully associative (one-to-many) Anywhere in cache 9 Let’s see what is the origin of these problems: Typically, cache memories are x smaller than main memory. (push->) Therefore, the cache controller needs to divide the main memory into address ranges that fit into the cache size (in this example 8 words) where the data is being mapped. There are two main approaches to perform this mapping (push->) For the two memory locations shown in blue and green: (push->) One possibility is assign these addresses to fixed locations in the cache. For instance, the blue memory location is assigned to the cache location no. 4 only and the green one to the cache location no. 3 only. This mapping requires a simpler controller at the expense of less freedom when assigning memory to cache addresses. This less freedom is translated into more potential misses and therefore to less efficient utilisation of the cache. These caches are known as direct mapped caches. (push->) The other possibility is let the controller to take the decision at run-time of which addresses to assign to the green and blue memory locations. In the extreme case it could assign it to any location, which at the moment of the mapping is free. In this case, the cache controller needs to keep track of the locations in use. For instance, the blue memory location could be assigned to the free cache location no. 2 and the green one to the free cache location no. 6. This cache approach is known as a fully associative cache. In these caching approach, only when all locations are occupied a miss can occur. When the set of cache locations for the mapping is restricted to a smaller number than the full cache size, the cache is known as a 2,4,8-way associative cache. These are the most commonly used approaches. 10 11 12 13 14 15 ... H.C. Platform-based Design 5KK70

4-way associative cache

Performance 1 KB 2 KB 8 KB H.C. Platform-based Design 5KK70

Cache Basics Cache_size = Nsets x Associativity x Block_size
Block_address = Byte_address DIV Block_size in bytes Index = Block_address MOD Nsets Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently tag index block offset block address … 2 1 0 31 … H.C. Platform-based Design 5KK70

Example 1 Assume Direct mapped (associativity=1) : 2-way associative
Cache of 4K blocks, with 4 word block size 32 bit addresses Direct mapped (associativity=1) : 16 bytes per block = 2^4  4 (2+2) bits for byte and word offsets 32 bit address : 32-4=28 bits for index and tag #sets=#blocks/ associativity : log2 of 4K=12 : 12 for index Total number of tag bits : (28-12)*4K=64 Kbits 2-way associative #sets=#blocks/associativity : 2K sets 1 bit less for indexing, 1 bit more for tag (compared to direct mapped) Tag bits : (28-11) * 2 * 2K=68 Kbits 4-way associative #sets=#blocks/associativity : 1K sets 2 bits less for indexing, 2 bits more for tag (compared to direct mapped) Tag bits : (28-10) * 4 * 1K=72 Kbits H.C. Platform-based Design 5KK70

Example 2 3 caches consisting of 4 one-word blocks:
Cache 1 : fully associative Cache 2 : two-way set associative Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8 H.C. Platform-based Design 5KK70

Example 2: Direct Mapped
Block address Cache Block 0 mod 4=0 6 6 mod 4=2 8 8 mod 4=0 Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3 miss Mem[0] 8 Mem[8] 6 Mem[6] Coloured = new entry = miss H.C. Platform-based Design 5KK70

Example 2: 2-way Set Associative: (4/2 = 2 sets)
Block address Cache Block 0 mod 2=0 6 6 mod 2=0 8 8 mod 2=0 (so all in set/location 0) Address of memory block Hit or miss SET 0 entry 0 entry 1 SET 1 Miss Mem[0] 8 Mem[8] Hit 6 Mem[6] LEAST RECENTLY USED BLOCK H.C. Platform-based Design 5KK70

Example 2: Fully associative (4 way assoc., 4/4 = 1 set)
Address of memory block Hit or miss Block 0 Block 1 Block 2 Block 3 Miss Mem[0] 8 Mem[8] Hit 6 Mem[6] H.C. Platform-based Design 5KK70

Cache Fundamentals The “Three C's”
Compulsory Misses 1st access to a block: never in the cache Capacity Misses Cache cannot contain all the blocks Blocks are discarded and retrieved later Avoided by increasing cache size Conflict Misses Too many blocks mapped to same set Avoided by increasing associativity Some add 4th C: Coherence Misses H.C. Platform-based Design 5KK70

Compulsory miss example
for(i=0; i<10; i++) A[i] = f(B[i]); i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before  loaded into cache A[3] never loaded before  allocates new line i=3) The first type of cache miss we analyse are the so called called capacity misses To understand their nature we will use an illustrative example. The slide shows a loop containing two data array and how these two arrays have been organised in main memory (push->) Assuming a fully associative cache, (so only capacity misses can occur) and a size of eight words, after 7 iterations of the loop (for i=7), the cache is now filled with some data (push->) In the next loop iteration (for i=7), three new data elements from array A and B are required. However, since there is no free locations left in the cache -> 3 existing data elements must be flushed (least recently used), and three new ones must be filled in (push->) The element from the B array is filled for first time in the cache, this is not yet considered as a capacity miss (push->) However, the other two elements from array A were already and flushed (respectively) because cache capacity limitations. They need to be brought again in the cache, hence creating two capacity misses. H.C. Platform-based Design 5KK70

Capacity miss example 11 compulsory misses (+8 write misses)
Cache size: 8 blocks of 1 word Fully associative for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 The first type of cache miss we analyse are the so called called capacity misses To understand their nature we will use an illustrative example. The slide shows a loop containing two data array and how these two arrays have been organised in main memory (push->) Assuming a fully associative cache, (so only capacity misses can occur) and a size of eight words, after 7 iterations of the loop (for i=7), the cache is now filled with some data (push->) In the next loop iteration (for i=7), three new data elements from array A and B are required. However, since there is no free locations left in the cache -> 3 existing data elements must be flushed (least recently used), and three new ones must be filled in (push->) The element from the B array is filled for first time in the cache, this is not yet considered as a capacity miss (push->) However, the other two elements from array A were already and flushed (respectively) because cache capacity limitations. They need to be brought again in the cache, hence creating two capacity misses. 11 compulsory misses (+8 write misses) 5 capacity misses H.C. Platform-based Design 5KK70

Conflict miss example Cache (@ i=0) for(j=0; j<10; j++)
2 3 4 5 6 7 B[0][j] A[0]/B[0][j] for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] A[1] A[2] B[3][9] 10 31 B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 B[0][3] ... Memory address Cache A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss For fully associative caches the main source of misses are the capacity misses. In direct mapped caches, also misses can be due to data conflicting in the same cache locations. Direct mapped caches assign a unique location in the cache to each address in main memory. The slide shows two arrays, and how they are typically organised in memory by the compiler. The Figure also shows how for the iterations of the inner loop, the elements accessed by the algorithm are mapped in the cache: - When j takes even values, the elements of the B array are mapped in location no.4. - However, when j takes odd values, these are assigned in location no.0, hence conflicting with the element of array A. Therefore, after loading in the cache the element of array A, this has to be flushed to accommodate the elements of array B. This in itself wouldn’t be a problem if the element of array A, wouldn’t be needed in subsequent iterations (push->) Unfortunately, this is the case, hence creating a conflict miss in which the same element of array A needs to be loaded and flushed several times. This is true for all elements of array A. j=odd j=even H.C. Platform-based Design 5KK70

“Three C's” vs Cache size [Gee93]

Data layout may reduce cache misses

Example 1: Capacity & Compulsory miss reduction
for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 The first type of cache miss we analyse are the so called called capacity misses To understand their nature we will use an illustrative example. The slide shows a loop containing two data array and how these two arrays have been organised in main memory (push->) Assuming a fully associative cache, (so only capacity misses can occur) and a size of eight words, after 7 iterations of the loop (for i=7), the cache is now filled with some data (push->) In the next loop iteration (for i=7), three new data elements from array A and B are required. However, since there is no free locations left in the cache -> 3 existing data elements must be flushed (least recently used), and three new ones must be filled in (push->) The element from the B array is filled for first time in the cache, this is not yet considered as a capacity miss (push->) However, the other two elements from array A were already and flushed (respectively) because cache capacity limitations. They need to be brought again in the cache, hence creating two capacity misses. 11 compulsory misses (+8 write misses) 5 capacity misses H.C. Platform-based Design 5KK70

Fit data in cache with in-place mapping
#Words for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words Detailed Analysis: max=15 words 15 Cache Memory Main Memory (16 words) AB[new] B[] A detailed analysis of the maximum amount of locations that need to be present in memory in order to execute each of the loop iterations can help to reduce the amount of capacity misses. Figure shows the amount of locations needed by the two arrays along different loop iterations. The maximum number of locations of each array is 9 (push->) Traditional analysis techniques like the implemented by conventional compilers would assume that the maximum number of memory locations for these two arrays is 18 (push->) However, when a more detailed analysis of storage requirements for both array is done, only maximum of 16 locations is really necessary to accommodate sufficiently all data which is “alive” at a time (push->) It is possible to “hint” the compiler this information by changing the code structure so instead of declaring two different arrays we declare the merged version of the two of them. This complicates, however, their original index expressions Now, the resulting require storage of the new array fits into the one of the cache, hence reducing the amount of capacity misses. A[] 6 12 i H.C. Platform-based Design 5KK70

Remove capacity / compulsory misses with in-place mapping
for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] i=7 The first type of cache miss we analyse are the so called called capacity misses To understand their nature we will use an illustrative example. The slide shows a loop containing two data array and how these two arrays have been organised in main memory (push->) Assuming a fully associative cache, (so only capacity misses can occur) and a size of eight words, after 7 iterations of the loop (for i=7), the cache is now filled with some data (push->) In the next loop iteration (for i=7), three new data elements from array A and B are required. However, since there is no free locations left in the cache -> 3 existing data elements must be flushed (least recently used), and three new ones must be filled in (push->) The element from the B array is filled for first time in the cache, this is not yet considered as a capacity miss (push->) However, the other two elements from array A were already and flushed (respectively) because cache capacity limitations. They need to be brought again in the cache, hence creating two capacity misses. 11 compulsory misses 5 cache hits (+8 write hits) H.C. Platform-based Design 5KK70

Example 2: Conflict miss reduction
Cache i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] A[1] A[2] B[3][9] 10 31 B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 B[0][3] ... Memory address Cache A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss For fully associative caches the main source of misses are the capacity misses. In direct mapped caches, also misses can be due to data conflicting in the same cache locations. Direct mapped caches assign a unique location in the cache to each address in main memory. The slide shows two arrays, and how they are typically organised in memory by the compiler. The Figure also shows how for the iterations of the inner loop, the elements accessed by the algorithm are mapped in the cache: - When j takes even values, the elements of the B array are mapped in location no.4. - However, when j takes odd values, these are assigned in location no.0, hence conflicting with the element of array A. Therefore, after loading in the cache the element of array A, this has to be flushed to accommodate the elements of array B. This in itself wouldn’t be a problem if the element of array A, wouldn’t be needed in subsequent iterations (push->) Unfortunately, this is the case, hence creating a conflict miss in which the same element of array A needs to be loaded and flushed several times. This is true for all elements of array A. j=odd j=even H.C. Platform-based Design 5KK70

Avoid conflict miss with main memory data layout
for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 1 1 A[1] 2 2 A[2] 3 3 A[0] multiply loaded A[i] multiple x read No conflict A[3] 4 4 B[0][0] 5 5 B[1][0] 6 6 B[2][0] 7 7 B[3][0] Cache i=0) ... Leave gap 12 4 This problem can be avoided by carefully organising the data in main memory (differently than the default approach taken by the compiler). The slide shows a slightly different organisation of array B in main memory where we have deliberately leave some unused memory locations (gaps in the Figure) in the middle of the array. Our goal is to change the locations assigned to the conflicting elements of array B in the cache by changing the layout of data in main memory. This is done in such way that now elements of arrays A and B do not conflict any longer in the cache. For instance B[0][1] which was allocated before in memory address 8, hence in cache address 0, is now reallocated in memory address 12, hence cache address 4. As as result all elements of array B for i=0 are allocated in the same cache location (no.4) irrespective on the value taken by j. (push->) As result, A[0] is not in conflict anymore and it does not need to be flushed, hence avoiding the misses due to conflicts in the cache addresses. B[0][1] A[0] 13 5 B[1][1] 1 14 6 B[2][1] 2 15 B[3][1] 7 3 ... Leave gap 4 j=any B[0][j] 5 18 4 B[0][2] 6 ... 7 31 7 B[3][9] © imec 2001 H.C. Platform-based Design 5KK70

Data Layout Organization for Direct Mapped Caches

Conclusion on Data Management
In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method Reducing number of external memory accesses Reducing external memory size Trade-offs between internal memory complexity and speed Platform independent high-level transformations Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) Substantial energy reduction Although caches are hardware controlled data layout can largely influence the miss-rate H.C. Platform-based Design 5KK70

Platform-based Design

Similar presentations

Presentation on theme: "Platform-based Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Platform-based Design

Similar presentations

Presentation on theme: "Platform-based Design"— Presentation transcript:

Similar presentations

About project

Feedback