Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
Embedded Systems in Silicon TD5102 Data Management (3) SCBD, MAA, and Data Layout Henk Corporaal Technical.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.
How caches take advantage of Temporal locality
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Caches The principle that states that if data is used, its neighbor will likely be used soon.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Computing Systems Memory Hierarchy.
EECS 370 Discussion 1 xkcd.com. EECS 370 Discussion Topics Today: – Caches!! Theory Design Examples 2.
Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Computer Organization CS224 Fall 2012 Lessons 39 & 40.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CSCI206 - Computer Organization & Programming
Improving Memory Access The Cache and Virtual Memory
CSE 351 Section 9 3/1/12.
Improving Memory Access 1/3 The Cache and Virtual Memory
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
Cache Memory Presentation I
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Lecture 23: Cache, Memory, Virtual Memory
Lecture 08: Memory Hierarchy Cache Performance
Module IV Memory Organization.
Lecture 22: Cache Hierarchies, Memory
Lecture: Cache Innovations, Virtual Memory
ECE232: Hardware Organization and Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 22: Cache Hierarchies, Memory
CS-447– Computer Architecture Lecture 20 Cache Memories
Cache - Optimization.
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
10/18: Lecture Topics Using spatial locality
Platform-based Design
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Embedded Computer Architecture 5KK73 TU/e Henk Corporaal Bart Mesman Data Memory Management Part d: Data Layout for Caches

@H.C. Embedded Computer Architecture2 Data layout for caches Caches are hardware controled Therefore: no explicit reuse copy code needed in your code! What can we still do to improve performance? Topics: –Cache principles –The 3 C's: Compulsory, Capacity and Conflict misses –Data layout examples reducing misses

@H.C. Embedded Computer Architecture3 Cache operation (direct mapped cache) Memory / Lower level Cache / Higher level block or line tagsdata

@H.C. Embedded Computer Architecture4 Why does a cache work? Principle of Locality –Temporal locality an accessed item has a high probability being accessed in the near future –Spatial locality items close in space to a recently accessed item have a high probability of being accessed next Check yourself why there is temporal and spatial locality for instruction accesses and for data accesses –Regular programs have high instruction and data locality

@H.C. Embedded Computer Architecture5 Direct mapped cache Byte offset ValidTagDataIndex Tag Index HitData Address (bit positions)

@H.C. Embedded Computer Architecture6 Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)

@H.C. Embedded Computer Architecture7 Increasing the block (or line) size tends to decrease miss rate Performance: effect of block size

@H.C. Embedded Computer Architecture8 p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache Line or Block Cache principles Virtual or Physical address

@H.C. Embedded Computer Architecture9 4 Cache Architecture Fundamentals 1.Block placement –Where in the cache will a new block be placed? 2.Block identification –How is a block found in the cache? 3.Block replacement policy –Which block is evicted from the cache? 4.Updating policy –When is a block written from cache to memory? –Write-Through vs. Write-Back caches

@H.C. Embedded Computer Architecture10Cache Fully associative (one-to-many) Anywhere in cache Here only! Direct mapped (one-to-one) Here only! Memory Mapping?... Block placement policies

@H.C. Embedded Computer Architecture11 4-way associative cache 4 ways 256 sets

@H.C. Embedded Computer Architecture12 Performance: effect of associativity 1 KB 2 KB 8 KB

@H.C. Embedded Computer Architecture13 Cache Basics Cache_size = N sets x Associativity x Block_size Block_address = Byte_address DIV Block_size in bytes Index = Block_address MOD N sets Because the block size and the number of sets are (usually) powers of two, DIV and MOD can be performed efficiently tag index block offset block address … …

@H.C. Embedded Computer Architecture14 Example 1 Assume –Cache of 4K blocks, with 4 word block size –32 bit addresses Direct mapped (associativity=1) : –16 bytes per block = 2^4  4 (2+2) bits for byte and word offsets –32 bit address : 32-4=28 bits for index and tag –#sets=#blocks/ associativity : log2 of 4K=12 : 12 for index –Total number of tag bits : (28-12)*4K=64 Kbits 2-way associative –#sets=#blocks/associativity : 2K sets –1 bit less for indexing, 1 bit more for tag (compared to direct mapped) –Tag bits : (28-11) * 2 * 2K=68 Kbits 4-way associative –#sets=#blocks/associativity : 1K sets –2 bits less for indexing, 2 bits more for tag (compared to direct mapped) –Tag bits : (28-10) * 4 * 1K=72 Kbits

@H.C. Embedded Computer Architecture15 Example 2 3 caches consisting of 4 one-word blocks: Cache 1 : fully associative Cache 2 : two-way set associative Cache 3 : direct mapped Suppose following sequence of block addresses: 0, 8, 0, 6, 8

@H.C. Embedded Computer Architecture16 Example 2: Direct Mapped Block addressCache Block 00 mod 4=0 66 mod 4=2 88 mod 4=0 Address of memory block Hit or miss Location 0 Location 1 Location 2 Location 3 0missMem[0] 8missMem[8] 0missMem[0] 6missMem[0]Mem[6] 8missMem[8]Mem[6] Coloured = new entry = miss

@H.C. Embedded Computer Architecture17 Example 2: 2-way Set Associative: (4/2 = 2 sets) Block addressCache Block 00 mod 2=0 66 mod 2=0 88 mod 2=0 Address of memory block Hit or miss SET 0 entry 0 SET 0 entry 1 SET 1 entry 0 SET 1 entry 1 0MissMem[0] 8MissMem[0]Mem[8] 0HitMem[0]Mem[8] 6MissMem[0]Mem[6] 8MissMem[8]Mem[6] LEAST RECENTLY USED BLOCK (so all in set/location 0)

@H.C. Embedded Computer Architecture18 Example 2: Fully associative (4 way assoc., 4/4 = 1 set) Address of memory block Hit or miss Block 0Block 1Block 2Block 3 0MissMem[0] 8MissMem[0]Mem[8] 0HitMem[0]Mem[8] 6MissMem[0]Mem[8]Mem[6] 8HitMem[0]Mem[8]Mem[6]

@H.C. Embedded Computer Architecture19 Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity Some add 4 th C: Coherence Misses

@H.C. Embedded Computer Architecture20 for(i=0; i<10; i++) A[i] = f(B[i]); i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before  loaded into cache A[3] never loaded before  allocates new line i=3) Compulsory miss example

@H.C. Embedded Computer Architecture21 Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative

@H.C. Embedded Computer Architecture22 Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example

@H.C. Embedded Computer Architecture23 “Three C's” vs Cache size [Gee93]

Data layout may reduce cache misses

@H.C. Embedded Computer Architecture25 Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses

@H.C. Embedded Computer Architecture26 #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words

@H.C. Embedded Computer Architecture27 Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)

@H.C. Embedded Computer Architecture28 Cache i=0) B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] A[3] B[0][0] B[1][0] B[1][1] B[2][0] B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction

@H.C. Embedded Computer Architecture29 for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] B[3][0] B[0][1] Main Memory A[3] B[0][0] B[1][0] B[1][1] B[2][0] Leave gap B[2][1] B[3][1] B[0][2] B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout

@H.C. Embedded Computer Architecture30 Data Layout Organization for Direct Mapped Caches

@H.C. Embedded Computer Architecture31 Conclusions on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method: –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction Although caches are hardware controlled data layout can largely influence the miss-rate