Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Systems I Locality and Caching
Lecture 19: Virtual Memory
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
CS.305 Computer Architecture Memory: Virtual Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Measuring Performance II and Logic Design
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
ECE232: Hardware Organization and Design
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Reducing Hit Time Small and simple caches Way prediction Trace caches
The Goal: illusion of large, fast, cheap memory
CS352H: Computer Systems Architecture
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers
5.2 Eleven Advanced Optimizations of Cache Performance
Morgan Kaufmann Publishers
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers
William Stallings Computer Organization and Architecture 7th Edition
ECE 445 – Computer Organization
The University of Adelaide, School of Computer Science
Lecture 14: Reducing Cache Misses
Systems Architecture II
ECE 445 – Computer Organization
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Virtual Memory Lecture notes from MKP and S. Yalamanchili.
Principle of Locality: Memory Hierarchies
Memory Principles.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy 24 August, 2018 Chapter 5 Large and Fast: Exploiting Memory Hierarchy Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Multilevel On-Chip Caches Morgan Kaufmann Publishers Multilevel On-Chip Caches 24 August, 2018 Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Supporting Multiple Issue Morgan Kaufmann Publishers Supporting Multiple Issue 24 August, 2018 Both have multi-banked caches that allow multiple accesses per cycle assuming no bank conflicts Cortex-A53 and Core i7 cache optimizations Return requested word first Non-blocking cache Hit under miss allows additional cache hits during a miss hides some miss latency with other work Miss under miss allows multiple outstanding cache misses overlap the latency of two different misses Data prefetching look at a pattern of data misses and predict the next address to start fetching the data before the miss occurs. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Performance of the Cortex-A53 Memory Hierarchies 32 KiB two-way set associative L1 instruction cache, 32 KiB four-way set associative L1 data cache, 1 MiB 16-way set associative L2 cache running the integer SPEC2006 benchmarks

Performance of the Cortex-A53 Memory Hierarchies The L1 miss penalty for a 1 GHz Cortex-A53 is 12 clock cycles, while the L2 miss penalty is 124 clock cycles Low miss rates multiplied by their high miss penalties represent a significant fraction of the CPI for 5 of the 12 SPEC2006 programs

Performance of the Core i7 Memory Hierarchies L1 instruction cache miss rate: 0.1% to 1.8%, average 0.4% L1 data cache miss rates: 5% to 10%, and sometimes higher L2 average data miss rate: 4% L3 average data miss rate: 1%

DGEMM Combine cache blocking, subword parallelism and instruction level parallelism Blocking improves performance over unrolled AVX code by factors of 2 to 2.5 for the larger matrices

Morgan Kaufmann Publishers Pitfalls 24 August, 2018 Ignoring memory system effects when writing or generating code Example: iterating over rows vs. columns of arrays Large strides result in poor locality Byte vs. word addressing Example: 32-byte direct-mapped cache, 4-byte blocks Byte 36 maps to block 1, since byte address 36 is block address 9 and (9 modulo 8)=1. Word 36 maps to block 4, (36 mod 8)=4. Example: cache with 256 bytes and a block size of 32 bytes. Into which block does the byte address 300 fall? Byte address 300 is block address: 300 32 = 9 The number of blocks in the cache is 256 32 = 8 Block number 9 falls into cache block number (9 modulo 8)=1. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 24 August, 2018 Pitfalls In multiprocessor with shared L2 or L3 cache Less associativity than cores results in conflict misses More cores  need to increase associativity Using AMAT (Average Memory Access Time) to evaluate performance of out-of-order processors Ignores effect of non-blocked accesses Instead, evaluate performance by simulation Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers 24 August, 2018 Pitfalls Extending address range using segments E.g., Intel 80286 But a segment is not always big enough Makes address arithmetic complicated Implementing a VMM on an ISA not designed for virtualization E.g., non-privileged instructions accessing hardware resources Either extend ISA, or require guest OS not to use problematic instructions Chapter 5 — Large and Fast: Exploiting Memory Hierarchy

Fallacies Disk failure rates in the field match their specifications 100,000 disks quoted MTTF of 1,000,000 to 1,500,000 hours, or AFR of 0.6% to 0.8%. AFRs of 2% to 4%, often 3-5 times higher than the specified rates more than 100,000 disks at Google, quoted AFR of 1.5%, failure rates of 1.7% for drives in their first year rise to 8.6% for drives in their third year, or about 5-6 times the declared rate Operating systems are the best place to schedule disk accesses OS sorts the LBA into increasing order to improve performance Disk knows the actual mapping of the logical addresses onto the physical geometry of sectors, tracks, and surfaces, it can reduce the rotational and seek latencies by rescheduling

Morgan Kaufmann Publishers Concluding Remarks 24 August, 2018 Fast memories are small, large memories are slow We really want fast, large memories  Caching gives this illusion  Principle of locality Programs use a small part of their memory space frequently Memory hierarchy L1 cache  L2 cache  …  DRAM memory  disk Multilevel caches make it possible to use more cache optimizations more easily Memory system design is critical for multiprocessors Compiler enhancements such as restructuring the loops that access the arrays, substantially improves locality and cache performance Prefetching - a block of data is brought into the cache before it is actually referenced Chapter 5 — Large and Fast: Exploiting Memory Hierarchy