Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.
Kernel memory allocation
Computer Organization CS224 Fall 2012 Lesson 44. Virtual Memory  Use main memory as a “cache” for secondary (disk) storage l Managed jointly by CPU hardware.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
The Memory Hierarchy (Lectures #24) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer Organization.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
COMPSYS 304 Computer Architecture Memory Management Units Reefed down - heading for Great Barrier Island.
Systems I Locality and Caching
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
CMPE 421 Parallel Computer Architecture
Computer Architecture Memory Management Units Iolanthe II - Reefed down, heading for Great Barrier Island.
IT253: Computer Organization
Memory and cache CPU Memory I/O. CEG 320/52010: Memory and cache2 The Memory Hierarchy Registers Primary cache Secondary cache Main memory Magnetic disk.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Caltech CS184b Winter DeHon 1 CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] Day14:
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Memory Management Fundamentals Virtual Memory. Outline Introduction Motivation for virtual memory Paging – general concepts –Principle of locality, demand.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
CS2100 Computer Organisation Virtual Memory – Own reading only (AY2015/6) Semester 1.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
Virtual Memory Ch. 8 & 9 Silberschatz Operating Systems Book.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchies Sonish Shrestha October 3, 2013.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CS161 – Design and Architecture of Computer
CSE 351 Section 9 3/1/12.
CS161 – Design and Architecture of Computer
Memory and cache CPU Memory I/O.
Associativity in Caches Lecture 25
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Morgan Kaufmann Publishers Memory & Cache
CMPT 886: Computer Architecture Primer
Computer Architecture
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Memory Operation and Performance
CSC3050 – Computer Architecture
Fundamentals of Computing: Computer Architecture
Operating Systems: Internals and Design Principles, 6/E
Presentation transcript:

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE Used by the CPU to reduce the memory latency A section of Memory closer to the CPU Stores frequently used memory Design assumptions for the cache. Data that is accessed once will more than likely be accessed again When memory is accessed, memory near that location will be accessed.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE Instruction Cache – used for executable instructions Data cache – used to speed up data fetch and store L1 (lLevel 1) – closest cache to the CPU – fastest – smaller L2 (level 2 ) – if data is not in the L2 cache – slower than L1 but faster than main memory, larger than L2. L1 – L2 … caches may be shared on multii-core systems Many systems now have an L3 cache

Copyright © 2013, SAS Institute Inc. All rights reserved.

MEMORY CACHE TERMS The data is found in the cacheCache Hit Cache MissThe data is not found in the cache. The CPU will need to load it from a higher level cache or main memory. You want to avoid Cache Misses. Cache LineData is copied from main memory in a fixed size area. Typically 64 bytes long. Cache lines will be copied into main memory to satisfy the data request. Multiple cache lines may be copied.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE TERMS Dirty Cache Line When data is written to memory it needs to eventually be written back to main memory. It is dirty, if the contents have not been written back. Write-back policy The policy the CPU uses to determine when to write the dirty cache lines back to main memory.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE TERMS As the cache becomes full and more cache lines are loaded, an existing cache line will need to be evicted. Evicted Replacement Policy The policy the CPU uses to determine which cache line to evict. LRU is the most commonly used policy. Cache Coherence Multiple CPU caches have a private copy of the same piece of memory. The process of making sure each of these copies have the updated “correct” content.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE TERMS A fixed size (page size) block of memory that is mapped to areas of physical memory. Pages Page TableThe page table contains the translation from the virtual address to the physical address for the pages. Translation Lookaside Buffer (TLB) Used to speed up Virtual to Physical address translation. TLB contains the recent mapping from the page table. Applications access memory virtually Memory is allocated in pages PrefetchingThe CPU guesses at what memory will be needed next and load it. Guess right can save latency Guess wrong, can cost bandwidth and cache line evictions.

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE PERFORMANCE ISSUES Performance problems occur when there are a lot of cache misses. Best to look at the ratio of cache misses to cache hits. Accessing memory that is in the lower level caches is the best Accessing memory sequentially the best – prefetching Full Random is the worst – prefetching is loading bad data and TLB misses. Cache misses may also cause further delay if the bandwidth become saturated. L1 CacheL2 CacheL3 Cache Main memory Sequential4 clk11 clk14 clk6ns In-Page Random 4clk11 clk18 clk22ns Full Random4 clk11 clk38 clk65.8 ns Sandy Bridge Latencies for accessing memory. Clk stands for clock cycles and ns stands for nanoseconds.

Copyright © 2013, SAS Institute Inc. All rights reserved. HOW MUCH CAN LATENCY REALLY AFFECT THE SYSTEM? From the SandyBridge numbers. Assume 3GHz processor executes 3 instructions per cycle Going to the L1 cache the processor stalls for 4 clk or the CPU could have executed 12 instructions. If the memory is in the L2 cache the CPU could have executed 44 instructions. Sequentially accessing main memory would result in stalling the CPU for 6*9 (54) instructions. Randomly accessing main memory could result in stalling the CPU for almost 600 instructions.

Copyright © 2013, SAS Institute Inc. All rights reserved. DATA LAYOUT CACHE PROBLEM (FETCH UTILIZATION) Good Utilization all memory in the cache is used Poor utilization, only half of the memory in the cache is used. The other memory takes up cache space and also needs to be moved thru the pipeline.

Copyright © 2013, SAS Institute Inc. All rights reserved. FETCH UTILIZATION In your structures, Put data that is used often together So that the used data is all in cache and rarely used data is not loaded into the cache If needed break up your structures if they are allocated as arrays into multiple structures. struct Good {struct Bad { int used;int used; int used2;int used2; } int not_used; struct Good2 { int not_used2; int not_used; } int not_used2; }

Copyright © 2013, SAS Institute Inc. All rights reserved. FETCH UTILIZATION Put data that is written together Data that changes may affect other caches – reduce the number of writebacks – especially in data that is shared across threads. Make sure data items are sizes correctly. If you only need a short, don’t use an int or long. The extra 2 bytes is wasted. Using small memory allocations can be very wasteful Causes a random pattern Often times memory allocators allocate more than the real size for headers,….

Copyright © 2013, SAS Institute Inc. All rights reserved. FETCH UTILIZATION Account for the alignment of the data items. Keep data items in a structure that have similar sizes near each other Struct Bad { Struct Better { int a; int a; char b; int c; int c; char b; } char1 byte1 byte aligned Short2 bytes2 byte aligned Int / long4 bytes4 byte aligned Float4 bytes4 byte aligned Double (Windows)8 bytes8 byte aligned Double (Linux)8 bytes4 byte aligned Long double8-12 bytes4 – 8 byte aligned

Copyright © 2013, SAS Institute Inc. All rights reserved. Data Access Problems Once the data is in the cache, use the cache line as much as possible before it is evicted!

Copyright © 2013, SAS Institute Inc. All rights reserved. DATA ACCESS – NON-TEMPORAL DATA for( i=0; i<num_cols;i++) for( j=0; j< num_rows; j++ ) do something with the array element Accessing in Row order would use all the memory in the cache. Accessing in column order runs out of cache before the memory can be reused. Non-Temporal access pattern can occur if you are just trying to analyze too much memory at once even if it is not in a loop. Break it up into smaller chunks and combine at the end if possible.

Copyright © 2013, SAS Institute Inc. All rights reserved. DATA ACCESS – NON-TEMPORAL DATA for( i=0; i<10;i++) for( j=0; j< bigsize; j++ ) mem[j] += cnt[j]; If bigsize is large enough, the code will execute and load each cache line into memory but the cache line will be evicted before the next iteration. for( i=0; i<bigsize;i++) for( j=0; j< 10; j++ ) mem[i] += cnt[i]; This will keep the cache line in memory for the full duration of the loop of 10 where it is used.

Copyright © 2013, SAS Institute Inc. All rights reserved. CACHE COHERENCY AND COMMUNICATION UTILIZATION When 2 or more threads share a common memory area and any data is written, cache problems can occur. When one thread writes to the area the cache for the other thread(s) will be invalidated. Care should be taken to reduce the number of writes into shared memory.

Copyright © 2013, SAS Institute Inc. All rights reserved. FALSE SHARING 2 or more threads are using data in the same cache line. 1 thread writes to the cache line and it invalidates the data in the other thread(s) cache line Often seen when allocating arrays of data based on the number of threads and shared by the threads Avoid false sharing by placing data that can change, close together. Reading data does not destroy the cache. Align memory on a cache line boundary. (pad structures if necessary)

Copyright © 2013, SAS Institute Inc. All rights reserved. RANDOM MEMORY ACCESS Caches work best when memory that is near an already loaded cache line is accessed. Memory allocations produce random access to memory. Random access patterns can cause TLB misses which can be costly Linked list, hashes, tree traversals can also produce a random access memory pattern.

Copyright © 2013, SAS Institute Inc. All rights reserved. TOOLS Amplifier – with general exploration will tell you some information about the performance. Several of the counters deal with the cache. The tool will point you to the code and assembler code that is causing the problems. ThreadSpotter –It is solely looking at memory usage and will show you the areas in your program where the cache is not utilized thoroughly, where sharing between threads is hurting the case, false sharing and loop order issues. Gives source code and a good description of the issues involved. I use both tools to get a better idea of where we are spending performance cycles.

Copyright © 2013, SAS Institute Inc. All rights reserved.

SUMMARY Cache’s were designed with the assumption that once memory is loaded it will likely be accessed again. Memory that is accessed is likely close to other memory that will be used. Memory caches have improved performance but if a developer doesn’t understand the principals of the cache and doesn’t design with caches in mind, their application will suffer performance problems. Remember each time the CPU has to go back to main memory, the CPU will be stalled and not performing useful work. Not all issues that may be discovered will be fixable.

Copyright © 2013, SAS Institute Inc. All rights reserved. Questions?????