Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Components of a Computer
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Systems I Locality and Caching
ECE Dept., University of Toronto
Architecture Basics ECE 454 Computer Systems Programming
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
Computer Orgnization Rabie A. Ramadan Lecture 7. Wired Control Unit What are the states of the following design:
Memory Hierarchy 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
University of Washington Memory and Caches I The Hardware/Software Interface CSE351 Winter 2013.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
University of Washington Today Midterm topics end here. HW 2 is due Wednesday: great midterm review. Lab 3 is posted. Due next Wednesday (July 31)  Time.
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
CSE 351 Section 9 3/1/12.
CSC 4250 Computer Architectures
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Today How’s Lab 3 going? HW 3 will be out today
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Memory hierarchy.
Bojian Zheng CSCD70 Spring 2018
ECE Dept., University of Toronto
Lecture: Cache Innovations, Virtual Memory
15-740/ Computer Architecture Lecture 14: Prefetching
Lecture: Cache Hierarchies
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory System Performance Chapter 3
Cache Memory and Performance
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software and Hardware Prefetching Cristiana Amza

– 2 – Why Caches Work Locality : Programs tend to use data and instructions with addresses near or equal to those they have used recently Temporal locality: Recently referenced items are likely to be referenced again in the near future Spatial locality: Items with nearby addresses tend to be referenced close together in time block

– 3 – Example: Locality of Access Data: Temporal: sum referenced in each iteration Spatial: array a[] accessed in stride-1 patternInstructions: Temporal: cycle through loop repeatedly Spatial: reference instructions in sequence Locality of code is a crucial skill for a programmer! Locality of code is a crucial skill for a programmer! sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

– 4 – Prefetching Bring into cache elements expected to be accessed in the future (ahead of future access) Bringing in the cache a whole cache line instead of element by element already does this We will learn more general prefetching techniques In the context of the UG Memory Hierarchy

UG Core 2 Machine Architecture L2 Cache L1 Caches P L1 Caches P Processor Chip L2 Cache L1 Caches P L1 Caches P Processor Chip Multi-chip Module 32KB, 8-way data cache 32KB, 8-way inst cache 12 MB (2X 6MB), 16-way Unified L2 cache

– 6 – Core2 Architecture (2006): UG machines

– 7 – UG Machines CPU Core Arch. Features 64-bit instructions Deeply pipelined 14 stages Branches are predictedSuperscalar Can issue multiple instructions at the same time Can issue instructions out-of-order

Core 2 Memory Hierarchy Disk Main Memory L2 unified cache L1 I-cache L1 D-cache CPUCPU RegReg Latency:100 cycles16 cycles3 cycles 10s of millions 6 MB 32 KB ~4 GB ~500 GB L1/L2 cache: 64 B blocks 8-way associative! 16-way associative! Reminder: Conflict misses are not an issue nowadays Staying within on-chip cache capacity is key

– 9 – Run lstopo on UG machine, gives: Machine (3829MB) + Socket #0 L2 #0 (6144KB) L1 #0 (32KB)+Core #0+PU #0 (phys=0) L1 #1 (32KB)+Core #1+PU #1 (phys=1) L2 #1 (6144KB) L1 #2 (32KB) + Core #2 + PU #2 (phys=2) L1 #3 (32KB) + Core #3 + PU #3 (phys=3) Get Memory System Details: lstopo 4GB RAM 2X 6MB L2 cache 32KB L1 cache per core 2 cores per L2

– 10 – Get More Cache Details: L1 dcache ls /sys/devices/system/cpu/cpu0/cache/index0 coherency_line_size: 64 // 64B cache lines level: 1 // L1 cache number_of_sets physical_line_partition shared_cpu_list shared_cpu_map size: type: data // data cache ways_of_associativity: 8 // 8-way set associative

– 11 – Get More Cache Details: L2 cache ls /sys/devices/system/cpu/cpu0/cache/index2 coherency_line_size: 64 // 64B cache lines level: 2 // L2 cache number_of_sets physical_line_partition shared_cpu_list shared_cpu_map size: 6144K type: Unified // unified cache, means instructions and data ways_of_associativity: 24 // 24-way set associative

– 12 – Access Hardware Counters: perf The tool ‘perf’ allows you to access performance counters way easier than it used to be To measure L1 cache load misses for program foo, run: perf stat -e L1-dcache-load-misses foo perf stat -e L1-dcache-load-misses foo 7803 L1-dcache-load-misses # M/sec 7803 L1-dcache-load-misses # M/sec To see a list of all events you can measure: perf list perf list Note: you can measure multiple events at once

Prefetching Basic idea: Predicts which data will be needed soon (might be wrong) Initiates an early request for that data (like a load-to-cache) If effective, can be used to tolerate latency to memory load X (misses cache) inst4 inst3 inst2 inst1 inst6 inst5 (must wait for load value) Cache miss latency ORIGINAL CODE: prefetch X inst3 inst2 inst4 inst1 inst6 inst5 (load value is ready) Cache miss latency CODE WITH PREFETCHING: load X (hits cache)

Prefetching is Difficult Prefetching is effective only if all of these are true: There is spare memory bandwidth to begin with Otherwise prefetches could make things worse Prefetches are accurate Only useful if you prefetch data you will soon use Prefetches are timely Ie., prefetch the right data, but not early enough Prefetched data doesn’t displace other in-use data Eg: bad if PF replaces a cache block about to be used Latency hidden by prefetches outweighs their cost C ost of many useless prefetches could be significant Ineffective prefetching can hurt performance!

– 15 – Hardware Prefetching A simple hardware prefetcher: When one block is accessed prefetch the adjacent block i.e., behaves like blocks are twice as big A more complex hardware prefetcher: Can recognize a “stream”: addresses separated by a “stride” Eg1: 0x1, 0x2, 0x3, 0x4, 0x5, 0x6... (stride = 0x1) Eg2: 0x100, 0x300, 0x500, 0x700, 0x900… (stride = 0x200) Prefetch predicted future addresses Eg., current_address + stride*4

Core 2 Hardware Prefetching Disk Main Memory L2 unified cache L1 I-cache L1 D-cache CPU Reg 6 MB 32 KB ~4 GB ~500 GB (?) L1/L2 cache: 64 B blocks L2->L1 data prefetching L2->L1 inst prefetching Mem->L2 data prefetching Includes next-block prefetching and multiple streaming prefetchers They will only prefetch within a page boundary (details are kept vague/secret)

Software Prefetching Hardware provides special prefetch instructions: Eg., intel’s prefetchnta instruction Compiler or programmer can insert them into the code: Can PF patterns that hardware wouldn’t recognize (non-strided) void process_list(list_t *head){ list_t *p = head; while (p){ process(p); p = p->next; } void process_list_PF(list_t *head){ list_t *p = head; list_t *q; while (p){ q = p->next; prefetch(q); process(p); p = q; } Assumes process() is long enough to hide the prefetch latency

– 18 – Memory Optimizations: Review Caches Conflict Misses: Less of a concern due to high-associativity (8-way L1, 16-way L2) Cache Capacity: Main concern: keep working set within on-chip cache capacity Focus on either L1 or L2 depending on required working-set size Virtual Memory: Page Misses: Keep “big-picture” working set within main memory capacity TLB Misses: may want to keep working set #pages < TLB #entriesPrefetching: Try to arrange data structures, access patterns to favor sequential/strided access Try compiler or manual-inserted prefetch instructions