CSE502: Computer Architecture Prefetching. CSE502: Computer Architecture Prefetching (1/3) Fetch block ahead of demand Target compulsory, capacity, (&

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

CS7810 Prefetching Seth Pugsley. Predicting the Future Where have we seen prediction before? – Does it always work? Prefetching is prediction – Predict.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Caches P & H Chapter 5.1, 5.2 (except writes)

CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.

DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Caches Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University See P&H 5.1, 5.2 (except writes)

Advanced Microarchitecture

IT253: Computer Organization

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.

Prefetching Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen and Mark Hill Updated by Mikko Lipasti.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Fundamentals of Parallel Computer Architecture - Chapter 61 Chapter 6 Introduction to Memory Hierarchy Organization Yan Solihin Copyright.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Memory Architecture Chapter 5 in Hennessy & Patterson.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Cache Perf. CSE 471 Autumn 021 Tolerating/hiding Memory Latency One particular technique: prefetching Goal: bring data in cache just in time for its use.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.

15-740/ Computer Architecture Lecture 19: Caching II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/31/2011.

1 Lecture 5a: CPU architecture 101 boris.

CMSC 611: Advanced Computer Architecture

CSE 502: Computer Architecture

Prof. Hsien-Hsin Sean Lee

Lecture: Large Caches, Virtual Memory

Associativity in Caches Lecture 25

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Lecture: Cache Hierarchies

Lecture 23: Cache, Memory, Virtual Memory

Lecture 14: Reducing Cache Misses

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Lecture: Cache Innovations, Virtual Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Performance metrics for caches

Performance metrics for caches

15-740/ Computer Architecture Lecture 27: Prefetching II

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

15-740/ Computer Architecture Lecture 14: Prefetching

CSE451 Virtual Memory Paging Autumn 2002

Lecture: Cache Hierarchies

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Performance metrics for caches

Fundamentals of Computing: Computer Architecture

Principle of Locality: Memory Hierarchies

Performance metrics for caches

Presentation transcript:

CSE502: Computer Architecture Prefetching

CSE502: Computer Architecture Prefetching (1/3) Fetch block ahead of demand Target compulsory, capacity, (& coherence) misses – Why not conflict? Big challenges: – Knowing “what” to fetch Fetching useless blocks wastes resources – Knowing “when” to fetch Too early  clutters storage (or gets thrown out before use) Fetching too late  defeats purpose of “pre”-fetching

CSE502: Computer Architecture Without prefetching: With prefetching: Or: Prefetch Prefetching (2/3) Load L1L2 Data DRAM Total Load-to-Use Latency Data Load Much improved Load-to-Use Latency Somewhat improved Latency Data Load Prefetching must be accurate and timely time

CSE502: Computer Architecture Prefetching (3/3) Without prefetching: With prefetching: Run Load time Prefetching removes loads from critical path

CSE502: Computer Architecture Common “Types” of Prefetching Software Next-Line, Adjacent-Line Next-N-Line Stream Buffers Stride “Localized” (e.g., PC-based) Pointer Correlation

CSE502: Computer Architecture Software Prefetching (1/4) Compiler/programmer places prefetch instructions Put prefetched value into… – Register (binding, also called “hoisting”) May prevent instructions from committing – Cache (non-binding) Requires ISA support May get evicted from cache before demand

CSE502: Computer Architecture A A C C B B R3 = R1+4 R1 = [R2] Software Prefetching (2/4) A A C C B B R1 = [R2] R3 = R1+4 (Cache misses in red) Hopefully the load miss is serviced by the time we get to the consumer R1 = R1- 1 Hoisting must be aware of dependencies A A C C B B R1 = [R2] R3 = R1+4 PREFETCH[R2] Using a prefetch instruction can avoid problems with data dependencies

CSE502: Computer Architecture Software Prefetching (3/4) for (I = 1; I < rows; I++) { for (J = 1; J < columns; J++) { prefetch(&x[I+1,J]); sum = sum + x[I,J]; }

CSE502: Computer Architecture Software Prefetching (4/4) Pros: – Gives programmer control and flexibility – Allows time for complex (compiler) analysis – No (major) hardware modifications needed Cons: – Hard to perform timely prefetches At IPC=2 and 100-cycle memory  move load 200 inst. earlier Might not even have 200 inst. in current function – Prefetching earlier and more often leads to low accuracy Program may go down a different path – Prefetch instructions increase code footprint May cause more I$ misses, code alignment issues

CSE502: Computer Architecture Hardware Prefetching (1/3) Hardware monitors memory accesses – Looks for common patterns Guessed addresses are placed into prefetch queue – Queue is checked when no demand accesses waiting Prefetchers look like READ requests to the hierarchy – Although may get special “prefetched” flag in the state bits Prefetchers trade bandwidth for latency – Extra bandwidth used only when guessing incorrectly – Latency reduced only when guessing correctly No need to change software

CSE502: Computer Architecture Processor Hardware Prefetching (2/3) Registers L1 I-Cache L1 D-Cache L2 Cache D-TLB I-TLB Main Memory (DRAM) L3 Cache (LLC) Potential Prefetcher Locations

CSE502: Computer Architecture Processor Hardware Prefetching (3/3) Real CPUs have multiple prefetchers – Usually closer to the core (easier to detect patterns) – Prefetching at LLC is hard (cache is banked and hashed) Registers L1 I-Cache L1 D-Cache L2 Cache D-TLB I-TLB L3 Cache (LLC) Intel Core2 Prefetcher Locations

CSE502: Computer Architecture Next-Line (or Adjacent-Line) Prefetching On request for line X, prefetch X+1 (or X^0x1) – Assumes spatial locality Often a good assumption – Should stop at physical (OS) page boundaries Can often be done efficiently – Adjacent-line is convenient when next-level block is bigger – Prefetch from DRAM can use bursts and row-buffer hits Works for I$ and D$ – Instructions execute sequentially – Large data structures often span multiple blocks Simple, but usually not timely

CSE502: Computer Architecture Next-N-Line Prefetching On request for line X, prefetch X+1, X+2, …, X+N – N is called “prefetch depth” or “prefetch degree” Must carefully tune depth N. Large N is … – More likely to be useful (correct and timely) – More aggressive  more likely to make a mistake Might evict something useful – More expensive  need storage for prefetched lines Might delay useful request on interconnect or port Still simple, but more timely than Next-Line

CSE502: Computer Architecture Stream Buffers (1/3) What if we have multiple inter-twined streams? – A, B, A+1, B+1, A+2, B+2, … Can use multiple stream buffers to track streams – Keep next-N available in buffer – On request for line X, shift buffer and fetch X+N+1 into it Can extend to “quasi-sequential” stream buffer – On request Y in [X…X+N], advance by Y-X+1 – Allows buffer to work when items are skipped – Requires expensive (associative) comparison

CSE502: Computer Architecture Stream Buffers (2/3) Figures from Jouppi “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA’90

CSE502: Computer Architecture Stream Buffers (3/3) Can support multiple streams in parallel

CSE502: Computer Architecture Stride Prefetching (1/2) Access patterns often follow a stride – Accessing column of elements in a matrix – Accessing elements in array of struct s Detect stride S, prefetch depth N – Prefetch X+1∙S, X+2∙S, …, X+N∙S Column in matrix Elements in array of struct s

CSE502: Computer Architecture Stride Prefetching (2/2) Must carefully select depth N – Same constraints as Next-N-Line prefetcher How to determine if A[i]  A[i+1] or X  Y ? – Wait until A[i+2] (or more) – Can vary prefetch depth based on confidence More consecutive strided accesses  higher confidence New access to A+3N StrideCount A+2N N N A+4N + + = = Update count >2 Do prefetch? Last Addr

CSE502: Computer Architecture “Localized” Stride Prefetchers (1/2) What if multiple strides are interleaved? – No clearly-discernible stride – Could do multiple strides like stream buffers Expensive (must detect/compare many strides on each access) – Accesses to structures usually localized to an instruction Miss pattern looks like: A, X, Y, A+N, X+N, Y+N, A+2N, X+2N, Y+2N, … (X-A) (Y-X) (A+N-Y) Use an array of strides, indexed by PC (X-A) (Y-X) (A+N-Y) (X-A) (Y-X) (A+N-Y) Load R1 = [R2] Load R3 = [R4] Store [R6] = R5 Add R5, R1, R3

CSE502: Computer Architecture “Localized” Stride Prefetchers (2/2) Store PC, last address, last stride, and count in RPT On access, check RPT (Reference Prediction Table) – Same stride?  count++ if yes, count-- or count=0 if no – If count is high, prefetch (last address + stride*N) PCa: 0x409A34Load R1 = [R2] PCb: 0x409A38Load R3 = [R4] PCc: 0x409A40Store [R6] = R5 0x409 TagLast AddrStrideCount 0x409 A+3N N N 2 2 X+3N N N 2 2 Y+2N N N 1 1 If confident about the stride (count > C min ), prefetch (A+4N) + +

CSE502: Computer Architecture Other Patterns Sometimes accesses are regular, but no strides – Linked data structures (e.g., lists or trees) A A B B C C D D E E F F Linked-list traversal F F A A B B C C D D E E Actual memory layout (no chance to detect a stride)

CSE502: Computer Architecture Pointer Prefetching (1/2) Data filled on cache miss (512 bits of data) Nope Maybe! struct bintree_node_t { int data1; int data2; struct bintree_node_t * left; struct bintree_node_t * right; }; This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) This allows you to walk the tree (or other pointer-based data structures which are typically hard to prefetch) Pointers usually “look different” Go ahead and prefetch these (needs some help from the TLB) Nope

CSE502: Computer Architecture Pointer Prefetching (2/2) Relatively cheap to implement – Don’t need extra hardware to store patterns Limited lookahead makes timely prefetches hard – Can’t get next pointer until fetched data block X X Access Latency Stride Prefetcher: A A Access Latency B B C C Pointer Prefetcher: X+N X+2N

CSE502: Computer Architecture Pair-wise Temporal Correlation (1/2) Accesses exhibit temporal correlation – If E followed D in the past  if we see D, prefetch E Correlation Table D F A B C E E E ? ? B B C C D D F F A A B B C C D D E E F F Linked-list traversal F F A A B B C C D D E E Actual memory layout D D F F A A B B C C E E Can use recursively to get more lookahead

CSE502: Computer Architecture Pair-wise Temporal Correlation (2/2) Many patterns more complex than linked lists – Can be represented by a Markov Model – Required tracking multiple potential successors Number of candidates is called breadth ABC DEF Correlation Table D F A B C E C C E E B B C C D D A A 11 E E ? ? C C ? ? F F ? ? D D F F A A B B C C E E Recursive breadth & depth grows exponentially  Markov Model

CSE502: Computer Architecture Increasing Correlation History Length Longer history enables more complex patterns – Use history hash for lookup – Increases training time A A B B C C D D E E F F G G DFS traversal: ABDBEBACFCGCA A A B B B B D D D D B B B B E E E E B B B B A A A A C C D D B B E E B B A A C C F F Much better accuracy, exponential storage cost 

CSE502: Computer Architecture Spatial Correlation (1/2) Irregular layout  non-strided Sparse  can’t capture with cache blocks But, repetitive  predict to improve MLP Database Page in Memory (8kB) page header tuple data tuple slot index Memory Large-scale repetitive spatial access patterns

CSE502: Computer Architecture Logically divide memory into regions Identify region by base address Store spatial pattern (bit vector) in correlation table Spatial Correlation (2/2) Region A Region B Correlation Table PCx: A’ PCy: B’ 110… … … …111 PCx PCy + + To prefetch FIFO

CSE502: Computer Architecture Evaluating Prefetchers Compare against larger caches – Complex prefetcher vs. simple prefetcher with larger cache Primary metrics – Coverage: prefetched hits / base misses – Accuracy: prefetched hits / total prefetches – Timeliness: latency of prefetched blocks / hit latency Secondary metrics – Pollution: misses / (prefetched hits + base misses) – Bandwidth: total prefetches + misses / base misses – Power, Energy, Area...

CSE502: Computer Architecture Hardware Prefetcher Design Space What to prefetch? – Predictors regular patterns (x, x+8, x+16, …) – Predicted correlated patterns (A…B->C, B..C->J, A..C->K, …) When to prefetch? – On every reference  lots of lookup/prefetcher overhead – On every miss  patterns filtered by caches – On prefetched-data hits (positive feedback) Where to put prefetched data? – Prefetch buffers – Caches

CSE502: Computer Architecture What’s Inside Today’s Chips Data L1 – PC-localized stride predictors – Short-stride predictors within block  prefetch next block Instruction L1 – Predict future PC  prefetch L2 – Stream buffers – Adjacent-line prefetch