Data Prefetching Smruti R. Sarangi.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

1 Seoul National University Wrap-Up. 2 Overview Seoul National University Wrap-Up of PIPE Design  Exception conditions  Performance analysis Modern.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

Computer Organization and Architecture The CPU Structure.

Chapter 12 Pipelining Strategies Performance Hazards.

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.

Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Branch Hazards and Static Branch Prediction Techniques

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

Instruction Prefetching Smruti R. Sarangi. Contents  Motivation for Prefetching  Simple Schemes  Recent Work  Proactive Instruction Fetching  Return.

Computer Organization CS224

Concepts and Challenges

Dynamic Branch Prediction

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

PowerPC 604 Superscalar Microprocessor

CS203 – Advanced Computer Architecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Morgan Kaufmann Publishers The Processor

CMSC 611: Advanced Computer Architecture

Exploring Value Prediction with the EVES predictor

Seoul National University

The processor: Pipelining and Branching

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Address-Value Delta (AVD) Prediction

CS 704 Advanced Computer Architecture

Lecture 11: Memory Data Flow Techniques

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Control unit extension for data hazards

Lecture 10: Branch Prediction and Instruction Delivery

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

CS203 – Advanced Computer Architecture

Control unit extension for data hazards

Wackiness Algorithm A: Algorithm B:

Control unit extension for data hazards

Consider the following code segment for a loop: int a = 3, b = 4;

Computer Structure Advanced Branch Prediction

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Data Prefetching Smruti R. Sarangi

Data Prefetching Instead of instructions let us now prefetch data Important distinction Instructions are fetched into the in-order part of the OOO pipeline Data is fetched into the OOO pipeline As a result the IPC is slightly more resilient to data cache misses Nevertheless, prefetching is useful On the other hand x86 has instructions where a single instruction (with the rep prefix) can be used to load or store multiple bytes from memory Small i-cache footprint, yet large d-cache footprint

Outline Stride based Prefetching Prefetching based Runahead Execution

History based Prefetching for (i = 0; i < N; i++ ) { ... A [i] = B[i] + C[2*i]; } Consider the following piece of code For the arrays A and B, the addresses in each iteration differ by 4 bytes (assumed to be the size of an integer) For the array, C, consecutive accesses differ by 8 bytes This instruction will translate into multiple load/store RISC instruction There will be only on memory access per instruction The hardware will observe that for the same PC The memory address keeps getting incremented by a certain value (4 or 8 bytes in this case) This fixed increment is called a stride

Reference Prediction Table (RPT) Memory address Instruction Tag Previous Address Stride State PC Prefetched address Adder For each instruction keep trace of the address and the stride You can decide to prefetch N interations in advance State: initial  transient (not sure about the stride)  steady

Extensions How many iterations to prefetch in advance? This depends upon the rest of the instructions in the for loop This ideally should be dynamically adjusted For a subset of prefetches Monitor the number of cycles between when a line is prefetched and it is actually used for the steady state Should not be negative Should be a small positive number Extensions to the RPT scheme Track the next address for a linked list (If p is prefetched, prefetch *(p->next))

Outline Stride based Prefetching Prefetching based Runahead Execution

Runahead Mode What happens when we do bad prefetching? We have misses. L1 misses can more or less be taken care of by an OOO pipeline L2 misses are bigger problems What happens on an L2 miss You go to main memory (200-400) cycles What does the OOO pipeline do? The IW and ROB fill up It pretty much stalls Idea: Do some work during the stalled period

What can you do? Enter runahead mode Once when the L2 miss returns Return a junk value for the L2 miss Restart execution You will produce junk values (in the forward slice) That is still okay. Why You will prefetch data into the caches Once when the L2 miss returns Flush the instructions in the forward slice of the L2 miss Re-execute them Very effective prefetching technique

Operation Before entering the runahead mode Start the runahead mode Take a checkpoint of the architectural register file Also, of the branch history register and return address stack Start the runahead mode Add an invalid (INV) bit to each register The L2 miss generates an INV value All the instructions in its foward slice have an INV value The INV value is same as the poison bit (learnt earlier) Question: Do we restrict invalid values to the pipeline or let them propagate?

Value Propagation Pipeline L1 L2 If till the pipeline  nullify all stores If till the L1  do not evict INV data from the L1 Preferably not till the L2 (massive amount of state management)

Let us take INV values till the L1 If we use the traditional L1 cache, there will be some problems We need to maintain both INV and non-INV data together What if they are on the same line (additional state required) Solution: have an additional runahead cache that contains only INV data. Runahead L1 Pipeline L1

Runahead L1 Cache A store might have the INV flag set for two reasons Its address is INV. Ignore. Assume that the address is correct and access the runahead cache. A stores data is INV. Access the runahead cache Perform the store Let us keep some additional state per line. Let us mark those words written by an INV store as INV. Cache Line INV Stores

Loading a Value First try forwarding in the LSQ If not possible: Access the runahead cache and L1 cache in parallel The runahead cache gets preference Load the data. If the data is marked as INV, marked the load data as INV Otherwise, load data from the L1 cache If there is a miss, load data from the L2 (acts as prefetching)

Operation (Contd...) Keep fetching and retiring (in runahead mode) instructions All the instructions before the speculated load (in program order) will retire After that, all insts. are in runahead mode (will not retire) As we fetch and execute more and more instructions, we in effect do more prefetching What about updating branch predictors? Best option: Treat runahead instructions as normal instructions Return from runahead mode Similar to a branch misprediction, restore the checkpoint

THE END