Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

Efficient Runahead Execution Processors A Power-Efficient Processing Paradigm for Tolerating Long Main Memory Latencies Onur Mutlu PhD Defense 4/28/2006.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

1 Runahead Execution A review of “Improving Data Cache Performance by Pre- executing Instructions Under a Cache Miss” Ming Lu Oct 31, 2006.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Revisiting Load Value Speculation:

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Runahead Execution A Power-Efficient Alternative to Large Instruction Windows for Tolerating Long Main Memory Latencies Onur Mutlu EE 382N Guest Lecture.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

Computer Architecture: Runahead Execution Prof. Onur Mutlu Carnegie Mellon University.

Value Prediction Kyaw Kyaw, Min Pan Final Project.

Data Prefetching Smruti R. Sarangi.

Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/14/2011

ECE 445 – Computer Organization

Exploring Value Prediction with the EVES predictor

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Ka-Ming Keung Swamy D Ponpandi

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

15-740/ Computer Architecture Lecture 10: Runahead and MLP

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

15-740/ Computer Architecture Lecture 14: Runahead Execution

15-740/ Computer Architecture Lecture 16: Prefetching Wrap-up

Instruction Level Parallelism (ILP)

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Data Prefetching Smruti R. Sarangi.

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Patrick Akl and Andreas Moshovos AENAO Research Group

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Ka-Ming Keung Swamy D Ponpandi

Presentation transcript:

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt

AVD Prediction 2 What is AVD Prediction? A new prediction technique used to break the data dependencies between dependent load instructions

AVD Prediction 3 Talk Outline Background on Runahead Execution The Problem: Dependent Cache Misses AVD Prediction Why Does It Work? Evaluation Conclusions

AVD Prediction 4 Background on Runahead Execution A technique to obtain the memory-level parallelism benefits of a large instruction window When the oldest instruction is an L2 miss:  Checkpoint architectural state and enter runahead mode In runahead mode:  Instructions are speculatively pre-executed  The purpose of pre-execution is to generate prefetches  L2-miss dependent instructions are marked INV and dropped Runahead mode ends when the original L2 miss returns  Checkpoint is restored and normal execution resumes

AVD Prediction 5 Works when Load 1 and 2 are independent Compute Load 1 Miss Miss 1 Stall Compute Load 2 Miss Miss 2 Stall Load 1 Miss Runahead Load 2 MissLoad 2 Hit Miss 1 Miss 2 Compute Load 1 Hit Saved Cycles Small Window: Runahead: Runahead Example

AVD Prediction 6 Runahead execution cannot parallelize dependent misses This limitation results in  wasted opportunity to improve performance  wasted energy (useless pre-execution) Runahead performance would improve by 25% if this limitation were ideally overcome The Problem: Dependent Cache Misses Compute Load 1 Miss Miss 1 Load 2 Miss Miss 2 Load 2Load 1 Hit Runahead: Load 2 is dependent on Load 1 Runahead Cannot Compute Its Address!  INV

AVD Prediction 7 The Goal Enable the parallelization of dependent L2 cache misses in runahead mode with a low-cost mechanism How:  Predict the values of L2-miss address (pointer) loads Address load: loads an address into its destination register, which is later used to calculate the address of another load as opposed to data load

AVD Prediction 8 Parallelizing Dependent Misses Compute Load 1 Miss Miss 1 Load 2 Hit Miss 2 Load 2Load 1 Hit Value Predicted Runahead Saved Cycles Can Compute Its Address Compute Load 1 Miss Miss 1 Load 2 Miss Miss 2 Load 2 INVLoad 1 Hit Runahead Cannot Compute Its Address!  Saved Speculative Instructions Miss

AVD Prediction 9 A Question How can we predict the values of address loads with low hardware cost and complexity?

AVD Prediction 10 Talk Outline Background on Runahead Execution The Problem: Dependent Cache Misses AVD Prediction Why Does It Work? Evaluation Conclusions

AVD Prediction 11 The Solution: AVD Prediction Address-value delta (AVD) of a load instruction defined as: AVD = Effective Address of Load – Data Value of Load For some address loads, AVD is stable An AVD predictor keeps track of the AVDs of address loads When a load is an L2 miss in runahead mode, AVD predictor is consulted If the predictor returns a stable (confident) AVD for that load, the value of the load is predicted Predicted Value = Effective Address – Predicted AVD

AVD Prediction 12 Identifying Address Loads in Hardware Insight:  If the AVD is too large, the value that is loaded is likely not an address Only keep track of loads that satisfy: -MaxAVD ≤ AVD ≤ +MaxAVD This identification mechanism eliminates many loads from consideration  Enables the AVD predictor to be small

AVD Prediction 13 An Implementable AVD Predictor Set-associative prediction table Prediction table entry consists of  Tag (Program Counter of the load)  Last AVD seen for the load  Confidence counter for the recorded AVD Updated when an address load is retired in normal mode Accessed when a load misses in L2 cache in runahead mode Recovery-free: No need to recover the state of the processor or the predictor on misprediction  Runahead mode is purely speculative

AVD Prediction 14 AVD Update Logic

AVD Prediction 15 AVD Prediction Logic

AVD Prediction 16 Talk Outline Background on Runahead Execution The Problem: Dependent Cache Misses AVD Prediction Why Does It Work? Evaluation Conclusions

AVD Prediction 17 Why Do Stable AVDs Occur? Regularity in the way data structures are  allocated in memory AND  traversed Two types of loads can have stable AVDs  Traversal address loads Produce addresses consumed by address loads  Leaf address loads Produce addresses consumed by data loads

AVD Prediction 18 Traversal Address Loads Regularly-allocated linked list: A A+k A+2k A+3k A+4k A+5k... A traversal address load loads the pointer to next node: node = node  next Effective AddrData Value AVD AA+k -k A+kA+2k-k A+2kA+3k-k A+3kA+4k-k A+4kA+5k-k Stable AVDStriding data value AVD = Effective Addr – Data Value

AVD Prediction 19 Stable AVDs can be captured with a stride value predictor Stable AVDs disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator  Allocation of contiguous, fixed-size chunks is useful Properties of Traversal-based AVDs A A+k A+2k A+3k A+k A A+2k Sorting Distance between nodes NOT constant! 

AVD Prediction 20 Leaf Address Loads Sorted dictionary in parser: Nodes point to strings (words) String and node allocated consecutively A+k A C+k C B+k B D+kE+kF+kG+k DEFG Dictionary looked up for an input word. A leaf address load loads the pointer to the string of each node: Effective AddrData Value AVD A+kA k C+kCk F+kFk lookup (node, input) { //... ptr_str = node  string; m = check_match(ptr_str, input); if (m>=0) lookup(node->right, input); if (m left, input); } Stable AVDNo stride! AVD = Effective Addr – Data Valuestring node

AVD Prediction 21 Properties of Leaf-based AVDs Stable AVDs cannot be captured with a stride value predictor Stable AVDs do not disappear with the re-organization of the data structure (e.g., sorting) Stability of AVDs is dependent on the behavior of the memory allocator A+k A B+k BC C+k Sorting Distance between node and string still constant! C+k C A+k AB B+k

AVD Prediction 22 Talk Outline Background on Runahead Execution The Problem: Dependent Cache Misses AVD Prediction Why Does It Work? Evaluation Conclusions

AVD Prediction 23 Baseline Processor Execution-driven Alpha simulator 8-wide superscalar processor 128-entry instruction window, 20-stage pipeline 64 KB, 4-way, 2-cycle L1 data and instruction caches 1 MB, 32-way, 10-cycle unified L2 cache 500-cycle minimum main memory latency 32 DRAM banks, 32-byte wide processor-memory bus (4:1 frequency ratio), 128 outstanding misses  Detailed memory model Pointer-intensive benchmarks from Olden and SPEC INT00

AVD Prediction 24 Performance of AVD Prediction 12.1%

AVD Prediction 25 Effect on Executed Instructions 13.3%

AVD Prediction 26 AVD Prediction vs. Stride Value Prediction Performance:  Both can capture traversal address loads with stable AVDs e.g., treeadd  Stride VP cannot capture leaf address loads with stable AVDs e.g., health, mst, parser  AVD predictor cannot capture data loads with striding data values Predicting these can be useful for the correct resolution of mispredicted L2-miss dependent branches, e.g., parser Complexity:  AVD predictor requires much fewer entries (only address loads)  AVD prediction logic is simpler (no stride maintenance)

AVD Prediction 27 AVD vs. Stride VP Performance 5.1% 2.7% 6.5% 5.5% 4.7% 8.6% 16 entries4096 entries

AVD Prediction 28 Conclusions Runahead execution is unable to parallelize dependent L2 cache misses A very simple, 16-entry (102-byte) AVD predictor reduces this limitation on pointer-intensive applications  Increases runahead execution performance by 12.1%  Reduces executed instructions by 13.3% AVD prediction takes advantage of the regularity in the memory allocation patterns of programs Software (programs, compilers, memory allocators) can be written to take advantage of AVD prediction

Backup Slides

AVD Prediction 30 The Potential: What if it Could? 25% 27%

AVD Prediction 31 Effect of Confidence Threshold

AVD Prediction 32 Effect of MaxAVD

AVD Prediction 33 Effect of Memory Latency 8% 12.1%13.5% 9.3% 13%