Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 2015020802 최병준 1 Computer Engineering and Systems Group.

Slides:

Advertisements

Similar presentations

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Advertisements

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

Better Branch Prediction Through Prophet/Critic Hybrids A. Falcón, J. Stark, A. Ramirez, K. Lai, M. Valero Paper Presentation and Discussion.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.

Superscalar SMIPS Processor Andy Wright Leslie Maldonado.

Conference title1 A New Methodology for Studying Realistic Processors in Computer Science Degrees Crispín Gómez, María E. Gómez y Julio Sahuquillo DISCA.

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Semiconductor Memory 1970 Fairchild Size of a single core –i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Sampling Dead Block Prediction for Last-Level Caches

Performance Analysis of the Compaq ES40--An Overview Paper evaluates Compaq’s ES40 system, based on the Alpha Only concern is performance: no power.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Pipelining and Parallelism Mark Staveley

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Sunpyo Hong, Hyesoon Kim

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Prophet/Critic Hybrid Branch Prediction B B B

Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Computer Sciences Department University of Wisconsin-Madison

Reza Yazdani Albert Segura José-María Arnau Antonio González

Milad Hashemi, Onur Mutlu, and Yale N. Patt

5.2 Eleven Advanced Optimizations of Cache Performance

The University of Texas at Austin

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Address-Value Delta (AVD) Prediction

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

Patrick Akl and Andreas Moshovos AENAO Research Group

The University of Adelaide, School of Computer Science

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group (CESG), Department of Electrical and Computer Engineering, Texas A&M University -> Reena Panda, Paul V. Gratz Department of Computer Science, The University of Texas at San Antonio -> Daniel A. Jim´enez

Korea Univ 1. Introduction Modern computer architecture is beset by two opposing, conflicting trends  while technology scaling and deep pipelining have led to high processor frequencies, the memory access speed has not scaled accordingly  Meanwhile, power and energy considerations have revived interest in in-order processors 2

Korea Univ 1. Introduction In this paper we present a novel data cache prefetch scheme, leveraging both execution path speculation as well as effective address speculation, to efficiently improve performance of in-order processors. Much prior work focuses on reducing the impact of the memory-wall on processor performance 3

Korea Univ 1. Introduction In this paper we propose a light-weight prefetcher ‘B-Fetch’, a combined control-flow and effective address speculating prefetching scheme. B-Fetch leverages the high prediction accuracies of current-generation branch predictors, combined with novel effective address speculation. 4

Korea Univ 2. Background To be effective at masking such high latencies, a prefetcher must anticipate misses and issue prefetches significantly ahead of actual execution This requires accurate prediction of ~ 1. the likely memory instructions to be executed 2. the likely effective addresses of these instructions 5

Korea Univ 2. Background Program execution path is determined by the direction taken by the component control instructions The memory access behavior can therefore be linked to the prior control flow behavior 6

Korea Univ 2. Background 7 The program execution path is determined by direction taken by the relevant control instruction. Memory access behavior can therefore be linked to prior control flow behavior

Korea Univ 2. Background 8

Korea Univ 3. Proposed Design Pipeline Overview 1. Branch Lookahead 2. Register-Table Lookup 3. Prefetch Issue 9

Korea Univ 3. Proposed Design System Components  Path Confidence Estimator  Branch Trace Cache  Branch-Register Table  Prefetch Filtering 10

Korea Univ 3. Proposed Design System Components  Path Confidence Estimator  Branch Trace Cache  Branch-Register Table  Prefetch Filtering 11

Korea Univ 3. Proposed Design System Components  Path Confidence Estimator  Branch Trace Cache  Branch-Register Table  Prefetch Filtering 12

Korea Univ 3. Proposed Design System Components  Path Confidence Estimator  Branch Trace Cache  Branch-Register Table  Prefetch Filtering 13

Korea Univ 3. Proposed Design Hardware Cost  The table shows B-Fetch requires ∼ 33% of the table state required by SMS 14

Korea Univ 4. Evaluation Methodology  We evaluate our prefetcher in a simulation environment based on the M5 Simulator. The simulator is used to model a 1-wide, 5-stage in-order pipeline  The test workload consists of the 18 SPEC CPU2006 benchmarks that our simulation infrastructure supports, compiled for the ALPHA ISA 15

Korea Univ 4. Evaluation Prefetcher Performance  Figure 5 contains the IPC for the simple Stride, SMS and Bfetch prefetchers normalized against baseline  B-Fetch provides performance benefits across a range of applications, both integer and floating point 16

Korea Univ 4. Evaluation Prefetcher Performance  The results show the B-Fetch prefetcher provides a mean speedup of 39% (62%) across all (prefetch sensitive) benchmarks.  As compared to a stride prefetcher, B-Fetch improves the performance by over 25% (39.4%). Compared against SMS, B-Fetch improves the performance by 2.2% (3.4%), at the cost of ∼ 1/3 the overhead in storage 17

Korea Univ 5. Conclusions B-Fetch not only predicts effective addresses which display regular-access patterns, but also can take advantage of the dynamic values of the registers at runtime to predict irregular and isolated data accesses. The focus of this paper has been improving in-order processor performance, the B-Fetch scheme should perform comparably on superscalar processors, we plan to explore this in future work. 18

Korea Univ The End 19