Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs,

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Branch prediction Titov Alexander MDSP November, 2009.

Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Evaluating an Adaptive Framework For Energy Management in Processor- In-Memory Chips Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

CS 7810 Lecture 7 Trace Cache: A Low Latency Approach to High Bandwidth Instruction Fetching E. Rotenberg, S. Bennett, J.E. Smith Proceedings of MICRO-29.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

UPC Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Better Branch Prediction Through Prophet/Critic Hybrids A. Falcón, J. Stark, A. Ramirez, K. Lai, M. Valero Paper Presentation and Discussion.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Prophet/Critic Hybrid Branch Prediction Falcon, Stark, Ramirez, Lai, Valero Presenter: Christian Wanamaker.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Evaluation of Dynamic Branch Prediction Schemes in a MIPS Pipeline Debajit Bhattacharya Ali JavadiAbhari ELE 475 Final Project 9 th May, 2012.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

02/02/2005 UCY Computer Architecture Group Andreas Artemiou 1 Power awareness through selective Dynamically Optimized Traces Roni Rosner, Yoav Almong,

Erkan Çetiner. Outline Introduction Related Works Modeling Methodology Baseline Results DTM Techniques Conclusions.

8 – Simultaneous Multithreading. 2 Review from Last Time Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for.

1 Carnegie Mellon University 2 Intel Corporation Chris Fallin 1 Chris Wilkerson 2 Onur Mutlu 1 The Heterogeneous Block Architecture A Flexible Substrate.

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Runtime Software Power Estimation and Minimization Tao Li.

Trace Substitution Hans Vandierendonck, Hans Logie, Koen De Bosschere Ghent University EuroPar 2003, Klagenfurt.

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Fetch Directed Prefetching - a Study

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

CISC Machine Learning for Solving Systems Problems Microarchitecture Design Space Exploration Lecture 4 John Cavazos Dept of Computer & Information.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

ECE Dept., Univ. Maryland, College Park

Multiscalar Processors

SECTIONS 1-7 By Astha Chawla

Improving Program Efficiency by Packing Instructions Into Registers

EE 382N Guest Lecture Wish Branches

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Computer Architecture: A Science of Tradeoffs

8 – Simultaneous Multithreading

rePLay: A Hardware Framework for Dynamic Optimization

CAPS project-team Compilation et Architectures pour Processeurs Superscalaires et Spécialisés.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Power Awareness through Selective Dynamically Optimized Traces Roni Rosner, Yoav Almog, Micha Moffie, Naftali Schwartz and Avi Mendelson – Intel Labs, Haifa, Israel Presenter: Ioana Burcea

Agenda Motivation for PARROT = Power-Aware aRchitecture Running Optimized Traces PARROT Concept and Architecture Performance and Energy Results Discussion – What makes PARROT a power-aware architecture? – What is new about this paper? / What are the contributions of this paper?

Motivation We pay more energy per task – Poor scaling of performance with power consumption PARROT tries to change the balance – Filtering Techniques to Improve Trace-Cache Efficiency – PACT 2001 – Selecting Long Atomic Traces for High Coverage – ICS 2003 – Specialized Dynamic Optimizations for High-Performance Energy-Efficient Microarchitecture – CGO 2004

PARROT Concepts – The Big Picture Based on the well-known cold/hot (10/90) paradigm PARROT Principles – Reuse: trace-cache centric – Dynamic optimizations: more performance with less energy – Focus: invest where it pays – Pipeline decoupling: hybrid front-end, cold and hot execution pipelines – Transparency: immune to s/w compatibility

Traces and Trace Selection Decoded atomic traces – Complex retirement & recovery in case of misprediction – More aggressive optimizations Trace Selection – deterministic criteria – Capacity limitation: 64 uops – Complete basic blocks – Terminating CTI (control-transfer instructions) Indirect jumps, software exceptions, backward taken branches – Return instructions: procedure inlining – Trace join

Microarchitecture Split-execution vs. unified-execution – Foreground phase: fetch-to-execution pipeline – Background phase (post-processing): trace selection and optimization

Microarchitecture (cont’d) Two predictors: GHR = Global History Buffer Branch predictor Trace predictor Deterministic trace build scheme Filtering mechanisms: The hot filter selects frequent traces from those executed on the cold pipeline The blazing filter selects for optimization the hottest traces Dynamic optimizations generic and core specific optimizations gradually applied (?)

Simulation framework An “in-house” proprietary performance and power simulator Optimizations applied as different passes – Optimization delay for one trace ~ 100 cycles Energy simulation – Power consumption matrix for each operation on each hardware unit – Leakage Uniform leakage in space over the processor core and L2 cache and in time modeling a high temperature LE = PMAX * (0.05 * M + 0.4*K) * CYC

Configuration Space

Experimental Evaluation Metrics – IPC – Total energy – Cubic-MIPS-per-WATT (CMPW) A measure of the design tradeoffs between power and performance Benchmarks – SpecInt2000 – SpecFP2000 – Office – Multimedia – DotNet

Performance and Power Awareness

Extreme Microarchitectural Alternatives

Hot Code Predictability

Trace-cache Fetch Coverage

Optimizer Capabilities

Energy Breakdown

Their Conclusions…

Our Conclusions What makes PARROT a power-aware architecture? What is new about this paper? / What are the contributions of this paper? – rePlay (?)