Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Using Virtual Load/Store Queues (VLSQs) to Reduce The Negative Effects of Reordered Memory Instructions Aamer Jaleel and Bruce Jacob Electrical and Computer.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.

1 Lecture 17: Basic Pipelining Today’s topics:  5-stage pipeline  Hazards and instruction scheduling Mid-term exam stats:  Highest: 90, Mean: 58.

Complexity-Effective Issue Queue Design Under Load-Hit Speculation Tali Moreshet and R. Iris Bahar Brown University Division of Engineering.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.

University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.

A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.

Energy-Effective Issue Logic Hasan Hüseyin Yılmaz.

Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

12/03/2001 MICRO’01 Reducing Power Requirements of Instruction Scheduling Through Dynamic Allocation of Multiple Datapath Resources* *supported in part.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.

PipeliningPipelining Computer Architecture (Fall 2006)

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.

Data Prefetching Smruti R. Sarangi.

ECE Dept., Univ. Maryland, College Park

SECTIONS 1-7 By Astha Chawla

Computer Structure Multi-Threading

CSCI1600: Embedded and Real Time Software

Superscalar Processors & VLIW Processors

Lecture 18: Pipelining Today’s topics:

Power-Aware Microprocessors

Address-Value Delta (AVD) Prediction

Phase Capture and Prediction with Applications

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Control unit extension for data hazards

Data Prefetching Smruti R. Sarangi.

CSCI1600: Embedded and Real Time Software

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division of Engineering Richard Weiss Hampshire College School of Cognitive Science BROWN UNIVERSITY

BARC January 30, 2003 Motivation Performance drives high-end processor design  Include many complex architectural features  Resources may not always be optimally utilized Resources dissipate some power regardless of utilization  Dynamic schemes allow processor to reconfigure resources according to program’s needs  Some means of monitoring program is needed to drive reconfiguration

BROWN UNIVERSITY BARC January 30, 2003 Monitoring Options Hardware monitoring Relatively easy to implement Can easily adjust to changing patterns  Must first recognize pattern before reacting  Restricted to fixed-sized sampling windows Software profiling Reconfiguration occurs in anticipation of changing needs Sampling ranges are adaptable  Requires instruction annotation and initial sampling overhead  Only applicable to instructions with very deterministic behavior

BROWN UNIVERSITY BARC January 30, 2003 Why Not Combine? Each has its particular benefits If hardware and software techniques can be combined, can we improve the control policies driving processor reconfiguration? Potentially lead to better energy savings and higher overall performance.

BROWN UNIVERSITY BARC January 30, 2003 Our Goal Have HW and SW profiling work together to better identify program behavior  Allow processor to react more quickly to strongly deterministic behavior  Allow HW monitoring to assist with hard-to-predict cases with hints from software profiling

BROWN UNIVERSITY BARC January 30, 2003 Low Power Configurations We consider 2 different configurations separately:  Reducing issue width and ALUs Save power in issue queue arbitration logic Save power from underutilized ALUs  Fetch Halting Triggered by a critical load missing to main memory Fetching is disabled for the duration of the miss Reduces occupancy rates in fetch and issue queues Reduces number of wrong path instructions fetched

BROWN UNIVERSITY BARC January 30, 2003 Pipeline Organization Annotation Decoder Annotation Decoder Branch Predictor Branch Predictor Fetch Unit Instruction Cache Instruction Cache Instruction Decoder Instruction Decoder Instruction Scheduler Instruction Scheduler Register File Integer ALU Cluster 1 Integer ALU Cluster 1 Integer ALU Cluster 2 Integer ALU Cluster 2 Floating Point ALU Cluster 2 Floating Point ALU Cluster 2 Floating Point ALU Cluster 1 Floating Point ALU Cluster 1 Load/Store Unit Data Cache Data Cache Low-Power State Logic Low-Power State Logic Disable Fetch Unit Disable auxiliary ALU cluster and reduce issue width

BROWN UNIVERSITY BARC January 30, 2003 Adjusting Issue Width Adjust issue width between 8 and 4 and disable second integer ALU cluster SW approach profiles IPC from train dataset  Annotates blocks with low IPC  Decoding start of block triggers entry to LP mode HW approach using built-in counters to monitor IPC  Use fixed 256 cycle window  If integer IPC < threshold, enter LP mode Combined approach  SW steers blocks with consistent behavior  HW handles remaining blocks

BROWN UNIVERSITY BARC January 30, 2003 Results for Reduced Issue Width SW and HW results are comparable COMBined results show that SW + HW methods identify different opportunities for saving power

BROWN UNIVERSITY BARC January 30, 2003 Results for Reduced Issue Width SW performance is more consistent because thresholds can be tuned on a per-application basis

BROWN UNIVERSITY BARC January 30, 2003 Fetch Halting Requires a combination of SW and HW monitoring: SW profiling:  Identify critical loads that miss to main memory  IPC, occupancy rates, dead cycles, “miss stride” HW monitoring:  Using annotations from SW profiling, HW tracks miss behavior only for “promising” load instructions.  Miss stride from annotations is compared to miss counter in HW to capture dynamic miss behavior For now we simulate a perfect miss-predictor

BROWN UNIVERSITY BARC January 30, 2003 Fetch Halting Potential Memory access rates shows that the fetch halting potential for each benchmark varies Bench- mark % DL1 miss % L2 miss % mem access mgrid3.9%22.8%0.9% vpr4.5%24.7%1.1% gcc0.5%12.8%0.1% mcf23.8%48.0%11.4% twolf6.4%20.1%1.3%

BROWN UNIVERSITY BARC January 30, 2003 Results for fetch halting Restricting fetch halting based on criticality information benefits performance

BROWN UNIVERSITY BARC January 30, 2003 Fetch Halting and RUU Occupancy Perfect + crit results in average 10% RUU occupancy drop

BROWN UNIVERSITY BARC January 30, 2003 Conclusions and Future Work HW and SW predict different low power events and can be combined offering greater power saving potential. Future work:  Improve HW/SW combination scheme  Improve criticality predictor  Currently working on HW miss predictor  Adjust the halt period