Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie.

Slides:

Advertisements

Similar presentations

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.

Advertisements

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Computer Abstractions and Technology

Thermal-Scheduling For Ultra Low Power Mobile Microprocessor May, Thermal-Scheduling For Ultra Low Power Mobile Microprocessor George Cai 1 Chee.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

Low power Design Strategies Daniele Folegnani. Talk outline Why Low Power is Important Power Consumption in CMOS Circuits New Trends for Future Microprocessors.

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park 2 Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Yunheung Paek 2.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized.

© ACES Labs, CECS, ICS, UCI. Energy Efficient Code Generation Using rISA * Aviral Shrivastava, Nikil Dutt

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

Automatic Generation of Operation Tables for Fast Exploration of Bypasses in Embedded Systems Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Sanghyun.

PBExplore: A Framework for CIL Exploration of Partial Bypasses in Embedded Processors Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2.

Low-power computer architecture

September 28 th 2004University of Utah1 A preliminary look Karthik Ramani Power and Temperature-Aware Microarchitecture.

An Efficient Compiler Technique for Code Size Reduction using Reduced Bit-width ISAs S. Ashok Halambi, Aviral Shrivastava, Partha Biswas, Nikil Dutt, Alex.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications A. Chormoviti, N. Vassiliadis, G. Theodoridis, S. Nikolaidis.

Compiler-in-the-Loop ADL-driven Early Architectural Exploration Aviral Shrivastava 1 Nikil Dutt 1 Alex Nicolau 1 Eugene Earlie 2 1 Center For Embedded.

Compilation, Architectural Support, and Evaluation of SIMD Graphics Pipeline Programs on a General-Purpose CPU Mauricio Breternitz Jr, Herbert Hum, Sanjeev.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

ECE 510 Brendan Crowley Paper Review October 31, 2006.

COM181 Computer Hardware Ian McCrumRoom 5B18,

Generic Software Pipelining at the Assembly Level Markus Pister

SRC Project Temperature-Aware SoC Optimization Framework PIs: Fadi Kurdahi & Nikil Dutt, UCI PIs: Fadi J. Kurdahi and Nikil D. Dutt Center for.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

Low Power Techniques in Processor Design

Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

|Processors designed for low power |Architectural state is correct at basic block granularity rather than instruction granularity 2.

LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.

Runtime Software Power Estimation and Minimization Tao Li.

1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Power Analysis of Embedded Software : A Fast Step Towards Software Power Minimization 指導教授 : 陳少傑教授組員 : R 張馨怡 R 林秀萍.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,

Best detection scheme achieves 100% hit detection with

Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

CS203 – Advanced Computer Architecture

Thermal-Aware Data Flow Analysis José L. Ayala – Complutense University (Spain) David Atienza – EPFL (Switzerland) Philip Brisk – EPFL (Switzerland)

Operation Tables for Scheduling in the presence of Partial Bypassing Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1 1 Center For Embedded.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

Graduate Seminar Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun April 2005.

Computer Architecture & Operations I

Lynn Choi School of Electrical Engineering

CGRA Express: Accelerating Execution using Dynamic Operation Fusion

Dave Maze Edwin Olson Andrew Menard

Improving Program Efficiency by Packing Instructions Into Registers

Power-Aware Operand Delivery

EPIMap: Using Epimorphism to Map Applications on CGRAs

Lecture 2: Performance Today’s topics: Technology wrap-up

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Die Stacking (3D) Microarchitecture -- from Intel Corporation

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

*Qiang Zhu Fujitsu Laboratories LTD. Japan

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Bypass Aware Instruction Scheduling for Register File Power Reduction Sanghyun Park, Aviral Shrivastava Nikil Dutt, Alex Nicolau Yunheung Paek Eugene Earlie Published: Proceedings of the 2006 LCTES Conference SESSION: Low power issues PRESENTED by SALEEL KUDCHADKER

Processor Power Power is now a primary architectural concern Power is now a primary architectural concern Processor power consumption doubles w/ Pentium generations Processor power consumption doubles w/ Pentium generations High Power Consumption High Power Consumption Increases packaging/cooling cost Increases packaging/cooling cost Limits achievable performance Limits achievable performance Important for handheld embedded devices Important for handheld embedded devices Battery life Battery life Weight Weight Managing the Impact of Increasing… Cost of Removing heat from a microprocessor Increasing power consumption Intel website

Power Density Power Density = power /area Power Density = power /area Silicon is a bad heat Conductor Silicon is a bad heat Conductor Areas with high power density becomes hot Areas with high power density becomes hot Increased leakage current in transistors when heat increases Increased leakage current in transistors when heat increases Important to distribute power over the die Important to distribute power over the die Heat Stroke - Have to stop if any part of die has more than critical temperature Heat Stroke - Have to stop if any part of die has more than critical temperature

Register File Power Register File is a significant source of power dissipation Register File is a significant source of power dissipation Motorola M.CORE – approx. 16% processor power Motorola M.CORE – approx. 16% processor power RF may consume up to 25% of processor power RF may consume up to 25% of processor power High Register File Power density High Register File Power density Small size, causes Hotspots Small size, causes Hotspots e.g., Alpha 21264, Intel Pentium e.g., Alpha 21264, Intel Pentium Trend: increasing RF power due to Trend: increasing RF power due to Microarchitectural enhancements to improve IPC Microarchitectural enhancements to improve IPC Compiler techniques to improve IPC Compiler techniques to improve IPC Large Register Files (esp. VLIW processors) Large Register Files (esp. VLIW processors)

Reducing RF Power: Related Work Three ways to reduce RF Power 1. Reduce energy per access to RF 2. Reduce number of registers in RF 3. Reduce number of accesses to RF

“On-Demand RF Read ” Existing processors anticipatorily read RF Existing processors anticipatorily read RF e.g., Pentium 4, Alpha e.g., Pentium 4, Alpha SpecInt95 running on MIPS II SpecInt95 running on MIPS II 36% operands come from bypasses 36% operands come from bypasses 8-issue SimpleScalar running SpecInt2K 8-issue SimpleScalar running SpecInt2K 50-70% operands come from bypasses 50-70% operands come from bypasses Read from RF only if necessary Read from RF only if necessary First find out if the value is present in the bypasses First find out if the value is present in the bypasses If not, then read the value from RF If not, then read the value from RF We’ll call this “On-Demand RF Read” We’ll call this “On-Demand RF Read” When applied to Intel XScale model When applied to Intel XScale model 58% energy reduction 58% energy reduction < 3% performance loss < 3% performance loss

Processor Model Pipeline Bypasses Pipeline Bypasses Improve performance Improve performance Full bypassing Full bypassing Best performance, but high power, area & wiring complexity Best performance, but high power, area & wiring complexity Partial Bypassing Partial Bypassing Keep only some bypasses Keep only some bypasses Popular in embedded processors, e.g., Intel XScale Popular in embedded processors, e.g., Intel XScale

Bypass-sensitive RF Power-Aware Scheduling Schedule instructions so that Schedule instructions so that Dependent instruction transfer operands using bypasses Dependent instruction transfer operands using bypasses Reduce RF usage Reduce RF usage Compiler needs to know Compiler needs to know When does an instruction bypass result? When does an instruction bypass result? Which operands can read the result? Which operands can read the result? When result is written into register file? When result is written into register file? A BYPASS AWARE COMPILER IS NEEDED!! A BYPASS AWARE COMPILER IS NEEDED!! Add R1 R2 R3 ADD R10 R11 R12 SUB R4 R5 R1 NO BYPASS!! Add R1 R2 R3 SUB R4 R5 R1 ADD R10 R11 R12 BYPASS POSSIBLE!!

OT-based RF Power-Aware Scheduling Operation Tables (OTs) provide a mechanism Operation Tables (OTs) provide a mechanism To accurately estimate the number of operands read from RF To accurately estimate the number of operands read from RF Exploit OTs for scheduling to reduce RF usage Exploit OTs for scheduling to reduce RF usage Various scheduling strategies can be employed Various scheduling strategies can be employed Choose scheduling heuristic with the least RF usage Choose scheduling heuristic with the least RF usage 3 BB scheduling techniques 1. RFPEX: Exhaustive 2. RFPN: Greedy 3. RFPN2: Greedy with one level of backtracking

Experimental Setup Intel XScale Intel XScale 7 –stage, partially bypassed 7 –stage, partially bypassed On-Demand RF Read Architecture On-Demand RF Read Architecture RF Power Model RF Power Model = # Register File Accesses MiBench benchmarks MiBench benchmarks Scheduler Scheduler Operation Table - based Operation Table - based RF Power-Aware Scheduling RF Power-Aware Scheduling Within Basic Block Within Basic Block Tried 3 strategies Tried 3 strategies RF Power Results RF Power Results Compare with On-Demand RF Read Compare with On-Demand RF Read GCC –O3 Assembly Executable Runtime RF Reads OT – based Scheduler Application GCC linker

1. RFPEX Scheduling Exhaustive Exhaustive Try all legal permutations of instructions Try all legal permutations of instructions Compilation Time Compilation Time Hours Hours Could not schedule susan, rijndael (2 days) Could not schedule susan, rijndael (2 days) RF Power Reduction RF Power Reduction Average 12% Average 12% Performance Improvement Performance Improvement Average 1.4% Average 1.4% 2. RFPN Scheduling Greedy Greedy Pick instructions one by one Pick instruction which gets most operands from bypass Compilation time Compilation time Seconds Seconds RF Power Reduction RF Power Reduction Average 6% Average 6% Performance Improvement Performance Improvement Average: -3.5% Average: -3.5% 3. RFPN2 Scheduling Greedy with OP table comparison Greedy with OP table comparison Compilation time Compilation time Minutes Minutes RF Power Reduction RF Power Reduction Average 10.5% Average 10.5% Performance Improvement Performance Improvement Average: -2% Average: -2% 2. RFPN Scheduling 3. RFPN2 Scheduling 1. RFPEX Scheduling 2. RFPN Scheduling 3. RFPN2 Scheduling

Summary Register File is one of the main hotspots in processors Very important to reduce RF Power Repeated accesses cause “Heat Stroke” Up to 90% performance degradation On-Demand RF Read is an effective technique 58% RF power reduction Scope for further RF power reduction via instruction scheduling Contribution: Instruction Scheduling Technique for further RF power reduction Up to 26%, Average 12% RF power reduction 2% performance degradation Over and above On-Demand RF Read architecture RFPN2 is an effective heuristic for RF Power reduction Future Work Beyond basic block scheduling

Our Project Our class project features on reducing the power consumption using Power Aware Instruction Scheduling or Value Life time characteristics of the register Our class project features on reducing the power consumption using Power Aware Instruction Scheduling or Value Life time characteristics of the register Paper with Value lifetime characteristic will be presented by Pradyanesh.