Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine,

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Reducing Leakage Power in Peripheral Circuits of L2 Caches Houman Homayoun and Alex Veidenbaum Dept. of Computer Science, UC Irvine {hhomayou,
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
1 Cleared for Open Publication July 30, S-2144 P148/MAPLD 2004 Rea MAPLD 148:"Is Scaling the Correct Approach for Radiation Hardened Conversions.
Power Reduction Techniques For Microprocessor Systems
School of Computing Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Power, Temperature, Reliability and Performance - Aware Optimizations in On-Chip SRAMs Houman Homayoun PhD Candidate Dept. of Computer Science, UC Irvine.
Adaptive Techniques for Leakage Power Management in L2 Cache Peripheral Circuits Houman Homayoun Alex Veidenbaum and Jean-Luc Gaudiot Dept. of Computer.
Chuanjun Zhang, UC Riverside 1 Low Static-Power Frequent-Value Data Caches Chuanjun Zhang*, Jun Yang, and Frank Vahid** *Dept. of Electrical Engineering.
11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Super-Drowsy Caches Single-V DD and Single-V T Super-Drowsy Techniques for Low- Leakage High-Performance Instruction Caches Nam Sung Kim, Krisztián Flautner,
UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.
CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
CS 7810 Lecture 13 Pipeline Gating: Speculation Control For Energy Reduction S. Manne, A. Klauser, D. Grunwald Proceedings of ISCA-25 June 1998.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Case Study - SRAM & Caches
Copyright © 2012 Houman Homayoun 1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Low Power Techniques in Processor Design
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
1 Lecture 21: Core Design, Parallel Algorithms Today: ARM Cortex A-15, power, sort and matrix algorithms.
Power Reduction for FPGA using Multiple Vdd/Vth
1 Overview 1.Motivation (Kevin) 1.5 hrs 2.Thermal issues (Kevin) 3.Power modeling (David) Thermal management (David) hrs 5.Optimal DTM (Lev).5 hrs.
CS Lecture 4 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin.
Dept. of Computer Science, UC Irvine
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
Power Management in High Performance Processors through Dynamic Resource Adaptation and Multiple Sleep Mode Assignments Houman Homayoun National Science.
A Centralized Cache Miss Driven Technique to Improve Processor Power Dissipation Houman Homayoun, Avesta Makhzan, Jean-Luc Gaudiot, Alex Veidenbaum University.
1 Presented By Şahin DELİPINAR Simon Moore,Peter Robinson,Steve Wilcox Computer Labaratory,University Of Cambridge December 15, 1995 Rotary Pipeline Processors.
Multiple Sleep Mode Leakage Control for Cache Peripheral Circuits in Embedded Processors Houman Homayoun, Avesta Makhzan, Alex Veidenbaum Dept. of Computer.
Low Power Processor Design VLSI Systems Lab. 3 월 28 일 박 봉 일.
1 Energy-Efficient Register Access Jessica H. Tseng and Krste Asanović MIT Laboratory for Computer Science, Cambridge, MA 02139, USA SBCCI2000.
1 Dynamically Heterogeneous Cores Through 3D Resource Pooling Houman Homayoun Vasileios Kontorinis Amirali Shayan Ta-Wei Lin Dean M. Tullsen Speaker: Houman.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Computer Science and Engineering Power-Performance Considerations of Parallel Computing on Chip Multiprocessors Jian Li and Jose F. Martinez ACM Transactions.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
A High-Speed & High-Capacity Single-Chip Copper Crossbar John Damiano, Bruce Duewer, Alan Glaser, Toby Schaffer, John Wilson, and Paul Franzon North Carolina.
Lecture 4 General-Purpose Input/Output NCHUEE 720A Lab Prof. Jichiang Tsai.
1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,
Presented by Rania Kilany.  Energy consumption  Energy consumption is a major concern in many embedded computing systems.  Cache Memories 50%  Cache.
Overview Motivation (Kevin) Thermal issues (Kevin)
Crusoe Processor Seminar Guide: By: - Prof. H. S. Kulkarni Ashish.
Memories.
Inc. 32 nm fabrication process and Intel SpeedStep.
SECTIONS 1-7 By Astha Chawla
ISPASS th April Santa Rosa, California
CSC 4250 Computer Architectures
Architecture & Organization 1
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Superscalar Processors & VLIW Processors
Architecture & Organization 1
Out-of-Order Commit Processor
Ka-Ming Keung Swamy D Ponpandi
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Adaptive Single-Chip Multiprocessing
A High Performance SoC: PkunityTM
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Houman Homayoun, Sudeep Pasricha, Mohammad Makhzan, Alex Veidenbaum Center for Embedded Computer Systems, University of California, Irvine, Dynamic Register File Resizing and Frequency Scaling to Improve Embedded Processor Performance and Energy-Delay Efficiency

INTRODUCTION Technology scaling into the ultra deep submicron allowed hundreds of millions of gates integrated onto a single chip.  Designers have ample silicon budget to add more processor resources to exploit application parallelism and improve performance. Restrictions with the power budget and practically achievable operating clock frequencies are limiting factors.  Increasing register file (RF) size increases its access time, which reduces processor frequency. Dynamically Resizing RF in tandem with dynamic frequency scaling (DFS) significantly improves the performance.

MOTIVATION FOR INCREASING RF SIZE After a long latency L2 cache miss the processor executes some independent instructions but eventually ends up becoming stalled.  After L2 cache miss one of ROB, IQ, RF or LQ/SQ fills up and processor stalls until the miss serviced. With larger resources it is less likely that these resources will fill up completely during the L2 cache miss service time and potentially improve performance. The sizes of resources have to be scaled up together; otherwise the non- scaled ones would become a performance bottleneck. Frequency of stalls due to L2 cache misses, in PowerPC 750FX architecture

IMPACT OF INCREASING RF SIZE Increasing the size of RF, (as well as ROB, LQ and IQ)  can potentially increase processor performance by reducing the occurrences of idle periods,  has critical impact on the achievable processor operating frequency RF decide the max achievable operating frequency significant increase in bitline delay when the size of the RF increases. Breakdown of RF component delay with increasing size

ANALYSIS OF RF COMPONENT ACCESS DELAY The equivalent capacitance on the bitline is Ceq = N * diffusion capacitance of pass transistors + wire capacitance (usually 10% of total diffusion capacitance) where N is the total number of rows. As the number of rows increases the equivalent bitline capacitance also increases and therefore the propagation delay increases. Reduction in clock freq with increasing resource size

STATIC REGISTER FILE SIZING Performance in terms of IPC for different configurations Relative idle period processor stalls due to L2 cache misses for different configurations Increasing the size of RF  Increases the IPC  Reduces relative idle period processor stalls due to L2 cache misses  Reduces the max achievable operating clock frequency

IMPACT ON EXECUTION TIME The execution time increases with larger resource sizes Normalized execution time for different configs with reduced operating frequency compared to baseline architecture trade-off between  larger resources (and hence reducing the occurrences of idle period) and  lowering the clock frequency, the latter becomes more important and plays a major role in deciding the performance in terms of execution time.

DYNAMIC REGISTER FILE RESIZING dynamic RF scaling based on L2 cache misses  allows the processor use smaller RF (having a lower access time) during the period when there is no pending L2 cache miss (normal period) and a larger RF (at the cost of having a higher access time) during the L2 cache miss period. To satisfy accessing the RF in one cycle, reduce the operating clock frequency when we scale up its size  DFS needs to be done fast, otherwise it impacts the performance benefit  need to use a PLL architecture capable of applying DFS with the least transition delay. The studied processor (IBM PowerPC 750) uses a dual PLL architecture which allows fast DFS with effectively zero latency.

CIRCUIT MODIFICATION The challenge is to design the RF in such a way that its access time is dynamically being controlled. Proposed circuit modification for RF Among all RF components, the bitline delay increase is responsible for the majority of RF access time increase. Dynamically adjust bitline load.

L2 MISS DRIVEN RF SCALING (L2MRFS) Proposed circuit modification for RF Normal period: the upper segment is power gated and the transmission gate is turned off to isolate the lower bitline segment from the upper bitline segment.  Only the lower segment bitline is pre-charged during this period. L2 cache miss period: the transmission gate is turned on and both segments bitlines are pre-charged.  downsize at the end of cache miss period when the upper segment is empty. Augment the upper segment with one extra bit per entry. Set the entry when a register is taken and reset it when a register is released. ORing these bits can detect when the segment is empty.

PERFORMANCE AND ENERGY-DELAY Experimental results: (a) normalized performance improvement for L2MRFS (b) normalized energy-delay product compare to conf_1 and conf_2 Performance improvement 6% and 11% Energy-delay reduction 3.5% and 7%

CONCLUSION Technology scaling into the ultra deep submicron allowed hundreds of millions of gates integrated onto a single chip. Restrictions with the power budget and practically achievable operating clock frequencies are limiting factors. Increasing register file size, statically, while can increase IPC, reduces the execution time due to the impact on max achievable operating frequency. Dynamic register file resizing, allows the processor use smaller RF (having a lower access time) during the period when there is no pending L2 cache miss (normal period) and a larger RF (at the cost of having a higher access time) during the L2 cache miss period. Minimal modification in the register file to be able to adapt its size along with its access time. Combined dynamic register file resizing with dynamic frequency scaling achieves 11% performance improvement and 7% energy-delay reduction A similar methodology applied for RF can be applied to other timing constrains resources such as ROB, IQ, LQ/SQ and Caches.