Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego

Slides:



Advertisements
Similar presentations
ITRS Design ITWG Design and System Drivers Worldwide Design ITWG Key messages: 1.- Software is now part of semiconductor technology roadmap 2.-
Advertisements

Tunable Sensors for Process-Aware Voltage Scaling
Keeping Hot Chips Cool Ruchir Puri, Leon Stok, Subhrajit Bhattacharya IBM T.J. Watson Research Center Yorktown Heights, NY Circuits R-US.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Figure 2.8 Compiler phases Compiling. Figure 2.9 Object module Linking.
Run-Time Storage Organization
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
UC San Diego / VLSI CAD Laboratory Reliability-Constrained Die Stacking Order in 3DICs Under Manufacturing Variability Tuck-Boon Chan, Andrew B. Kahng,
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations ‡ Computer Science and Engineering, UC San Diego variability.org.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters Abbas Rahimi, Andrea.
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.
Chalmers University of Technology FlexSoC Seminar Series – Page 1 Power Estimation FlexSoc Seminar Series – Daniel Eckerbert
Research on Analysis and Physical Synthesis Chung-Kuan Cheng CSE Department UC San Diego
Andrea Marongiu Luca Benini ETH Zurich Daniele Cesarini University of Bologna.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Compiler & Microarchitecture Lab Support of Cross Calls between Microprocessor and FPGA in CPU-FPGA Coupling Architecture G. NguyenThiHuong and Seon Wook.
Safe Overclocking Safe Overclocking of Tightly Coupled CGRAs and Processor Arrays using Razor © 2012 Guy Lemieux Alex Brant, Ameer Abdelhadi, Douglas Sim,
Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Advanced Computer Architecture Fundamental of Computer Design Instruction Set Principles and Examples Pipelining:Basic and Intermediate Concepts Memory.
LA-LRU: A Latency-Aware Replacement Policy for Variation Tolerant Caches Aarul Jain, Cambridge Silicon Radio, Phoenix Aviral Shrivastava, Arizona State.
Patricia Gonzalez Divya Akella VLSI Class Project.
An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)
1 RELOCATE Register File Local Access Pattern Redistribution Mechanism for Power and Thermal Management in Out-of-Order Embedded Processor Houman Homayoun,
-1- UC San Diego / VLSI CAD Laboratory Optimal Reliability-Constrained Overdrive Frequency Selection in Multicore Systems Andrew B. Kahng and Siddhartha.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Gopakumar.G Hardware Design Group
Efficient Software-Based Fault Isolation
CS203 – Advanced Computer Architecture
Multiprocessing.
Processes and threads.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
Andrea Acquaviva, Luca Benini, Bruno Riccò
Evaluating Register File Size
Supervised Learning Based Model for Predicting Variability-Induced Timing Errors Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose.
nZDC: A compiler technique for near-Zero silent Data Corruption
“Temperature-Aware Task Scheduling for Multicore Processors”
Morgan Kaufmann Publishers
What happens inside a CPU?
VLSI Design MOSFET Scaling and CMOS Latch Up
Functions and Procedures
Improving java performance using Dynamic Method Migration on FPGAs
Department of Electrical & Computer Engineering
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
A Review of Processor Design Flow
The University of British Columbia
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
Challenges in Nanoelectronics: Process Variability
A. Rahimi, A. Marongiu, P. Burgio, R. K. Gupta, L. Benini
A 100 µW, 16-Channel, Spike-Sorting ASIC with On-the-Fly Clustering
Dual Mode Logic An approach for high speed and energy efficient design
Circuit Design Techniques for Low Power DSPs
A High Performance SoC: PkunityTM
The University of Adelaide, School of Computer Science
FPGA Glitch Power Analysis and Reduction
†UCSD, ‡UCSB, EHTZ*, UNIBO*
Die Stacking (3D) Microarchitecture -- from Intel Corporation
CS510 - Portland State University
Low Power Digital Design
Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego
Performance Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Parametric Yield Estimation Considering Leakage Variability Rajeev Rao, Anirudh Devgan, David Blaauw, Dennis Sylvester Present by Fengbo Ren Apr. 30.
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters Abbas Rahimi‡, Luca Benini†, and Rajesh Gupta‡ ‡CSE, UC San Diego †DEIS, Università di Bologna http://mesl.ucsd.edu http:// micrel.deis.unibo.it http://variability.org International Symposium on Low-Power Electronics and Design 

Procedure Hopping to Mitigate CMOS Variability - How and why we leveraged a high-level concept like “procedure” to solve a problem as low-level as transistor-level?

Sources of Device Variation 10% VCC, ~160˚C Temperature, 40% VTH Variations are more challenging in a many-core platform! guardband actual circuit delay Clock Other uncertainty Across-wafer Frequency Temperature VCC Droop 1- within-die 3σ performance variation of more than 25% at 0.8V in 2- ITRS projects Vdd variation to be 10% while the operating temperature can vary from -30C to 175C (e.g., in automotive context) 3- Dynamic Variations contain high-frequency and low-frequency components which occur locally as well as globally across the die

Agenda Sources of Variations Variation-tolerant Shared-L1 Processor Cluster Process Variation → Variation-aware VDD-hopping Dynamic Voltage Variation → Procedure hopping Methodology for PLV Design time characterization Compile time PLV metadata generation Runtime preventive compensation Experimental Results

Shared-L1 Processor Clusters Each cluster consists of: 16 32-bit in-order RISC cores An intra-cluster shared-L1I$ An on-chip multi-banked tightly coupled data memory (TCDM) Two single-cycle logarithmic interconnections for both instruction and data sides A hardware synchronization handler module (SHM) to coordinate and synchronize cores for accessing shared data on TCDM. VDD-hopping per core. Shared-L1 TCDM cluster template - The code is easily accessible via the shared-L1 I$. The data and parameters are passed through the shared stack in TCDM. 4x8 cluster: 4 PEs and an 8-bank TCDM

VDD–hopping to Compensate Process Variation VDD = 0.81V VDD = 0.99V VA-VDD-Hopping=( 0.81V , 0.99V ) f0 862 f1 909 f2 870 f3 847 f4 826 f5 855 f6 877 f7 893 f8 820 f9 f10 f11 f12 901 f13 917 f14 f15 f0 1408 f1 1389 f2 f3 1370 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f0 862 f1 909 f2 870 f3 847 f4 1370 f5 855 f6 877 f7 893 f8 f9 f10 f11 f12 901 f13 917 f14 f15  Three cores (f4, f8, f9) cannot meet the target frequency of 830MHz.  All cores of the same cluster meet the target frequency of 830MHz.  VA-VDD-hopping can accordingly tune the cores' voltage based on their delay reported by CPMs.

VDD–hopping to Compensate Process Variation The process variation is compensated  but, cluster will have various Voltage/Temperature-islands! f0 862 f1 909 f2 870 f3 847 f4 1370 f5 855 f6 877 f7 893 f8 f9 f10 f11 f12 901 f13 917 f14 f15 Each core increases voltage if its delay is high. Every core have its own voltage domain All cores work with the same frequency VDD-hopping tunes the voltage of each core based on CMP.

Fast Dynamic IR-drop within Cluster (Vol., Temp.) 0.99V, 125C 0.90V, 25C 0.81V, 125C 0.81V, -40C Power density 0.66 μW/μm2 0.21 μW/μm2 0.18 μW/μm2 0.16 μW/μm2 Max IR-drop 44 mV < 35 mV < 31 mV The IR-drop of execution of FIR on cores with various operating corners. FIR does not face any voltage emergency (IR-drop < 4%) at the corners with voltages of 0.81V-0.9V due to their lower power densities.

Procedure hopping to Compensate Voltage Variation Each procedure hops from one core to another if it causes voltage variation. Procedure hopping facilitates fast and proactive migration of procedures within a cluster to prevent voltage variation thanks to shared I$ and TCDM resources.

Agenda Sources of Variations Variation-tolerant Shared-L1 Processor Cluster Process Variation → Variation-aware VDD-hopping Dynamic Voltage Variation → Procedure hopping Methodology for PLV Design time characterization Compile time PLV metadata generation Runtime preventive compensation Experimental Results

Procedure-level Vulnerability (PLV) The notion of PLV to fast dynamic voltage variation is defined. The design time stage analyzes the dynamic voltage droops/rises for every ProcX under full operating conditions  generating PLVx metadata. Observe IR-drops int ProcX (…) { … } - (Vi,Tj) Corei PLVX @(Vi,Tj) = 0.75

Characterization of PLV to IR-drop: Compile time + Runtime At compile time, PLVx metadata of ProcX is attached to the procedure. During runtime, the discretized (V,T) point to the corresponding characterized PLV metadata to assess the vulnerability of ProcX at the current (V,T). If PLVx ≥ PLV_threshold, the ProcX will be hopped from caller core to a favor callee core.

Agenda Sources of Variations Variation-tolerant Shared-L1 Processor Cluster Process Variation → Variation-aware VDD-hopping Dynamic Voltage Variation → Procedure hopping Methodology of PLV Design time characterization Compile time PLV metadata generation Runtime preventive compensation Experimental Results

Max Voltage Variation Across Corners and Procedures Max voltage droop (%) (Vol., Temp.) a2tim FIR IFFT bitmnp cacheb IDCT matrix pntrch PWM sspeed tblook ttsprk 0.99V, 125°C 5.39 4.46 6.34 5.03 4.62 6.26 5.89 5.36 5.23 5.05 3.84 5.41 0.90V, 25°C 3.65 2.98 4.63 3.47 3.11 4.41 4.09 3.63 3.48 2.44 4.99 0.81V, 125°C 3.45 2.8 3.7 3.43 2.92 3.77 3.39 3.27 3.33 2.29 0.81V, -40°C 3.34 2.72 3.66 2.84 3.53 3.26 3.24 2.22 Most of procedures running at cores with 0.99V have voltage emergencies. At 0.9V, only four procedures (IFFT, IDCT, matrix, ttsprk) face the voltage emergencies. No voltage emergency at 0.81V. Procedure hopping avoids the voltage emergency for all procedures by hopping them form a high-voltage core to a low-voltage core.

Cost of Procedure Hopping   Caller hopping Caller not hopping Callee service Callee no service Latency 218 cycles 88 cycles 575 cycles 342 cycles Voltage droop 1.3% 0.6% 2.9% 1.8% The total roundtrip overhead of the hopping a procedure from the caller core and returning the results from the callee core is less than 800 cycles. This overhead is less than 1% of the total cycles needed to execute any of the characterized procedures in EEMBC benchmark. During the procedure hopping no voltage emergency can occur even at (0.99V,125˚C), neither in the caller nor the callee core.

Conclusion The notion of procedure-level vulnerability to fast dynamic voltage variation is defined. Based on PLV metadata, a fully-software low-cost procedure hopping technique is proposed which guarantees the voltage emergency-free migration of all procedures, fast and proactively enough within a shared-L1 processor cluster. Full post-P&R results in 45nm TSMC technology confirms that the procedure hopping avoids the voltage emergency across a variability-affected cluster, while imposing only an amortized cost of less than 1% latency for any of the characterized embedded procedures.

Thank you! http://mesl.ucsd.edu http:// micrel.deis.unibo.it http://variability.org

HW/SW Collaborative Architecture to Support Intra-cluster Procedure Hopping The code is easily accessible via the shared-L1 I$. The data and parameters are passed through the shared stack in TCDM. A procedure hopping information table (PHIT) keeps the status for a migrated procedure.

Intra-procedure Peak Power Variation Maximum of 1.28× intra-corner peak power variation occurs between IFFT and tblook procedures at (0.81V,125C). Maximum inter-corner peak power variation is 3.5× for FIR. Maximum of 4.1× peak power variation across corners and procedures, a2time at (0.81V,-40C), and IFFT at (0.99V,125C).