Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group

Slides:



Advertisements
Similar presentations
Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.
Advertisements

Reducing Network Energy Consumption via Sleeping and Rate- Adaption Sergiu Nedevschi, Lucian Popa, Gianluca Iannaccone, Sylvia Ratnasamy, David Wetherall.
Misbah Mubarak, Christopher D. Carothers
QuT: A Low-Power Optical Network-on-chip
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
A Novel 3D Layer-Multiplexed On-Chip Network
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
A 2 -MAC: An Adaptive, Anycast MAC Protocol for Wireless Sensor Networks Hwee-Xian TAN and Mun Choon CHAN Department of Computer Science, School of Computing.
REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.
An Analytical Model for Worst-case Reorder Buffer Size of Multi-path Minimal Routing NoCs Gaoming Du 1, Miao Li 1, Zhonghai Lu 2, Minglun Gao 1, Chunhua.
Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
HK-NUCA: Boosting Data Searches in Dynamic NUCA for CMPs Javier Lira ψ Carlos Molina ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Ultra-Low Duty Cycle MAC with Scheduled Channel Polling Wei Ye Fabio Silva John Heidemann Presented by: Ronak Bhuta Date: 4 th December 2007.
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
Issues in System-Level Direct Networks Jason D. Bakos.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Distributed Quality-of-Service Routing of Best Constrained Shortest Paths. Abdelhamid MELLOUK, Said HOCEINI, Farid BAGUENINE, Mustapha CHEURFA Computers.
QUICK TIPS (--THIS SECTION DOES NOT PRINT--) This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly.
Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.
Elastic-Buffer Flow-Control for On-Chip Networks
Javier Lira (Intel-UPC, Spain)Timothy M. Jones (U. of Cambridge, UK) Carlos Molina (URV, Spain)Antonio.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
A Lightweight Fault-Tolerant Mechanism for Network-on-Chip
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Mobile Agent Migration Problem Yingyue Xu. Energy efficiency requirement of sensor networks Mobile agent computing paradigm Data fusion, distributed processing.
An Adaptive Energy-Efficient and Low- Latency MAC for Data Gathering in Wireless Sensor Networks Gang Lu, Bhaskar Krishnamachari, and Cauligi S. Raghavendra.
Runtime Power Gating of On-Chip Routers Using Look-Ahead Routing
Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.
University of Michigan, Ann Arbor
Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Yu Cai Ken Mai Onur Mutlu
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Doc.: IEEE /2200r2 Submission July 2007 Sandesh Goel, Marvell et alSlide 1 Route Metric Proposal Date: Authors:
Route Metric Proposal Date: Authors: July 2007 Month Year
FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers
Warped Gates: Gating Aware Scheduling and Power Gating for GPGPUs
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Exploring Concentration and Channel Slicing in On-chip Network Router
Rahul Boyapati. , Jiayi Huang
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Using Packet Information for Efficient Communication in NoCs
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Yiannis Andreopoulos et al. IEEE JSAC’06 November 2006
Route Metric Proposal Date: Authors: July 2007 Month Year
Presentation transcript:

NoRD: Node-Router Decoupling for Effective Power-gating of On-Chip Routers Lizhong Chen and Timothy M. Pinkston SMART Interconnects Group University of Southern California December 4, 2012

NoC Power Consumption Chip power has become a main design constraint Power: Chip -> NoC Canonical router at 45nm and 1.0V Chip power has become a main design constraint High power consumption in the NoC Static power increasing in on-chip routers Various contributors to router static power

Use of Power-gating Applications of power-gating Save static power by cutting off power supply to block Have been applied to cores and execution units Few works on applying it to on-chip routers Objectives of power-gating Maximize net energy savings Minimize performance penalty Proposed Node-Router Decoupling Increase power-gating opportunity and effectiveness in on-chip networks

Conventional Use of Power-gating Applied to NoC Routers Power off the router When the datapath of the router is empty, and After notifying all of its neighbors (PG signal) Awake the router when Any neighbors assert WU signal Neighbors wait for PG signal to clear Effectiveness subject to Wakeup latency (~12 cycles for router) Breakeven-time (BET) The minimum number of consecutive gated-off idle cycles to offset power-gating energy overhead (~10 cycles for router) Router C WU PG WU WU Router A Router B Router D PG PG WU PG Router E

Challenges in Conventional Use of Power-gating to NoC Routers BET limitation is intensified Intermittent packet arrivals => fragmented idle intervals Cumulative wakeup latency in multi-hop NoCs Worse for larger networks Disconnection problem Idle period is upper bounded by local node’s traffic Disconnected network Full system simulation on PARSEC shows that 61% of the total number of idle periods has length less than BET! 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 S D Conventional use of power gating to NoC routers can have limited effectiveness

Node-Router Decoupling in a Nutshell Break node-router dependence through decoupling bypass paths Add two bypass paths to each router On the chip-level: form a bypass ring connecting all nodes Bypass Inport => NI ejection, NI injection => Bypass Outport Mitigate BET limitation Use bypass paths instead of waking up routers Hide wakeup latency Use bypass paths while routers are waking up Eliminate disconnection All nodes are always connected by the bypass ring 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 1 3 Node 2 S D 4 NI = Network Interface

Outline Introduction, motivation, basic idea Node-router decoupling implementation Evaluation methodology and results Related work Summary

Network Interface (NI) On-chip Networks NoC-based architecture Canonical Router architecture Role of NI Network Interface (NI) Core, Cache, Memory Controller

NoRD Bypass Paths Add two bypass paths to each router One bypass from Bypass Inport to the NI ejection One bypass from the NI injection to Bypass Outport State-transitions On -> off, when the datapath of router is empty Off -> on, when a wakeup metric exceeds a threshold VC request rate at the local NI ① ③ Network Interface Low implementation cost of decoupling bypass paths and forwarding logic: 3.1% of router area

NoRD Routing Based on Duato’s Protocol for fully adaptive routing Minimal path along gated-on routers & gated-off routers 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 S D D

NoRD Routing Based on Duato’s Protocol for Fully Adaptive Routing Minimal path along gated-on routers & gated-off routers Limited misroutes possible only if all routers off along min path Bypass Ring serves as “escape path” 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 S Explain DP, max hop, if 8 is on; if not, then D D

Increasing NoRD Efficiency Differentiate routers Routers have different impact on performance based on their locations in the NoC 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15

Increasing NoRD Efficiency Differentiate routers Routers have different impact on performance based on their locations in the NoC Performance-centric class vs. Power-centric class Wake up early a few performance-critical routers to add “shortcuts” in routing Wake up late the rest (majority) of the routers to save more static power Use an off-line program to classify the routers 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 Wake up early a few performance-critical routers to improve performance by adding “shortcuts” in routing Wake up late the rest (majority) of the routers to save more static power by allowing those routers to stay in gated-off state for a longer time NoRD enables this trade-off

Evaluation Methodology Simulation platform Platform: Simics + Gems (Garnet+Orion2.0) Workloads: PARSEC 2.0 + Synthetic traffic Key parameters for simulations Core model Sun UltraSPARC III+, 3GHz Private I/D L1$ 32KB, 2-way, LRU, 1-cycle latency Shared L2 per bank 256KB, 16-way, LRU, 6-cycle latency Cache block size 64Bytes Coherence protocol MOESI Network topology 4x4 and 8x8 mesh Router 4-stage, 3GHz Virtual channel 4 per protocol class Input buffer 5-flit depth Link bandwidth 128 bits/cycle Memory controllers 4, located one at each corner Memory latency 128 cycles

Schemes Under Comparison No power-gating (No_PG) Conventional power-gating (Conv_PG) Apply power-gating technique conventionally to routers Optimized conventional power-gating (Conv_PG_OPT) Conv_PG + early wakeup (hide some wakeup latency) Node-router decoupling (NoRD) Power-gate routers and enable bypass paths when load is low When load becomes high, routers are powered on gradually

Static Energy Comparison Static energy saved Conv_PG: 51.2%, Conv_PG_OPT : 47.0% NoRD: 62.9% Relative improvement of NoRD: 23.9% and 29.9%

Power-gating Overhead Reduction NoRD reduces power-gating overhead and number of router wakeups by over 80% Power-gating Overhead Reduction in # of router wakeups

Overall NoC Energy Overall NoC energy saved Conv_PG: 9.4%, Conv_PG_OPT: 9.1%, NoRD: 20.6% Static energy savings exceed dynamic energy losses Discuss misrouting

Performance Average packet latency penalty Execution time penalty Conv_PG: 63.8%, Conv_PG_OPT: 41.5%, NoRD: 15.2% Execution time penalty Conv_PG: 11.7%, Conv_PG_OPT: 8.1%, NoRD: 3.9% Average packet latency Execution time Misrouting and PG

Related Work Applications of power-gating in CMPs Other uses of bypass Apply to cores and execution units in CMPs (Z. Hu, et al., 2004; A. Lungu, et al., 2009; N. Madan, et al., 2011; others) Apply power-gating conventionally to on-chip routers (H. Matsutani, et al., 2008; S.Jafri, et al., 2010, H. Matsutani, et al., 2010) Effectiveness is limited by the BET requirement, wakeup delay and disconnection problem Other uses of bypass For fault-tolerance: work for infrequent on/off transitions (M. Koibuchi, et al., 2008; J. Kim, et al., 2006; others) For express channels: improve performance and dynamic power (W. Dally, 1991; A. Kumar, et al., 2007; B. Grot, et al., 2009; others) For reducing power consumption in links (E. Kim, et al., 2003; V. Soteriou, et al., 2004; B. Zafar, et al., 2010; others) These techniques are either not suitable for run-time router power-gating or have different targets, thus being orthogonal to this work

Summary Node-router dependence severely limits the use of power-gating in on-chip routers BET limitation, wakeup delay and disconnection problem A novel approach, Node-Router Decoupling (NoRD), is proposed based on power-gating bypass paths Significantly reduces the number of power state transitions Increases the length of idle periods Completely hides the wakeup latency from the critical path Eliminates network disconnection problems NoRD increases power-gating opportunity while minimizing performance overhead

Thank you!

Power-gating Basics Breakeven-time (BET) The minimum number of consecutive gated-off idle cycles to offset power-gating energy overhead Around 10 cycles for router Wakeup latency Around 10~15 cycles for router time

NoRD Routing Based on Duato’s Protocol Packets on adaptive VCs Escape resources are comprised of escape VCs of the bypass ring formed by (Bypass Inport, Bypass Outport) pairs Other VCs are adaptive resources Packets on adaptive VCs First routed minimally If not possible, detoured by one May still routed on adaptive VCs If misrouted hops reach threshold Forced to enter escape VCs Packets on escape VCs Confined to bypass ring until destination 2 1 3 4 5 6 7 8 9 10 11 12 13 14 15 S D Explain DP, max hop, if 8 is on; if not, then D