LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Slides:

Advertisements

Similar presentations

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Advertisements

Misbah Mubarak, Christopher D. Carothers

A Novel 3D Layer-Multiplexed On-Chip Network

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Presenter : Cheng_Ta Wu Masoumeh Ebrahimi, Masoud Daneshtalab, N P Sreejesh, Pasi Liljeberg, Hannu Tenhunen Department of Information Technology, University.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

Making Parallel Packet Switches Practical Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science,

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Design of a High-Throughput Distributed Shared-Buffer NoC Router

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

McRouter: Multicast within a Router for High Performance NoCs

TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.

José Vicente Escamilla José Flich Pedro Javier García 1.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Elastic-Buffer Flow-Control for On-Chip Networks

Dynamic Networks CS 213, LECTURE 15 L.N. Bhuyan CS258 S99.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

The Alpha Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.

George Michelogiannakis, Prof. William J. Dally Concurrent architecture & VLSI group Stanford University Elastic Buffer Flow Control for On-chip Networks.

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Yu Cai Ken Mai Onur Mutlu

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

Lecture 16: Router Design

Module R R RRR R RRRRR RR R R R R Access Regulation to Hot-Modules in Wormhole NoCs Isask’har (Zigi) Walter Supervised by: Israel Cidon, Ran Ginosar and.

Efficient Microarchitecture for Network-on-Chip Routers

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Virtual-Channel Flow Control William J. Dally

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

M AESTRO : Orchestrating Predictive Resource Management in Future Multicore Systems Sangyeun Cho, Socrates Demetriades Computer Science Department University.

Deterministic Communication with SpaceWire

FlexiBuffer: Reducing Leakage Power in On-Chip Network Routers

Rachata Ausavarungnirun, Kevin Chang

Exploring Concentration and Channel Slicing in On-chip Network Router

Chapter 3 Part 3 Switching and Bridging

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Using Packet Information for Efficient Communication in NoCs

Multi-hop Coflow Routing and Scheduling in Data Centers

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Chapter 3 Part 3 Switching and Bridging

Presentation transcript:

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of Science and Technology Master’s degree defense -

Table of Contents Motivation LIBRA –Introduction to Probabilistic Distance-based Arbitration –Virtual Contention-based Arbitration –Hybrid Arbitration Evaluation Conclusions 2 / 20

Motivation [Data collected by C. Batten, Y. Pan] On-Chip Network is an important shared resource in CMP. Fair allocation of shared resource is needed. 3 / 20

Motivation Experiment: 16-core CMP Run SPEC benchmark and 15 copies of memory-intensive microbenchmark to create hotspot. The location of SPEC bench is varied. Round-robin arbiter results in a significant unfairness. Why fairness in OCN matters? –Hard to predict performance (SLA). –Complicates OS design. –Parallel application slowdown. This work proposes LIBRA, an OCN support for locality- oblivious task placement. Hotspot MC Up to 12x! 4 / 20

Overview of LIBRA Locality-Oblivious Bandwidth Regulatory Aribter Libra: constellation of zodiac that symbolizes a balance. Leverages probabilistic distance-based arbitration (MICRO’10) Consists of two mechanisms: 1.Virtual contention arbitration (VCA) -Solve with unfairness 2.Hybrid arbitration -Solve high latency problem Combination of 1 and 2: multi-mode arbitration 5 / 20

Probabilistic Distance-based Arbitration (PDBA) Proposed to provide fairness in on-chip networks. 1.Probabilistic arbitration 2.Weight is multiplied by contention degree x1 x2 1 source queue Router 0 Router 1Router 2 6 / 20

Limitation of Real Contention-based Arbitration Real contention: when two or more requests contend. Real contention-based arbitration (RCA): –Non-contention is not accounted for. –In many cases, there is no real contention → unfairness Unfair bandwidth allocation! 7 / 20

Virtual Contention-based Arbitration (VCA) Considers historical non-contention in future arbitration. Two modes Virtual contention mode example: Last weight: 4 Priority counter: 0 Last weight: 1 Priority counter: 0 4 Virtual contention 4 Real contention mode Virtual contention mode 8 / 20

Virtual Contention-based Arbitration Cont’d Real contention mode example: If priority of all ports are the same, then do PDBA Last weight: 4 Priority counter: 4 Last weight: 1 Priority counter: 0 Real contention 4>0, so wins. 2 3 Decrement priority counter. 9 / 20

Hybrid Arbiter VCA increases router critical path → low clock freq. Observation: fairness matters only at high load. –At low load, there are few contention → RR is fine. –At high load, there are many contention and the impact is huge VCA is needed, but packets are queued up in the buffer → more time for processing Low load: RR has little impact on fairnessHigh load: VCA provides fairness RR VCA Do pre-calculation 10 / 20

Hybrid Arbiter Cont’d If there was no chance for pre-calculation, use RR. Use VCA whenever possible. 11 / 20

LIBRA: Multi-mode Arbitration Hybrid Contention SimpleComplex Yes Round-robin Virtual contention arbiter (VCA) in real contention mode No Virtual contention arbiter (VCA) in virtual contention mode Operate in one of multiple modes depending on contention type and load. –Contention type: # of requests for the output port –Load: whether pre-calculation is done or not 12 / 20

Methodology ParametersValues Network size64 Topology8x8 2D mesh Buffers16 flits per VC Virtual channels1 RoutingXY routing Router latency3 cycle Packet size Bimodal (50% 1 flit and 50% 4 flit) ParametersValues Processor 16 out-of-order cores (2GHz, 4-way issue, 64 entry ROB) L1 cache32KB, 2-way L2 cache512KB, 32-way, block size of 64B Memory controllerClosed-page mode, 2 controllers Topology4x4 2D mesh Buffers6 flits per VC Virtual channels4 Flit size16 byte Synthetic traffic simulation parameters GEMS simulation parameters Area and timing evaluation: Synopsys Design Compiler and IC Compiler. Synthetic simulation using cycle-accurate Booksim simulator. SPEC CPU 2006 application and microbenchmark simulation using cycle-accurate GEMS + Booksim simulator. 13 / 20

Timing and Area Baseline (RR): 1.4GHz and 0.07mm 2 LIBRA reduces latency significantly, while introducing low area overhead. [MICRO’10] 14 / 20

Synthetic Traffic Evaluation Network stability and throughput Uniform randomTornadoBitcomp 15 / 20

Support for Locality-oblivious Task Placement Configuration –14 copies of memory-intensive microbenchmark. –SPEC bench. placement: closest or farthest to the hotspot. LIBRA reduces max. slowdown by 2.7x and 1.8x compared to RR and AGE, respectively. 16 / 20

Analysis on Unfairness of AGE AGE can be unfair in closed-loop evaluation. Assumptions: -All nodes send packets to MC -Ideal age-based arbitration -Steady state 17 / 20

Cost Comparison of QoS Mechanisms Area overhead comparison: additional area overhead per node (um 2 ) [MICRO’10][ISCA’08] [MICRO’10][MICRO’09] LIBRA achieves 38% lower area overhead! (compared to PVC) 18 / 20

Conclusions Impact of task placement on performance: up to 30x with RR. This work proposes LIBRA, a multi-mode arbitration. –VCA for providing global fairness. –Hybrid arbitration for reducing latency overhead. LIBRA can support locality-oblivious task placement. Analysis on unfairness of age-based arbitration. LIBRA has 38% lower area overhead compared to PVC. 19 / 20

Q&A 20 / 20

Hybrid Arbiter Cont’d If there was no chance for pre-calculation, use RR. Use VCA whenever possible. X X + + < < Pre-calculation stage (PC) Arbitration stage (SAc) 21 / 20