1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

Slides:

Advertisements

Similar presentations

George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝

Advertisements

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Thomas Moscibroda Distributed Systems Research, Redmond Onur Mutlu

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Cs238 CPU Scheduling Dr. Alan R. Davis. CPU Scheduling The objective of multiprogramming is to have some process running at all times, to maximize CPU.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

McRouter: Multicast within a Router for High Performance NoCs

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Stall-Time Fair Memory Access Scheduling Onur Mutlu and Thomas Moscibroda Computer Architecture Group Microsoft Research.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.

Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

1 11/29/2015 Chapter 6: CPU Scheduling l Basic Concepts l Scheduling Criteria l Scheduling Algorithms l Multiple-Processor Scheduling l Real-Time Scheduling.

University of Michigan, Ann Arbor

Yu Cai Ken Mai Onur Mutlu

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

Lecture 16: Router Design

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Virtual-Channel Flow Control William J. Dally

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

$ Advances in On-Chip Networks Advances in On-Chip Networks TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A Thomas.

Computer Architecture: Interconnects (Part II) Michael Papamichael Carnegie Mellon University Material partially based on Onur Mutlu’s lecture slides.

Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012.

UH-MEM: Utility-Based Hybrid Memory Management

Reducing Memory Interference in Multicore Systems

Effective mechanism for bufferless networks at intensive workloads

Milad Hashemi, Onur Mutlu, and Yale N. Patt

Rachata Ausavarungnirun, Kevin Chang

Exploring Concentration and Channel Slicing in On-chip Network Router

Application Slowdown Model

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Using Packet Information for Efficient Communication in NoCs

Achieving High Performance and Fairness at Low Cost

CPU SCHEDULING.

Prof. Onur Mutlu Carnegie Mellon University 10/24/2012

Presentation transcript:

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu † Thomas Moscibroda ‡ Chita Das § § Pennsylvania State University † Carnegie Mellon University ‡ Microsoft Research

The Problem: Packet Scheduling Network-on-Chip L2$ Bank mem cont Memory Controller P Accelerator L2$ Bank L2$ Bank P PP P P PP Network-on-Chip Network-on-Chip is a critical resource shared by multiple applications App1 App2 App N App N-1

The Problem: Packet Scheduling R PE R R R R R R R R R R R R R R R Crossbar(5x5) To East To PE To West To North To South Input Port with Buffers Control Logic Crossbar R Routers PE Processing Element (Cores, L2 Banks, Memory Controllers etc) Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA)

VC0 Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA ) VC1 2 From East From West From North From South From PE The Problem: Packet Scheduling

Conceptual View From East From West From North From South From PE VC 0 VC 1 VC 2 App1App2App3App4 App5App6App7App8 The Problem: Packet Scheduling VC0 Routing Unit (RC) VC Allocator (VA) Switch VC1 2 From East From West From North From South From PE Allocator (SA)

Scheduler Conceptual View VC0 Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA) VC1 2 From East From West From North From South From PE From East From West From North From South From PE VC 0 VC 1 VC 2 App1App2App3App4 App5App6App7App8 Which packet to choose? The Problem: Packet Scheduling

 Existing scheduling policies  Round Robin  Age  Problem 1: Local to a router  Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next.  Problem 2: Application oblivious  Treat all applications packets equally  But applications are heterogeneous  Solution : Application-aware global scheduling policies.

Outline  Problem: Packet Scheduling  Motivation: Stall Time Criticality of Applications  Solution: Application-Aware Coordinated Policies  Ranking  Batching  Example  Evaluation  Conclusion

Motivation: Stall Time Criticality  Applications are not homogenous  Applications have different criticality with respect to the network  Some applications are network latency sensitive  Some applications are network latency tolerant  Application’s Stall Time Criticality (STC) can be measured by its average network stall time per packet (i.e. NST/packet)  Network Stall Time (NST) is number of cycles the processor stalls waiting for network transactions to complete

Motivation: Stall Time Criticality  Why applications have different network stall time criticality (STC)?  Memory Level Parallelism (MLP)  Lower MLP leads to higher STC  Shortest Job First Principle (SJF)  Lower network load leads to higher STC  Average Memory Access Time  Higher memory access time leads to higher STC

 Observation 1: Packet Latency != Network Stall Time STALL STALL of Red Packet = 0 LATENCY Application with high MLP STC Principle 1 {MLP} Compute

 Observation 1: Packet Latency != Network Stall Time  Observation 2: A low MLP application’s packets have higher criticality than a high MLP application’s STALL STALL of Red Packet = 0 LATENCY Application with high MLP STALL LATENCY STALL LATENCY STALL LATENCY Application with low MLP STC Principle 1 {MLP}

STC Principle 2 {Shortest-Job-First} 4X network slow down 1.2X network slow down 1.3X network slow down 1.6X network slow down Overall system throughput{weighted speedup} increases by 34% Running ALONE Baseline (RR) Scheduling SJF Scheduling Light Application Heavy Application Compute

Outline  Problem: Packet Scheduling  Motivation: Stall Time Criticality of Applications  Solution: Application-Aware Coordinated Policies  Ranking  Batching  Example  Evaluation  Conclusion

Solution: Application-Aware Policies  Idea  Identify stall time critical applications (i.e. network sensitive applications) and prioritize their packets in each router.  Key components of scheduling policy:  Application Ranking  Packet Batching  Propose low-hardware complexity solution

Component 1 : Ranking  Ranking distinguishes applications based on Stall Time Criticality (STC)  Periodically rank applications based on Stall Time Criticality (STC).  Explored many heuristics for quantifying STC (Details & analysis in paper)  Heuristic based on outermost private cache Misses Per Instruction (L1-MPI) is the most effective  Low L1-MPI => high STC => higher rank  Why Misses Per Instruction (L1-MPI)?  Easy to Compute (low complexity)  Stable Metric (unaffected by interference in network)

Component 1 : How to Rank?  Execution time is divided into fixed “ranking intervals”  Ranking interval is 350,000 cycles  At the end of an interval, each core calculates their L1-MPI and sends it to the Central Decision Logic (CDL)  CDL is located in the central node of mesh  CDL forms a ranking order and sends back its rank to each core  Two control packets per core every ranking interval  Ranking order is a “partial order”  Rank formation is not on the critical path  Ranking interval is significantly longer than rank computation time  Cores use older rank values until new ranking is available

Component 2: Batching  Problem: Starvation  Prioritizing a higher ranked application can lead to starvation of lower ranked application  Solution: Packet Batching  Network packets are grouped into finite sized batches  Packets of older batches are prioritized over younger batches  Alternative batching policies explored in paper  Time-Based Batching  New batches are formed in a periodic, synchronous manner across all nodes in the network, every T cycles

Putting it all together  Before injecting a packet into the network, it is tagged by  Batch ID (3 bits)  Rank ID (3 bits)  Three tier priority structure at routers  Oldest batch first (prevent starvation)  Highest rank first (maximize performance)  Local Round-Robin (final tie breaker)  Simple hardware support: priority arbiters  Global coordinated scheduling  Ranking order and batching order are same across all routers

Outline  Problem: Packet Scheduling  Motivation: Stall Time Criticality of Applications  Solution: Application-Aware Coordinated Policies  Ranking  Batching  Example  Evaluation  System Software Support  Conclusion

STC Scheduling Example Injection Cycles Batch 0 Packet Injection Order at Processor Core1 Core2 Core3 Batching interval length = 3 cycles Ranking order = Batch 1 Batch 2

STC Scheduling Example Router Scheduler Injection Cycles Batch 2 Batch 1 Batch 0 Applications

STC Scheduling Example Router Scheduler Round Robin STALL CYCLESAvg RR Age STC Time

STC Scheduling Example Router Scheduler Round Robin Age STALL CYCLESAvg RR Age STC Time

STC Scheduling Example Router Scheduler Round Robin Age STC STALL CYCLESAvg RR Age STC Ranking order Time

Outline  Problem: Packet Scheduling  Motivation: Stall Time Criticality of Applications  Solution: Application-Aware Coordinated Policies  Ranking  Batching  Example  Evaluation  Conclusion

Evaluation Methodology  64-core system  x86 processor model based on Intel Pentium M  2 GHz processor, 128-entry instruction window  32KB private L1 and 1MB per core shared L2 caches, 32 miss buffers  4GB DRAM, 320 cycle access latency, 4 on-chip DRAM controllers  Detailed Network-on-Chip model  2-stage routers (with speculation and look ahead routing)  Wormhole switching (8 flit data packets)  Virtual channel flow control (6 VCs, 5 flit buffer depth)  8x8 Mesh (128 bit bi-directional channels)  Benchmarks  Multiprogrammed scientific, server, desktop workloads (35 applications)  96 workload combinations

Qualitative Comparison  Round Robin & Age  Local and application oblivious  Age is biased towards heavy applications  heavy applications flood the network  higher likelihood of an older packet being from heavy application  Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008]  Provides bandwidth fairness at the expense of system performance  Penalizes heavy and bursty applications  Each application gets equal and fixed quota of flits (credits) in each batch.  Heavy application quickly run out of credits after injecting into all active batches & stall till oldest batch completes and frees up fresh credits.  Underutilization of network resources

System Performance  STC provides 9.1% improvement in weighted speedup over the best existing policy{averaged across 96 workloads}  Detailed case studies in the paper

Enforcing Operating System Priorities  Existing policies cannot enforce operating system(OS) assigned priorities in Network-on-Chip  Proposed framework can enforce OS assigned priorities  Weight of applications => Ranking of applications  Configurable batching interval based on application weight W. Speedup W. Speedup

Summary of other Results  Alternative batching policies  E.g. Packet-Based batching  Alternative ranking heuristics  E.g. NST/packet, outstanding queue lengths etc.  Unstable metrics lead to “positive feedback loops”  Alternative local policies within STC  RR, Age, etc.  Local policy used within STC minimally impacts speedups  Sensitivity to ranking and batching interval

Outline  Problem: Packet Scheduling  Motivation: Stall Time Criticality of Applications  Solution: Application-Aware Coordinated Policies  Ranking  Batching  Example  Evaluation  Conclusion

Conclusions  Packet scheduling policies critically impact performance and fairness of NoCs  Existing packet scheduling policies are local and application oblivious  We propose a new, global, application-aware approach to packet scheduling in NoCs  Ranking: differentiates applications based on their Stall Time Criticality (STC)  Batching: avoids starvation due to rank-based prioritization  Proposed framework provides higher system speedup and fairness than all existing policies  Proposed framework can effectively enforce OS assigned priorities in network-on-chip

Questions? Thank you!