HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Slides:

Advertisements

Similar presentations

George Nychis✝, Chris Fallin✝, Thomas Moscibroda★, Onur Mutlu✝

Advertisements

Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

A Novel 3D Layer-Multiplexed On-Chip Network

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

High Performing Cache Hierarchies for Server Workloads

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin, Greg Nazario, Xiangyao Yu*, Kevin Chang, Rachata Ausavarungnirun,

CHIPPER: A Low-complexity Bufferless Deflection Router Chris Fallin Chris Craik Onur Mutlu.

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.

Congestion Control Tanenbaum 5.3, /12/2015Congestion Control (A Loss Based Technique: TCP)2 What? Why? Congestion occurs when –there is no reservation.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

1 On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-Core Interconnects George Nychis ✝, Chris Fallin ✝, Thomas Moscibroda.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

McRouter: Multicast within a Router for High Performance NoCs

TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.

HIgh Performance Computing & Systems LAB Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun,

DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.

Scalable Many-Core Memory Systems Optional Topic 5: Interconnects Prof. Onur Mutlu HiPEAC ACACES Summer School.

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Yu Cai Ken Mai Onur Mutlu

On Reducing Mesh Delay for Peer- to-Peer Live Streaming Dongni Ren, Y.-T. Hillman Li, S.-H. Gary Chan Department of Computer Science and Engineering The.

Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.

The Evicted-Address Filter

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems Tsinghua University Tsinghua National Laboratory for Information.

A Case for Toggle-Aware Compression for GPU Systems

Corelite Architecture: Achieving Rated Weight Fairness

UH-MEM: Utility-Based Hybrid Memory Management

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Effective mechanism for bufferless networks at intensive workloads

Rachata Ausavarungnirun, Kevin Chang

Exploring Concentration and Channel Slicing in On-chip Network Router

Application Slowdown Model

Israel Cidon, Ran Ginosar and Avinoam Kolodny

Row Buffer Locality Aware Caching Policies for Hybrid Memories

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Using Packet Information for Efficient Communication in NoCs

Achieving High Performance and Fairness at Low Cost

A Case for Bufferless Routing in On-Chip Networks

Presented by Florian Ettinger

Presentation transcript:

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu

Executive Summary Problem: Packets contend in on-chip networks (NoCs), causing congestion, thus reducing system performance Approach: Source throttling (temporarily delaying packet injections) to reduce congestion 1) Which applications to throttle? Observation: Throttling network-intensive applications leads to higher system performance  Key idea 1: Application-aware source throttling 2) How much to throttle? Observation: There is no single throttling rate that works well for every application workload  Key idea 2: Dynamic throttling rate adjustment Result: Improves both system performance and energy efficiency over state-of-the-art source throttling policies 2

Outline Background and Motivation Mechanism Comparison Points Results Conclusions 3

On-Chip Networks 4 PE Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE Connect cores, caches, memory controllers, etc. – Buses and crossbars are not scalable PE … IInterconnect

On-Chip Networks 5 R PE R R R R R R R R R Router Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE Connect cores, caches, memory controllers, etc – Buses and crossbars are not scalable Packet switched 2D mesh: Most commonly used topology Primarily serve cache misses and memory requests

Network Congestion Reduces Performance 6 Network congestion:  system performance R PE R R R R R R R R P P P P P P P P P P Congestion R Router Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE P P Packet Limited shared resources (buffers and links) due to design constraints: Power, chip area, and timing

Goal Improve system performance in a highly congested network Observation: Reducing network load (number of packets in the network) decreases network congestion, hence improves system performance Approach: Source throttling (temporarily delaying new traffic injection) to reduce network load 7

PE … Source Throttling 8 I Network Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE P Packet I P P P P P P P P P P P P P PP Long network latency when the network is congested

PE … Source Throttling 9 I Network Processing Element (Cores, L2 Banks, Memory Controllers, etc) PE P Packet I P P P P P P P P P P P P P PP Throttle P Throttling makes some packets wait longer to inject Average network throughput increases, hence higher system performance P P P P P P

Key Questions of Source Throttling Every cycle when a node has a packet to inject, source throttling blocks the packet with probability P – We call P “throttling rate” (ranges from 0 to 1) Throttling rate can be set independently on every node Two key questions: 1.Which applications to throttle? 2.How much to throttle? Naïve mechanism: Throttle every single node with a constant throttling rate 10

Key Observation #1 11 Throttling network-intensive applications leads to higher system performance Configuration: 16-node system, 4x4 mesh network, 8 gromacs (network-non-intensive), and 8 mcf (network-intensive) gromacs … mcf I P P P P P P P PP P Throttle Throttling gromacs decreases system performance by 2% due to minimal network load reduction

Throttling mcf increases system performance by 9% (gromacs: +14% mcf: +5%) due to reduced congestion gromacs … Key Observation #1 12 Throttling network-intensive applications leads to higher system performance Configuration: 16-node system, 4x4 mesh network, 8 gromacs (network-non-intensive), and 8 mcf (network-intensive) mcf I P P P P P P P PP P Throttle

gromacs … Key Observation #1 13 Throttling network-intensive applications leads to higher system performance Configuration: 16-node system, 4x4 mesh network, 8 gromacs (network-non-intensive), and 8 mcf (network-intensive) mcf I P P P P P Throttle Throttling mcf reduces congestion gromacs benefits more from less network latency Throttling network-intensive applications reduces congestion Benefits both network-non-intensive and network-intensive applications

mcf Key Observation #2 There is no single throttling rate that works well for every application workload Network runs best at or below a certain network load Dynamically adjust throttling rate to avoid overload and under-utilization 14 … mcf I P P P P P P P P gromacs I P P P Throttle 0.8 …

Outline Background and Motivation Mechanism Comparison Points Results Conclusions 15

Heterogeneous Adaptive Throttling (HAT) 1.Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications 2.Network-load-aware throttling rate adjustment: Dynamically adjust throttling rate to adapt to different workloads and program phases 16

Heterogeneous Adaptive Throttling (HAT) 1.Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications 2.Network-load-aware throttling rate adjustment: Dynamically adjust throttling rate to adapt to different workloads and program phases 17

Application-Aware Throttling 1.Measure applications’ network intensity 1.Throttle network-intensive applications 18 Network-non-intensive (Unthrottled) How to select unthrottled applications? Σ MPKI < Threshold Higher L1 MPKI Use L1 MPKI (misses per thousand instructions) to estimate network intensity Leaving too many applications unthrottled overloads the network  Select unthrottled applications so that their total network intensity is less than the total network capacity Network-intensive (Throttled) App

Heterogeneous Adaptive Throttling (HAT) 1.Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications 2.Network-load-aware throttling rate adjustment: Dynamically adjust throttling rate to adapt to different workloads and program phases 19

Different workloads require different throttling rates to avoid overloading the network But, network load (fraction of occupied buffers/links) is an accurate indicator of congestion Key idea: Measure current network load and dynamically adjust throttling rate based on load if network load > target: Increase throttling rate else: Decrease throttling rate Dynamic Throttling Rate Adjustment 20 If network is congested, throttle more If network is not congested, avoid unnecessary throttling

Heterogeneous Adaptive Throttling (HAT) 1.Application-aware throttling: Throttle network-intensive applications that interfere with network-non-intensive applications 2.Network-load-aware throttling rate adjustment: Dynamically adjust throttling rate to adapt to different workloads and program phases 21

Epoch-Based Operation Application classification and throttling rate adjustment are expensive if done every cycle Solution: recompute at epoch granularity 22 Time Current Epoch (100K cycles) Next Epoch (100K cycles) During epoch: Every node: 1)Measure L1 MPKI 2)Measure network load Beginning of epoch: All nodes send measured info to a central controller, which: 1)Classifies applications 2)Adjusts throttling rate 3)Sends new classification and throttling rate to each node

Putting It Together: Key Contributions 1. Application-aware throttling – Throttle network-intensive applications based on applications’ network intensities 2. Network-load-aware throttling rate adjustment – Dynamically adjust throttling rate based on network load to avoid overloading the network 23 HAT is the first work to combine application-aware throttling and network-load-aware rate adjustment

Outline Background and Motivation Mechanism Comparison Points Results Conclusions 24

Comparison Points Source throttling for bufferless NoCs [Nychis+ Hotnets’10, SIGCOMM’12] – Throttle network-intensive applications when other applications cannot inject – Does not take network load into account – We call this “Heterogeneous Throttling” Source throttling for buffered networks [Thottethodi+ HPCA’01] – Throttle every application when the network load exceeds a dynamically tuned threshold – Not application-aware – Fully blocks packet injections while throttling – We call this “Self-Tuned Throttling” 25

Outline Background and Motivation Mechanism Comparison Points Results Conclusions 26

Methodology Chip Multiprocessor Simulator – 64-node multi-core systems with a 2D-mesh topology – Closed-loop core/cache/NoC cycle-level model – 64KB L1, perfect L2 (always hits to stress NoC) Router Designs – Virtual-channel buffered router: 4 VCs, 4 flits/VC [Dally+ IEEE TPDS’92] Input buffers to hold contending packets – Bufferless deflection router: BLESS [Moscibroda+ ISCA’09] Misroute (deflect) contending packets 27

Methodology Workloads – 60 multi-core workloads of SPEC CPU2006 benchmarks – 4 network-intensive workload categories based on the network intensity of applications (Low/Medium/High) Metrics System performance: Fairness: Energy efficiency: 28

Performance: Bufferless NoC (BLESS) HAT provides better performance improvement than state-of-the-art throttling approaches 2. Highest improvement on heterogeneous workloads - L and M are more sensitive to network latency + 7.4% + 9.4% % % + 8.6%

Performance: Buffered NoC 30 HAT provides better performance improvement than prior approaches + 3.5%

Application Fairness 31 HAT provides better fairness than prior works - 15% - 5%

Network Energy Efficiency % + 5% HAT increases energy efficiency by reducing network load by 4.7% (BLESS) or 0.5% (buffered)

Other Results in Paper Performance on CHIPPER [Fallin+ HPCA’11] – HAT improves system performance Performance on multithreaded workloads – HAT is not designed for multithreaded workloads, but it slightly improves system performance Parameter sensitivity sweep of HAT – HAT provides consistent system performance improvement on different network sizes 33

Conclusions Problem: Packets contend in on-chip networks (NoCs), causing congestion, thus reducing system performance Approach: Source throttling (temporarily delaying packet injections) to reduce congestion 1) Which applications to throttle? Observation: Throttling network-intensive applications leads to higher system performance  Key idea 1: Application-aware source throttling 2) How much to throttle? Observation: There is no single throttling rate that works well for every application workload  Key idea 2: Dynamic throttling rate adjustment Result: Improves both system performance and energy efficiency over state-of-the-art source throttling policies 34

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu

Throttling Rate Steps 36

Network Sizes 37

Performance on CHIPPER/BLESS 38 Injection rate (number of injected packets / cycle): +8.5% (BLESS) or +4.8% (CHIPPER)

Multithreaded Workloads 39

Motivation 40 90% 94%

Sensitivity to Other Parameters Network load target: WS peaks at b/w 50% and 65% network utilization – Drops more than 10% beyond that Epoch length: WS varies by less than 1% b/w 20K and 1M cycles Low-intensity workloads: HAT does not impact system performance Unthrottled network intensity threshold: 41

Implementation L1 MPKI: One L1 miss hardware counter and one instruction hardware counter Network load: One hardware counter to monitor the number of flits routed to neighboring nodes Computation: Done at one central CPU node – At most several thousand cycles (a few percent overhead for 100K-cycle epoch) 42