18-742 Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012.

Slides:



Advertisements
Similar presentations
Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.
Advertisements

A Novel 3D Layer-Multiplexed On-Chip Network
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Kilo-NOC: A Network-on-Chip Architecture for Scalability and Service Guarantees Boris Grot The University of Texas at Austin.
Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.
TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
1 Application Aware Prioritization Mechanisms for On-Chip Networks Reetuparna Das Onur Mutlu † Thomas Moscibroda ‡ Chita Das § Reetuparna Das § Onur Mutlu.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
University of Michigan, Ann Arbor
Microprocessors and Microsystems Volume 35, Issue 2, March 2011, Pages 230–245 Special issue on Network-on-Chip Architectures and Design Methodologies.
Yu Cai Ken Mai Onur Mutlu
A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. MishraChita R. DasOnur Mutlu.
Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
How to Train your Dragonfly
Dan C. Marinescu Office: HEC 439 B. Office hours: M, Wd 3 – 4:30 PM.
Reducing Memory Interference in Multicore Systems
Lecture 23: Interconnection Networks
Prof. Onur Mutlu Carnegie Mellon University
ESE532: System-on-a-Chip Architecture
Fall 2012 Parallel Computer Architecture Lecture 22: Dataflow I
Effective mechanism for bufferless networks at intensive workloads
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel
Chapter 6: CPU Scheduling
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Using Packet Information for Efficient Communication in NoCs
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter5: CPU Scheduling
On-time Network On-chip
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Chapter 6: CPU Scheduling
CPU SCHEDULING.
Operating System , Fall 2000 EA101 W 9:00-10:00 F 9:00-11:00
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Prof. Onur Mutlu Carnegie Mellon University 10/24/2012
Module 5: CPU Scheduling
Presentation transcript:

Fall 2012 Parallel Computer Architecture Lecture 21: Interconnects IV Prof. Onur Mutlu Carnegie Mellon University 10/29/2012

New Review Assignments Were Due: Sunday, October 28, 11:59pm.  Das et al., “Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA  Dennis and Misunas, “A Preliminary Architecture for a Basic Data Flow Processor,” ISCA Due: Tuesday, October 30, 11:59pm.  Arvind and Nikhil, “Executing a Program on the MIT Tagged-Token Dataflow Architecture,” IEEE TC Due: Thursday, November 1, 11:59pm.  Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO  Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO

Other Readings Dataflow Gurd et al., “The Manchester prototype dataflow computer,” CACM Lee and Hurson, “Dataflow Architectures and Multithreading,” IEEE Computer Restricted Dataflow  Patt et al., “HPS, a new microarchitecture: rationale and introduction,” MICRO  Patt et al., “Critical issues regarding HPS, a high performance microarchitecture,” MICRO  Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA  Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer

Project Milestone I Meetings Please come to office hours for feedback on  Your progress  Your presentation 4

Last Lectures Transactional Memory (brief) Interconnect wrap-up Project Milestone I presentations 5

Today More on Interconnects Research Start Dataflow 6

Research in Interconnects

Research Topics in Interconnects Plenty of topics in interconnection networks. Examples: Energy/power efficient and proportional design Reducing Complexity: Simplified router and protocol designs Adaptivity: Ability to adapt to different access patterns QoS and performance isolation  Reducing and controlling interference, admission control Co-design of NoCs with other shared resources  End-to-end performance, QoS, power/energy optimization Scalable topologies to many cores, heterogeneous systems Fault tolerance Request prioritization, priority inversion, coherence, … New technologies (optical, 3D) 8

Packet Scheduling Which packet to choose for a given output port?  Router needs to prioritize between competing flits  Which input port?  Which virtual channel?  Which application’s packet? Common strategies  Round robin across virtual channels  Oldest packet first (or an approximation)  Prioritize some virtual channels over others Better policies in a multi-core environment  Use application characteristics 9

Application-Aware Packet Scheduling Das et al., “Application-Aware Prioritization Mechanisms for On-Chip Networks,” MICRO 2009.

The Problem: Packet Scheduling Network-on-Chip L2$ Bank mem cont Memory Controller P Accelerator L2$ Bank L2$ Bank P PP P P PP Network-on-Chip Network-on-Chip is a critical resource shared by multiple applications App1 App2 App N App N-1

The Problem: Packet Scheduling R PE R R R R R R R R R R R R R R R Crossbar(5x5) To East To PE To West To North To South Input Port with Buffers Control Logic Crossbar R Routers PE Processing Element (Cores, L2 Banks, Memory Controllers etc) Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA)

VC0 Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA ) VC1 2 From East From West From North From South From PE The Problem: Packet Scheduling

Conceptual View From East From West From North From South From PE VC 0 VC 1 VC 2 App1App2App3App4 App5App6App7App8 The Problem: Packet Scheduling VC0 Routing Unit (RC) VC Allocator (VA) Switch VC1 2 From East From West From North From South From PE Allocator (SA)

Scheduler Conceptual View VC0 Routing Unit (RC) VC Allocator (VA) Switch Allocator (SA) VC1 2 From East From West From North From South From PE From East From West From North From South From PE VC 0 VC 1 VC 2 App1App2App3App4 App5App6App7App8 Which packet to choose? The Problem: Packet Scheduling

 Existing scheduling policies  Round Robin  Age  Problem 1: Local to a router  Lead to contradictory decision making between routers: packets from one application may be prioritized at one router, to be delayed at next.  Problem 2: Application oblivious  Treat all applications’ packets equally  But applications are heterogeneous  Solution : Application-aware global scheduling policies.

Motivation: Stall Time Criticality  Applications are not homogenous  Applications have different criticality with respect to the network  Some applications are network latency sensitive  Some applications are network latency tolerant  Application’s Stall Time Criticality (STC) can be measured by its average network stall time per packet (i.e. NST/packet)  Network Stall Time (NST) is number of cycles the processor stalls waiting for network transactions to complete

Motivation: Stall Time Criticality  Why do applications have different network stall time criticality (STC)?  Memory Level Parallelism (MLP)  Lower MLP leads to higher STC  Shortest Job First Principle (SJF)  Lower network load leads to higher STC  Average Memory Access Time  Higher memory access time leads to higher STC

 Observation 1: Packet Latency != Network Stall Time STALL STALL of Red Packet = 0 LATENCY Application with high MLP STC Principle 1 {MLP} Compute

 Observation 1: Packet Latency != Network Stall Time  Observation 2: A low MLP application’s packets have higher criticality than a high MLP application’s STALL STALL of Red Packet = 0 LATENCY Application with high MLP STALL LATENCY STALL LATENCY STALL LATENCY Application with low MLP STC Principle 1 {MLP}

STC Principle 2 {Shortest-Job-First} 4X network slow down 1.2X network slow down 1.3X network slow down 1.6X network slow down Overall system throughput{weighted speedup} increases by 34% Running ALONE Baseline (RR) Scheduling SJF Scheduling Light Application Heavy Application Compute

Solution: Application-Aware Policies  Idea  Identify stall time critical applications (i.e. network sensitive applications) and prioritize their packets in each router.  Key components of scheduling policy:  Application Ranking  Packet Batching  Propose low-hardware complexity solution

Component 1 : Ranking  Ranking distinguishes applications based on Stall Time Criticality (STC)  Periodically rank applications based on Stall Time Criticality (STC).  Explored many heuristics for quantifying STC (Details & analysis in paper)  Heuristic based on outermost private cache Misses Per Instruction (L1-MPI) is the most effective  Low L1-MPI => high STC => higher rank  Why Misses Per Instruction (L1-MPI)?  Easy to Compute (low complexity)  Stable Metric (unaffected by interference in network)

Component 1 : How to Rank?  Execution time is divided into fixed “ranking intervals”  Ranking interval is 350,000 cycles  At the end of an interval, each core calculates their L1-MPI and sends it to the Central Decision Logic (CDL)  CDL is located in the central node of mesh  CDL forms a ranking order and sends back its rank to each core  Two control packets per core every ranking interval  Ranking order is a “partial order”  Rank formation is not on the critical path  Ranking interval is significantly longer than rank computation time  Cores use older rank values until new ranking is available

Component 2: Batching  Problem: Starvation  Prioritizing a higher ranked application can lead to starvation of lower ranked application  Solution: Packet Batching  Network packets are grouped into finite sized batches  Packets of older batches are prioritized over younger batches  Alternative batching policies explored in paper  Time-Based Batching  New batches are formed in a periodic, synchronous manner across all nodes in the network, every T cycles

Putting it all together  Before injecting a packet into the network, it is tagged by  Batch ID (3 bits)  Rank ID (3 bits)  Three tier priority structure at routers  Oldest batch first (prevent starvation)  Highest rank first (maximize performance)  Local Round-Robin (final tie breaker)  Simple hardware support: priority arbiters  Global coordinated scheduling  Ranking order and batching order are the same across all routers

STC Scheduling Example Injection Cycles Batch 0 Packet Injection Order at Processor Core1 Core2 Core3 Batching interval length = 3 cycles Ranking order = Batch 1 Batch 2

STC Scheduling Example Router Scheduler Injection Cycles Batch 2 Batch 1 Batch 0 Applications

STC Scheduling Example Router Scheduler Round Robin STALL CYCLESAvg RR Age STC Time

STC Scheduling Example Router Scheduler Round Robin Age STALL CYCLESAvg RR Age STC Time

STC Scheduling Example Router Scheduler Round Robin Age STC STALL CYCLESAvg RR Age STC Ranking order Time

Qualitative Comparison  Round Robin & Age  Local and application oblivious  Age is biased towards heavy applications  heavy applications flood the network  higher likelihood of an older packet being from heavy application  Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008]  Provides bandwidth fairness at the expense of system performance  Penalizes heavy and bursty applications  Each application gets equal and fixed quota of flits (credits) in each batch.  Heavy application quickly run out of credits after injecting into all active batches & stall till oldest batch completes and frees up fresh credits.  Underutilization of network resources

System Performance  STC provides 9.1% improvement in weighted speedup over the best existing policy{averaged across 96 workloads}  Detailed case studies in the paper

Slack-Driven Packet Scheduling Das et al., “ Aergia: Exploiting Packet Latency Slack in On-Chip Networks,” ISCA 2010.

Packet Scheduling in NoC  Existing scheduling policies  Round robin  Age  Problem  Treat all packets equally  Application-oblivious  Packets have different criticality  Packet is critical if latency of a packet affects application’s performance  Different criticality due to memory level parallelism (MLP) All packets are not the same…!!!

Latency ( ) MLP Principle Stall Compute Latency ( ) Stall ( ) = 0 Packet Latency != Network Stall Time Different Packets have different criticality due to MLP Criticality ( ) > Criticality ( )

Outline  Introduction  Packet Scheduling  Memory Level Parallelism  Aergia  Concept of Slack  Estimating Slack  Evaluation  Conclusion

What is Aergia?  Aergia is the spirit of laziness in Greek mythology  Some packets can afford to slack!

Outline  Introduction  Packet Scheduling  Memory Level Parallelism  Aergia  Concept of Slack  Estimating Slack  Evaluation  Conclusion

Slack of Packets  What is slack of a packet?  Slack of a packet is number of cycles it can be delayed in a router without (significantly) reducing application’s performance  Local network slack  Source of slack: Memory-Level Parallelism (MLP)  Latency of an application’s packet hidden from application due to overlap with latency of pending cache miss requests  Prioritize packets with lower slack

Concept of Slack Instruction Window Stall Network-on-Chip Load Miss Causes returns earlier than necessary Compute Slack ( ) = Latency ( ) – Latency ( ) = 26 – 6 = 20 hops Execution Time Packet( ) can be delayed for available slack cycles without reducing performance! Causes Load Miss Latency ( ) Slack

Prioritizing using Slack Core A Core B Causes Load Miss Prioritize Load Miss Causes Interference at 3 hops Slack( ) > Slack ( )

Slack in Applications 50% of packets have 350+ slack cycles 10% of packets have <50 slack cycles Non-critical critical

Slack in Applications 68% of packets have zero slack cycles

Diversity in Slack

Slack varies between packets of different applications Slack varies between packets of a single application

Outline  Introduction  Packet Scheduling  Memory Level Parallelism  Aergia  Concept of Slack  Estimating Slack  Evaluation  Conclusion

Estimating Slack Priority Slack (P) = Max (Latencies of P’s Predecessors) – Latency of P Predecessors(P) are the packets of outstanding cache miss requests when P is issued  Packet latencies not known when issued  Predicting latency of any packet Q  Higher latency if Q corresponds to an L2 miss  Higher latency if Q has to travel farther number of hops

 Slack of P = Maximum Predecessor Latency – Latency of P  Slack(P) = PredL2: Set if any predecessor packet is servicing L2 miss MyL2: Set if P is NOT servicing an L2 miss HopEstimate: Max (# of hops of Predecessors) – hops of P Estimating Slack Priority PredL2 (2 bits) PredL2 (2 bits) MyL2 (1 bit) MyL2 (1 bit) HopEstimate (2 bits) HopEstimate (2 bits)

Estimating Slack Priority  How to predict L2 hit or miss at core?  Global Branch Predictor based L2 Miss Predictor  Use Pattern History Table and 2-bit saturating counters  Threshold based L2 Miss Predictor  If #L2 misses in “M” misses >= “T” threshold then next load is a L2 miss.  Number of miss predecessors?  List of outstanding L2 Misses  Hops estimate?  Hops => ∆X + ∆ Y distance  Use predecessor list to calculate slack hop estimate

Starvation Avoidance  Problem: Starvation  Prioritizing packets can lead to starvation of lower priority packets  Solution: Time-Based Packet Batching  New batches are formed at every T cycles  Packets of older batches are prioritized over younger batches

Putting it all together  Tag header of the packet with priority bits before injection  Priority(P)?  P’s batch (highest priority)  P’s Slack  Local Round-Robin (final tie breaker) PredL2 (2 bits) PredL2 (2 bits) MyL2 (1 bit) MyL2 (1 bit) HopEstimate (2 bits) HopEstimate (2 bits) Batch (3 bits) Batch (3 bits) Priority (P) =

Outline  Introduction  Packet Scheduling  Memory Level Parallelism  Aergia  Concept of Slack  Estimating Slack  Evaluation  Conclusion

Evaluation Methodology  64-core system  x86 processor model based on Intel Pentium M  2 GHz processor, 128-entry instruction window  32KB private L1 and 1MB per core shared L2 caches, 32 miss buffers  4GB DRAM, 320 cycle access latency, 4 on-chip DRAM controllers  Detailed Network-on-Chip model  2-stage routers (with speculation and look ahead routing)  Wormhole switching (8 flit data packets)  Virtual channel flow control (6 VCs, 5 flit buffer depth)  8x8 Mesh (128 bit bi-directional channels)  Benchmarks  Multiprogrammed scientific, server, desktop workloads (35 applications)  96 workload combinations

Qualitative Comparison  Round Robin & Age  Local and application oblivious  Age is biased towards heavy applications  Globally Synchronized Frames (GSF) [Lee et al., ISCA 2008]  Provides bandwidth fairness at the expense of system performance  Penalizes heavy and bursty applications  Application-Aware Prioritization Policies (SJF) [Das et al., MICRO 2009]  Shortest-Job-First Principle  Packet scheduling policies which prioritize network sensitive applications which inject lower load

System Performance  SJF provides 8.9% improvement in weighted speedup  Aergia improves system throughput by 10.3%  Aergia+SJF improves system throughput by 16.1%

Network Unfairness  SJF does not imbalance network fairness  Aergia improves network unfairness by 1.5X  SJF+Aergia improves network unfairness by 1.3X

Conclusions & Future Directions  Packets have different criticality, yet existing packet scheduling policies treat all packets equally  We propose a new approach to packet scheduling in NoCs Slack  We define Slack as a key measure that characterizes the relative importance of a packet. Aergia  We propose Aergia a novel architecture to accelerate low slack critical packets  Result  Improves system performance: 16.1%  Improves network fairness: 30.8%

Express-Cube Topologies Grot et al., “ Express Cube Topologies for On-Chip Interconnects” ” HPCA 2009.

UTCS60HPCA '09 2-D Mesh

 Pros Low design & layout complexity Simple, fast routers  Cons Large diameter Energy & latency impact UTCS61HPCA '09 2-D Mesh

 Pros Multiple terminals attached to a router node Fast nearest-neighbor communication via the crossbar Hop count reduction proportional to concentration degree  Cons Benefits limited by crossbar complexity UTCS62HPCA '09 Concentration (Balfour & Dally, ICS ‘06 )

UTCS63HPCA '09 Concentration  Side-effects Fewer channels Greater channel width

UTCS64HPCA ‘09 Replication CMesh-X2  Benefits Restores bisection channel count Restores channel width Reduced crossbar complexity

UTCS65HPCA '09 Flattened Butterfly (Kim et al., Micro ‘07)  Objectives: Improve connectivity Exploit the wire budget

UTCS66HPCA '09 Flattened Butterfly (Kim et al., Micro ‘07)

UTCS67HPCA '09 Flattened Butterfly (Kim et al., Micro ‘07)

UTCS68HPCA '09 Flattened Butterfly (Kim et al., Micro ‘07)

UTCS69HPCA '09 Flattened Butterfly (Kim et al., Micro ‘07)

 Pros Excellent connectivity Low diameter: 2 hops  Cons High channel count: k 2 /2 per row/column Low channel utilization Increased control (arbitration) complexity UTCS70HPCA '09 Flattened Butterfly (Kim et al., Micro ‘07)

UTCS71HPCA '09 Multidrop Express Channels (MECS)  Objectives: Connectivity More scalable channel count Better channel utilization

UTCS72HPCA '09 Multidrop Express Channels (MECS)

UTCS73HPCA '09 Multidrop Express Channels (MECS)

UTCS74HPCA '09 Multidrop Express Channels (MECS)

UTCS75HPCA '09 Multidrop Express Channels (MECS)

UTCS76HPCA ‘09 Multidrop Express Channels (MECS)

 Pros One-to-many topology Low diameter: 2 hops k channels row/column Asymmetric  Cons Asymmetric Increased control (arbitration) complexity UTCS77HPCA ‘09 Multidrop Express Channels (MECS)

Partitioning: a GEC Example UTCS78HPCA '09 MECS MECS-X2 Flattened Butterfly Partitioned MECS

Analytical Comparison UTCS79HPCA '09 CMeshFBflyMECS Network Size Radix (conctr’d) Diameter Channel count Channel width Router inputs Router outputs

Experimental Methodology TopologiesMesh, CMesh, CMesh-X2, FBFly, MECS, MECS-X2 Network sizes64 & 256 terminals RoutingDOR, adaptive Messages64 & 576 bits Synthetic trafficUniform random, bit complement, transpose, self-similar PARSEC benchmarks Blackscholes, Bodytrack, Canneal, Ferret, Fluidanimate, Freqmine, Vip, x264 Full-system configM5 simulator, Alpha ISA, 64 OOO cores Energy evaluationOrion + CACTI 6 UTCS80HPCA '09

UTCS81HPCA '09 64 nodes: Uniform Random

UTCS82HPCA ' nodes: Uniform Random

UTCS83HPCA '09 Energy (100K pkts, Uniform Random)

UTCS84HPCA '09 64 Nodes: PARSEC

Summary  MECS A new one-to-many topology Good fit for planar substrates Excellent connectivity Effective wire utilization  Generalized Express Cubes Framework & taxonomy for NOC topologies Extension of the k-ary n-cube model Useful for understanding and exploring on-chip interconnect options Future: expand & formalize UTCS85HPCA '09

Kilo-NoC: Topology-Aware QoS Grot et al., “ Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011.

Motivation Extreme-scale chip-level integration  Cores  Cache banks  Accelerators  I/O logic  Network-on-chip (NOC) cores today assets in the near future 87

Kilo-NOC requirements High efficiency  Area  Energy Good performance Strong service guarantees (QoS) 88

Topology-Aware QoS Problem: QoS support in each router is expensive (in terms of buffering, arbitration, bookkeeping)  E.g., Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on- Chip,” MICRO Goal: Provide QoS guarantees at low area and power cost Idea:  Isolate shared resources in a region of the network, support QoS within that area  Design the topology so that applications can access the region without interference 89

Baseline QOS-enabled CMP Multiple VMs sharing a die 90 Shared resources (e.g., memory controllers) VM-private resources (cores, caches) QOS-enabled router Q

Conventional NOC QOS Contention scenarios: Shared resources  memory access Intra-VM traffic  shared cache access Inter-VM traffic  VM page sharing 91

Conventional NOC QOS 92 Contention scenarios: Shared resources  memory access Intra-VM traffic  shared cache access Inter-VM traffic  VM page sharing Network-wide guarantees without network-wide QOS support

Kilo-NOC QOS Insight: leverage rich network connectivity  Naturally reduce interference among flows  Limit the extent of hardware QOS support Requires a low-diameter topology  This work: Multidrop Express Channels (MECS) 93 Grot et al., HPCA 2009

Dedicated, QOS-enabled regions  Rest of die: QOS-free Richly-connected topology  Traffic isolation Special routing rules  Manage interference 94 QOS-free Topology-Aware QOS

Dedicated, QOS-enabled regions  Rest of die: QOS-free Richly-connected topology  Traffic isolation Special routing rules  Manage interference 95 Topology-Aware QOS

Dedicated, QOS-enabled regions  Rest of die: QOS-free Richly-connected topology  Traffic isolation Special routing rules  Manage interference 96 Topology-Aware QOS

Dedicated, QOS-enabled regions  Rest of die: QOS-free Richly-connected topology  Traffic isolation Special routing rules  Manage interference 97 Topology-Aware QOS

Topology-aware QOS support  Limit QOS complexity to a fraction of the die Optimized flow control  Reduce buffer requirements in QOS- free regions 98 QOS-free Kilo-NOC view

ParameterValue Technology15 nm Vdd0.7 V System1024 tiles: 256 concentrated nodes (64 shared resources) Networks: MECS+PVCVC flow control, QOS support (PVC) at each node MECS+TAQVC flow control, QOS support only in shared regions MECS+TAQ+EBEB flow control outside of SRs, Separate Request and Reply networks K-MECSProposed organization: TAQ + hybrid flow control 99

100

101

Kilo-NOC: a heterogeneous NOC architecture for kilo-node substrates  Topology-aware QOS  Limits QOS support to a fraction of the die  Leverages low-diameter topologies  Improves NOC area- and energy-efficiency  Provides strong guarantees 102