Circuit-Switched Coherence Natalie Enright Jerger, Li-Shiuan Peh +, Mikko Lipasti *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

Slides:

Advertisements

Similar presentations

Prof. Natalie Enright Jerger

Advertisements

Quality-of-Service Routing in IP Networks Donna Ghosh, Venkatesh Sarangan, and Raj Acharya IEEE TRANSACTIONS ON MULTIMEDIA JUNE 2001.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

To Include or Not to Include? Natalie Enright Dana Vantrease.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

High Performing Cache Hierarchies for Server Workloads

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

ECE 1749H: Interconnection Networks for Parallel Computer Architectures: Flow Control Prof. Natalie Enright Jerger.

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Reporter: Bo-Yi Shiu Date: 2011/05/27 Virtual Point-to-Point Connections for NoCs Mehdi Modarressi, Arash Tavakkol, and Hamid Sarbazi- Azad IEEE TRANSACTIONS.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Design of a High-Throughput Distributed Shared-Buffer NoC Router

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

1 Lecture 26: Interconnection Networks Topics: flow control, router microarchitecture.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

8.1 Chapter 8 Switching Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

McRouter: Multicast within a Router for High Performance NoCs

Networks-on-Chips (NoCs) Basics

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

SMART: A Single- Cycle Reconfigurable NoC for SoC Applications -Jyoti Wadhwani Chia-Hsin Owen Chen, Sunghyun Park, Tushar Krishna, Suvinay Subramaniam,

Chip Multiprocessor Coherence and Interconnect System Design Natalie Enright Jerger PhD Defense October 20, 2008.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.

University of Michigan, Ann Arbor

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Lecture 16: Router Design

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Virtual-Channel Flow Control William J. Dally

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

1 Kyung Hee University Chapter 8 Switching. 2 Kyung Hee University Switching  Switching  Switches are devices capable of creating temporary connections.

Data Communication Networks Lec 13 and 14. Network Core- Packet Switching.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

Lecture 23: Interconnection Networks

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Exploring Concentration and Channel Slicing in On-chip Network Router

Interconnection Networks: Flow Control

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Using Packet Information for Efficient Communication in NoCs

Improving Multiple-CMP Systems with Token Coherence

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

CS 6290 Many-core & Interconnect

The University of Adelaide, School of Computer Science

Lecture 25: Interconnection Networks

The University of Adelaide, School of Computer Science

Presentation transcript:

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE International Symposium on Networks-on-Chip

Motivation Network on Chip for general purpose multi-core Replacing dedicated global wires Efficient/scalable communication on-chip Router latency overhead can be significant Exploit application characteristics to lower latency Co-design coherence protocol to match network functionality 4/26/ Natalie Enright Jerger - University of Wisconsin

Executive Summary Hybrid Network Interleaves circuit-switched and packet- switched flits Optimize setup latency Improve throughput over traditional circuit- switching Reduce interconnect delay by up to 22% Co-design cache coherence protocol Improves performance by up to 17% 4/26/ Natalie Enright Jerger - University of Wisconsin

Switching Techniques Packet Switching Efficient bandwidth utilization Router latency overhead Circuit Switching Poor bandwidth utilization Stalled requests due to unavailable resources Low latency Avoids router overhead after circuit is established 4/26/ Natalie Enright Jerger - University of Wisconsin Best of both worlds? Efficient bandwidth utilization + low latency

Circuit-Switched Coherence Two key observations Commercial workloads are very sensitive to communication latency Significant pair-wise sharing 4/26/2015Natalie Enright Jerger - University of Wisconsin 5 Construct fast pair-wise circuits? Commercial Workloads: SpecJBB, SpecWeb, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace

Traditional Circuit Switching Traditional circuit-switching hurts performance by up to ~7% 4/26/ Natalie Enright Jerger - University of Wisconsin *Data collected for 16 in-order core chip multiprocessor

Circuit Switching Redesigned Latency is critical Utilize Circuit Switching for lower latency A circuit connects resources across multiple hops to avoid router overhead Traditional circuit-switching performs poorly My contributions Novel setup mechanism Bandwidth stealing 4/26/2015Natalie Enright Jerger - University of Wisconsin 7

Outline Motivation Router Design Setup Mechanism Bandwidth Stealing Coherence Protocol Co-design Pair-wise sharing 3-hop optimization Region prediction Results Conclusions 4/26/ Natalie Enright Jerger - University of Wisconsin

Traditional Circuit Switching Path Setup (with Acknowledgement) Significant latency overhead prior to data transfer Other requests forced to wait for resources 4/26/2015 Natalie Enright Jerger - University of Wisconsin Acknowledgement Configuration Probe Data Circuit 0 5 9

Novel Circuit Setup Policy Overlap circuit setup with 1 st data transfer Reconfigure existing circuits if no unused links available Allows piggy-backed request to always achieve low latency Multiple circuit planes prevent frequent reconfiguration Configuration Packet Data Circuit A 0 5 4/26/ Natalie Enright Jerger - University of Wisconsin

Setup Network Light-weight setup network Narrow Circuit plane identifier (2 bits) + Destination (4 bits) Low Load No virtual channels  small area footprint Stores circuit configuration information Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration Buffered, traverses packet-switched pipeline 4/26/ Natalie Enright Jerger - University of Wisconsin

Packet-Switched Bandwidth Stealing Remember: problem with traditional Circuit-Switching is poor bandwidth Need to overcome this limitation Hybrid Circuit-Switched Solution: Packet- switched messages snoop incoming links When there are no circuit-switched messages on the link A waiting packet-switched message can steal idle bandwidth 4/26/ Natalie Enright Jerger - University of Wisconsin

Hybrid Circuit-Switched Router Design T T T T T Allocators Crossbar Inj N S E W W E S N Ej 4/26/2015Natalie Enright Jerger - University of Wisconsin 13

HCS Pipeline Circuit-switched messages: 1 stage Packet-switched messages: 3 stages Aggressive Speculation reduces stages Buffer Write Virtual Channel/ Switch Allocation Switch Traversal Link Traversal 4/26/2015Natalie Enright Jerger - University of Wisconsin Switch Traversal Link Traversal 14 RouterLink RouterLink Link Traversal

Outline Motivation Router Design Setup Mechanism Bandwidth Stealing Coherence Protocol Co-design Pair-wise sharing 3-hop optimization Region prediction Results Conclusions 4/26/ Natalie Enright Jerger - University of Wisconsin

Sharing Characterization Temporal sharing relationship: 67-76% of misses are serviced by 2 most recently shared with cores 4/26/ Natalie Enright Jerger - University of Wisconsin Commercial Workloads: SpecJBB, SpecWeb, TPC-H, TPC-W Scientific Workloads: Barnes-Hut, Ocean, Radiosity, Raytrace

Directory Coherence Directory AddressStateSharers AExclusive2 BShared1,2 4/26/ Natalie Enright Jerger - University of Wisconsin 12 Read A 1 Forward Read A 2 Data Response A 3 Directory AddressStateSharers AShared1,2 BShared1,2

Coherence Protocol Co-Design Goal: Better exploit circuits through coherence protocol Modifications: Allow a cache to send a request directly to another cache Notify the directory in parallel Prediction mechanism for pair-wise sharers Directory is sole ordering point 4/26/2015Natalie Enright Jerger - University of Wisconsin 18

Circuit-Switched Coherence Optimization 4/26/ Natalie Enright Jerger - University of Wisconsin Directory AddressStateSharers AExclusive2 BShared1,2 12 Update A 1 Data Response A 2 3 Directory AddressStateSharers AShared1,2 BShared1,2 Ack A Read A 1

Region Prediction Each memory region spans 1KB Takes advantage of spatial and temporal sharing 4/26/ Natalie Enright Jerger - University of Wisconsin Directory AddressStateSharers A[0]Shared2 A[1]Shared2 12 Miss A[0] 1 Forward Read A[0] 2 Data Response A[0] 3 Region Table A-- B3 Region Table A2 B3 Region A Update 4 5 Read A[1] Directory AddressStateSharers A[0]Shared1,2 A[1]Shared2

Simulation Methodology PHARMSim Full-system multi-core simulator Detailed network level model Cycle accurate router model Flit-level contention modeled More results in paper 4/26/ Natalie Enright Jerger - University of Wisconsin

Simulation Workloads 4/26/ Natalie Enright Jerger - University of Wisconsin Commercial SPECjbbJava server workload 24 warehouse, 200 requests SPECwebWeb server, 300 requests TPC-WWeb e-commerce, 40 transactions TPC-HDecision support system Scientific Barnes-Hut8k particles, full run Ocean514x514, parallel phase RadiosityParallel phase RaytraceCar input, parallel phase Synthetic Uniform RandomDestination select with uniform random distribution Permutation TrafficEach node communicates with one other node (pair-wise)

Simulation Configuration Table with config parameters 4/26/ Natalie Enright Jerger - University of Wisconsin Processors Cores16 in-order general purpose Memory System L1 I/D Caches32 KB 2-way set associative 1 cycle Private L2 caches512 KB 4-way set associative 6 cycles 64 Byte lines Shared L3 Cache16 MB (1MB bank/tile) 4-way set associative 12 cycles Main Memory Latency100 cycles Interconnect: 4x4 2-D Mesh Packet-switched baselineOptimized 1-3 router stages 4 Virtual channels with 4 Buffers each Hybrid Circuit Switching1 router stage 2 or 4 Circuit planes

Network Results Communication latency is key: shave off precious cycles in network latency 4/26/ Natalie Enright Jerger - University of Wisconsin

Flit breakdown Reduce interconnect latency for a significant fraction of messages 4/26/ Natalie Enright Jerger - University of Wisconsin

HCS + Protocol Optimization Improvement of HCS + Protocol optimization is greater than the sum of HCS or Protocol Optimization alone. Protocol Optimization drives up circuit reuse, better utilizing HCS 4/26/2015Natalie Enright Jerger - University of Wisconsin 26

Uniform Random Traffic HCS successfully overcomes bandwidth limitations associated with Circuit Switching 4/26/ Natalie Enright Jerger - University of Wisconsin

Related Work Router optimizations Express Virtual Channels [Kumar, ISCA 2007] Single-cycle router [Mullins, ISCA 2004] Many more… Hybrid Circuit-Switching Wave-switching [Duato, ICPP 1996] SoCBus [Wiklund, IPDPS 2003] Coherence Protocols Significant research in removing overhead of indirection 4/26/ Natalie Enright Jerger - University of Wisconsin

Circuit-Switched Coherence Summary Replace packet-switched mesh with hybrid circuit-switched mesh Interleave circuit and packet switched flits Reconfigurable circuits Dedicated bandwidth for frequent pair-wise sharers Low Latency and low power Avoid switching/routing Devise novel coherence mechanisms to take advantage of benefits of circuit switching 4/26/2015Natalie Enright Jerger - University of Wisconsin 29

Thank you 4/26/ Natalie Enright Jerger - University of Wisconsin

Circuit Setup Novel Setup Policy Overlap circuit setup with first data transfer Store circuit information at each router Reconfigure existing circuits if no unused links available Allows piggy-backed request to always achieve low latency Multiple narrow circuit planes prevent frequent reconfiguration Reconfiguration Buffered, traverses packet-switched pipeline 4/26/2015Natalie Enright Jerger - University of Wisconsin 31