CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Slides:

Advertisements

Similar presentations

Adaptive Backpressure: Efficient Buffer Management for On-Chip Networks Daniel U. Becker, Nan Jiang, George Michelogiannakis, William J. Dally Stanford.

Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Nikos Hardavellas, Northwestern University

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.

Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.

University of Utah1 Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng Naveen Muralimanohar Karthik Ramani Rajeev Balasubramonian.

Aérgia: Exploiting Packet Latency Slack in On-Chip Networks

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez, William J. Dally, Christos Kozyrakis Stanford University.

$ A Case for Bufferless Routing in On-Chip Networks A Case for Bufferless Routing in On-Chip Networks Onur Mutlu CMU TexPoint fonts used in EMF. Read the.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.

1 Lecture 17: On-Chip Networks Today: background wrap-up and innovations.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

Network based System on Chip Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.

IP I/O Memory Hard Disk Single Core IP I/O Memory Hard Disk IP Bus Multi-Core IP R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R Networks.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

1 Lecture 21: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

1 Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies.

Special Course on Computer Architecture

Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Manycore Network Interfaces for In-Memory Rack-Scale Computing Alexandros Daglis, Stanko Novakovic, Edouard Bugnion, Babak Falsafi, Boris Grot.

McRouter: Multicast within a Router for High Performance NoCs

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)

Communication issues for NOC By Farhadur Arifin. Objective: Future system of NOC will have strong requirment on reusability and communication performance.

On-Chip Networks and Testing

Physical Planning for the Architectural Exploration of Large-Scale Chip Multiprocessors Javier de San Pedro, Nikita Nikitin, Jordi Cortadella and Jordi.

José Vicente Escamilla José Flich Pedro Javier García 1.

Three-Dimensional Layout of On-Chip Tree-Based Networks Hiroki Matsutani (Keio Univ, Japan) Michihiro Koibuchi (NII, Japan) D. Frank Hsu (Fordham Univ,

Elastic-Buffer Flow-Control for On-Chip Networks

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Enabling System-Level Modeling of Variation-Induced Faults in Networks-on-Chips Konstantinos Aisopos (Princeton, MIT) Chia-Hsin Owen Chen (MIT) Li-Shiuan.

University of Michigan, Ann Arbor

Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.

Yu Cai Ken Mai Onur Mutlu

Lecture 16: Router Design

Assaf Shacham, Keren Bergman, Luca P. Carloni Presented for HPCAN Session by: Millad Ghane NOCS’07.

Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.

1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.

Boris Grot, Joel Hestness, Stephen W. Keckler 1 The University of Texas at Austin 1 NVIDIA Research Onur Mutlu Carnegie Mellon University.

1 Lecture 29: Interconnection Networks Papers: Express Virtual Channels: Towards the Ideal Interconnection Fabric, ISCA’07, Princeton Interconnect Design.

Presented by: Nick Kirchem Feb 13, 2004

Interaction of NoC design and Coherence Protocol in 3D-stacked CMPs

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Temporal Streaming of Shared Memory

Rahul Boyapati. , Jiayi Huang

Impact of Interconnection Network resources on CMP performance

Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti

Presentation transcript:

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli LSI Integrated Systems Laboratory

NoCs Major Power Consumer  Move towards manycore Tiled architectures  Network-on-Chip (NoC) Significant power consumer 40% MIT RAW 30% Intel Tera-scale  Cache coherent CMP Server workloads C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ C $ Core $$ Crossbar

Proposals to Reduce NoC Power  Multiple networks Better area and power [Balfour & Dally ICS 2006]  Commercial server workloads Traffic patterns are different  Run on cache coherent CMPs Strong relation between coherence protocol and NoC  Not optimized for Commercial Server Workload traffic

Contributions  Commercial server workloads Optimized for reuse in L1, little sharing Full blown coherence protocol in CMPs Only some transitions are frequent  Duality in Request/Response message size  CCNoC Full advantage of heterogeneity Same number of buffers 16% less power same performance as Mesh

Outline  Overview  Why CCNoC?  Dual-router design  Evaluation  Conclusions

Dual Router is More Efficient  Dual router Two crossbars per routing node  Wires less expensive on-chip Use more wires for better performance  Area and power grows faster than connectivity Balfour & Dally ICS 2006 Dual router: better performance, power and area N bit wide N/2 bit wide

Right Dual Router Design  Avoid protocol level deadlock Separate  Requests  Responses Use Virtual Channels  CCNoC sub-networks  Request / Response No VCs needed Same number of buffers  Buffers are power hungry H.S.Wang & L.S.Peh, MICRO 2003

Protocol Activity  CMPs implement full blown coherence protocol Some transitions are frequent [Hardavellas ISCA 2009]  Read clean block  Evict clean block  Write to unshared block Other transitions needed for correctness (infrequent)  Read dirty block  Evict dirty  Write to shared block

Frequent Read Protocol Activity ReaderDirectoryWriter Read Req Read Resp Evict Clean Req Short Req Short Resp Long Resp

Frequent Write Protocol Activity WriterDirectory Fetch/Upgrade Req Fetch Resp Short Req Short Resp Long Resp Upgrade Resp

Infrequent Read Protocol Activity ReaderDirectoryWriter Read Req Read Resp Short Req Short Resp Long Resp Downgrade Req Downgrade Resp

Infrequent Write Protocol Activity WriterDirectoryReader 1 Fetch/Upgrade Req Fetch Resp Short Req Short Resp Long Resp Reader 2 Upgrade Resp Inv Req Inv Resp Evict Dirty Req

Traffic Analysis Request: 93% shortResponse: 86% long

CCNoC Router Request network narrow: optimized for short messages Response network wide: optimized for long messages Request Switch Response Switch NI Router

Previous Work  Balfour et al. ICS 2006 Better than single large router Read/Write traffic Same number of reads and writes  Yoon et al. DAC 2010 Physical channel better then virtual channel  Not optimized for cache coherent CMP Running commercial server workloads

Outline  Overview  Why CCNoC?  Dual-router design  Evaluation  Conclusions

Evaluation Methodology  FLEXUS Full system simulation 16 or 8 UltraSPARC III ISA cores Split I/D, 64KB L1 1 or 2 MB L2  ORION 2.0 power estimation area estimation  Workloads OLTP: TPC-C  IBM DB2 and Oracle DSS: TPC-H  IBM DB2  Q1, Q6, Q13, Q16 Web: SPECweb99  Apache and Zeus Scientific: EM3D Multiprogrammed:  SPEC2K  2x: gcc, twolf, art, mcf

Evaluation NoCs  Mesh baseline 128 bit flit width  Torus - reference 128 bit flit width  Mesh-176 – high performance 176 bit flit width  CCNoC Request: 48 bit flit width Response: 128 bit flit width  Switches Wormhole flow control Input queued Transmission protocol  On/Off Input buffers  2 entry

Performance Performance loss: 2% Torus, 8% Mesh-176

Power Savings Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176

Conclusions  Duality in Request/Response traffic Request: dominated by short messages Response: dominated by long messages  Proposed CCNoC Narrow request network Wide response network  Showed significant power savings 22% against Torus 38% against Mesh-176

Thank you! Q&A