Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Slides:



Advertisements
Similar presentations
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
Advertisements

1 Networks for Multi-core Chip A Controversial View Shekhar Borkar Intel Corp.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
A Novel 3D Layer-Multiplexed On-Chip Network
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
High Performing Cache Hierarchies for Server Workloads
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 E. Bolotin – The Power of Priority, NoCs 2007 The Power of Priority : NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon,
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 21: Coherence and Interconnection Networks Papers: Flexible Snooping: Adaptive Filtering and Forwarding in Embedded Ring Multiprocessors, UIUC,
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
TLC: Transmission Line Caches Brad Beckmann David Wood Multifacet Project University of Wisconsin-Madison 12/3/03.
1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.
International Symposium on Low Power Electronics and Design NoC Frequency Scaling with Flexible- Pipeline Routers Pingqiang Zhou, Jieming Yin, Antonia.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, ​ Kevin Chang, Greg Nazario, Reetuparna.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.
ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Traffic Steering Between a Low-Latency Unsiwtched TL Ring and a High-Throughput Switched On-chip Interconnect Jungju Oh, Alenka Zajic, Milos Prvulovic.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
University of Michigan, Ann Arbor
Yu Cai Ken Mai Onur Mutlu
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Timestamp snooping: an approach for extending SMPs Milo M. K. Martin et al. Summary by Yitao Duan 3/22/2002.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
University of Utah 1 Interconnect Design Considerations for Large NUCA Caches Naveen Muralimanohar Rajeev Balasubramonian.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Exploring Concentration and Channel Slicing in On-chip Network Router
Lecture 17: NoC Innovations
Energy-Efficient Address Translation
Address Translation for Manycore Systems
Lecture 2: Snooping-Based Coherence
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
A Case for Interconnect-Aware Architectures
Presentation transcript:

Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah and *HP Labs

University of Utah 2 Motivation - I Future CMPs are likely to be power-limited –On-chip networks consume 20-36% of total chip power –Network power dominated by routers Chip design and verification costs are tremendous –Directory-based protocols are complicated and have the inherent problem of indirection –Snooping-based protocols are well understood and simple to design Metal and wiring are cheap and plentiful We are no longer pin limited for the interconnection network

University of Utah 3 Motivation - II Future of multi-core computing likely to diverge into two separate tracks –Mid-range multicore machines for home/office cores –Many-core machines for scientific/server applications 1000s of cores Even machines with large core counts are likely to be virtualized, with communication localized to small chunks of approx. 64 cores Design energy-efficient networks for moderate core-counts VM

University of Utah 4 Executive Summary Elimination of routers leads us back to bus-based networks Dramatic reduction in energy consumption, little or no loss in performance, reduction in design complexity Enhancing the life of buses for moderately sized CMPs –Filtered segmented bus, low-swing wiring, address interleaved buses, page coloring

University of Utah 5 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing Wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

Baseline Chip and Interconnect Organization University of Utah 6 CoreL1 L2 Simple mesh used for illustration here, other options discussed in the paper Static-NUCA shared L2, each line has a “home” slice based on its address Router

University of Utah 7 Where does energy go in the network? 1.39e-10 J/access 1.56e-11 J/access 8X RouterLink Energy estimates based on CACTI 6.0 and Orion 2.0

University of Utah 8 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

University of Utah 9 What is the solution? We are left with.. a bus! Could we really just use a bus? Not really –Too many links activated on every transaction –Energy gained by eliminating routers lost by activating more links – Poor performance due to increased arbitration times and network contention

University of Utah 10 We can do better.. Useless snoop: Particular cache line not present in any other core

Segment and filter snoop transactions at intermediate points Two types of filters –Out-filter –In-filter Reduces number of links activated Allows for safe parallelism (serialization happens at the central bus if required) Filtered Bus University of Utah 11 Bus link Filter

Filters Each “filter” depicted in the figure is a combination of an “Out-filter” and an “In-filter” Each of these is a Counting Bloom Filter –2 arrays of 10-bit entries –Subsets of the address bits hashed into each of these arrays, incremented to add entries, decremented to remove entries –To test for membership, simply check if entries in both arrays are non-zero –Compact representation, false positives possible University of Utah 12 Bus link In + Out Filter

Out-filter - Case 1 University of Utah 13 R Home Segment Bloom filter in every segment keeps track of a superset of lines that call that segment “home” and have been sent “out” of that segment If a line has never left a segment, none of its transactions need to be seen outside Energy Saved Completely localized transaction Only home segment activated Bus link In - Filter Activated bus Activated filter Out - Filter R – Requested Address

Out-filter – Case 2 University of Utah 14 Home Segment R Update If the line is being requested from outside its home segment, transaction has to go out on the central bus The out-filter of the home segment is updated appropriately The in-filter then takes over R R R Bus link Activated bus Activated filter In - FilterOut - Filter R – Requested Address

In-filter University of Utah 15 RR R Bloom filters keep track of a superset of lines currently present in the segment Only broadcast within the local segment if required Energy Saved Bus link Activated bus Activated filter In - FilterOut - Filter R – Requested Address

Arbitration Global arbitration delay is non-trivial for a single bus connecting even 16 cores Multi-step arbitration, as required On every request –arbitrate for local bus and broadcast –if filter indicates that the transaction is complete, “validate” broadcast via wired-OR –if not, arbitrate for central bus and hold broadcast in a single-entry buffer until the central bus is available –at the remote sub-buses, priority is given to requests originating from the central bus University of Utah 16

University of Utah 17 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

Low-swing Wiring Differential low-swing wiring up to 10X more energy efficient than regular wiring These have less impact on packet- switched networks since routers are the bottleneck anyway –Amdahl’s law! Slightly increased latency, more metal requirement University of Utah 18

University of Utah 19 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

Address Interleaved Buses As core counts increase, increased pressure on the bus due to contention At 64 cores, even though bus-based networks continue to be highly energy efficient, performance begins to dip To shore up performance, increase the number of buses – different buses handle mutually exclusive addresses – increased metal requirement University of Utah 20

University of Utah 21 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

Page Coloring OS-assisted page-coloring for L2 cache We use a simple first-touch approach Improved locality helps any network, but is especially well-suited for our network because – More flexibility in page placement – Less negative impact by sub-optimal page placement – Improves filter behavior University of Utah 22

University of Utah 23 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

University of Utah 24 Methodology Virtutech SIMICS full-system simulator –“g-cache” significantly modified to add network models CACTI 6.0 and Orion 2.0 for router/link energy computation 16 cores for most experiments, sensitivity analysis for 32- and 64-core systems 32nm process, 3GHz clock 32K D-L1, 16K I-L1, 2MB/slice shared L2 200 cycle main memory latency 4KB page size PARSEC, NAS, SPLASH-2 benchmark suites – run for entire Region-Of-Interest/parallel section Baseline routers - 4 VCs, 8 buffers/VC

Energy Consumption – Address Network University of Utah 25 Ring – 20x Grid – 27x Fbfly – 31x

Energy Consumption – Data Network University of Utah 26 Ring – 2x Grid – 2.5x Fbfly – 3x

How does energy consumption reduce? Router : Link energy ratio is high enough to significantly impact energy characteristics Efficient bloom filters, at 16KB/filter – Out-filters are 85% accurate (note that there are only false positives, no false negatives) – In-filters are 90% accurate University of Utah 27

Effect of Page Coloring More locality Better filtering –Out filter accuracy increases from 85% to 97% University of Utah 28

System Performance University of Utah 29 Ring – 7% Grid – 3% Fbfly – 1%

How does performance improve? Two basic reasons – Inherent indirection in directory-based protocols – Deep pipelines in routers increasing the no-load latency Avg. latency in bus-based network is 16.4 cycles – Arbitration (3.7 cyc) + Contention (1 cyc) + Bloom filter (1.2 cyc) + Link latency (10.5 cyc) Even in the most connected FBFLY, average of 1.5 hops per message, bare minimum two messages per transaction – 3 hops – 15 cycles without contention – Link (6 cyc) + Router (9 cyc) University of Utah 30

Scaling – 32 Cores – Energy Average energy reduction of 19X in address network, 3X in data network University of Utah 31

32 Cores – Performance Average 5% drop in performance University of Utah 32

Scaling - 64 Cores – Energy Average reduction of 13X in address network, 2.5X in data network University of Utah 33

64 Core - Performance University of Utah 34 Average 39% increase in execution time compared to fbfly, only 12% increase with just two interleaved buses

Router Optimizations University of Utah 35 For packet-switched networks to be as energy efficient as bus-based networks, Router : Link energy ratio should be less than –3.5 X at 16 cores –4.5X at 32 cores –7X at 64 cores Current energy ratio is approx. 70X

University of Utah 36 Outline Overview Proposal I - Filtered Segmented Bus Proposal II - Low-swing wiring Proposal III - Address Interleaved Buses Proposal IV - Page Coloring Evaluation Conclusion

University of Utah 37 Related Work Packet Switched Networks –Dally/Towles (DAC ’01), Kim et al. (MICRO ’07), Grot et al. (HPCA ’09), TRIPS, TILERA Hierarchical Networks –Muralimanohar et al. (ISCA ’07), Das et al. (HPCA ’09) Snoop Filtering – Moshovos et al. (HPCA ’01), Strauss et al. (ISCA ’06), Salapura et al. (HPCA ’08) Bus applications in CMPs – Manevich et al. (NOCS ’09)

Key Contributions For moderate core counts, buses just work! – Dramatic energy reduction – little or no loss in performance – simple snooping protocols, reduction in design complexity Low-swing wiring Multiple Address Interleaved buses OS-assisted page coloring Potential for router optimization University of Utah 38

University of Utah 39 Thank you.. Questions?