Scalable Directory Protocols for 1000s of Cores Dominic DiTomaso EE 6633.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Copyright Josep Torrellas 2003,20081 Cache Coherence Instructor: Josep Torrellas CS533 Term: Spring 2008.
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) MOSAIC :
The University of Adelaide, School of Computer Science
Cache Coherence Mechanisms (Research project) CSCI-5593
Nikos Hardavellas, Northwestern University
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Cache Optimization Summary
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Per Stenstrom, Truman Joe and Anoop Gupta Presented by Colleen Lewis.
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
An Efficient Hardware-based Multi-hash Scheme for High Speed IP Lookup Department of Computer Science and Information Engineering National Cheng Kung University,
Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.
Parallel Computer Architecture: Essentials for Both Computer Scientists and Engineers Edward F. Gehringer †* Yan.
CS 258 Parallel Computer Architecture LimitLESS Directories: A Scalable Cache Coherence Scheme David Chaiken, John Kubiatowicz, and Anant Agarwal Presented:
Cache Coherence for Large-Scale Machines Todd C. Mowry CS 495 Sept. 26 & Oct. 1, 2002 Topics Hierarchies Directory Protocols.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Cache Improvements James Brock, Joseph Schmigel May 12, 2006 – Computer Architecture.
Tufts University Department of Electrical and Computer Engineering
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
1 CACM July 2012 Talk: Mark D. Hill, Cornell University, 10/2012.
Westin Harbour Castle, August 24, 2000 The NUMAchine Multiprocessor ICPP 2000.
LimitLESS Directories: A Scalable Cache Coherence Scheme By: David Chaiken, John Kubiatowicz, John Kubiatowicz, Anant Agarwal Anant Agarwal Presented by:
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
1 Packet Network Simulator-on-Chip Henry Wong Danyao Wang University of Toronto Connections 2009 ECE Graduate Symposium.
Dynamic Link Labels for Energy Efficient MAC Headers in Wireless Sensor Networks Sheng-Shih Wang Gautam Kulkarni, Curt Schurgers, and Mani Srivastava IEEE.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Fast Lookup for Dynamic Packet Filtering in FPGA REPORTER: HSUAN-JU LI 2014/09/18 Design and Diagnostics of Electronic Circuits & Systems, 17th International.
The University of Adelaide, School of Computer Science
REED : Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Siva and Osman March 7, 2000 Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal.
Architecture and Design of AlphaServer GS320
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Building Expressive, Area-Efficient Coherence Directories
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Directory-based Protocol
The University of Adelaide, School of Computer Science
CS5102 High Performance Computer Systems Distributed Shared Memory
Interconnect with Cache Coherency Manager
E. Bilir, R. Dickson, Y. Hu, M. Plakal, D. Sorin,
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Leveraging Optical Technology in Future Bus-based Chip Multiprocessors
High Performance Computing
Lucía G. Menezo Valentín Puente Jose Ángel Gregorio
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Scalable Directory Protocols for 1000s of Cores Dominic DiTomaso EE 6633

Outline Introduction Background ATAC SPATL Cuckoo Directory SCD Conclusions

Directory Protocols Snoopy (broadcast) -> Directory (multicast) Large Directory Overhead Overhead = P*M P - # of processors M - # of memory blocks 64 nodes: 12.7% overhead 256 nodes: 50% overhead 1024 nodes: 200% overhead P M

Directory Protocols Requirements Small area, energy, and latency overheads Accurate sharer information Limited directory-induced invalidations Duplicate Tags Area-efficient High associativity -> high power Sparse Directory Power-efficient Large capacity -> large area Coarse-grain vectors, Hierarchical, etc.

ATAC Optical Broadcast Network

SPATL Tagless Bloom Filters

Cuckoo Directory N-ary Cuckoo Hash Table

SCD Variable directory tags

SCD (cont.)

Conclusions Large directory overhead at 1000s of cores Solutions Optics – ATAC Tagless – SPATL Hash Tables – Cuckoo Variable Tags – SCD

References [1] George Kurian, Jason E. Miller, James Psota, Jonathan Eastep, Jifeng Liu, Jurgen Michel, Lionel C. Kimerling, and Anant Agarwal, “ATAC: a 1000-core cache-coherent processor with on-chip optical network,” In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10), [2] Daniel Sanchez and Christos Kozyrakis, “SCD: A scalable coherence directory with flexible sharer set encoding,” In Proceedings of the 2012 IEEE 18th International Symposium on High-Performance Computer Architecture (HPCA '12), [3] H. Zhao, A. Shriraman, S. Dwarkadas, and V. Srinivasan, “SPATL: Honey, I Shrunk the Coherence Directory,” In Proceedings of the 20th international conference on Parallel architectures and compilation techniques (PACT ’11), [4] M. Ferdman, P. Lotfi-Kamran, K. Balet, B. Falsafi, "Cuckoo directory: A scalable directory for many-core systems," 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA), pp , Feb