Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Slides:

Advertisements

Similar presentations

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Advertisements

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Thoughts on Shared Caches Jeff Odom University of Maryland.

CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.

Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.

Chapter 17 Parallel Processing.

Multiscalar processors

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science

Presenter: Jyun-Yan Li Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors Pramod Subramanyan, Virendra.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Amphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-cores an analytical performance model for boosting performance Jun Ma, Guihai Yan,

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

1 Presented By: Michael Bieniek. Embedded systems are increasingly using chip multiprocessors (CMPs) due to their low power and high performance capabilities.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,

Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Speculative Lock Elision

ASR: Adaptive Selective Replication for CMP Caches

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Lecture 13: Large Cache Design I

The University of Adelaide, School of Computer Science

Reducing Memory Reference Energy with Opportunistic Virtual Caching

CMPT 886: Computer Architecture Primer

Another Performance Evaluation of Memory Hierarchy in Embedded Systems

Miss Rate versus Block Size

Lecture: Cache Hierarchies

Dynamic Verification of Sequential Consistency

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

Presentation transcript:

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ Graduate University of Chinese Academy of Sciences (GUCAS) Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

2Outline Introduction TDB execution model Experimental results Conclusion

3 Architectural level Dual Modular Redundancy Memory system L1 Instruction-level DMR Core-level DMR AR-SMT[FTCS’99], SRT[ISCA’00] Thread-level DMR DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02] CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07] Leading thread Trailing thread EX’ CHK Leading instructions Trailing instructions A A’ A’ B B’ B’ For CMP systems, to make use of abundant hardware resources, building Core-level DMR!

4 Core-level Dual Modular Redundancy (DMR) Using coupled cores to verify each other’s execution Static binding –lacks of flexibility –e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03] Dynamic binding –Lacks of scalability for parallel processing –e.g., DCC [DSN’07, WDDD’08]

5 Key issue in Core-level DMR Maintain master-slave memory consistency Master-slave memory consistency –Coupled cores must get the same memory value –External writes causes consistency violation Reunion [Smolens-MICRO’06] –Rollback and recovery for the inconsistency Dynamic Core Coupling (DCC) [LaFrieda-DSN’07] –Consistency window to stall the external writes Scalability problem Consistency violation

6 Scalability problem External writes occur earlier and more frequently as the system scales –Reunion: Unacceptable recovery overhead for consistency violation –DCC: Unacceptable stall latency caused by consistency window Scalable solution needed –Reduce the consistency maintenance overhead Probability of external writes occurring within certain slacks For 4-CMP system:  28% in 100 cycles  37% in 500 cycles For 16-CMP system:  43% in 100 cycles  55% in 500 cycles cycles #External writes within 1K cycles: 0.3 for 4-CMP  3.3 for 16-CMP

7 Basic idea the scope of the master-slave memory consistency maintenance Sphere of Consistency (SoC) –The memory hierarchy –The private caches Master L1 cache Slave Global memory Master L1 cache Slave Global memory Transparent Dynamic Binding (TDB): scalableflexible Reduce the SoC to the scale of private caches; provide scalable and flexible Core-level DMR solution!

8Outline Introduction TDB execution model Experimental results Conclusion

9 TDB principle The same program input for the pair Similar memory access behavior Program A-L1$A’-L1$ Global memory Transparent binding:  Master issues L1 miss requests for the logical pair  Slave is prevent from accessing the global memory Dynamic binding: using the system network for  data communication and result comparison

10 Transparent dynamic binding Master Global memory Slave Program Logical pair:Consumer-consumer Sphere of Consistency The private caches Transparent of slaves Passively waiting Consumer-consumer data access pattern Producer

11 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects [1]: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 LRU MRU [1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

12 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 2 2 LRU MRU

13 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA LRU MRU Pipeline Refresh

14 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA MRU LRU 5 5

15 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA MRU LRU Master-slave private cache consistency violation Invariant: in-order memory instruction retirement sequence

16 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 Global memory Victim Buffer:  Filter the WP data blocks

17 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory

18 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory

19 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory Conservative private cache ingress rule: accept data blocks from correct path into private caches

20 Master Slave Program MA MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory MA1 MA5 Invariant: in-order memory instruction retirement sequence Maintain Consistency under Out- of-Order Execution Potential master-slave consistency violation

21 update-after-retirement LRU Replacement policy (uar-LRU) Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 Global memory MA1 MA5

22 update-after-retirement LRU Replacement policy (uar-LRU) Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU Global memory MA1 MA uar-LRU: update MRU after the instruction retirement to prevent the WP memory references from violating the consistency

23 Master-slave memory consistency violation External writes violates the master-slave memory consistency Atomicity of master-slave data access behavior Lacks of scalability as external writes become more frequent Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

24 Transparent Input Coherence Strategy Take advantage of Transparent dynamic binding Break the atomicity of master-slave data access behavior Checker

25Outline Introduction TDB execution model Experimental results Conclusion

26 Experimental Setup Full system simulator: simics + GEMS Parallel workloads: SPLASH-2 The Baseline Dual Modular Redundancy System – N active cores and another N disabled cores –Simulate the DMR system where the slaves work without interfering the masters

27 The Performance of TDB Proposal 97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively Conservative private cache ingress rule helps filter the WP effects

28 Network Traffic of TDB Proposal the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

29 Comparison against DCC [DSN’07] 9.2% 10.4% 18% 37.1% Transparent Dynamic Binding (TDB): scalableflexible scalable and flexible Core-level DMR solution!

30Conclusion Transparent Dynamic Binding –Reduce SoC to the scale of Private Caches Techniques to maintain the consistency –Consumer-consumer data access pattern –Victim Buffer assisted conservative ingress rule –uar-LRU replacement policy –Transparent input coherence policy Scalable and flexible core-level DMR solution

31Q&A?