Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences ‡ Graduate University of Chinese Academy of Sciences (GUCAS) Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

2Outline Introduction TDB execution model Experimental results Conclusion

3 Architectural level Dual Modular Redundancy Memory system L1 Instruction-level DMR Core-level DMR AR-SMT[FTCS’99], SRT[ISCA’00] Thread-level DMR DIVA[MICRO’99], SHREC[MICRO’04], EDDI[TR’02] CRTR [ISCA’03], Reunion[MICRO’06], DCC[DSN’07] Leading thread Trailing thread EX’ CHK Leading instructions Trailing instructions A A’ A’ B B’ B’ For CMP systems, to make use of abundant hardware resources, building Core-level DMR!

4 Core-level Dual Modular Redundancy (DMR) Using coupled cores to verify each other’s execution Static binding –lacks of flexibility –e.g., Reunion [MICRO’06], CRT [ISCA’02], CRTR [ ISCA’03] Dynamic binding –Lacks of scalability for parallel processing –e.g., DCC [DSN’07, WDDD’08]

5 Key issue in Core-level DMR Maintain master-slave memory consistency Master-slave memory consistency –Coupled cores must get the same memory value –External writes causes consistency violation Reunion [Smolens-MICRO’06] –Rollback and recovery for the inconsistency Dynamic Core Coupling (DCC) [LaFrieda-DSN’07] –Consistency window to stall the external writes Scalability problem Consistency violation

6 Scalability problem External writes occur earlier and more frequently as the system scales –Reunion: Unacceptable recovery overhead for consistency violation –DCC: Unacceptable stall latency caused by consistency window Scalable solution needed –Reduce the consistency maintenance overhead Probability of external writes occurring within certain slacks For 4-CMP system:  28% in 100 cycles  37% in 500 cycles For 16-CMP system:  43% in 100 cycles  55% in 500 cycles cycles #External writes within 1K cycles: 0.3 for 4-CMP  3.3 for 16-CMP

7 Basic idea the scope of the master-slave memory consistency maintenance Sphere of Consistency (SoC) –The memory hierarchy –The private caches Master L1 cache Slave Global memory Master L1 cache Slave Global memory Transparent Dynamic Binding (TDB): scalableflexible Reduce the SoC to the scale of private caches; provide scalable and flexible Core-level DMR solution!

9 TDB principle The same program input for the pair Similar memory access behavior Program A-L1$A’-L1$ Global memory Transparent binding:  Master issues L1 miss requests for the logical pair  Slave is prevent from accessing the global memory Dynamic binding: using the system network for  data communication and result comparison

10 Transparent dynamic binding Master Global memory Slave Program Logical pair:Consumer-consumer Sphere of Consistency The private caches Transparent of slaves Passively waiting Consumer-consumer data access pattern Producer

11 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects [1]: Master Global memory Slave Program Producer MA1 1 1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 LRU MRU [1] R. Sendaga, et al.“The impact of wrong-path memory references in cache-coherent multiprocessor systems.” In JPDC’07

12 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA1 1 1 2 2 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 2 2 LRU MRU

13 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA1 1 1 2 2 3 3 4 4 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 2 2 3 3 4 4 LRU MRU Pipeline Refresh

14 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA1 1 1 2 2 3 3 4 4 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 2 2 3 3 4 4 MRU LRU 5 5

15 Maintain Consistency under Out- of-Order Execution Out-of-Order execution brings in wrong-path effects: Master Global memory Slave Program Producer MA1 2 2 3 3 4 4 3 3 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 1 1 4 4 MRU LRU 5 5 5 5 Master-slave private cache consistency violation Invariant: in-order memory instruction retirement sequence

16 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 Global memory Victim Buffer:  Filter the WP data blocks

17 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 2 2 2 2 Global memory

18 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 2 2 3 3 4 4 3 3 4 4 2 2 Global memory

19 Victim Buffer Assisted Conservative Private Cache Ingress Rule Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 2 2 3 3 4 4 3 3 4 4 2 2 Global memory 5 5 5 5 Conservative private cache ingress rule: accept data blocks from correct path into private caches

20 Master Slave Program MA1 1 1 5 5 5 5 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 2 2 3 3 4 4 3 3 4 4 2 2 Global memory MA1 MA5 Invariant: in-order memory instruction retirement sequence Maintain Consistency under Out- of-Order Execution Potential master-slave consistency violation

21 update-after-retirement LRU Replacement policy (uar-LRU) Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 Global memory MA1 MA5

22 update-after-retirement LRU Replacement policy (uar-LRU) Master Slave Program MA1 1 1 MA2 MA3 MA4 MA5 MA1 MA3 MA6 MA1 MA5 MRU LRU 1 1 2 2 3 3 4 4 3 3 4 4 2 2 Global memory MA1 MA5 5 5 5 5 uar-LRU: update MRU after the instruction retirement to prevent the WP memory references from violating the consistency

23 Master-slave memory consistency violation External writes violates the master-slave memory consistency Atomicity of master-slave data access behavior Lacks of scalability as external writes become more frequent Master-slave input coherence: (a) external writes violates the consistency; (b) the master-slave consistency window in DCC

24 Transparent Input Coherence Strategy Take advantage of Transparent dynamic binding Break the atomicity of master-slave data access behavior Checker

26 Experimental Setup Full system simulator: simics + GEMS Parallel workloads: SPLASH-2 The Baseline Dual Modular Redundancy System – N active cores and another N disabled cores –Simulate the DMR system where the slaves work without interfering the masters

27 The Performance of TDB Proposal 97.2%, 99.8%, 101.2% and 105.4% over the baseline for 4, 8, 16 and 32 cores respectively Conservative private cache ingress rule helps filter the WP effects

28 Network Traffic of TDB Proposal the total traffic is increased by 5.2%, 3.6%, 1.3% and 2.5% for 4-, 8-, 16- and 32-core CMP systems

29 Comparison against DCC [DSN’07] 9.2% 10.4% 18% 37.1% Transparent Dynamic Binding (TDB): scalableflexible scalable and flexible Core-level DMR solution!

30Conclusion Transparent Dynamic Binding –Reduce SoC to the scale of Private Caches Techniques to maintain the consistency –Consumer-consumer data access pattern –Victim Buffer assisted conservative ingress rule –uar-LRU replacement policy –Transparent input coherence policy Scalable and flexible core-level DMR solution

31Q&A?

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Similar presentations

Presentation on theme: "Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Similar presentations

Presentation on theme: "Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences."— Presentation transcript:

Similar presentations

About project

Feedback