Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Multiple-CMP Systems with Token Coherence

Similar presentations


Presentation on theme: "Improving Multiple-CMP Systems with Token Coherence"— Presentation transcript:

1 Improving Multiple-CMP Systems with Token Coherence
Mike Marty1, Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun

2 Summary Microprocessor  Chip Multiprocessor (CMP)
Symmetric Multiprocessor (SMP)  Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Protocol Complex & Slow New Solution: Apply Token Coherence Developed for glueless multiprocessor [ISCA 2003] Keep: Flat for Correctness Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory

3 Outline Motivation and Background
Coherence in Multiple-CMP Systems Example: DirectoryCMP Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation

4 Coherence in Multiple-CMP Systems
Chip Multiprocessors (CMPs) emerging Larger systems will be built with Multiple CMPs interconnect I D P L2 CMP 2 CMP 1 interconnect CMP 3 CMP 4

5 Problem: Hierarchical Coherence
Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity explodes state space CMP 2 CMP 1 interconnect Inter-CMP Coherence Intra-CMP Coherence CMP 3 CMP 4

6 Improving Multiple CMP Systems with Token Coherence
Token Coherence allows Multiple-CMP systems to be... Flat for correctness, but Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 interconnect Performance Protocol CMP 3 CMP 4

7 Example: DirectoryCMP
2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D S S O S data/ ack getx data/ ack inv ack inv ack WB getx fwd inv ack data/ ack Shared L2 / directory Shared L2 / directory S getx WB fwd B: [S O] B: [M I] getx Memory/Directory Memory/Directory

8 Outline Motivation and Background
Token Coherence: Flat for Correctness Safety Starvation Avoidance Token Coherence: Hierarchical for Performance Evaluation

9 Example: Token Coherence [ISCA 2003]
Load B Load B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 mem 0 interconnect mem 3 Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block

10 Extending to Multiple-CMP System
L1 I&D L1 I&D L1 I&D L1 I&D L2 L2 L2 L2 interconnect interconnect Shared L2 Shared L2 mem 0 interconnect mem 1

11 Extending to Multiple-CMP System
Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect Token counting remains flat Tokens to caches Handles shared caches and other complex hierarchies

12 Tokens move freely in the system
Starvation Avoidance CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 GETX GETX GETX L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect Tokens move freely in the system Transient requests can miss in-flight tokens Incorrect speculation, filters, prediction, etc

13 Starvation Avoidance P0 P1 P2 P3 interconnect
CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect Solution: issue Persistent Request Heavyweight request guaranteed to succeed Methods: Centralized [2003] and Distributed (New)

14 Old Scheme: Central Arbiter [2003]
CMP 0 CMP 1 Store B timeout Store B timeout Store B timeout P0 P1 P2 P3 L1 I&D L1 I&D L1 I&D L1 I&D interconnect interconnect Shared L2 Shared L2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1 Processors issue persistent requests

15 Old Scheme: Central Arbiter [2003]
CMP 0 CMP 1 Store B Store B Store B Store B P0 P1 P2 P3 B: P0 L1 I&D L1 I&D B: P0 B: P0 L1 I&D L1 I&D B: P0 interconnect interconnect B: P0 Shared L2 Shared L2 B: P0 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P1 Processors issue persistent requests Arbiter orders and broadcasts activate

16 Old Scheme: Central Arbiter [2003]
CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 B: P0 B: P2 L1 I&D L1 I&D B: P0 B: P2 B: P2 B: P0 L1 I&D L1 I&D B: P2 B: P0 3 interconnect interconnect B: P0 B: P2 Shared L2 Shared L2 B: P0 B: P2 1 2 mem 0 mem 1 arbiter 0 interconnect B: P0 arbiter 0 B: P2 B: P2 B: P1 Processor sends deactivate to arbiter Arbiter broadcasts deactivate (and next activate) Bottom Line: handoff is 3 message latencies

17 Improved Scheme: Distributed Arbitration [NEW]
CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P2: B Processors broadcast persistent requests

18 Improved Scheme: Distributed Arbitration [NEW]
CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect P0: B P0: B Shared L2 Shared L2 P0: B P0: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P0: B P1: B P2: B Processors broadcast persistent requests Fixed priority (processor number)

19 Improved Scheme: Distributed Arbitration [NEW]
CMP 0 CMP 1 Store B Store B P0 P1 P2 P3 P0: B P0: B P0: B P0: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect P0: B Shared L2 Shared L2 P0: B P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P0: B P1: B P1: B P2: B Processors broadcast persistent requests Fixed priority (processor number) Processors broadcast deactivate

20 Improved Scheme: Distributed Arbitration [NEW]
CMP 0 CMP 1 P0 P1 P2 P3 P1: B P1: B P1: B P1: B P1: B P1: B P1: B P1: B 1 P2: B L1 I&D L1 I&D P2: B L1 I&D L1 I&D P2: B P2: B interconnect interconnect Shared L2 Shared L2 P1: B P1: B P1: B P1: B P2: B P2: B mem 0 mem 1 interconnect P1: B P1: B P2: B Bottom line: Handoff is a single message latency Subtle point: P0 and P1 must wait until next “wave”

21 Outline Motivation and Background
Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation

22 Hierarchical for Performance: TokenCMP
Target System: 2-8 CMPs Private L1s, shared L2 per CMP Any interconnect, but high-bandwidth Performance Policy Goals: Aggressively acquire tokens Exploit on-chip locality and bandwidth Respect cache hierarchy Detecting and handling missed tokens

23 Hierarchical for Performance: TokenCMP
Approach: On L1 miss, broadcast within own CMP Local cache responds if possible On L2 miss, broadcast to other CMPs Appropriate L2 bank responds or broadcasts within its CMP Optionally filter Responses between CMPs carry extra tokens for future locality Handling missed tokens: Timeout after average memory latency Invoke persistent request (no retries) Larger systems can use filters, multicast, soft-state directories

24 Outline Motivation and Background
Token Coherence: Flat for Correctness Token Coherence: Hierarchical for Performance Evaluation Model checking Performance w/ commercial workloads Robustness

25 TokenCMP Evaluation Simple? Fast? Robust? Model checking
Full-system simulation w/ commercial workloads Robust? Micro-benchmarks to simulate high contention

26 Complexity Evaluation with Model Checking
Methods: TLA+ and TLC DirectoryCMP omits all intra-CMP details TokenCMP’s correctness substrate modeled Result: Complexity similar between TokenCMP and non-hierarchical DirectoryCMP Correctness Substrate verified to be correct and deadlock-free Small configuration, varied parameters All possible performance protocols correct

27 Performance Evaluation
Target System: 4 CMPs, 4 procs/cmp 2GHz OoO SPARC, 8MB shared L2 per chip Directly connected interconnect Methods: Multifacet GEMS simulator Simics augmented with timing models Released soon: ISCA 2005 Tutorial! Benchmarks: Performance: Apache, Spec, OLTP Robustness: Locking uBenchmark

28 Full-system Simulation: Runtime
TokenCMP performs 9-50% faster than DirectoryCMP

29 Full-system Simulation: Runtime
TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2

30 Full-system Simulation: Traffic
TokenCMP traffic is reasonable (or better) DirectoryCMP control overhead greater than broadcast for small system

31 Performance Robustness
Locking micro-benchmark (correctness substrate only) more contention less contention

32 Performance Robustness
Locking micro-benchmark (correctness substrate only) more contention less contention

33 Performance Robustness
Locking micro-benchmark more contention less contention

34 Summary Microprocessor  Chip Multiprocessor (CMP)
Symmetric Multiprocessor (SMP)  Multiple CMPs Problem: Coherence with Multiple CMPs Old Solution: Hierarchical Protocol Complex & Slow New Solution: Apply Token Coherence Developed for glueless multiprocessor [2003] Keep: Flat for Correctness Exploit: Hierarchical for performance Less Complex & Faster than Hierarchical Directory

35

36 Full-system Simulation: Traffic

37 Full-system Simulation: Intra-CMP Traffic


Download ppt "Improving Multiple-CMP Systems with Token Coherence"

Similar presentations


Ads by Google