Presentation is loading. Please wait.

Presentation is loading. Please wait.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Similar presentations


Presentation on theme: "Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison."— Presentation transcript:

1 Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison

2 Overview Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering ▫Bus-based snooping coherence not sufficient Solutions: ▫O RDERING -P OINT : establish an ordering point ▫G REEDY -O RDER : greedily order requests ▫R ING -O RDER : complete requests in ring order R ING -O RDER offers and performance

3 Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

4 Future CMPs Bus? Crossbar? Packet-Switched?Ring?

5 The “Cell” Processor

6 Ring Interconnect Why?  Short, fast point-to-point links  Fewer (data) ports  Less complex than packet-switched  Simple, distributed arbitration  Exploitable ordering for coherence

7

8 Cache Coherence for a Ring

9 Ring is broadcast and offers ordering Apply existing bus-based snooping protocols? NO! Order properties of ring are different

10 Ring Order != Bus Order P9P3 P6 P12 A B {A, B} {B, A}

11 Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

12 Snooping Protocols for Rings Assumptions: ▫Unidirectional ring Multiple rings per-address OK ▫Write-back, write-invalidate caches ▫Eager request forwarding e.g., forward message then snoop [Strauss et al. ISCA 2006] Can total bus order be recreated? YES

13 O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O S ordering point Store P9 getM (inactive)

14 O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S ordering point Store P9 getM own request ordered

15 O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store P9 getM own request ordered

16 O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store Data to P9 own request ordered P9 ACK

17 O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O  I S  I ordering point Store Data to P9 own request ordered P9 ACK Store P6 getM

18 O RDERING -P OINT Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 ordering point Data to P6 Store P6 getM Store Complete

19 Bottom line: O RDERING- P OINT Requests totally ordered + Stable, predictable performance Slow – Requests not active immediately Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message Can requests be active immediately? YES (e.g., IBM Power4/5)

20 G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O P9 getM S Store P12 response:  I Store

21 G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK  I will send data

22 G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12 P9 getM response: ACK  I will send data Store P6 getM response:

23 G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 O Store P12  I will send data Store P6 getM response: acked Data to P9

24 G REEDY -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store P6 getM response: acked Data to P9 M RETRY

25 Bottom line: G REEDY -O RDER Average case is fast + Request active immediately Requires combined snoop response ▫Synchronous timing of snoops for efficiency Resorts to unbounded # of retries in conflict ▫Will conditions eventually allow request completion? ▫Probabilistic system (e.g. Ethernet)

26 Recap Existing Solutions: 1.O RDERING- P OINT Establishes total order Extra latency and control message overhead 2.G REEDY -O RDER Fast in common case Unbounded retries Ideal Solution ▫Fast for average case ▫Stable for worse-case (no retries)

27 New Approach: R ING -O RDER + Requests complete in order of ring position ▫Fully exploits ring ordering + Initial requests always succeeds ▫No retries, No ordering point ▫Fast, stable, predictable performance Key: Use token counting ▫All tokens to write, one token to read

28 R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token

29 R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 P9 getM Store P12 = token = priority token FurthestDest = P9

30 R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store FurthestDest = P9 P6 getM

31 R ING -O RDER Example P9P3 P6 P10 P11 P1 P2 P4 P5 P7 P8 Store P12 Store Complete FurthestDest = P9 Store Complete

32 R ING -O RDER Recap Key: Exploit Order of Ring with token counting ▫Requests never race with tokens Furthest Destination field ▫Carried in responses, tracked in MSHRs ▫Determines if tokens need to keep moving Priority token ensures liveness Data satisfies all requestors during traversal

33 R ING -O RDER vs. Token Coherence Token CoherenceR ING -O RDER Safetytoken counting Liveness retries + persistent requests priority token + ring order DRAM state (bits per block) Log 2 (# tokens)1

34 Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion

35 Applying to Baseline CMP

36 Interfacing with Memory Controllers Problem: When should memory respond? Solution: 1-bit per block of memory ▫Owner bit for O RDERING -P OINT and G REEDY -O RDER ▫Token-count bit for R ING -O RDER All or none tokens Cache the bits in a Memory Interface Cache ▫Eliminates costly DRAM accesses ▫Enable G REEDY -O RDER to meet snoop timing

37 Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results ▫Metholodogy ▫Runtime ▫Traffic ▫Performance Stability Conclusion

38 Methodology Full-system Simulation ▫Virtutech Simics ▫Wisconsin GEMS GPL software Workloads: ▫Commercial: OLTP, Apache, SpecJBB, Zeus ▫Scientific: OMPart, OMPfma3d, OMPmgrid Protocols: ▫O RDERING -P OINT ▫G REEDY -O RDER (called –I DEAL in paper) ▫R ING -O RDER

39 Simulation Parameters 1/2 SPARC 4GHz 8MB, 16-way 25-cycle bank access 1MB, 4-way 15-cycle data access 64KB I&D, 4-way 2-cycle access

40 Simulation Parameters 2/2 Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle 275-cycle DRAM access

41 Normalized Runtime R ING - O RDER is up to 52% faster than O RDERING- P OINT

42 Ring Bandwidth R ING - O RDER uses up to 34% less bandwidth

43 G REEDY -O RDER Starvation RETRY #1402 time Processor 3Processor 4Processor 6Processor issue getM RETRY # RETRY # Complete RETRY # ack p7, send data issue getM RETRY # Complete RETRY # ack p3, send data RETRY # issue getM RETRY # Complete ack p7, send data RETRY # issue getM RETRY # Complete ack p3, send data issue getM +70,000 cycles

44 Retries MAX Retries/Request G REEDY -O RDER R ING -O RDER Apache 100 OLTP 80 SpecJBB 110 Zeus 140 OMPmgrid timed out0 OMPart 290 OMPfma3d 100 R ING - O RDER offers stable, bounded performance

45 Conclusion Rings a viable interconnect for CMPs Ring != Bus for ordering R ING -O RDER protocol offers best of: ▫O RDERING -P OINT (stable) and, ▫G REEDY -O RDER (fast) P.S. R ING -O RDER requires NO system-wide snoop response ▫Useful for hierarchy of rings

46 BACKUP SLIDES

47 Flexible Snooping [Strauss et al. ISCA 2006] Eager vs. Lazy forwarding Key Differences: ▫Targets coherence between bus-based CMPs ▫Logical ring on message-passing interconnect ▫Protocol similar to G REEDY -O RDER Uses a separate combined snoop response message R ING -O RDER also works with logical ring ▫Possible to extend protocol to send data off the ring Lazy vs. Eager Forwarding applies to R ING -O RDER ▫Synergistic fit to reduce snoop power


Download ppt "Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison."

Similar presentations


Ads by Google