March University of Utah CS 7698 Token Coherence: Decoupling Performance and Correctness Article by: Martin, Hill & Wood Presented by: Michael Tabet CS 7698
March CS 7698 A Tale of Two Methods Snooping based Uses totally ordered broadcasts to preserve correctness Uses lots of bandwidth Big (large busses) = BAD! Directory based Uses indirection to preserve bandwidth Indirection adds latency Needs a directory controller
March CS 7698 Potential work arounds Snooping Snooping is fast, but requires a bus. Big fast busses are complex -> Use a virtual bus to virtual broadcast! Directory Networks require lots of logic (especially big ones) -> Use glueless networks!
March CS 7698 Token Coherence Provides for both indirection, and speed up through unordered broadcasts Two components: Correctness substrate Performance protocol
March CS 7698 Correctness Speed is Good, Correctness is Better! Need to guarantee ordered reads/writes! Thus, use a correctness “substrate”
March CS 7698 Correctness Invariants 1.At all times, each block has T tokens 2.A processor can only write a block if it holds all T tokens 3.A processor can read a block only if it holds at least one token 4.If a coherence message contains one or more tokens, it must contain data
March CS 7698 Invariant 1 Implications Allows for precise control of blocks of data.
March CS 7698 Invariant 2 Implications Enables write control mechanism to allow in order writes
March CS 7698 Invariant 3 Implications Restricts reads
March CS 7698 Invariant 4 Implications Provides a method to ensure cache coherence
March CS 7698 Starvation Invariants allow of ordered reads/writes, but how do we prevent starvation? Persistent requests: 1.A processor times out on transient requests 2.Raises a persistent request (only one per block) 3.All nodes must forward blocks to the node But repeated & persistent requests only make up 1-3% of the messages
March CS 7698 Persistent Request State Diagram
March CS 7698 Performance protocol But if you always follow the rules, it can get slow and tedious! Tokens allow for unordered responses to requests. This opens the door for all sorts of optimizations
March CS 7698 TokenB A New Contender Akin to MSI snooping protocol: Requests broadcast Data exists either in Modified (All tokens) Shared (Some tokens) Invalid (No tokens) But: Performance protocol allows for better performance!
March CS 7698 TokenB: Optimized Token Counting MSI was a bit of a lie, can optimize token counting by altering invariants 1,3,4: 1.At all times, each block has T tokens, one of which is the owner token 3.A processor can read a block only if it holds at least one token for that block and has valid data 4.If a coherence message contains the owner token, it must contain data
March CS 7698 TokenB Continued The Good Stuff Performance in: Tokens allow replies to be sent unordered, and indirectly (no broadcast) This means: 15-28% faster than snooping 17-54% faster than directory 21-25% less bandwidth than snooping
March CS 7698 An Example P1 reads then P2 writes then P1 reads Presume a 4 node systems, where P1 has an invalid copy, P2 has a shared copy, and P3 is the “home/owner” node
March CS 7698 Example The Snooping Way P1 P2 P3 P All messages broadcast!
March CS 7698 Example The Directory Way P1 P2 P3 P4 Directory Directory process messages !
March CS 7698 Example The Token Way P1 P2 P3 P4 1(broadcast) 2 3(broadcast) (broadcast) 6
March CS 7698 Real world results Examined on a tree structure (virtual broadcast), and on a 2d torus Migratory optimization: a read request after a write is forwarded all tokens Benchmarked on OLTP, SPECjbb, Apache
March CS 7698 Results Token vs Snooping: TOKEN Wins!
March CS 7698 Results Directory vs Token: Token mostly wins!
March CS 7698 Conclusion TokenB offers a good performance for small-middle sized parallel systems Broadcasts limits scalability past 16 nodes But other performance implementations could be scaled larger!