Token Coherence: Decoupling Performance and Correctness

Token Coherence: Decoupling Performance and Correctness
Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project University of Wisconsin—Madison Something to ponder and keep in mind during the presentation: Why wasn’t this done before? Technology and workload trends Past focus on scalability The mindset that each request must succeed the first time Didn’t think of this decoupling Try to avoid oral pauses: “Ok?”, and “Right?” Need to separate goal setting from goal achieving Other things Previous systems have not done this: unordered broadcast protocol Not about logical time Give the context Replacements are latency tolerant Tokens in memory in ECC How do we select the timeout interval Decouple interconnect from protocol Alpha and AMD towards directories IBM and Sun towards snooping Remove the snooping/directory duality Implementation costs Bits in caches & memory Starvation-prevention mechanism Token counting vs token collection/finding

We See Two Problems in Cache Coherence
1. Protocol ordering bottlenecks Artifact of conservatively resolving racing requests “Virtual bus” interconnect (snooping protocols) Indirection (directory protocols) 2. Protocol enhancements compound complexity Fragile, error prone & difficult to reason about Why? A distributed & concurrent system Often enhancements too complicated to implement (predictive/adaptive/hybrid protocols) Performance and correctness tightly intertwined Hardware shared-memory successful Especially for cost-effective, small-scale systems Further enhancements  more complexity Low-latency/predictive/adaptive/hybrid protocols Direct communication, highly-integrated nodes Avoid ordered or synchronous interconnects, global snoop responses, logical timestamping

Rethinking Cache-Coherence Protocols
Goal of invalidation-based coherence Invariant: many readers -or- single writer Enforced by globally coordinated actions Enforce this invariant directly using tokens Fixed number of tokens per block One token to read, all tokens to write Guarantees safety in all cases Global invariant enforced with only local rules Independent of races, request ordering, etc. Key innovation

Token Coherence: A New Framework for Cache Coherence
Goal: Decouple performance and correctness Fast in the common case Correct in all cases To remove ordering bottlenecks Ignore races (fast common case) Tokens enforce safety (all cases) To reduce complexity Performance enhancements (fast common case) Without affecting correctness (all cases) (without increased complexity) Focus of this talk Not just new enhancements, but many previous performance enhancements Ones that were previously too complicated to implement Decouple the cache-coherence problem Safety (do no harm) Forward progress (do some good) Performance (policy and optimizations) We propose: a new framework for coherence Decouple correctness and performance Remove bottlenecks & enable performance enhancements without concern for correctness Create a “performance design space” in which designers can do what ever they like, it is separate from the substrate.. Not interwoven.. As designers add performance enhancements, they do not add complexity to the performance substrate. Enables designers to actually implement these enhancements in the performance design space Briefly described

Outline Overview Problem: ordering bottlenecks
Solution: Token Coherence (TokenB) Evaluation Further exploiting decoupling Conclusions

Technology trends  unordered interconnects
High-speed point-to-point links No (multi-drop) busses Increasing design integration “Glueless” multiprocessors Improve cost & latency Desire: low-latency interconnect Avoid “virtual bus” ordering Enabled by directory protocols Technology trends  unordered interconnects Green squares are processor/memory nodes Need to point out the problem with the second interconnect Shared bus is like Ethernet: does not scale, a central bottleneck

Workload trends  avoid indirection, broadcast ok
P M 2 1 3 Directory Protocol Commercial workloads Many cache-to-cache misses Clusters of small multiprocessors Goals: Direct cache-to-cache misses (2 hops, not 3 hops) Moderate scalability P M 1 2 Turn clustering from a competitor into a “feature” Workload trends  avoid indirection, broadcast ok

1 2 Basic Approach Low-latency protocol Low-latency interconnect
M 1 2 Low-latency protocol Broadcast with direct responses As in snooping protocols Low-latency interconnect Use unordered interconnect As in directory protocols Fast & works fine with no races… …but what happens in the case of a race?

Basic approach… but not yet correct
Delayed in interconnect 1 P0 issues a request to write (delayed to P2) Request to write P2 Read/Write P1 No Copy P0 2 Ack 3 P1 issues a request to read Request to read

Read-only Read-only 1 P2 Read/Write P1 No Copy P0 2 4 3 P2 responds with data to P1

Read-only Read-only 1 P2 Read/Write P1 No Copy P0 5 2 4 “Last writer responds with data” - side-step the ownership issues 3 P0’s delayed request arrives at P2

6 No Copy Read-only Read-only 1 Read/Write P1 No Copy Read/Write 5 P0 2 P2 7 4 3 P2 responds to P0

6 No Copy Read-only Read-only 1 Read/Write P1 No Copy Read/Write 5 P0 2 P2 7 4 Allows for inconsistent access of data leading to program errors 3 Problem: P0 and P1 are in inconsistent states Locally “correct” operation, globally inconsistent

Contribution #1: Token Counting
Tokens control reading & writing of data At all times, all blocks have T tokens E.g., one token per processor One or more to read All tokens to write Tokens: in caches, memory, or in transit Components exchange tokens & data Provides safety in all cases This is the “inspiration” part of the talk. We’ve been building MPs for decades: we’re the first we know of to take this approach We’re maintaining a different kind of state *** Need to point out ~one token per processor

Basic Approach (Revisited)
As before: Broadcast with direct responses (like snooping) Use unordered interconnect (like directory) Track tokens for safety More refinement in a moment… Safety: avoid reading or writing data incoherently

Token Coherence Example
Max Tokens 1 P0 issues a request to write (delayed to P2) Request to write 2 Delayed P2 T=16 (R/W) P1 T=0 P0 3 P1 issues a request to read Delayed Request to read

T=1(R) T=15(R) 1 P2 T=16 (R/W) P1 T=0 P0 2 4 T=1 3 P2 responds with data to P1

T=1(R) T=15(R) 1 P2 T=16 (R/W) P1 T=0 P0 5 2 4 3 P0’s delayed request arrives at P2

6 T=15 T=0 T=1(R) T=15(R) 1 T=15(R) P1 T=0 T=16 (R/W) 5 P0 2 P2 7 4 3 P2 responds to P0

6 T=0 T=1(R) T=15(R) 1 T=15(R) P1 T=0 T=16 (R/W) 5 P0 2 P2 7 4 3

T=1(R) P0 T=15(R) Now what? (P0 wants all tokens)

Basic Approach (Re-Revisited)
As before: Broadcast with direct responses (like snooping) Use unordered interconnect (like directory) Track tokens for safety Reissue requests as needed Needed due to racing requests (uncommon) Timeout to detect failed completion Wait twice average miss latency Small hardware overhead All races handled in this uniform fashion Safety: avoid reading or writing data incoherently

8 P0 reissues request P2 T=0 P1 T=1(R) P0 T=15(R) Timeout! P1 responds with a token T=1 9

T=16 (R/W) P1 T=0 P2 T=0 P0’s request completed One final issue: What about starvation?

Contribution #2: Guaranteeing Starvation-Freedom
Handle pathological cases Infrequently invoked Can be slow, inefficient, and simple When normal requests fail to succeed (4x) Longer timeout and issue a persistent request Request persists until satisfied Table at each processor “Deactivate” upon completion Implementation Arbiter at memory orders persistent requests More about why they persist Emphasize the persistent nature not the implemenation

Evaluation Goal: Four Questions
Are reissued requests rare? Yes 2. Can Token Coherence outperform snooping? Yes: lower-latency unordered interconnect Can Token Coherence outperform directory? Yes: direct cache-to-cache misses 4. Is broadcast overhead reasonable? Yes (for 16 processors) Quantitative evidence for qualitative behavior Point out that exact speedup numbers is the job of industry for their particular designs

Workloads and Simulation Methods
OLTP - On-line transaction processing SPECjbb - Java middleware workload Apache - Static web serving workload All workloads use Solaris 8 for SPARC Simulation methods 16 processors Simics full-system simulator Out-of-order processor model Detailed memory system model Many assumptions and parameters (see paper)

Q1: Reissued Requests (percent of all L2 misses)
OLTP SPECjbb Apache

Q1: Reissued Requests (percent of all L2 misses)
Outcome OLTP SPECjbb Apache Not Reissued 98% 96% Reissued Once 2% 3% Reissued > 1 0.4% 0.3% 0.7% Persistent Requests (Reissued > 4) 0.2% 0.1% Yes; reissued requests are rare (these workloads, 16p)

Similar performance on same interconnect
Q2: Runtime: Snooping vs. Token Coherence Hierarchical Switch Interconnect Similar performance on same interconnect “Tree” interconnect

Q2: Runtime: Snooping vs. Token Coherence Direct Interconnect
Snooping not applicable “Torus” interconnect

Q2: Runtime: Snooping vs. Token Coherence
Yes; Token Coherence can outperform snooping (15-28% faster) Why? Lower-latency interconnect

Q3: Runtime: Directory vs. Token Coherence
Yes; Token Coherence can outperform directories (17-54% faster with slow directory) Why? Direct “2-hop” cache-to-cache misses

Q4: Traffic per Miss: Directory vs. Token
Yes; broadcast overheads reasonable for 16 processors (directory uses 21-25% less bandwidth) Why is the amount of data sent is different: Slight difference in handling O->M upgrades between the protocols

Q4: Traffic per Miss: Directory vs. Token
Requests & forwards Responses Yes; broadcast overheads reasonable for 16 processors (directory uses 21-25% less bandwidth) Why? Requests are smaller than data (8B v. 64B) Why is the amount of data sent is different: Slight difference in handling O->M upgrades between the protocols

Contribution #3: Decoupled Coherence
Cache Coherence Protocol Correctness Substrate (all cases) Performance Protocol (common cases) Many Implementation choices Safety (token counting) Starvation Freedom (persistent requests)

Example Opportunities of Decoupling
Example#1: Broadcast is not required P2 Pn P1 T=16 P0 Predictive push T=16 Predict a destination-set [ISCA ‘03] Based on past history Need not be correct (rely on persistent requests) Enables larger or more cost-effective systems Example#2: predictive push Requires no changes to correctness substrate

Conclusions Token Coherence (broadcast version)
Low cache-to-cache miss latency (no indirection) Avoids “virtual bus” interconnects Faster and/or cheaper Token Coherence (in general) Correctness substrate Tokens for safety Persistent requests for starvation freedom Performance protocol for performance Decouple correctness from performance Enables further protocol innovation

Cache-Coherence Protocols
Goal: provide a consistent view of memory Permissions in each cache per block One read/write -or- Many readers Cache coherence protocols Distributed & complex Correctness critical Performance critical Races: the main source of complexity Requests for the same block at the same time Two or more requests for the same block and the same time

Evaluation Parameters
Processors SPARC ISA 2 GHz, 11 pipe stages 4-wide fetch/execute Dynamically scheduled 128 entry ROB 64 entry scheduler Memory system 64 byte cache lines 128KB L1 Instruction and Data, 4-way SA, 2 ns (4 cycles) 4MB L2, 4-way SA, 6 ns (12 cycles) 2GB main memory, 80 ns (160 cycles) Interconnect 15ns link latency Switched tree (4 link latencies) cycles 2-hop round trip 2D torus (2 link latencies on average) cycles 2-hop round trip Link bandwidth: 3.2 Gbyte/second Coherence Protocols Aggressive snooping Alpha like directory 72 byte data messages 8 byte request messages

Q3: Runtime: Directory vs. Token Coherence
Yes; Token Coherence can outperform directories (17-54% faster with slow directory) Why? Direct “2-hop” cache-to-cache misses Simulates the un-buildable design point of a super fast and perfect directory cache

More Information in Paper
Traffic optimization Transfer tokens without data Add an “owner” token Note: no silent read-only replacements Worst case: 10% interconnect traffic overhead Comparison to AMD’s Hammer protocol

Verifiability & Complexity
Divide and conquer complexity Formal verification is work in progress Difficult to quantify, but promising All races handled uniformly (reissuing) Local invariants Safety is response-centric; independent of requests Locally enforced with tokens No reliance on global ordering properties Explicit starvation avoidance Simple mechanism Further innovation  no correctness worries

Traditional v. Token Coherence
Traditional protocols Sensitive to request ordering Interconnect or directory Monolithic Complicated Intertwine correctness and performance Token Coherence Track tokens (safety) Persistent requests (starvation avoidance) Request are only “hints” Separate Correctness and Performance

Conceptual Interface M P0 P1 “Hint” requests Performance Protocol
Controller Performance Protocol Correctness Substrate M P0 Controller P1 Data, Tokens, & Persistent Requests

Here is block B & one token
Conceptual Interface I’m looking for block B Controller Performance Protocol Please send Block B to P1 Correctness Substrate M P0 Controller P1 Here is block B & one token (Or, no response)

Snooping v. Directories: Which is Better?
M 1 2 Snooping multiprocessors Uses broadcast “Virtual bus” interconnect Directly locate data (2 hops) Directory-based multiprocessors Directory tracks writer or readers Avoids broadcast Avoids “virtual bus” interconnect Indirection for cache-to-cache (3 hops) Examine workload and technology trends P M 2 1 3 This question has been debated for 20+ years… If there was one that was better, we would only have one approach Sun, Intel, IBM : Snooping Alpha, SGI : Directory Spend some time on this slide talking about ordering in snooping Talk about directory indirections Need to define cache-to-cache misses Our workloads have many cache-to-cache misses

Workload trends  snooping protocols
Commercial workloads Many cache-to-cache misses or sharing misses Cluster small- or moderate-scale multiprocessors Goals: Low cache-to-cache miss latency (2 hops) Moderate scalability Workload trends  snooping protocols Turn clustering from a competitor into a “feature”

Technology trends  directory protocols
High-speed point-to-point links No (multi-drop) busses Increasing design integration “Glueless” multiprocessors Improve cost & latency Desire: unordered interconnect No “virtual bus” ordering Decouple interconnect & protocol Technology trends  directory protocols Green squares are processor/memory nodes Need to point out the problem with the second interconnect Shared bus is like Ethernet: does not scale, a central bottleneck

Multiprocessor Taxonomy
Workload Trends Snooping Cache-to-cache latency High Low Technology Trends Directories Interconnect Virtual Bus Unordered Our Goal Best of both worlds Goal: replace both protocols with one approach that is best in most all cases This is a “goal setting” slide. Part of the separation from goal setting and goal achieving Note: Mark asked me to make this an “up and to the right” type graph. However, after trying it, I decided that I like this setup better, because it matches the eye-flow… left-to-right and top-to-bottom.

Overview Two approaches to cache-coherence The Problem A Solution
Workload trends  snooping protocols Technology trends  directory protocols Want a protocol that fits both trends A Solution Unordered broadcast & direct response Track tokens, reissue requests, prevent starvation Generalization: a new coherence framework Decouple correctness from “performance” policy Enables further opportunities

Token Coherence: Decoupling Performance and Correctness

Similar presentations

Presentation on theme: "Token Coherence: Decoupling Performance and Correctness"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Token Coherence: Decoupling Performance and Correctness

Similar presentations

Presentation on theme: "Token Coherence: Decoupling Performance and Correctness"— Presentation transcript:

Similar presentations

About project

Feedback