Presentation is loading. Please wait.

Presentation is loading. Please wait.

(C) 2002 Milo MartinHPCA, Feb. 2002 Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Similar presentations


Presentation on theme: "(C) 2002 Milo MartinHPCA, Feb. 2002 Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet."— Presentation transcript:

1 (C) 2002 Milo MartinHPCA, Feb. 2002 Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet Project Computer Sciences Department University of Wisconsin—Madison

2 BASH – Milo Martin slide 2 Two classes of multiprocessors Snooping (SMP) multiprocessors –Broadcast-based  use more interconnect bandwidth +Directly locate owner  low latency cache-to-cache transfers (36% - 91% of misses are cache-to-cache transfers in our commercial workloads) Directory-based multiprocessors +Indirection  bandwidth-efficient & scalable –Indirection  higher latency cache-to-cache transfers Problem: higher performing approach varies with: –Configuration (e.g., number of processors) –Workload (e.g., cache miss rate)

3 BASH – Milo Martin slide 3 Which approach is best? Micro-benchmark 64 processors

4 BASH – Milo Martin slide 4 Bandwidth Adaptive Snooping Hybrid (BASH) Goals –Best performance aspects of both approaches High performance for many configurations & workloads Future workload properties unknown at design time –Single design Coherence logic integrated with processors One part for many systems Hybrid protocol –Snooping-like broadcast requests –Directory-like “unicast” requests Bandwidth adaptive –Estimate available bandwidth –Adjust rate of broadcast based on estimate

5 BASH – Milo Martin slide 5 Best of both protocols Micro-benchmark 64 processors

6 BASH – Milo Martin slide 6 Outline Overview Bandwidth adaptive mechanism Hybrid protocol Evaluation Conclusions

7 BASH – Milo Martin slide 7 Ordered Interconnect $ P M $ P M $ P M $ P M System model Ordered interconnect Processor/Memory nodes –Directory state –Adaptive mechanism Bandwidth Adaptive Mechanism Network Interface Caches Processor Memory Directory Controller

8 BASH – Milo Martin slide 8 Bandwidth adaptive mechanism Choose broadcast or unicast for each miss Goal: minimize latency - avoid extreme queuing delay Approach: limit average interconnect utilization –Contention dominates miss latency at high utilizations –Interconnect utilization goal (e.g., 75%) –Adjust rate of broadcast –Feedback control system

9 BASH – Milo Martin slide 9 Implementation Two counters at each processor –Utilization counter (Above or below utilization threshold?) –Policy counter (Probability of broadcast?) At each processor –Each cycle: Monitor local link & adjust utilization counter –Each sampling interval: Adjust policy counter based on utilization counter –Each miss: Compare policy counter with a random number Why random? –Steady state of mixed broadcasts and unicasts –Enables us to avoid oscillation

10 BASH – Milo Martin slide 10 Outline Overview Bandwidth adaptive mechanism Hybrid protocol –Snooping-like operation –Directory-like operation –Complexity & Scalability Evaluation Conclusions

11 BASH – Milo Martin slide 11 Ordered broadcast Marker places request in total order marker request  Snooping-like operation P2P2 Owner P1P1 Shared P3P3 Invalid P0P0 Requestor M0M0 Home Data  Low latency cache-to-cache, but requires broadcast Owner: P 1

12 BASH – Milo Martin slide 12 marker request  Add indirection Uses order to avoid acks Similar to Alpha GS320 marker  re-request Directory-like operation P2P2 Owner P1P1 Shared P3P3 Invalid P0P0 Requestor M0M0 Home Data  Avoids broadcast, but frequently adds indirection Owner: P 1, Sharers: {P 2 }

13 BASH – Milo Martin slide 13 Protocol races Choose broadcast or unicast for each miss Protocol simultaneously allows –Broadcast requests –Unicast requests –Forwarded requests –Writebacks Like all protocols, BASH has protocol races

14 BASH – Milo Martin slide 14 Protocol race example P2P2 Owner P1P1 Shared P3P3 Requestor P0P0 M0M0 Home Owner: P 1, Sharers: {P 2 } Broadcast Unicast re-request Data

15 BASH – Milo Martin slide 15 Protocol race example P2P2 Invalid P1P1 P3P3 Modified P0P0 Requestor M0M0 Home Owner: P 3, Sharers: Ø Unicast re-request

16 BASH – Milo Martin slide 16 Protocol race example P2P2 Invalid P1P1 P3P3 Modified P0P0 Requestor M0M0 Home Owner: P 3, Sharers: Ø Unicast re-request Data 2nd re-request

17 BASH – Milo Martin slide 17 Protocol races Race detection: directory audits all requests –Observes all requests –Compares request destination set with current sharers –Occasionally needs to re-issue a request Requests are processed uniformly –Processors - respond with data or invalidate –Directory - audit request, may forward data or request See paper for more information

18 BASH – Milo Martin slide 18 Complexity One “cost” of implementing BASH Quantifying complexity is difficult… –Protocol controllers are finite state machines –Similar number of states –BASH has twice as many events and transitions Moderate complexity –Additive, not multiplicative Similar to Multicast Snooping –Original proposal [Bilir et al., ISCA 1999] –Enhanced, specified & verified [Sorin et al., TPDS 2002]

19 BASH – Milo Martin slide 19 Scalability Limited by ordered interconnect –BASH eliminates broadcast-only nature of snooping Recent systems with an ordered interconnect –Compaq AlphaServer GS320 (32 processor) - directory –Sun UE15000 (106 processors) - snooping –Fujitsu PrimePower 2000 (128 processors) - snooping Potential alternative –Timestamp Snooping network [Martin et al., ASPLOS 2000]

20 BASH – Milo Martin slide 20 Outline Overview Bandwidth adaptive mechanism Hybrid protocol Evaluation Conclusions

21 BASH – Milo Martin slide 21 Workloads & methods Workloads [ CAECW ‘02 ] –OLTP: IBM’s DB2 & TPCC-like (1GB database) –Static web: Apache –Dynamic web: SlashCode –Java middleware: SpecJBB –Scientific workload: Barnes-Hut Setup and tuned for 16 processors Full system simulation –Virtutech’s Simics –Solaris 8 on SPARC V9 –Blocking processor model Memory system simulator –Captures timing, races, and all transient states

22 BASH – Milo Martin slide 22 Three Questions 1)Is our adaptive mechanism effective? 2)Does BASH adapt to multiple workloads? 3)Does BASH adapt to multiple configurations?

23 BASH – Milo Martin slide 23 (1) SpecJBB on 16 processors

24 BASH – Milo Martin slide 24 (1) SpecJBB on 16 processors, 4x broadcast cost

25 BASH – Milo Martin slide 25 (1) SpecJBB on 16 processors, 4x broadcast cost

26 BASH – Milo Martin slide 26 (2) Can BASH adapt to multiple workloads? 1600 MB/s links SimilarSnooping Directory

27 BASH – Milo Martin slide 27 (2) Can BASH adapt to multiple workloads? 1600 MB/s links

28 BASH – Milo Martin slide 28 (3) Can BASH adapt to multiple configurations? Micro-benchmark 1600 MB/s links

29 BASH – Milo Martin slide 29 (3) Can BASH adapt to multiple configurations? Micro-benchmark 1600 MB/s links

30 BASH – Milo Martin slide 30 Results Summary 1)Is our adaptive mechanism effective? Yes 2)Does BASH adapt to multiple workloads? Yes 3)Does BASH adapt to multiple configurations? Yes

31 BASH – Milo Martin slide 31 Conclusions Bandwidth Adaptive Snooping Hybrid (BASH) –Hybrid of snooping and directories –Simple bandwidth adaptive mechanism Adapts to various workloads & system configurations –Robust performance –Outperforms base protocols in some cases Future directions –Focus bandwidth on likely cache-to-cache transfers –Explore multicasts –Power-adaptive coherence

32 BASH – Milo Martin slide 32

33 BASH – Milo Martin slide 33 Queuing model motivation Knee A multiprocessor as a simple queuing model –Exponential service & think time distributions “interconnect” “ processors ” requestsresponses


Download ppt "(C) 2002 Milo MartinHPCA, Feb. 2002 Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet."

Similar presentations


Ads by Google