(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

Slides:



Advertisements
Similar presentations
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.
(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
UW-Madison Computer Sciences Vertical Research Group© 2010 Relax: An Architectural Framework for Software Recovery of Hardware Faults Marc de Kruijf Shuou.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
(C) 2002 Daniel SorinDuke Architecture Why Computer Architecture is Exciting and Challenging Daniel Sorin Department of Electrical & Computer Engineering.
(C) 2003 Milo Martin Token Coherence: Decoupling Performance and Correctness Milo Martin, Mark Hill, and David Wood Wisconsin Multifacet Project
High Performance Router Architectures for Network- based Computing By Dr. Timothy Mark Pinkston University of South California Computer Engineering Division.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
[ 1 ] Agenda Overview of transactional memory (now) Two talks on challenges of transactional memory Rebuttals/panel discussion.
Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors Karin StraussAMD Advanced Architecture and Technology.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
Multiprocessor Cache Coherency
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
(C) 2005 Multifacet Project Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty 1, Jesse Bingham 2, Mark Hill 1, Alan Hu 2, Milo.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Evaluating a $2M Commercial Server on a $2K PC and Related Challenges Mark D. Hill Multifacet.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Deadlock CEG 4131 Computer Architecture III Miodrag Bolic.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Deadlock: Part II - Recovery.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 26 – Alternative Architectures.
Framework For Exploring Interconnect Level Cache Coherency
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
Directory-based Protocol
Improving Multiple-CMP Systems with Token Coherence
Simulating a $2M Commercial Server on a $2K PC
CEG 4131 Computer Architecture III Miodrag Bolic
Token Coherence: Decoupling Performance and Correctness
The University of Adelaide, School of Computer Science
BulkCommit: Scalable and Fast Commit of Atomic Blocks
Dynamic Verification of Sequential Consistency
The University of Adelaide, School of Computer Science
University of Wisconsin-Madison Presented by: Nick Kirchem
Presentation transcript:

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David A. Wood 3 1 Dept. of Electrical & Computer Engineering, Duke University 2 Dept. of Computer & Information Science, Univ. of Pennsylvania 3 Computer Sciences Dept., University of Wisconsin-Madison

IPDPS 2004 – Daniel Sorin slide 2 My Talk in One Slide Shared memory multiprocessors are complicated –Difficult to design for every possible corner case Proposal: Use speculation to target the common case –Speculate that corner cases won’t happen –Detect if they do occur and recover system –Ensure forward progress Case studies –Simplify cache coherence protocols –Simplify the interconnection network

IPDPS 2004 – Daniel Sorin slide 3 Speculation for Simplicity Why we want to avoid complexity –Time and money for design and verification Design for the common case –But we have to make ALL cases work correctly Examples of this philosophy in uniprocessors –Trapping to software for infrequent/obsolescent instructions –Pentium4 recovers from edge case scheduler deadlocks But this idea hadn’t been used for multiprocessors –Key: we now have efficient multiprocessor recovery

IPDPS 2004 – Daniel Sorin slide 4 Framework for Speculation Four keys to design simplification with speculation 1)Ensure that mis-speculations are rare 2)Detect all mis-speculations 3)Recover from mis-speculations 4)Ensure forward progress even for worst-case

IPDPS 2004 – Daniel Sorin slide 5 SafetyNet Checkpoint/Recovery We use SafetyNet [ISCA 2002] for system recovery All-hardware checkpoint/recovery for shared memory multiprocessors Periodically, takes logical checkpoints of system –Including caches, coherence state, memory, directory state –Implements checkpointing with incremental logging –Consistent checkpoints using logical time coordination Can recover 100,000+ cycles Negligible performance impact –Incremental logging performed off critical path Small log buffers (512 KB) at caches & memories

IPDPS 2004 – Daniel Sorin slide 6 The Need for Multiprocessor Recovery Assumption: multiprocessors will have system-wide recovery mechanisms for purposes of availability –As fault rates keep increasing, recovery is crucial Will be all-hardware (like SafetyNet) for performance –But many alternative designs are possible We leverage this recovery mechanism for recovering from mis-speculations

IPDPS 2004 – Daniel Sorin slide 7 Outline A Framework for Speculation Simplifying Cache Coherence Protocols Simplifying the Interconnection Network Evaluation Conclusions

IPDPS 2004 – Daniel Sorin slide 8 Directory Protocol Complexity We want adaptive routing in interconnection network –Better performance and availability –But adaptive routing precludes point-to-point ordering So what? –Point-to-point ordering simplifies protocol design –Eliminates several potential corner case races

IPDPS 2004 – Daniel Sorin slide 9 Race Case in Directory Protocol Example race if no point-to-point ordering in network P1 Dir P2 RequestReadWrite Writeback RequestReadWrite arrives first at Dir, gets forwarded to P1 Forwarded RequestReadWrite

IPDPS 2004 – Daniel Sorin slide 10 Race Case in Directory Protocol P1 Dir P2 RequestReadWrite Forwarded RequestReadWrite Writeback Ack Writeback Forwarded RequestReadWrite arrives after Writeback Ack

IPDPS 2004 – Daniel Sorin slide 11 Race Case in Directory Protocol Problem: P1 sees Forwarded Request in state Invalid P1 Dir P2 RequestReadWrite Forwarded RequestReadWrite Writeback Ack Writeback Not possible if point-to-point order in interconnection network

IPDPS 2004 – Daniel Sorin slide 12 Simplifying a Directory Protocol Speculate that adaptive network provides ordering 1)Why is mis-speculation rare? –Not many re-orderings –Most re-orderings don’t matter! 2)How do we detect all mis-speculations? –If we get a Forwarded RequestReadWrite in state Invalid 3)How do we recover? –SafetyNet 4)How do we ensure forward progress? –Slow-start operation for a while after recovery –Guarantees that this race can’t keep recurring

IPDPS 2004 – Daniel Sorin slide 13 Simplifying a Snooping Coherence Protocol During design, we missed a corner case State M State trans1 Writeback State trans2 Request ReadWrite Solution: it’s rare, treat it as mis-speculation Detect by seeing RequestReadWrite in state trans2 Recovery with SafetyNet Forward progress with slow-start after recovery ???

IPDPS 2004 – Daniel Sorin slide 14 Outline A Framework for Speculation Simplifying Cache Coherence Protocols Simplifying the Interconnection Network –Deadlock –Avoiding deadlock Evaluation Conclusions

IPDPS 2004 – Daniel Sorin slide 15 Two Causes of Deadlock P1 P2 Response full of requests Response Message M1 full of messages Message M2 Endpoint Deadlock Switch Deadlock switch1 switch2

IPDPS 2004 – Daniel Sorin slide 16 Avoiding Deadlock Simple but wasteful solution: full buffering –But it’s rare that we ever need full buffering More efficient solution: virtual channels (networks) For endpoint deadlock –Need a virtual network per type of message For switch deadlock –Need some number of virtual channels per virtual network –Depends on network topology and routing scheme A major source of design complexity

IPDPS 2004 – Daniel Sorin slide 17 Simplifying Deadlock Avoidance Speculate that deadlock won’t occur, despite using less than full buffering and no virtual channels 1)Why is mis-speculation rare? –Can usually avoid deadlock with reasonable buffering 2)How do we detect all mis-speculations? –Timeout mechanism for cache coherence transactions 3)How do we recover? –SafetyNet 4)How do we ensure forward progress? –Slow-start operation for a while after recovery –Guarantees that deadlock can’t keep recurring

IPDPS 2004 – Daniel Sorin slide 18 Outline A Framework for Speculation Simplifying Cache Coherence Protocols Simplifying the Interconnection Network Evaluation –Goals –Methodology –Results Conclusions

IPDPS 2004 – Daniel Sorin slide 19 Goals Discover the point at which mis-speculation recoveries impact performance –Determines whether our simplified snooping protocol and our simplified interconnection network are viable Determine whether our simplified directory protocol can usefully speculate on point-to-point ordering

IPDPS 2004 – Daniel Sorin slide 20 Methodology Full-system simulation –Simics provides full-system functionality –We added detailed timing model for memory system Workloads –Online transaction processing (OLTP) with DB2 –SPECjbb2000 java middleware –Apache static web serving –Slashcode dynamic web serving –Barnes-Hut scientific simulation

IPDPS 2004 – Daniel Sorin slide 21 How Rare Must Mis-speculation Be? We can tolerate high mis-speculation rates – these rates are much higher than what our simplified designs incur

IPDPS 2004 – Daniel Sorin slide 22 Adaptive Routing with Speculative Ordering Adaptive routing can provide better performance by routing around congestion, even with mis-speculations

IPDPS 2004 – Daniel Sorin slide 23 Conclusions Simplify multiprocessor design with speculation –Treat corner cases as mis-speculations & recover from them Must be able to ensure that –Mis-speculations are sufficiently rare –Can detect all mis-speculations –Can recover from mis-speculations –Can provide forward progress in all cases Showed how to simplify –Cache coherence protocols –Interconnection network deadlock avoidance Applicable to other complicated designs