Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,

Slides:

Advertisements

Similar presentations

System Integration and Performance

Advertisements

Parallel Discrete Event Simulation Richard Fujimoto Communications of the ACM, Oct

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 4 Instructor: Haifeng YU.

Parallel and Distributed Simulation Global Virtual Time - Part 2.

Time Warp: Global Control Distributed Snapshots and Fossil Collection.

Chapter 6: Process Synchronization

Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.

Mutual Exclusion.

Parallel and Distributed Simulation Time Warp: Basic Algorithm.

Parallel and Distributed Simulation Lookahead Deadlock Detection & Recovery.

Lookahead. Outline Null message algorithm: The Time Creep Problem Lookahead –What is it and why is it important? –Writing simulations to maximize lookahead.

Optimistic Parallel Discrete Event Simulation Based on Multi-core Platform and its Performance Analysis Nianle Su, Hongtao Hou, Feng Yang, Qun Li and Weiping.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Informationsteknologi Wednesday, September 26, 2007 Computer Systems/Operating Systems - Class 91 Today’s class Mutual exclusion and synchronization 

What we will cover…  Distributed Coordination 1-1.

Lecture 13 Synchronization (cont). EECE 411: Design of Distributed Software Applications Logistics Last quiz Max: 69 / Median: 52 / Min: 24 In a box outside.

Building Parallel Time-Constrained HLA Federates: A Case Study with the Parsec Parallel Simulation Language Winter Simulation Conference (WSC’98), Washington.

Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.

Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.

Distributed Deadlocks and Transaction Recovery.

Introduction To Time Management Siddharth Misra Background  Distributed Simulation which is nothing but simulation on LAN became a popular tool for.

Parallel and Distributed Simulation FDK Software.

Synchronous Algorithms I Barrier Synchronizations and Computing LBTS.

Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.

TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.

Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.

SplitX: Split Guest/Hypervisor Execution on Multi-Core 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

+ Simulation Design. + Types event-advance and unit-time advance. Both these designs are event-based but utilize different ways of advancing the time.

Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.

1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.

Communication & Synchronization Why do processes communicate in DS? –To exchange messages –To synchronize processes Why do processes synchronize in DS?

SoC CAD 2015/11/22 1 Instruction Set Extensions for Multi-Threading in LEON3 林孟諭電機系, Department of Electrical Engineering 國立成功大學, National Cheng Kung.

Event Ordering Greg Bilodeau CS 5204 November 3, 2009.

Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport Massachusetts Computer Associates,Inc. Presented by Xiaofeng Xiao.

Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic

Run-time Adaptive on-chip Communication Scheme 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C.

Operating System Chapter 6. Concurrency: Deadlock and Starvation Lynn Choi School of Electrical Engineering.

High Level Architecture Time Management. Time management is a difficult subject There is no real time management in DIS (usually); things happen as packets.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

System Hardware FPU – Floating Point Unit –Handles floating point and extended integer calculations 8284/82C284 Clock Generator (clock) –Synchronizes the.

Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.

Parallel Computing Presented by Justin Reschke

Efficient Algorithms for Distributed Snapshots and Global Virtual Time Approximation Author: Friedermann Mattern Presented By: Shruthi Koundinya.

Clock Synchronization (Time Management) Deadlock Avoidance Using Null Messages.

Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.

CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.

Parallel and Distributed Simulation Deadlock Detection & Recovery.

PDEVS Protocol Performance Prediction using Activity Patterns with Finite Probabilistic DEVS DEMO L. Capocchi, J.F. Santucci, B.P. Zeigler University of.

Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.

Distributed Systems Lecture 6 Global states and snapshots 1.

PDES Introduction The Time Warp Mechanism

Dynamic Scheduling Why go out of style?

Parallel and Distributed Simulation

Parallel and Distributed Simulation Techniques

CS203 – Advanced Computer Architecture

PDES: Time Warp Mechanism Computing Global Virtual Time

Superscalar Processors & VLIW Processors

Timewarp Elias Muche.

Background and Motivation

Parallel and Distributed Simulation

Non-Distributed Excercises

Introduction of Week 13 Return assignment 11-1 and 3-1-5

Chapter 5 (through section 5.4)

Parallel Exact Stochastic Simulation in Biochemical Systems

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Presentation transcript:

Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan, R.O.C

Outline Abstract Introduction Background & Motivation Global Synchronization Unit (GSU) A Simple Example Conclusion 2

Abstract The paper presents a design for a hardware supported global synchronization unit (GSU)  directly accessible by all processors in a multi–core architecture.  perform highly efficient and scalable time synchronization for parallel simulations. 3

Introduction With recent advances in multi–core technology, where multiple CPUs are put on a single chip.  There are greater opportunities to perform distributed simulations with multiple CPUs. Parallelizing these simulators imposes an additional constraint.  The constraint is that a process may receive messages with timestamps less than its current simulation time. To prevent these so–called causality errors, a frequent time synchronization protocol is needed. This global synchronization unit (GSU) is comprised of specialized register files.  The register files are read and updated using atomic HW instructions that allow efficient time synchronization actions. 4

Background & Motivation (1/3) When parallelizing a simulation it is crucial to ensure that an instance of the parallel simulator, called a logical process (LP), never processes events out of timestamp order.  This property is called causality. When an LP receives a message with a timestamp less than the current simulation time, a causality error occurs. Synchronization algorithms can be grouped into two main classes according to how they handle the issue of causality: conservative and optimistic. 5

Background & Motivation (2/3) Conservative synchronization algorithms enforce causality by prohibiting an LP from processing an event with timestamp t 1 until it can be guaranteed that no message will be received with a timestamp t 2 < t 1. The LP processes events until it has no more events that are safe. When this occurs, the LP must compute the lower bound on the time stamps (LBTS) of all N LPs in the simulation. The LBTS for an LP i with a minimum event timestamp (including all unprocessed messages it has sent) is defined as the min(T j + lookahead i;j ) for j = 0 to N.  T j is the minimum event timestamp of LP j including any unprocessed messages sent by LP j.  The lookahead i;j is the minimum simulation time possible between the current time at LP j and the timestamp of an event with a destination of LP i.  Lookahead is determined by the physical properties of the system. 6

Background & Motivation (3/3) But it is complicated by the existence of transient messages.  Transient messages are messages that have been sent but not yet processed by their receiver.  Transient messages are handled at each LP by keeping track of the number of messages it has both sent and received. Optimistic algorithms permit events to be processed before causality is guaranteed.  If a message is received with a timestamp less than the current simulation time (a causality error), the simulation is rolled back to its state at some time prior to the timestamp of the message.  In order to perform this rollback, the state of the simulation must be saved periodically. 7

Global Synchronization Unit (1/4) The GSU consists of several cooperating register files and atomic read/write instructions.  The GSU approach for parallel discrete event simulation eliminates the need for costly global communications.  In addition, this approach allows each LP to calculate a new LBTS as needed. This greatly reduces the amount of time each LP spends waiting for other LPs. We will first describe the functionality of each component of the GSU, and N represents the number of CPUs: 8 Fig. 2. The Global Synchronization Unit Fig. 1. The GSU is located on-chip with one unit per multi-core chip.

Global Synchronization Unit (2/4) Minimum Outstanding Event (MOE) register file consists of N floating point values.  these can be read and written in just a few clock cycles.  In addition, there are N - 1 comparators that compute the minimum value in the MOE file in lg(N) stages. Minimum Outstanding Message (MOM) register file consists of N floating point values.  also including N - 1 comparators computing the minimum value found in the file.  Again, this register file can be read and written to in just a few clock cycles. Transient Message Count (TMC) register file consists of N integer values.  Each of these can be atomically incremented and decremented. Minimum Timestamp (MTS) instruction computes the smaller of the two minima in the MOE and MOM register files. 9

Global Synchronization Unit (3/4) An atomic increment/decrement instruction increments / decrements any value in the TMC register file. An atomic write if less than instruction that will write a new value in either the MOE or the MOM but only if the new value is less than the value already present in the register file.  This instruction returns true if successful (the value was less than the existing value and was written). An atomic write if zero instruction that will write a new value in either the MOE or MOM but only if the corresponding value in the TMC register file is zero.  This instruction returns true if successful (the value was zero) or false if not. 10

Global Synchronization Unit (4/4) The approach works by keeping track of the timestamp of the minimum event in each LP’s event queue the timestamp of the minimum unprocessed message destined for each LP The action taken by an LP to determine the lower bound on timestamp is as follows: 1) while(TMC[myLpIndex] !=0) Read an event message from the pending message queue. Enqueue the event in the LP’s event list. Atomically decrement TMC[myLpIndex]. end 2) If the timestamp of any of the new events is less than the current MOE value, use the atomic write if less than instruction to update the MOE register. 11

Global Synchronization Unit (4/4) 3) If the TMC register file has a zero entry, use the atomic write if zero instruction to write a value of infinity in the MOM register.  If this atomic write fails, return to step 1 above.  This means that a new message has arrived from another LP after detecting that there are no more messages, but before updating the MOM register. 4) Use the Minimum Timestamp instruction to find the smallest outstanding timestamp.  The Minimum Timestamp instruction returns the minimum value for all entries in both the MOE and MOM registers.  This represents the timestamp of the globally smallest event in the distributed simulation, including Transient Messages. 5) Add the constant lookahead value to the global minimum;  the result is the timestamp of the upper bound of the safe event window for a given LP. 6) The LP can safely de–queue and process any event in the local event list with a timestamp less than or equal to the global minimum plus lookahead. 12

A Simple Example(1/5) This example will demonstrate how the approach accounts for transient messages and ensures the correctness of the LBTS computation at any point in the process. We will use 2 LPs for simplicity. A timeline of the LPs’ actions and the value of an LBTS computation at that time can be seen in Figure 3. The state of the GSU at each step in the example is in Figure 4. The state of each LP’s event queue is in Figure 5. 13

A Simple Example(2/5) 14 Fig. 4. Global Synchronization Unit state for example Fig. 5. Event queue state for example

All LPs start at simulation time=0 with lookahead=10. Therefore, the initial LBTS is 0, and each LP’s safe event window is equal to 10. A1) LP A populates its MOE register with the timestamp of the first event in its event queue, which is 3. B1) LP B populates its MOE register with the timestamp of the first event in its event queue, which is 19. At this point the LBTS, if calculated, would correctly return a value of 3. Prior to this point the LBTS would be 0 because MOE[B] still contained its initial value, A Simple Example(3/5)

A2) LP A removes the first event off of its event queue and processes it. This event generates an event with timestamp=13 whose destination is LP B. A3) LP A atomically increments the TMC register of LP B. A4) LPA uses the atomic write if less than function to put the timestamp of the message in the MOM register of LPB. A5) LPA sends the message to LPB. A6) LPA updates its MOE register to the timestamp of the next event in its event queue, which is 15. At this point the LBTS, if calculated, would correctly return a value of 13, because the transient message is accounted for in the MOM. 16 A Simple Example(4/5)

B2) LPB receives the message and puts it in its event queue. B3) LPB updates its MOE register to the timestamp of the next event in its event queue, which is 13. B4) LPB atomically decrements its TMC register. B5) LPB uses the atomic write if zero function to write a value of infinity in its MOM register. 17 A Simple Example(5/5)

Conclusion This paper presented a design for hardware-assisted conservative time synchronization, a Global Synchronization Unit (GSU). The rate of increase with each additional CPU is less. These results indicate that the GSU has potential for making parallel simulation much more efficient. 18