Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan.

Slides:



Advertisements
Similar presentations
Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
Advertisements

Parallel and Distributed Simulation Global Virtual Time - Part 2.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
Optimistic Parallel Discrete Event Simulation Based on Multi-core Platform and its Performance Analysis Nianle Su, Hongtao Hou, Feng Yang, Qun Li and Weiping.
Other Optimistic Mechanism, Memory Management. Outline Dynamic Memory Allocation Error Handling Event Retraction Lazy Cancellation Lazy Re-Evaluation.
Parallel and Distributed Simulation Time Warp: Other Mechanisms.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.
Parallel and Distributed Simulation Time Warp: State Saving.
G Robert Grimm New York University Cool Pet Tricks with… …Virtual Memory.
Rollback by Reverse Computation Kevin Hamlen CS717: Fault-Tolerant Computing.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
New Challenges in Cloud Datacenter Monitoring and Management
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
PMIT-6102 Advanced Database Systems
ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic.
Computer System Architectures Computer System Software
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Thread-Level Speculation Karan Singh CS
Time Warp State Saving and Simultaneous Events. Outline State Saving Techniques –Copy State Saving –Infrequent State Saving –Incremental State Saving.
Parallel and Distributed Simulation Memory Management & Other Optimistic Protocols.
Parallel and Distributed Simulation Introduction and Motivation.
Parallel and Distributed Simulation Introduction and Motivation.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
M. Accetta, R. Baron, W. Bolosky, D. Golub, R. Rashid, A. Tevanian, and M. Young MACH: A New Kernel Foundation for UNIX Development Presenter: Wei-Lwun.
Programmability Hiroshi Nakashima Thomas Sterling.
Full and Para Virtualization
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Clock Synchronization (Time Management) Deadlock Avoidance Using Null Messages.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Operating System Structure Lecture: - Operating System Concepts Lecturer: - Pooja Sharma Computer Science Department, Punjabi University, Patiala.
PDES Introduction The Time Warp Mechanism
Processes and threads.
Self Healing and Dynamic Construction Framework:
Operating Systems (CS 340 D)
Processes and Threads Processes and their scheduling
Operating Systems (CS 340 D)
Introduction to Operating Systems
CPSC 531: System Modeling and Simulation
Department of Computer Science University of California, Santa Barbara
Parallel and Distributed Simulation
Chapter 2: Operating-System Structures
Introduction to Operating Systems
Department of Computer Science University of California, Santa Barbara
Chapter 2: Operating-System Structures
CSE 542: Operating Systems
Parallel Exact Stochastic Simulation in Biochemical Systems
Presentation transcript:

Efficient Optimistic Parallel Simulations Using Reverse Computation Chris Carothers Department of Computer Science Rensselaer Polytechnic Institute Kalyan Permulla and Richard M. Fujimoto College of Computing Georgia Institute of Technology

 Goal: speed up discrete-event simulation programs using multiple processors  Enabling technology for…  intractable simulation models tractable  off-line decision aides on-line aides for time critical situation analysis  DPAT: A distributed simulation success story  simulation model of the National Airspace  MITRE using Georgia Tech Time Warp (GTW)  simulates 50,000 flights in < 1 minute, which use to take 1.5 hours.  web based user-interface  to be used in the FAA Command Center for on-line “what if” planning  Parallel/distributed simulation has the potential to improve how “what if” planning strategies are evaluated Why Parallel/Distributed Simulation?

How to Synchronize Distributed Simulations ? parallel time-stepped simulation: lock-step execution PE 1 PE 2 PE 3 barrier Virtual Time parallel discrete-event simulation: must allow for sparse, irregular event computations PE 1 PE 2 PE 3 Virtual Time Problem: events arriving in the past Solution: Time Warp processed event “straggler” event

Time Warp... Local Control Mechanism: error detection and rollback LP 1 LP 2 LP 3 VirtualTimeVirtualTime  undo state  ’s (2) cancel “sent” events Global Control Mechanism: compute Global Virtual Time (GVT) LP 1 LP 2 LP 3 VirtualTimeVirtualTime GVT collect versions of state / events & perform I/O operations that are < GVT processed event “straggler” event unprocessed event “committed” event

Challenge: Efficient Implementation? Advantages: automatically finds available parallelism makes development easier outperforms conservative schemes by a factor of N Disadvantages: Large memory requirements to support rollback operation State-saving incurs high overheads for fine-grain event computations Time Warp is out of “performance” envelop for many applications Time Warp P P P P P PPP Shared Memory or High Speed Network P Our Solution: Reverse Computation

Outline...  Reverse Computation  Example: ATM Multiplexor  Beneficial Application Properties  Rules for Automation  Reversible Random Number Generator  Experimental Results  Conclusions  Future Work

Our Solution: Reverse Computation...  Use Reverse Computation (RC)  automatically generate reverse code from model source  undo by executing reverse code  Delivers better performance  negligible overhead for forward computation  significantly lower memory utilization

if( qlen < B ) qlen++ delays[qlen ]++ else lost++ N B on cell arrival... Original if( b1 == 1 ) delays[qlen]-- qlen-- else lost-- Reverse if( qlen < B ) b1 = 1 qlen++ delays[qlen] ++ else b1 = 0 lost++ Forward Example: ATM Multiplexor

 State size reduction  from B+2 words to 1 word  e.g. B=100 => 100x reduction!  Negligible overhead in forward computation  removed from forward computation  moved to rollback phase  Result  significant increase in speed  significant decrease in memory  How?... Gains….

Beneficial Application Properties 1. Majority of operations are constructive  e.g., ++, --, etc. 2. Size of control state < size of data state  e.g., size of b1 < size of qlen, sent, lost, etc. 3. Perfectly reversible high-level operations gleaned from irreversible smaller operations  e.g., random number generation

Generation rules, and upper-bounds on bit requirements for various statement types Rules for Automation...

 Destructive assignment (DA):  examples: x = y; x %= y;  requires all modified bytes to be saved  Caveat:  reversing technique for DA’s can degenerate to traditional incremental state saving  Good news:  certain collections of DA’s are perfectly reversible!  queueing network models contain collections of easily/perfectly reversible DA’s  queue handling (swap, shift, tree insert/delete, … )  statistics collection (increment, decrement, …)  random number generation (reversible RNGs) Destructive Assignment...

Reversing an RNG? double RNGGenVal(Generator g) { long k,s; double u; u = 0.0; s = Cg [0][g]; k = s / 46693; s = * (s - k * 46693) - k * 25884; if (s < 0) s = s ; Cg [0][g] = s; u = u e-10 * s; s = Cg [1][g]; k = s / 10339; s = * (s - k * 10339) - k * 870; if (s < 0) s = s ; Cg [1][g] = s; u = u e-10 * s; if (u < 0) u = u + 1.0; s = Cg [2][g]; k = s / 15499; s = * (s - k * 15499) - k * 3979; if (s < 0.0) s = s ; Cg [2][g] = s; u = u e-10 * s; if (u >= 1.0) u = u - 1.0; s = Cg [3][g]; k = s / 43218; s = * (s - k * 43218) - k * 24121; if (s < 0) s = s ; Cg [3][g] = s; u = u e-10 * s; if (u < 0) u = u + 1.0; return (u); } Observation: k = s / is a Destructive Assignment Result: RC degrades to classic state-saving…can we do better?

RNGs: A Higher Level View The previous RNG is based on the following recurrence…. x i,n = a i x i,n-1 mod m i where x i,n one of the four seed values in the Nth set, m i is one the four largest primes less than 2 31, and a i is a primitive root of m i. Now, the above recurrence is in fact reversible…. inverse of a i modulo m i is defined, b i = a i m i -2 mod m i Using b i, we can generate the reverse recurrence as follows: x i,n-1 = b i x i,n mod m i

Reverse Code Efficiency...  Future RNGs may result in even greater savings.  Consider the MT19937 Generator...  Has a period of  Uses 2496 bytes for a single “generator”  Property...  Non-reversibility of indvidual steps DO NOT imply that the computation as a whole is not reversible.  Can we automatically find this “higher-level” reversibility?  Other Reversible Structures Include...  Circular shift operation  Insertion & deletion operations on trees (i.e., priority queues). Reverse computation is well-suited for queuing network models!

Performance Study Platform SGI Origin 2000, 16 processors (R10000), 4GB RAM Model 3 levels of multiplexers, fan-in N N^3 sources => N^3 + N^2 + N + 1 entities in total eg. N=4 => entities=85, N=64 => entities=266,305

million events/second Why the large increase in parallel performance?

Cache Performance... Faults TLB P cache S cache  SS 12pe:  RC 12pe:

Related Work...  Reverse computation used in  low power processors, debugging, garbage collection, database recovery, reliability, etc.  All previous work either  prohibit irreversible constructs, or  use copy-on-write implementation for every modification (correspond to incremental state saving)  Many operate at coarse, virtual page-level

Contributions We identify that  RC makes Time Warp usable for fine-grain models!  disproved previous belief that “fine grain models can’t be optimistically simulated efficiently”  less memory consumption, more speed, without extra user effort  RC generalizes state saving  e.g., incremental state saving, copy state saving  For certain data types, RC is more memory efficient than SS  e.g., priority queues

Future Work  Develop state minimization algorithms, by  State compression: bit size for reversibility < bit size of data variables  State reuse: same state bits for different statements  based on liveness, analogous to register allocation  Complete RC automation algorithm design avoiding the straightforward incremental state saving approach  Lossy integer and floating point arithmetic  Jump statements  Recursive functions

Geronimo! System Architecture multiprocessor rack-mounted CPUs (not in demonstration) Myrinet Geronimo High Performance Simulation Application distributed compute server Geronimo Features: (1) “risky” or “speculative” processing of object computations, (2) reverse computation to support “undo” operation, (3) “Active Code” in a combination, heterogeneous, shared-memory, message passing environment...

Geronimo!: “Risky” Processing... Error detection and Rollback Object 1 Object 2 Object 3 VirtualTimeVirtualTime  undo state  ’s (2) cancel “scheduled” tasks processed thread “straggler” thread unprocessed thread Execution Framework: Objects schedule Threads / Tasks at some “virtual time” Applications: discrete-event simulations scientific computing applications CAVEAT: Good performance relies on cost of recovery * probability of failure being less than cost of being “safe”!

Geronimo!: Efficient “Undo”  Traditional approach: State Saving  save byte-copies of modified items  high overhead for fine-granularity computations  memory utilization is large  need alternative for large-scale, fine-grain simulations  Our approach: Reverse Computation  automatically generate reverse code from model source  utilize reverse code to do rollback  negligible overhead for forward computation  significantly lower memory utilization  joint with Kalyan Perumalla and Richard Fujimoto Observation: “reverse” computation treats “code” as “state”. This results in a code-state duality. Can we generalize notion?…..

Geronimo!: Active Code  Key idea: allow object methods/code to be dynamically changed during run-time.  objects can schedule in the future a new method or re-define old methods of other objects and themselves.  objects can erase/delete methods on themselves or other objects.  new methods can contain “Active Code” which can re-specialize itself or other objects.  work in a heterogeneous environment.  How is this useful?  increase performance by allowing the program to consistently “execute the common case fast”.  adaptive, perturbation-free, monitoring of distributed systems.  potential for increasing a language’s “expressive power”.  Our approach?  Java…no, need higher performance…maybe used in the future...  special compiler…no, can’t keep up with changes to microprocessors.

Geronimo!: Active Code Implementation  Runtime infrastructure  modifies source code tree  start a rebuild of the executable on a another existing machine  uses a system’s naïve compiler  Re-exec system call  reloads only the new text or code segment of new executable  fix-up old stack to reflect new code changes  fix-up pointers to functions  will run in “user-space” for portability across platforms  Language preprocessor  instruments code to support stack and function pointer fix-up  instruments code to support stack reconstruction and re-start process

Research Issues  Software architecture for the heterogeneous, shared- memory, message passing environment.  Development of distributed algorithms that are fully optimized for this “combination” environment.  What language to use for development, C or C++ or both?  Geronimo! API.  Active Code Language and Systems Support.  Mapping relevant application types to this framework Homework Problem: Can you find specific applications/problems where we can apply Geronimo!?