Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Slides:



Advertisements
Similar presentations
4.4 Page replacement algorithms
Advertisements

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
Virtual Memory: Page Replacement
Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Silberschatz, Galvin and Gagne ©2007 Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Chapter 6 (a): Synchronization.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Operating Systems Process Scheduling (Ch 3.2, )
Virtual Memory Management G. Anuradha Ref:- Galvin.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.
Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Review: Operating System Manages all system resources ALU Memory I/O Files Objectives: Security Efficiency Convenience.
3.5 Interprocess Communication
Computer Organization and Architecture
Process Concept An operating system executes a variety of programs
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
DTHREADS: Efficient Deterministic Multithreading
RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.
Light64: Lightweight Hardware Support for Data Race Detection during Systematic Testing of Parallel Programs A. Nistor, D. Marinov and J. Torellas to appear.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ,
What is the Cost of Determinism?
Accelerating Mobile Applications through Flip-Flop Replication
Advanced Operating Systems CIS 720 Lecture 1. Instructor Dr. Gurdip Singh – 234 Nichols Hall –
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
Lecture 2 Foundations and Definitions Processes/Threads.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Virtual Memory.
11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 9: Virtual Memory.
CS140 Project 1: Threads Slides by Kiyoshi Shikuma.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 9 th Edition Chapter 9: Virtual-Memory Management.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.
Operating Systems CMPSC 473 Signals, Introduction to mutual exclusion September 28, Lecture 9 Instructor: Bhuvan Urgaonkar.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
Sunpyo Hong, Hyesoon Kim
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
Agenda  Quick Review  Finish Introduction  Java Threads.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Big Picture Lab 4 Operating Systems C Andras Moritz
Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation Deterministic Multiprocessing.
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.
OPERATING SYSTEM CONCEPTS AND PRACTISE
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Advanced Operating Systems CIS 720
Background on the need for Synchronization
William Stallings Computer Organization and Architecture
Effective Data-Race Detection for the Kernel
Process management Information maintained by OS for process management
MapReduce Simplied Data Processing on Large Clusters
Chapter 9: Virtual-Memory Management
Process & its States Lecture 5.
Process Description and Control
Parallel Exact Stochastic Simulation in Biochemical Systems
Presentation transcript:

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos

2 Motivation Parallel applications  Non-determinism inherent in threaded applications  Hard to develop, debug, test, maintain etc. Modify running environment to make the parallel application run deterministically  Make thread communication through shared memory deterministic  Deterministic interleaving of lock acquisition

3 Deterministic Multithreading Strong Determinism  Same output for every run – too costly Weak Determinism  Same output for all the inputs that lead to a race- free execution under the deterministic scheduler.

4 Benefits of Deterministic Multithreading Repeatability  Closest approach: record/replay systems can provide determinism for a single recorded run Debugging  Cyclic debugging methodology Testing  Test output or intermediate states of a program to justify correctness Multithreaded Replicas  Replica-based fault tolerant  Give same input to replicas and expect same behavior

5 Deterministic Logical Time ‘P’ monotonically increasing clocks, one for every thread Counting arbitrary events (for every thread), that are repeatable across executions  e.g. writes performed, instructions committed Measure of progress for every thread Decide on the thread interleaving (lock acquisition) based on logical time

6 Simplified Locking Algorithm At any given point it’s only one’s thread turn to acquire a lock:  All threads with a smaller ID have greater deterministic logical clocks  All threads with a larger ID have greater or equal deterministic logical clocks Turn waiting enforces a First-Come-First- Serve ordering of threads in logical time

7 Pseudocode for simplified locking algorithm

8 Improved Locking Algorithm

9

10 Optimizations Queueing for fairness  Queue structure in every lock  The thread at the head of the queue gets the lock; other threads spin increasing their logical clock Deterministic logical clock fast-forwarding  A thread advances its clock to lock.released_logical_time to save time from spinning Lock priority boosting (?)  If you can predict the next thread to get a lock, then decrease its clock to give it higher priority.

11 Implementation Deterministic Logical Clocks  retire_stores hardware counter; on an overflow increment the software counter maintained in shared memory  Chunk size: number of stores needed to cause an overflow Small chunk size higher overhead due to interrupt handlers  Increment amount: fidelity of the logical clock Can be different when counter goes off and when trying to get a lock

12 Implementation Thread Creation  Need to be careful when creating new threads  parent thread need to wait for its turn before initiating new thread Lazy reads (unprotected reads)  Provide API for deterministically reading unprotected data, writes always done with a lock  Keep a table of all

13 Evaluation 2.66GHz Intel Core 2 Quad running Debbian SPLASH-2 benchmark suite  also parallel traveling-sales-person (tsp) and parallel quicksort

14 Evaluation

15 Evaluation

16 Evaluation

17 Conclusions Software-only solution to provide weak deterministic multithreading Control the interleaving of lock acquisitions to make it deterministic Low overhead (16%) for up to four threads (?) in SPLASH benchmarks