UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Slides:

Advertisements

Similar presentations

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Advertisements

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Michael Bond (Ohio State) Milind Kulkarni (Purdue)

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

High Performing Cache Hierarchies for Server Workloads

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.

CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.

Transparent Checkpoint of Closed Distributed Systems in Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,

Continuously Recording Program Execution for Deterministic Replay Debugging.

October 2003 What Does the Future Hold for Parallel Languages A Computer Architect’s Perspective Josep Torrellas University of Illinois

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.

Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ，

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.

INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.

Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

A Case for Unlimited Watchpoints Joseph L. Greathouse †, Hongyi Xin*, Yixin Luo †‡, Todd Austin † † University of Michigan ‡ Shanghai Jiao Tong University.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

StealthTest: Low Overhead Online Software Testing Using Transactional Memory Jayaram Bobba, Weiwei Xiong*, Luke Yen †, Mark D. Hill, and David A. Wood.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.

Clock Snooping and its Application in On-the-fly Data Race Detection Koen De Bosschere and Michiel Ronsse University of Ghent, Belgium Taipei, TaiwanDec.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.

Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.

On-Demand Dynamic Software Analysis

Presented by: Daniel Taylor

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

Transactional Memory : Hardware Proposals Overview

Multiscalar Processors

Effective Data-Race Detection for the Kernel

Automatic Detection of Extended Data-Race-Free Regions

Safe and Efficient Supervised Memory Systems

Supporting Fault-Tolerance in Streaming Grid Applications

NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,

EECS 498 Introduction to Distributed Systems Fall 2017

Reducing Memory Reference Energy with Opportunistic Virtual Caching

EECS 498 Introduction to Distributed Systems Fall 2017

Improving Multiple-CMP Systems with Token Coherence

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

Presentation transcript:

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at University of Wisconsin-Madison

Executive summary Applications of deterministic record-replay –Debugging –Fault tolerance –Security Existing hardware record-replayer –Fast record but –Slow replay or –Requires major hardware changes Karma: Faster Replay with nearly- conventional h/w –Extends Rerun –Records more parallelism 2

Outline Background & Motivation Rerun Overview Karma Insights Karma Implementation Evaluation Conclusion 3

Deterministic Record-Replay Multi-threaded execution non-deterministic Deterministic record-replay to reincarnate past execution Record: –Record selective events in a log Replay: –Use the log to reincarnate past execution Key Challenge: Memory races 4

Record-Replay Motivation Debugging –Ensures bugs faithfully reappear (no heisenbugs) Fault-Tolerance –Enable hot backup for primary server to shadow primary & take over on failure Security –Real time intrusion detection & attack analysis Replay speed matters 5

Previous work Record Dependence –Wisconsin Flight Data Recorder [ISCA’03,etc.]: Too much state –UCSD Strata [ASPLOS’06]: Log size grows rapidly w #cores Record Independence –UIUC DeLorean [ISCA’08]: Non-conventional BulkSC H/W –Wisconsin Rerun [ISCA’08]: Sequential replay –Intel MRR [MICRO’09]: Only for snoop based systems –Timetraveler [ISCA’10]: Extends Rerun to lower log size Our Goal –Retain Rerun’s near-conventional hardware –Enable Faster Replay 6

Outline Background & Motivation Rerun Overview Karma Insights Karma Implementation Evaluation Conclusion 7

Rerun’s Recording Most code executes without races –Use race-free regions for ordering Episodes: independent execution regions –Defined per thread T0 T1 LD A ST B ST C LD F ST E LD B ST X LD R ST T LD X T2 ST V ST Z LD W LD J ST C LD Q LD J ST Q ST E ST K LD Z LD V ST X Partially adopted from ISCA’08 talk 8

23 Rerun’s Recording (Contd.) Capturing causality: –Timestamp via Lamport scalar clock [Lamport ‘78] Replay in timestamp order –Episodes with same timestamp can be replayed in parallel T0T1T2 9

Rerun’s Replay T0T1T TS=22 TS=45 TS=44 TS=43 TS=60 TS=61 10

Outline Background & Motivation Rerun Overview Karma Insights Karma Implementation Evaluation Conclusion 11

Karma’s Insight 1: Capture order with DAG (not scalar clock) Recording: DAG captured with episode predecessor & successor sets T0T1T2 12

Karma’s Insight 1: T0T1T T0T1T Rerun’s Replay Karma’s Replay 13

Karma’s Insight 1: (Contd.) Naïve approach: DAG arcs point to episodes –E–Episode represented by integers –T–Too much log size overhead !! Our approach: DAG arcs point to cores –R–Recording: Only one “active” episode per core –R–Replay: Send wakeup message(s) to core(s) of successor episode(s) 14

Karma’s Insight 1: T0T1T |0|1 0|0|1 Anatomy of a log entry 15

Each log entry: Karma’s Insight 1: (Contd.) REFS Count Predecessor Successor 16

Not necessary to end the episode on every conflict: –As long as the episodes can be ordered during replay ST B ST C Karma Insight 2: T0 T1 LD A LD F ST E LD B ST X LD R ST T LD X T2 ST V ST Z LD W LD J ST C LD Q LD J ST Q ST E ST K LD Z LD V ST X 17

Outline Background & Motivation Rerun Overview Karma Insights Karma Implementation Evaluation Conclusion 18

Karma’s Per-Core State Karma Hardware Data Tags Directory Coherence Controller L1 I L1 D Pipeline L2 0 L2 1 L2 14 L2 15 Core 15 Interconnect DRAM … Core 14 Core 1 Core 0 … Base System Rerun L2/Memory State Total State: 148 bytes/core Address Filter(FLT) Reference (REFS ) Predecessor(PRED) Successor(SUCC ) Timestamp(TS ) 19

Outline Background & Motivation Rerun Overview Karma Insights Karma Implementation Evaluation Conclusion 20

Evaluation: Were we able to speed up the replay? 21

Evaluation: Were we able to speed up the replay? On Average ~4X improvement in replay speed over Rerun 22

Evaluation Did we blowup log size? On average Karma does not increase the size of the log but instead improves it by as much as 40% as we allow larger episodes 23

Outline Background & Motivation Rerun Overview Karma Insights Karma Implementation Evaluation Conclusion 24

Conclusion Applications of deterministic replay –Debugging –Fault tolerance –Security Existing hardware record-replayer –Slow replay or –Requires major hardware changes Karma: Faster Replay with nearly-conventional h/w –Extends Rerun –Uses DAG instead of Scalar clock –Extend episodes past conflicts Widen Application + Lower Cost  More Attractive 25

Questions? 26