Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Toward Interactive Debugging for ISP Networks Chia-Chi Lin, Matthew Caesar, Jacobus Van der Merwe § University of Illinois at Urbana-Champaign § AT&T Labs.
Remus: High Availability via Asynchronous Virtual Machine Replication
Openflow App Testing Chao SHI, Stephen Duraski. Motivation Network is still a complex stuff ! o Distributed mechanism o Complex protocol o Large state.
Testing Concurrent/Distributed Systems Review of Final CEN 5076 Class 14 – 12/05.
HP Quality Center Overview.
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Module 20 Troubleshooting Common SQL Server 2008 R2 Administrative Issues.
Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.
Testing and Analysis of Device Drivers Supervisor: Abhik Roychoudhury Author: Pham Van Thuan 1.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Continuously Recording Program Execution for Deterministic Replay Debugging.
Transaction Processing Lecture ACID 2 phase commit.
CMPT 431 Dr. Alexandra Fedorova Lecture VIII: Time And Global Clocks.
CS 290C: Formal Models for Web Software Lecture 10: Language Based Modeling and Analysis of Navigation Errors Instructor: Tevfik Bultan.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Statistical Debugging: A Tutorial Steven C.H. Hoi Acknowledgement: Some slides in this tutorial were borrowed from Chao Liu at UIUC.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
CS590 Z Software Defect Analysis Xiangyu Zhang. CS590F Software Reliability What is Software Defect Analysis  Given a software program, with or without.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Colin Scott, Andreas Wundsam, Barath Raghavan, Aurojit Panda, Andrew Or, Jefferson Lai, Eugene Huang, Zhi Liu, Ahmed El-Hassany, Sam Whitlock, Hrishikesh.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
SDLC. Information Systems Development Terms SDLC - the development method used by most organizations today for large, complex systems Systems Analysts.
1 Software Testing and Quality Assurance Lecture 5 - Software Testing Techniques.
Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.
Department of Computer Science 1 CSS 496 Business Process Re-engineering for BS(CS)
Replay Debugging for Distributed Systems Dennis Geels, Gautam Altekar, Ion Stoica, Scott Shenker.
Automated Diagnosis of Software Configuration Errors
Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.
Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.
Exceptions and Mistakes CSE788 John Eisenlohr. Big Question How can we improve the quality of concurrent software?
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Timing and Race Condition Verification of Real-time Systems Yann–Hang Lee, Gerald Gannod, and Karam Chatha Dept. of Computer Science and Eng. Arizona State.
Software Quality Assurance Lecture #8 By: Faraz Ahmed.
Testing. Definition From the dictionary- the means by which the presence, quality, or genuineness of anything is determined; a means of trial. For software.
VeriFlow: Verifying Network-Wide Invariants in Real Time
Peng Liu and Charles Zhang Prism Research Group Department of Computer Science and Engineering Hong Kong University of Science and Technology 1.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
Dynamic Analysis of Multithreaded Java Programs Dr. Abhik Roychoudhury National University of Singapore.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Today’s Agenda  HW #1  Finish Introduction  Input Space Partitioning Software Testing and Maintenance 1.
1 Qualitative Reasoning of Distributed Object Design Nima Kaveh & Wolfgang Emmerich Software Systems Engineering Dept. Computer Science University College.
ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.
1 Introduction to Software Testing. Reading Assignment P. Ammann and J. Offutt “Introduction to Software Testing” ◦ Chapter 1 2.
Xusheng Xiao North Carolina State University CSC 720 Project Presentation 1.
- 1 - ©2009 Jasper Design Automation ©2009 Jasper Design Automation JasperGold for Targeted ROI JasperGold solutions portfolio delivers competitive.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
Design - programming Cmpe 450 Fall Dynamic Analysis Software quality Design carefully from the start Simple and clean Fewer errors Finding errors.
Course: COMS-E6125 Professor: Gail E. Kaiser Student: Shanghao Li (sl2967)
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
Software Quality Assurance and Testing Fazal Rehman Shamil.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
CHESS Finding and Reproducing Heisenbugs in Concurrent Programs
Agenda  Quick Review  Finish Introduction  Java Threads.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Testing Concurrent Programs Sri Teja Basava Arpit Sud CSCI 5535: Fundamentals of Programming Languages University of Colorado at Boulder Spring 2010.
Specifying and executing behavioral requirements The play-in/play-out approach David Harel, Rami Marelly Weizmann Institute of Science, Israel An interpretation.
Introduction to Computer Programming Concepts M. Uyguroğlu R. Uyguroğlu.
On the Relation Between Simulation-based and SAT-based Diagnosis CMPE 58Q Giray Kömürcü Boğaziçi University.
1 Active Random Testing of Parallel Programs Koushik Sen University of California, Berkeley.
SDN controllers App Network elements has two components: OpenFlow client, forwarding hardware with flow tables. The SDN controller must implement the network.
Using Execution Feedback in Test Case Generation
Lazy Diagnosis of In-Production Concurrency Bugs
Yongle Zhang, Serguei Makarov, Xiang Ren, David Lion, Ding Yuan
runtime verification Brief Overview Grigore Rosu
Reference-Driven Performance Anomaly Identification
Presentation transcript:

Troubleshooting SDN Control Software with Minimal Causal Sequences COLIN SCOTT, ANDREAS WUNDSAM, BARATH RAGHAVANAUROJIT PANDA, ANDREW OR, JEFFERSON LAI,EUGENE HUANG, ZHI LIU, AHMED EL-HASSANY,SAM WHITLOCK, HRISHIKESH B. ACHARYA, KYRIAKOS ZARIFIS,ARVIND KRISHNAMURTHY, SCOTT SHENKER

Bugs are costly and time consuming  Software bugs cost US economy $59.6 Billion annually  Developers spend ~50% of their time debugging  Best developers devoted to debugging

Distributed Systems are Bug-Prone Distributed correctness faults:  Race conditions  Atomicity violations  Deadlock  Livelock

Example Bug (Floodlight, 2012)

Best Practice: Logs Human analysis of log files

Best Practice: Logs

Our Goal Allow developers to focus on fixing the underlying bug

Problem Statement Identify a minimal sequence of inputs that triggers the bug in a blackbox fashion

Why minimization? Smaller event traces are easier to understand

Minimal Causal Sequence

Where Bugs are Found  Symptoms found: On developer’s local machine (unit and integration tests) In production environment On quality assurance testbed

Approach: Delta Debugging Replay

Approach: Modify Testbed

Testbed Observables  Invariant violation detected by testbed  Event Sequence : 1.External events (link failures, host migrations,..) injected by testbed 2.Internal events (message deliveries) observed by testbed (incomplete)

Replay Definition  A replay of log L involves replaying the external events E L, possibly taking into account the occurrence of internal events I L  The output of replay is a sequence of configurations  Ideally replay(L) reproduces the original configuration sequence

Approach: Delta Debugging Replay Events (link failures, crashes, host migrations) injected by test orchestrator

Key Point Must Carefully Schedule Replay Events To Achieve Minimization!

Challenges  Asynchrony  Divergent execution  Non-determinism

Challenge: Asynchrony Asynchrony definition:  No fixed upper bound on relative speed of processors  No fixed upper bound on time for messages to be delivered

Challenge: Asynchrony Need to maintain original event order

Challenge: Asynchrony

Coping with Asynchrony Use interposition to maintain causal dependencies

Challenge: Divergence Divergence: Absent Internal Events Prune Earlier Input..

Divergence: Absent Internal Events Some Events No Longer Appear Divergence: Absent Internal Events

Solution: Peek Ahead Infer which internal events will occur

Challenge: Non-determinism Coping With Non-Determinism  Replay multiple times per subsequence  Assuming i.i.d., probability of not finding bug modeled by:  If not i.i.d., override gettimeofday(), multiplex sockets, interpose on logging statements

Divergence: Syntactic Changes

Sequence Numbers Differ!

Solution: Equivalence Classes Mask Over Extraneous Fields

Solution: Peek ahead

Divergence: Unexpected Events Prune Input..

Divergence: Unexpected Events Unexpected Events Appear

Solution: Emperical Heuristic  Theory: Divergent paths ->Exponential possibilities  Practice: Allow unexpected events through

Approach Recap  Replay events in QA testbed  Apply delta debugging to inputs  Asynchrony: interpose on messages  Divergence: infer absent events  Non-determinism: replay multiple times

Evaluation Methodology  Evaluate on 5 open source SDN controllers (Floodlight, NOX, POX, Frenetic, ONOS)  Quantify minimization for: Synthetic bugs Bugs found in the wild  Qualitatively relay experience troubleshooting with MCSes

Case Studies

Comparison to Naïve Replay  Naïve replay: ignore internal events  Naïve replay often not able to replay at all 5 / 7 discovered bugs not replayable 1 / 7 synthetic bugs not replayable  Naïve replay did better in one case 2 event MCS vs. 7 event MCS with our techniques

Qualitative Results  15 / 17 MCSes useful for debugging 1 non-replayable case (not surprising) 1 misleading MCS (expected)

Case Studies

Runtime

Scalability

Coping with Non-Determinism

Complexity

Ongoing work  Formal analysis of approach  Apply to other distributed systems (databases, consensus protocols)  Investigate effectiveness of various interposition points  Integrate STS into ONOS (ON.Lab) development workflow

Related work  Thread Schedule Minimization Isolating Failure-Inducing Thread Schedules. SIGSOFT ’02. A Trace Simplification Technique for Effective Debugging of Concurrent Programs. FSE ’10.  Program Flow Analysis Enabling Tracing of Long-Running Multithreaded Programs via Dynamic Execution Reduction. ISSTA ’07. Toward Generating Reducible Replay Logs. PLDI ’11.  Best-Effort Replay of Field Failures A Technique for Enabling and Supporting Debugging of Field Failures. ICSE ’07. Triage: Diagnosing Production Run Failures at the User’s Site.SOSP ’07.

Improvements  Parallelize delta debugging  Smarter delta debugging time splits  Apply program flow analysis to further prune  Compress time

Conclusion  Possible to automatically minimize execution traces for SDN control software  System (23K+ lines of Python) evaluated on 5 open source SDN controllers (Floodlight,NOX, POX, Frenetic, ONOS) and one proprietary controller  Currently generalizing, formalizing approach