Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Slides:

Advertisements

Similar presentations

1 VLDB 2006, Seoul Mapping a Moving Landscape by Mining Mountains of Logs Automated Generation of a Dependency Model for HUG’s Clinical System Mirko Steinle,

Advertisements

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Programming Types of Testing.

G. Alonso, D. Kossmann Systems Group

Spark: Cluster Computing with Working Sets

Specification-based Intrusion Detection Michael May CIS-700 Fall 2004.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/15 EICE team Model-Level Debugging of Embedded Real-Time Systems Wolfgang Haberl, Markus.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

Packet Score: Statistics-based Overload Control against Distributed Denial-of- service Attacks: Yoohwan Kim,Wing Cheong Lau,Mooi Choo Chauh, H. Jonathan.

Learning-Based Anomaly Detection in BGP Updates Jian Zhang Jennifer Rexford Joan Feigenbaum.

Locality Optimizations in OceanStore Patrick R. Eaton Dennis Geels An introduction to introspective techniques for exploiting locality in wide area storage.

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.

Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.

Parameterizing Random Test Data According to Equivalence Classes Chris Murphy, Gail Kaiser, Marta Arias Columbia University.

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

1 Real Time, Online Detection of Abandoned Objects in Public Areas Proceedings of the 2006 IEEE International Conference on Robotics and Automation Authors.

Computer Science Lecture 12, page 1 CS677: Distributed OS Last Class Vector timestamps Global state –Distributed Snapshot Election algorithms.

WuKong: Automatically Detecting and Localizing Bugs that Manifest at Large System Scales Bowen ZhouJonathan Too Milind KulkarniSaurabh Bagchi Purdue University.

Anomaly detection Problem motivation Machine Learning.

Data Mining Techniques

Data Mining for Intrusion Detection: A Critical Review Klaus Julisch From: Applications of data Mining in Computer Security (Eds. D. Barabara and S. Jajodia)

272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 17: Code Mining.

IIT Indore © Neminah Hubballi

Fault Tolerance via the State Machine Replication Approach Favian Contreras.

Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski,

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Bug Localization with Machine Learning Techniques Wujie Zheng

Automated Problem Diagnosis for Production Systems Soila P. Kavulya Scott Daniels (AT&T), Kaustubh Joshi (AT&T), Matti Hiltunen (AT&T), Rajeev Gandhi (CMU),

Automatically Generating Models for Botnet Detection Presenter: 葉倚任 Authors: Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel, Christopher Kruegel,

1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Pre-GDB 2014 Infrastructure Analysis Christian Nieke – IT-DSS Pre-GDB 2014: Christian Nieke1.

1 Test Selection for Result Inspection via Mining Predicate Rules Wujie Zheng

Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.

“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo.

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Design of a Robust Search Algorithm for P2P Networks

Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,

Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.

Ben Miller.   A distributed algorithm is a type of parallel algorithm  They are designed to run on multiple interconnected processors  Separate parts.

Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

Agenda  Quick Review  Finish Introduction  Java Threads.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Ordering of Events in Distributed Systems UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Experience Report: System Log Analysis for Anomaly Detection

Outlier Discovery/Anomaly Detection

Outline Distributed Mutual Exclusion Introduction Performance measures

Feifei Li, Ching Chang, George Kollios, Azer Bestavros

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

CSE451 - Section 10.

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega

Papers 1.Problem Diagnosis in Large-Scale Computing Environments, A. Mirgorodskiy, N. Maruyama, Barton Miller, SC 2006; 2.DMTracker: Finding Bugs in Large-Scale Parallel Programs by Detecting Anomaly in Data Movements, Q. Gao, F. Qin, D. Panda, SC 2007.

Motivation for the Papers Debugging is a very hard task –½ of the development time in sequential applications –Problem gets magnified in systems with hundreds of processes Massively parallel systems becoming popular –How do we make parallel debugging easier by leveraging statistical bug detection techniques?

Background Statistical Techniques –Explore properties likely to hold at certain program points –Run-time information collected in traces –Empirical Execution models (profiles): Built from trace information –Find similarities (and dissimilarities) between profiles –Classification into groups –Outliers as suspects for buggy behavior –Assumption: Correct behavior is the common case, faulty behavior is unusual - a deviation from the common case

Paper 1: Miller’s Proc 1Proc 2Proc 3 … Proc N-1 Proc N Processes performing similar tasks Anomalous behavior

Paper’s Main Ideas Unusual process behavior detection by comparison with other processes “Control flow” trace collection –Function call information Per process trace analysis –Fail-stop: Processes that stop generating traces –Distance-based outlier detection: isolate processes that behave differently (non-fail-stop)

Fault Model Non-deterministic fail-stop failures –failing process stop collecting traces earlier Infinite loops –process spends unusual amount of time in a particular function Deadlock, livelock, starvation –deadlocked procs stop generating traces –Starving procs spend time in different parts than procs with resources granted Load imbalance –Unusual little time spent on certain parts –Analyst identifies

Limitations of Fault Model A problem that… Happens in all nodes is considered normal behavior Doesn’t change the ctrl flow is not detected Happens too early can’t be tracked since the trace collection is limited (can’t go too far back in history)

Finding Misbehaving Host Earliest Last Timestamp –Identifies host that stopped generating the trace –Fail-stop problems: crashes, infinite blocking –Assume global clock synchronization: |T min – T avg | > threshold Behavioral Outliers –Identify traces different from the rest –Distance-based outlier detection –Pair-wise distance between traces –Suspect score for each process

Profile’s distance metrics Time spent at f 1 in host h If h and g are similar: each function will consume similar amounts of time on both hosts and d(g,h) will be low Manhattan distance

Behavioral Outliers Consider all common behaviors as normal Parameter k adjusts the common behavior Score: high for outliers, low for common behavior K-nearest neighbor algorithm:

Finding Anomalies’ Causes Last Trace Entry: function that failed –Can be misleading –Solution: look at sequences of calls Max of Delta Vector: Function that differs most from the normal behavior (largest contribution to suspect score) Anomalous time interval: –partition traces from all hosts in short intervals –Apply outlier detection: identify earliest fragment with outlier

Results Network stability problem –Fail-stop behavior –One node stops 500 seconds earlier than others –Earliest timestamp approach Broadcast service –No fail-stop behavior –Suspect score from failed run traces

Summary and Conclusions Trace analysis to explain failures in large- scale distributed systems Detect anomalies rather than massive failures Identify both fail-stop and non-fail-stop anomalous behavior

Paper 2: DMTracker Proc 1Proc 2Proc 3 … Proc N-1 Proc N Processes performing similar tasks Anomalous behavior Proc 1Proc 2Proc 3 … Proc N-1 Proc N Processes performing similar tasks Spatial Dissimilarity Temporal Dissimilarity

Paper’s Main Ideas Tracks abnormal behaviors in data movements (DM) Works on Data movement chains: memory allocation, copies, sends/receives Extract DM-invariants and check for violation of these invariants Violations indicate potential bugs Two types of invariants: –Temporal: frequently occurring data movements (Frequent chain or FC) –Spatial: clusters data movements across processes (Chain distribution or CD)

Data Movement Chains Single processor DMs Multi-processor DMs Match Sends/Receives from processes’ traces Concatenation of memory operations of a trace file

Key: Data Movement Chain Normal Execution Buggy Execution

Data Movement-Based Invariants FC-invariant based: temporal similarity –Similar DM-chains occur many times during execution –Large groups (frequently happening) of DM-chains CD-invariant based: spatial similarity –Processes perform similar or identical tasks –Chain distribution clusters as CD-invariants

DMTracker: Design Overview Function calls Memory mgmt: allocation/deallocation Data Movement: copies/network operations Records Key arguments / return values Call sites Thread IDs Local timestamps Correlates each operation to its source and destination

Invariants generation Groups formed by chains of same type Chains of same type have the same –call sites for individual DM operations –allocation call sites for source and destination buffers

FC-Invariants Two criteria for invariants –Chains in the group must happens frequently –Chain type of each group must be “unique” Uniqueness of chain: aggregation of uniqueness values of memory operations Tunable parameters # of segments of data

FC-Invariant Anomaly Detection Abnormality of P compared to C based in –Combined using harmonic mean: Threshold for abnormality is an adjustable parameter

CD-Invariants Clusters of chain distributions across processes – one profile per trace (process) –DM chains in a particular trace –DM chains originated in a particular trace Profile: frequency of chains in a trace profile: K-nearest neighbor used to build invariants (clusters) Total # of distinct chain groups Total # of Chains in trace T Total # of chains of group C 2 in trace T

CD-Invariant Anomaly Detection Abnormal trace: distance to k-nearest neighbor exceeds threshold Exactly the same procedure as in paper1!

DMTracker Results FC-Invariant (15,075 times) violated by similar chains: 154 times –All processes triggered the bug CD-Invariant: catches non-deterministic bug

DMTracker Summary Data Movement chains derived from traces Frequency Chain and Chain Distribution invariants to capture temporal and spatial correlations in parallel system Study cases show bug detection

General Observations Use of spatial and temporal invariants Detection of deviant behavior as opposed to common behavior Simple Machine Learning techniques applied for data classification Bug detection in large systems using outlier detection Very few results to support broad conclusions about the effectiveness of the techniques