Distributed Self-Propelled Instrumentation Alex Mirgorodskiy VMware, Inc. Barton P. Miller University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Self-Propelled Instrumentation Alex Mirgorodskiy Barton Miller Computer Sciences Department University.

M. Muztaba Fuad Masters in Computer Science Department of Computer Science Adelaide University Supervised By Dr. Michael J. Oudshoorn Associate Professor.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

CS0004: Introduction to Programming Visual Studio 2010 and Controls.

Chapter 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Garbage Collecting the World. --Bernard Lang, Christian and Jose Presented by Shikha Khanna coen 317 Date – May25’ 2005.

Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 Self-propelled Instrumentation Wenbin Fang.

Parasol Architecture A mild case of scary asynchronous system stuff.

By Philipp Vogt, Florian Nentwich, Nenad Jovanovic, Engin Kirda, Christopher Kruegel, and Giovanni Vigna Network and Distributed System Security(NDSS ‘07)

Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Rheeve: A Plug-n-Play Peer- to-Peer Computing Platform Wang-kee Poon and Jiannong Cao Department of Computing, The Hong Kong Polytechnic University ICDCSW.

Page: 1 Director 1.0 TECHNION Department of Computer Science The Computer Communication Lab (236340) Summer 2002 Submitted by: David Schwartz Idan Zak.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

© Lethbridge/Laganière 2001 Chap. 3: Basing Development on Reusable Technology 1 Let’s get started. Let’s start by selecting an architecture from among.

Leveraging User Interactions for In-Depth Testing of Web Applications Sean McAllister, Engin Kirda, and Christopher Kruegel RAID ’08 1 Seoyeon Kang November.

Lesson 1: Configuring Network Load Balancing

** MapReduce Debugging with Jumbune. * Agenda * Debugging Challenges Debugging MapReduce Jumbune’s Debugger Zero Tolerance in Production.

Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Automated Tracing and Visualization of Software Security Structure and Properties Symposium on Visualization for Cyber Security 2012 (VizSec’12) Seattle,

Lecturer: Ghadah Aldehim

CS 501: Software Engineering Fall 1999 Lecture 16 Verification and Validation.

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

Service Architecture of Grid Faults Diagnosis Expert System Based on Web Service Wang Mingzan, Zhang ziye Northeastern University, Shenyang, China.

Ashish Patro MinJae Hwang Thanumalayan S. Thawan Kooburat.

1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Bug Localization with Machine Learning Techniques Wujie Zheng

©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.

CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Vulnerability Assessment of Grid Software James A. Kupsch Computer Sciences Department University of Wisconsin Condor Week 2007 May 2, 2007.

Computing Infrastructure for Large Ecommerce Systems -- based on material written by Jacob Lindeman.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Title Line Subtitle Line Top of Content Box Line Top of Footer Line Left Margin LineRight Margin Line Top of Footer Line Top of Content Box Line Subtitle.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

1 Introduction to Software Testing. Reading Assignment P. Ammann and J. Offutt “Introduction to Software Testing” ◦ Chapter 1 2.

1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.

CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.

Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma

Root Cause Analysis of Failures in Large-Scale Computing Environments Alex Mirgorodskiy, University of Wisconsin Naoya Maruyama, Tokyo.

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.

WebWatcher A Lightweight Tool for Analyzing Web Server Logs Hervé DEBAR IBM Zurich Research Laboratory Global Security Analysis Laboratory

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.

Pinpoint: Problem Determination in Large, Dynamic Internet Services Mike Chen, Emre Kıcıman, Eugene Fratkin {emrek,

1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.

1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.

Constraint Framework, page 1 Collaborative learning for security and repair in application communities MIT site visit April 10, 2007 Constraints approach.

Application Communities

TensorFlow– A system for large-scale machine learning

Essentials of UrbanCode Deploy v6.1 QQ147

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Network Load Balancing

High Availability in HTCondor

CHAPTER 3 Architectures for Distributed Systems

湖南大学-信息科学与工程学院-计算机与科学系

Software testing strategies 2

Reference-Driven Performance Anomaly Identification

Basic Grid Projects – Condor (Part I)

Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.

Dynatrace AI Demystified

Control-Data Plane Separation

Presentation transcript:

Distributed Self-Propelled Instrumentation Alex Mirgorodskiy VMware, Inc. Barton P. Miller University of Wisconsin-Madison

Distributed Self-Propelled Instrumentation -2- Motivation Problems are difficult to reproduce –Intermittent or environment-specific (anomalies) –“Rare but dangerous” Systems are large collections of black boxes –Many distributed components, different vendors –Little support for monitoring/debugging Collected data are difficult to analyze –High volume –High concurrency Diagnosis of production systems is hard

Distributed Self-Propelled Instrumentation -3- E-commerce systems Multi-tier: Clients, Web, DB servers, Business Logic Hard to debug: vendors have SWAT teams to fix bugs –Some companies get paid $1000/hour Interne t proxy server client 1 client 2 client 3 client 4 firewall Load balancer W2W2 A2A2 D2D2 W1W1 W3W3 A3A3 A1A1 D1D1 Common Environments Web Server processes App Server processes Database Server processes

Distributed Self-Propelled Instrumentation -4- Common Environments Clusters and HPC systems –Large-scale: failures happen often (MTTF: 30 – 150 hours) –Complex: processing a Condor job involves 10+ processes The Grid: Beyond a single supercomputer –Decentralized –Heterogeneous: different schedulers, architectures Hard to detect failures, let alone debug them Job description submit schedd shadow negotiator collector startd starter user job Submitting host Central Manager Execution Host

Distributed Self-Propelled Instrumentation -5- Approach User provides activation and deactivation events Agent propagates through the system –Collects distributed control-flow traces Framework analyzes traces automatically –Separates traces into flows (e.g., HTTP requests) –Identifies anomalous flows and the causes of anomalies Host A Host B Process P Process Q Agent network Process R

Distributed Self-Propelled Instrumentation -6- Self-Propelled Instrumentation: Overview The agent sits inside the process –Agent = small code fragment The agent propagates through the code –Receives control –Inserts calls to itself ahead of the control flow –Crosses process, host, and kernel boundaries –Returns control to the application Key features –On-demand distributed deployment –Application-agnostic distributed deployment

a.out bar 8430: 8431: 8433: 8444: 8446: 8449: 844b: 844c: push mov... call mov xor pop ret foo %ebp %esp,%ebp *%eax %ebp,%esp %eax,%eax %ebp call jmp Patch1 instrument(foo) foo 0x8405 instrumenter.so call jmp instrument(%eax) *%eax 0x8446 Patch2 patch jmp %ebp %esp,%ebp foo %ebp,%esp %ebp push mov... call mov pop ret 83f0: 83f1: 83f3: 8400: 8405: 8413: 8414: Inject Activate Propagate Analyze: build call graph/CFG with Dyninst Within-process Propagation Dynamic, low-overhead control flow tracing

Cross-process Propagation send(msg) get peer inject send(mark) send(msg) jmp back a.outagent.so Process P agent.soa.out recv(msg) if at mark propagate recv(msg) jmp back Process Q msg mark recv(portQ) pidQ=port2pid(portQ) hijack(pidQ, agent.so) send(done) spDaemon socket Host B Host A On-demand distributed deployment

Distributed Self-Propelled Instrumentation -9- PDG for a Simple Socket Program PDG: Parallel Dynamic Program Dependence Graph –Nodes: observed events –Intra-process edges: link consecutive events –Cross-process edges: link sends with matching recvs PDGs from real systems are more complex startconnect acceptrecvsend close sendrecv stop client server

Distributed Self-Propelled Instrumentation -10- PDG for One Condor Job

Distributed Self-Propelled Instrumentation -11- Automated Diagnosis Challenge for manual examination –High volume of trace data Automated Approach: find anomalies –Normal behavior often is repetitive –Pathological problems often are easy to find –Focus on anomalies: infrequent bad behavior

Distributed Self-Propelled Instrumentation -12- Overview of the Approach Obtain a collection of control flows –E.g., per-request traces in a Web server Anomaly detection: find an unusual flow –Summarize each flow as a profile –Assign suspect scores to profiles Root cause analysis: find why a profile is anomalous –Function responsible for the anomaly Flows Φ1Φ1 Φ2Φ2 Φ3Φ3 Φ4Φ4 cause

Distributed Self-Propelled Instrumentation -13- Anomaly Detection: Distributed Profiles main entry foo entry send foo exit main exit recv bar exit main exit 1s 2s 3s 1s h p t (h) = t main t foo Time p t = t i = normalized time spent in function f i Communication p s = s i = normalized number of bytes sent by f i Composite p c = Concatenate p t and p s Coverage p v = v i = 1 if function f i was called; v i = 0 otherwise t bar

Distributed Self-Propelled Instrumentation -14- σ(g) = distance to a common or known-normal node Can detect multiple anomalies Does not require known examples of prior runs –Unsupervised algorithm Can use such examples for fewer false positives –One-class ranking algorithm g h σ(g,g k ) gkgk Anomaly Detection: Suspect Scores

Distributed Self-Propelled Instrumentation -15- Finding the Cause: Coverage Analysis Find call paths taken only in the anomalous flow –Δ = {main → A, main → A → B, main → A → C, main → D → E, main → D → C} Correlated with the failure Likely location of the problem main D A BC AnomNorm D EC main A BC Δ = Anom - Norm D EC

Distributed Self-Propelled Instrumentation -16- Limitation of coverage analysis: too many reports –Noise in the trace, different input, workload Can eliminate effects of earlier differences –Retain the shortest prefixes in Δ –Merge leaves Can rank paths by the time of occurrence or length –Put the cause ahead of the symptoms or simplify manual examination main A BC Δ = Anom - Norm D EC main A Δ’Δ’ D E,C Finding the Cause: Coverage Analysis

Distributed Self-Propelled Instrumentation -17- PDG for One Condor Job

Distributed Self-Propelled Instrumentation -18- PDG for Two Condor Jobs

Distributed Self-Propelled Instrumentation -19- Separating Concurrent Flows Concurrency produces interleaved traces –Servers switch from one request to another Analyzing interleaved traces is difficult –Irrelevant details from other users –High trace variability → everything is an anomaly Solution: separate traces into flows

Distributed Self-Propelled Instrumentation -20- Flow-Separation Algorithm Decide when two events are in the same flow –(send → recv) and (local → non-recv) Remove all other edges Flow = events reachable from a start event click 1 connect send URL recv page show page select accept select accept select recv URL send page select recv URL send page click 2 connect send URL recv page show page Web server browser 2 browser 1

Distributed Self-Propelled Instrumentation -21- Rules violated for programs with queues –enQ 1 and deQ 1 must belong to the same flow –Assigned to different flows by our application- independent algorithm client 1 client 2 server enQ 1 enQ 2 deQ 1 deQ 2 Limitation

Distributed Self-Propelled Instrumentation -22- client 1 client 2 server enQ 1 enQ 2 deQ 1 deQ 2 Addressing the Limitation: Directives Pair events using custom directives Evt: location in the code Joinattr: related events have equal attr values

Distributed Self-Propelled Instrumentation -23- Experimental Study: Condor Job description Condor_submit schedd shadow Job output negotiator collector startd starter user job Submitting hostCentral ManagerExecution Host

Distributed Self-Propelled Instrumentation -24- Job-run-twice Problem Fault handling in Condor –Any component can fail –Detect the failure –Restart the component Bug in the shadow daemon –Symptoms: user job ran twice –Cause: intermittent crash after shadow reported successful job completion

Distributed Self-Propelled Instrumentation -25- Debugging Approach Insert an intermittent fault into shadow Submit a cluster of several jobs –Start tracing condor_submit –Propagate into schedd, shadow, collector, negotiator, startd, starter, mail, the user job Separate the trace into flows –Processing each job is a separate flow Identify anomalous flow –Use unsupervised and one-class algorithms Find the cause of the anomaly

Distributed Self-Propelled Instrumentation -26- Finding Anomalous Flow Suspect scores for composite profiles Without prior knowledge, Flows 1 and 5 are unusual –Infrequent but normal activities –Use prior known-normal traces to filter them out Flow 3 is a true anomaly

Distributed Self-Propelled Instrumentation -27- Finding the Cause Computed coverage difference –900+ call paths Filtered the differences –37 call paths left Ranked the differences –14 th path by time / 1 st by length as called by schedd: main → DaemonCore::Driver → DaemonCore::HandleDC_SERVICEWAITPIDS → DaemonCore::HandleProcessExit → Scheduler::child_exit → DaemonCore::GetExceptionString –Called when shadow terminates with a signal Last function called by shadow = failure location

Distributed Self-Propelled Instrumentation -28- Conclusion Self-propelled instrumentation –On-demand, low-overhead control-flow tracing –Across process and host boundaries Automated root cause analysis –Finds anomalous control flows –Finds the causes of anomalies Separation of concurrent flows –Little application-specific knowledge

Distributed Self-Propelled Instrumentation -29- Related Publications A.V. Mirgorodskiy and B.P. Miller, “Diagnosing Distributed Systems with Self-Propelled Instrumentation", Under submission, –ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy07DistDiagnosis. pdf A.V. Mirgorodskiy, N. Maruyama, and B.P. Miller, “Problem Diagnosis in Large-Scale Computing Environments", SC’06, Tampa, FL, November 2006, –ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy06ProblemDiagno sis.pdf A.V. Mirgorodskiy and B.P. Miller, "Autonomous Analysis of Interactive Systems with Self-Propelled Instrumentation", 12th Multimedia Computing and Networking (MMCN 2005), San Jose, CA, January 2005, –ftp://ftp.cs.wisc.edu/paradyn/papers/Mirgorodskiy04SelfProp.pdf