Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Slides:



Advertisements
Similar presentations
GPW2005 GGF Techniques for Monitoring Large Loosely-coupled Cluster Jobs Brian L. Tierney Dan Gunter Distributed Systems Department Lawrence Berkeley National.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Chapter 9. Performance Management Enterprise wide endeavor Research and ascertain all performance problems – not just DBMS Five factors influence DB performance.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Productivity Tools For SAS . SAS ® users today ASAP ™Enhancement complementSoft introduces ASAP ™ an innovative productivity tool for SAS ® Diagramming.
UC Berkeley Online System Problem Detection by Mining Console Logs Wei Xu* Ling Huang † Armando Fox* David Patterson* Michael Jordan* *UC Berkeley † Intel.
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
Statistical Approaches for Finding Bugs in Large-Scale Parallel Systems Leonardo R. Bachega.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Submitted by: Omer & Ofer Kiselov Supevised by: Dmitri Perelman Networked Software Systems Lab Department of Electrical Engineering, Technion.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
Intrusion and Anomaly Detection in Network Traffic Streams: Checking and Machine Learning Approaches ONR MURI area: High Confidence Real-Time Misuse and.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
Computer System Architectures Computer System Software
Christopher Jeffers August 2012
UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
© 2005 by IBM Corporation; made available under the EPL v1.0 | February 28 th 2005 Adopting the Eclipse™ Test and Performance Tools Platform (TPTP) project.
CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
©NEC Laboratories America 1 Huadong Liu (U. of Tennessee) Hui Zhang, Rauf Izmailov, Guofei Jiang, Xiaoqiao Meng (NEC Labs America) Presented by: Hui Zhang.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Technical Presentation
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Static Program Analysis of Embedded Software Ramakrishnan Venkitaraman Graduate Student, Computer Science Advisor: Dr. Gopal Gupta
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University July 21, 2008WODA.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
Generating Summaries from FOT Data ITS World Congress, Detroit 2014 Dr. Sami Koskinen, VTT
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Introduction to ASP.NET development. Background ASP released in 1996 ASP supported for a minimum 10 years from Windows 8 release ASP.Net 1.0 released.
Maikel Leemans Wil M.P. van der Aalst. Process Mining in Software Systems 2 System under Study (SUS) Functional perspective Focus: User requests Functional.
Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.
INTERNET SIMULATOR Jelena Mirkovic USC Information Sciences Institute
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Improve query performance with the new SQL Server 2016 query store!! Michelle Gutzait Principal Consultant at
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
SQL Database Management
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
YAHMD - Yet Another Heap Memory Debugger
End-to-End Monitoring and
A configurable binary instrumenter
Reference-Driven Performance Anomaly Identification
Human Complexity of Software
Brian L. Tierney, Dan Gunter
Stack Trace Analysis for Large Scale Debugging using MRNet
A General Approach to Real-time Workflow Monitoring
Outline System architecture Current work Experiments Next Steps
Presentation transcript:

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory

Outline Motivation / Why do we care? Related Work / What have others done? NetLogger’s Objective / What would we like to do? Background / What is NetLogger? How does NetLogger address the problems? What are the results / costs of the solution?

Motivation Large-scale applications are widely used in science and business. Astronomy, Biology, Weather Models, etc. Large-scale apps are complex and difficult to debug and optimize. Large number of concurrent operations Distributed resources Hard to find bottlenecks

Related Work Applications can be “tightly coupled”, “loosely coupled” or “uncoupled”. Tools have mostly focused on tightly coupled applications. Profiling and Tracing code segments. (TAU, Paraver, FPMPI, Intel Trace Collector) Tools extended to loosely coupled apps SvPablo – Auto code instrumentation and statistics collected for sections of source code. Phopesy – Auto code instrumentation and database of performance info. Tunable granularity. Paradyn – Dynamic instrumentation insertion at runtime. Designed for message passing and pthreads programs

End Objective Focus on loosely coupled and uncoupled applications. We would like a tool that can combine performance information of multiple resources and application components and expose their interactions.

NetLogger Background Log Generation – calls to logger libraries added to source code at critical points to create event logs. Log Management – The various logs are collected and merged based on event timestamps. Visualization and Analysis – Events, systems stats and “lifelines” are displayed.

Extensions to NetLogger Scaling NetLogger to large scale systems (100’s of machines) Collecting distributed log files Evaluating large log data-sets Addition of Work Flow identifiers

Log Collection and Management Netlogd Collection daemon which accepts logs across the network (UDP or TCP) Nlforward For finer-grain instrumentation, events can be written to local disk and forwarded in batches Nldemux Server-side tool to scan incoming logs Split events into separate files Allows for log file rollovers.

Sifting Through the data Huge amount of log data from just 5 nodes obscures important events.

Anomalous Workflow Detection Tool Define a linear sequence of events in a configuration file. Mark any workflow lifeline that is missing these events. Problems: We would like some context for normal behavior. (solved by and option to include neighbors of anomalous lifelines) Too many events to keep them all in memory for scanning.

Solutions Solution 1. Create a histogram with 100 bins for normal workflow execution times. Timeout when after 99 th percentile. Runs in fixed memory footprint. Supports additional parameters (min time, max time, etc) Solution 2 Calculate a running mean and standard deviation of workflow runtimes. Assumes statistically normal distribution of times.

NetLogger Workflow-logging Architecture

New Log Visualization 3 incomplete events from previous picture shown in blue with context events shown in red. Able to detect several errors in SNFactory Workflow application.

Key Differences in NetLogger Use of “Lifelines” to trace sequence of actions. Workflow anomaly detection. Facilitate log collection from multiple locations. Manual instrumentation of source code. Must have source code and understand it.

The End. Questions? Comments?