End-to-End Monitoring and

Slides:

Advertisements

Similar presentations

Categories of I/O Devices

Advertisements

Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.

Introduction CSCI 444/544 Operating Systems Fall 2008.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.

16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.

Installing and running COMSOL on a Windows HPCS2008(R2) cluster

Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.

11 MAINTAINING THE OPERATING SYSTEM Chapter 5. Chapter 5: MAINTAINING THE OPERATING SYSTEM2 CHAPTER OVERVIEW Understand the difference between service.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

ALICE data access WLCG data WG revival 4 October 2013.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

Operating System 4 THREADS, SMP AND MICROKERNELS

Computer and Automation Research Institute Hungarian Academy of Sciences Presentation and Analysis of Grid Performance Data Norbert Podhorszki and Peter.

Module 7: Fundamentals of Administering Windows Server 2008.

GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.

INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

NetLogger GGF Distributed Application Analysis and Debugging using NetLogger v2 Lawrence Berkeley National Laboratory Brian L. Tierney.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,

1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.

Monitoring Windows Server 2012

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Troubleshooting Tools

2. OPERATING SYSTEM 2.1 Operating System Function

Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016

Chapter 3: Process Concept

CCNA Routing and Switching Routing and Switching Essentials v6.0

Large Scale Parallel Print Service

Advanced Topics in Distributed and Reactive Programming

Software Architecture in Practice

GWE Core Grid Wizard Enterprise (

MCTS Guide to Microsoft Windows 7

Chapter 2: System Structures

Chapter 10: Device Discovery, Management, and Maintenance

CCNA Routing and Switching Routing and Switching Essentials v6.0

TYPES OFF OPERATING SYSTEM

Real-time Software Design

Chapter 10: Device Discovery, Management, and Maintenance

Brian L. Tierney, Dan Gunter

RM3G: Next Generation Recovery Manager

Software models - Software Architecture Design Patterns

Multithreaded Programming

Chapter 2: Operating-System Structures

Introduction to Operating Systems

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

Chapter 2: Operating-System Structures

Anant Mudambi, U. Virginia

Why Threads Are A Bad Idea (for most purposes)

Chapter 13: I/O Systems.

MapReduce: Simplified Data Processing on Large Clusters

Using NetLogger and Web100 for TCP analysis

Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Brian L. Tierney BLTierney@lbl.gov End-to-End Monitoring and Grid Troubleshooting with the NetLogger Toolkit Brian L. Tierney BLTierney@lbl.gov Distributed Systems Department Lawrence Berkeley National Laboratory

The Problem Notes here Assume a Grid job is: submitted to a resource broker, uses a reliable file transfer service to copy several files, then runs the job. This normally takes 15 minutes to complete. But… two hours have passed and the job has not yet completed. What, if anything, is wrong? Is the job still running or did one of the software components crash? Is the network particularly congested? Broken TCP stack? Is the CPU particularly loaded? Is there a disk problem? Was a software library containing a bug installed somewhere? GNEW 2004

The Solution End-to-End Monitoring All components between the application endpoints must be monitored. This includes: software (e.g., applications, services, middleware, operating systems) end-host hardware (e.g., CPUs, disks, memory, network interface) networks (e.g., routers, switches, or end-to-end paths) GNEW 2004

Monitoring Components A complete End-to-End monitoring framework includes: Instrumentation Tools Facilities for precision monitoring of all software (applications, middleware, and operating systems) and hardware (host and network) resources Monitoring Data Publication Standard schemas, discovery and publication mechanisms, and access policies for monitoring event data are required Sensor Management The amount of monitoring data produced can quickly become overwhelming A mechanism for activating sensors on demand is required Data Analysis Tools event analysis and visualization tools Event Archives Historical data used to establish a baseline to compare current and predict future performance GNEW 2004

Uses for Monitoring Data Troubleshooting and Fault Detection Detect failures and recovery Performance analysis and tuning Better program design (e.g.: will better pipelining of I/O and computation help?) Network-aware Applications (TCP buffer size tuning, # of parallel streams, etc.) Debugging Complex, multithreaded, distributed programs are difficult to debug without the proper monitoring data Guiding scheduling decisions Grid Schedulers Find the best match of CPUs and data sets for a given job Grid Replica Selection Find the “best” copy of a data set to use Auditing and intrusion detection GNEW 2004

NetLogger Toolkit

NetLogger Toolkit We have developed the NetLogger Toolkit (short for Networked Application Logger), which includes: tools to make it easy for distributed applications to log interesting events at every critical point NetLogger client library (C, C++, Java, Perl, Python) Extremely light-weight: can generate > 900,000 events / second on current systems (9000 events / sec with 1% app. perturbation) tools for host and network monitoring event visualization tools that allow one to correlate application events with host/network events NetLogger event archive and retrieval tools NetLogger combines network, host, and application-level monitoring to provide a complete view of the entire system. GNEW 2004

NetLogger Analysis: Key Concepts NetLogger visualization tools are based on time correlated and object correlated events. precision timestamps (default = microsecond) If applications specify an “object ID” for related events, this allows the NetLogger visualization tools to generate an object “lifeline” In order to associate a group of events into a “lifeline”, you must assign an “object ID” to each NetLogger event Sample Event ID: file name, block ID, frame ID, Grid Job ID, etc. GNEW 2004

Sample NetLogger Instrumentation log = netlogger.open(“x-netlog://log.lbl.gov”,”w”) done = 0 while not done: log.write(0,"EVENT_START","TEST.SIZE=%d",size) # perform the task to be monitored done = do_something(data,size) log.write(0,"EVENT_END") Sample Event: DATE=20000330112320.957943 HOST=gridhost.lbl.gov \ PROG=gridApp LVL=1 NL.EVNT=WriteData SEND.SZ=49332 GNEW 2004

NetLogger Activation Service Do not want all monitoring data collected all the time Potentially way too much data Need to adjust the level of monitoring as needed for: Debugging Performance tuning Error analysis NetLogger Activation Service addresses this issue NetLogger-based sensors register with the activation service Very useful debugging tool for MPI / PC cluster-based jobs GNEW 2004

NetLogger Filter and Activation Service Subscription C: change the logging level of program ftpd to level 2, and send me the results Subscription A: send me all monitoring data for Grid Job # 23 Subscription B: send all level 0 monitoring data to archive at host a.lbl.gov Output to consumers Multiplex / Demultiplex monitoring streams Incoming monitoring data: application, middleware, host NetLogger Filter and Activation Service GNEW 2004

NetLogger Archive Architecture Architecture must be scalable and capable of handling large amounts of application event data, None of the components can cause the pipeline to “block” while processing the data, as this could cause the application to block For example, instrumented FTP server could send > 6000 events/second to the archive (500 KB/sec (1.8 GB/hr) of monitoring event data) GNEW 2004

NetLogger Tools nlforward: Log file forwarder forwards a single NetLogger file or directory of files to an output URL netlogd: TCP socket server daemon accepts one or more NetLogger TCP streams and writes them to one or more NetLogger output URL's GNEW 2004

Grid Troubleshooting Example Step 1: insert instrumentation code during the development stage to ensure the program is operating as expected Step 2: establish a performance baseline for this service, and store this information in the monitoring event archive. Include system information such as processor type and speed, OS version, CPU load, disk load, network load data, etc. Step 3: put service into production, and everything works fine Until….. One day, users start complaining that service X is taking much longer than previously GNEW 2004

Grid Troubleshooting Example To collect data for analysis, one must: Locate relevant monitoring data, and subscribe to that data. Activate any missing sensors, and subscribe to the their data. Activate debug-level instrumentation in the service, and subscribe. Locate monitoring data in the monitoring event archive for the baseline test from when things were last working. Data analysis can be then begin: Check the hardware and OS information to see if anything changed. Look at the application instrumentation data to see if anything looks unusual. Look at the system monitoring data to see of anything looks unusual (e.g., unusually high CPU load). Correlate the application and middleware instrumentation data with the host and network monitoring data. GNEW 2004

Grid Job ID In order to graphically link events from several Grid components monitoring events for the same “job” needs the same “Grid Job ID” (GID) We have instrumented the following pyGlobus components with NetLogger with a GID globus-job-run, globus-url-copy, Globus gatekeeper Globus job manager globus-job-run generates the GID using uuidgen GID passed to gatekeeper via RSL In OGSA-based Grids, it should be easy standardize a mechanisms to pass GID’s between Grid Services GNEW 2004

Troubleshooting Example: Step 1: Generate Grid Job “Lifeline” GlobusUrlCopy.put.end GlobusUrlCopy.put.transferStart GlobusUrlCopy.put.start GlobusJobRun.end jobManager.end jobManger.jobState.done gridJob.end gridJob.start jobManager.jobState.active jobManager.jobState.pending akentiAuthorization.end akentiAuthorization.start gateKeeper.end jobManager.start gateKeeper.start GlobusJobRun.start GlobusUrlCopy.get.end GlobusUrlCopy.get.transferStart GlobusUrlCopy.get.start Time Copy input data Copy output data Run Grid Job Job error during gridJob Connection setup and authentication Data transfer Waiting in PBS queue Job running Successful Job Run Application Events GNEW 2004

Step 2: Add detailed application instrumentation, (1st example) Next I/O starts when processing ends Before Start next I/O process previous block After I/O followed by processing overlapped I/O and processing almost a 2:1 speedup GNEW 2004

Step 2: Add detailed application instrumentation, (2nd example) GNEW 2004

Step 2: Add detailed application instrumentation, (3rd example) e.g.: MPI Synchronization Barrier AMBER is a computational chemistry application (computes molecular mechanics and molecular dynamics of biomolecular systems) seconds GNEW 2004

Step 3: add host monitoring (e.g.: CPU load or TCP retransmits) seconds GNEW 2004

Step 3b: add more TCP monitoring GNEW 2004

Detailed TCP Analysis: Correlation of Sack and OtherReductionsCM CWND drops SACKs OtherReductionsCM GNEW 2004

Conclusions NetLogger Activation Service allows the Grid User or developer to easily “drill down” from high-level to low-level analysis Grid ID is essential for correlating events GNEW 2004

For More Information DMF: http://dsd.lbl.gov/NetLogger/ All software components are available for download under DOE/LBNL open source license (BSD-style) email: BLTierney@LBL.GOV Other Useful URLs: PFLDnet 2004: http://www-didc.lbl.gov/PFLDnet2004/program.htm TCP Tuning: http://www-didc.lbl.gov/TCP-tuning/TCP-tuning.html GNEW 2004

Extra Slides

TCP flow visualization GNEW 2004

NetLogger Trigger API Trigger API is used to activate monitoring from an external configuration file, which is created by the “activation node” NetLoggerSetTrigger(handle, char *filename, int sec) Check the configuration file every sec seconds for updated log level level. Trigger file specifies what events to log, and where to send them Can specify the log/debug level for a given program GNEW 2004

NetLogger Filtering NetLogger filters are used to provide efficient data reduction services NetLogger filters operate on one item of monitoring data at a time Filter expression is a list of (name, operator, value) tuples Simple filter language allows for efficient implementation Sample filter: matches all “Start” or “End” monitoring events for program “Athena” at a logging level <=2 would be: NL.EVNT=”Start” and PROG=”Athena” and \ LVL <= 2 or NL.EVNT=”End” and \ PROG=”Athena” and LVL <= 2 GNEW 2004

Performance Filtering: Activation Producer Scalability: 20K - 140K events/second, depending on filter complexity Activation Producer Scalability: Performance based on number of producers X number of consumers And filter complexity E.g.: 20 producers, complex filter, 10 consumers: 8000 events/second E.g.: 500 producers, simple filter, 2 consumers, 5000 events/second (10 events per producer per second) Details in the paper Note: merging multiple filters not yet implemented Could improve performance considerably for certain combinations of filters GNEW 2004