Presentation is loading. Please wait.

Presentation is loading. Please wait.

CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory

Similar presentations


Presentation on theme: "CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory"— Presentation transcript:

1 CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory http://www.cedps-scidac.org/

2 2 CEDPS in a Nutshell Center for Enabling Distributed Petascale Science CEDPS – “seds” (silent P) DOE SciDAC Center for Enabling Technology July 1, 2006 – June 30, 2011, $2.4M/yr Collaboration between 5 sites Argonne National Laboratory Fermi National Laboratory Lawrence Berkeley National Laboratory USC Information Sciences Institute University of Wisconsin Madison Three focus areas Moving data to compute resources Moving compute services to data sites Troubleshooting and diagnosis tools

3 3 The Petascale Data Challenge DOE facilities generate many petabytes of data (2 petabytes = all U. S. academic research libraries!) Massive data U U U U U DOE facilities Remote users (at labs universities, industry) need data! Rapid, reliable access key to maximizing value of $B facilities U Remote distributed users U U

4 4 Reliable: recover from many failures Predictable: data arrives when scheduled Secure: protect expensive resources & data Scalable: deal with many users & much data Bridging the Divide (1): Move Data to Users When & Where Needed C B A Fast: >10,000x faster than usual Internet “Deliver this 100 Terabytes to locations A, B, C by 9am tomorrow”

5 5 Flexible: easy integration of functions Secure: protect expensive resources & data Scalable: deal with many users & much data Bridging the Divide (2): Allow Users to Move Computation Near Data A Science services: provide analysis functions near data source “Perform my computation F on datasets X, Y, Z” Y Z X F

6 6 Instrument: include monitoring points in all system components Monitor: collect data in response to problems Diagnose: identify the source of problems Bridging the Divide (3): Troubleshoot End-to-End Problems C B A “Why did my data transfer (or remote operation) fail?” Identify & diagnose failures & performance problems

7 CEDPS Troubleshooting Work

8 8 The Troubleshooting Problem Large production Grids (OSG, TeraGrid, etc.) report a high failure rate 10-30% of jobs submitted to the Grid fail mostly authentication errors and disk space problems Users don’t always notice, as jobs may be automatically resubmitted and may succeed the next time Troubleshooting in this environment is very difficult Current Approach Log into all hosts used (if possible) ‘grep’ various log files looking for problems Inconsistent logging levels Multiple file formats Often a tedious and time consuming problem

9 9 Our Approach Discover broken services before users do Deploy monitoring software that can perform alerts on errors Run test jobs to detect problems When need to do log analysis Use a unified approach to logging for applications and middleware Collect logs centrally Develop automatic analysis tools for anomaly detection

10 10 Logging Solution: “NetLogger” Methodology NetLogger (short for “Networked Application Logger”) Methodology for troubleshooting distributed applications Troubleshooting = detection and analysis of faults and performance issues Key Components: common log format: name=value pairs http://www.cedps.net/wiki/index.php/LoggingBestPractices precision ISO-format timestamp and synchronized clocks (via NTP) wrap all “interesting” actions with ‘start’ and ‘end’ log events interesting = anything that might fail or run slow use unique event names e.g.: org. event=org.globus.gridFTP.authn.start include “event ID” in each log message allows correlation of sets of events collect logs in a centralized database

11 11 Example Log: GridFTP ts=2006-12-08T18:39:23.114369Z event=org.globus.gridFTP.start prog=GridFTP-4.0.3 localhost=myhost remoteHost=somehost.gov:56010 serverMode=inetd guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 ts=2006-12-08T18:39:23.114567Z event=org.globus.gridFTP.authn.start DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” guid=1DDF1F3D-A677-4DBC-8C4E- 6A8A3B252AE3 ts=2006-12-08T18:39:25.514369Z event=org.globus.gridFTP.authn.end DN=“/DC=org/DC=doegrids/OU=People/CN=Somebody” msg=“123456 successfully authorized” localUser=uscmspool381 guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 status=0 ts=2006-12-08T18:39:25.864369Z event=org.globus.gridFTP.transfer.start file=/tmp/myfile tcpBufferSize=128KB dataBlockSize=262144 numStreams=1 numStripes=1 destHost=129.79.4.64 guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 ts=2006-12-08T18:45:02.214369Z event=org.globus.gridFTP.transfer.end file=/tmp/myfile bytesTransferred=678433 guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 status=0 ts=2006-12-08T18:45:02.214386Z event=org.globus.gridFTP.end guid=1DDF1F3D-A677-4DBC- 8C4E-6A8A3B252AE3 status=226

12 12 No need to invent something new for this syslog-ng tool fills all requirements Open source, runs on all major OSes Fault tolerant, secure (via stunnel), scalable, easy to configure, etc. Large user base Can filter logs based on level and content Arbitrary number of sources and destinations Can act as a proxy, tunnel thru firewalls Execute programs: Send email, load database, etc. Built-in log rotation Timezone support and Fully qualified host names http://www.balabit.com/products/syslog-ng/ Log Collection

13 13 Log collection using syslog-ng

14 14 Sample Site Deployment For Grid Middleware

15 15 Troubleshooting Architecture: Site Archive

16 Log Summarization Library: New Component of the NetLogger Toolkit http://dsd.lbl.gov/NetLogger/

17 17 NetLogger Toolkit Client library that to make it easy to generate logs C, Java, Perl, Python First released in 1994 Summarizer new in NetLogger 4.0 release, Nov. 2007 Open Source available at http://dsd.lbl.gov/NetLogger/ Visualization Support scripts to convert logs to R, gnuplot, Excel format collection of sample R and gnuplot scripts Database tools tools to load logs into mySQL

18 18 New NetLogger Feature: Time-based summarization DO for i=1, N log(“read.start”,...) read_data() log(“read.end”,...) process_data() log(“write.start”,...) write_data() log(“write.end”,...) DONE Example: you want detailed I/O instrumentation of a simple loop that reads data, processes it, then writes out some result.

19 19 Go from here.. Time-based summarization Full logging can produce many gigabytes of logs! Summarized Logs report only a periodic summary (mean, sd) of times and values between “start” and “end” events (of a given type and ID). Orders of magnitude data reduction with, for many purposes, no loss in explanatory power. Since the original instrumentation is still there, you can “turn on” and off the full detail in a running program.

20 20 Sample Use: GridFTP Bottleneck Detector GridFTP today Only logs single throughput number Throughput might be limited by Sending/receiving disk NFS/AFS mounted partition? Network Currently no way to tell New GridFTP enhancement Use log summarization library to monitor all I/O streams

21 21 Full vs Summary Results Bottleneck = disk read

22 22 More Information General CEDPS information: http://www.cedps.net Log Summarizer: http://dsd.lbl/gov/NetLogger New GridFTP with NetLogger Log Summarizer: http://www.cedps.net/wiki/index.php/Gridftp-netlogger Contact us if you need troubleshooting help! email: BLTierney@lbl.gov, DKGunter@lbl.gov

23 Extra Slides

24 24 NetLogger Summarizer vs ‘R’ summarized logs

25 25 Logging “Best Practices” Recommendations Practices All logs should contain a unique event name and an ISO- format timestamp All system operations that might fail or experience performance variations should be wrapped with start and end events. All logs from a given execution thread should be tagged with a globally unique ID (or GUID), such as a Universal Unique Identifiers (UUIDs) Log format Logs should be composed of lines of ASCII name=value pairs Example: ts=2006-12-08T18:48:27.598448Z event=org.globus.gridFTP.transfer.start prog=GridFTP-v4.2 guid=1DDF1F3D-A677-4DBC-8C4E-6A8A3B252AE3 file=filename src.host=H1 src.port=P1 dst.host=H2 dst.port=P2 http://www.cedps.net/wiki/index.php/LoggingBestPractice

26 26 Event Names Use a '.' as a separator and go from general to specific Same as Java class names First part of name should be used as a unique namespace (e.g.: org.globus) Use start/end suffixes whenever possible Helps immensely with troubleshooting Examples org.globus.gridFTP.start org.globus.gridFTP.authn.start org.globus.gridFTP.authn.end org.globus.gridFTP.transfer.start org.globus.gridFTP.transfer.end org.globus.gridFTP.end –org.globus.MDS.response.start –org.globus.MDS.query.start –org.globus.MDS.query.end –org.globus.MDS.write.net.start –org.globus.MDS.write.net.end –org.globus.MDS.response.end

27 27 Globally Unique IDs Use the ‘guid’ or ‘id’ reserved name to allow correlation of a set of events together event=org.globus.gridFTP.authn.start id=27023 event=org.globus.gridFTP.authn.end id=27023 event=org.globus.gridFTP.transfer.start id=27023 event=org.globus.gridFTP.transfer.end id=27023 Can use standard unix/windows program ‘uuidgen’ to generate globally unique ID e.g.: A5A563CD-D80C-4E58-9ECD-79C6B611E122

28 28 GridFTP: network vs disk performance

29 29 Syslog-ng Deployment for OSG


Download ppt "CEDPS: Center for Enabling Distributed Petascale Science Brian Tierney Lawrence Berkeley National Laboratory"

Similar presentations


Ads by Google