Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.cs.wisc.edu/condor 1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.

Similar presentations


Presentation on theme: "Www.cs.wisc.edu/condor 1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of."— Presentation transcript:

1 www.cs.wisc.edu/condor 1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor condor-admin@cs.wisc.edu

2 www.cs.wisc.edu/condor 2 What does Condor have? › …lots of core technology for building a distributed system

3 www.cs.wisc.edu/condor 3 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine

4 www.cs.wisc.edu/condor 4 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine › …lots of core technology for managing a work load of tasks

5 www.cs.wisc.edu/condor 5 What does Condor have? › …lots of core technology for building a distributed system › …lots of core technology for monitoring the status of a machine › …lots of core technology for managing a work load of tasks › …lots of really, truly, skilled and experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest.  Email for Wisconsin Gov Scott McCallum: wisgov@gov.state.wi.us

6 www.cs.wisc.edu/condor 6 One day an avid Condor user asked:

7 www.cs.wisc.edu/condor 7 One day an avid Condor user asked: Say, could Condor Technology be used for distributed system administration??

8 www.cs.wisc.edu/condor 8 Time to think… › Gathered up our experiences with our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born. › Completely separate from Condor from end user prospective.  Can install HawkEye, or Condor, or both

9 www.cs.wisc.edu/condor 9 First Component: MONITORING › Sysadmins first need information about what is happening on the machines they are responsible for.  Both Current and Past  Information must be consolidated and easily accessible  Information must be dynamic

10 www.cs.wisc.edu/condor 10 Condor ClassAds › Technology for an entity to describe itself › Simple attribute value pairs [ load_average = 1.3 free_Swap_space_mb = 140 number_of_processes = 92 keyboard_idle_secs = 6 ram = 128 total_swap = 512 total_memory = ram + total_swap busy = load_average > 1.0 ]

11 www.cs.wisc.edu/condor 11 Condor ClassAds, cont. › No fixed schema › Attributes can contain values or expressions › Serialize Ads in XML › Open source libraries on C++ and Java to:  Manipulate Ads and Ad attributes  Store Ads  Query collections of Ads › Bindings for Perl and others on the way…

12 www.cs.wisc.edu/condor 12 HawkEye Monitoring Agent HawkEye Manager ClassAd Updates Via Secure UDP

13 www.cs.wisc.edu/condor 13 HawkEye Monitoring Agent HawkEye Manager HawkEye Monitoring Agent

14 www.cs.wisc.edu/condor 14 HawkEye Monitoring Agent /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager ClassAd Updates Via Secure UDP

15 www.cs.wisc.edu/condor 15 Monitor Agent, cont. › Updates are sent periodically  Information does not get stale › Updates also serve as a heartbeat monitor  Know when a machine is down › Out of the box, the update ClassAd has many attributes about the machine of interest for system administration  Current Prototype = 184 attributes

16 www.cs.wisc.edu/condor 16 What if I want to monitor something you didn’t think about?

17 www.cs.wisc.edu/condor 17 Custom Attributes /proc, kstat… Hawkeye_Startup_Agent Hawkeye_Monitor HawkEye Monitoring Agent HawkEye Manager Data from hawkeye_update_attribute command line tool Create your own HawkEye plugins, or share plugins with others

18 www.cs.wisc.edu/condor 18 Role of HawkEye Manager › Store all incoming ClassAds in a indexed resident data structure  Fast response to client tool queries about current state  “Show me all machines with a load average > 10” › Periodically store ClassAd attributes into a Round Robin Database  Store information over time  “Show me a graph with the load average for this machine over the past week” › Speak to clients via CEDAR, HTTP HawkEye Manager

19 Several different clients › Command-line, GUI, Web-based

20 www.cs.wisc.edu/condor 20 But sysadmins also sometimes have to do work… › Task: copy a new library onto the local disk of each machine.  Just a script to copy via rcp/scp to every machine… or is it?

21 www.cs.wisc.edu/condor 21 Running tasks on behalf of the sysadmin › Submit your sysadmin tasks to HawkEye  Tasks are stored in a persistent queue by the Manager  Tasks can leave the queue upon completion, or repeat after specified intervals  Tasks can have complex interdependencies via DAGMan  Records are kept on which task ran where › Sounds like Condor, eh?  Yes, but simpler…

22 www.cs.wisc.edu/condor 22 Run Tasks in response to monitoring information › ClassAd “Requirements” Attribute › Example: Send email if a machine is low on disk space or low on swap space  Submit an email task with an attribute: Requirements = free_disk < 5 || free_swap < 5 › Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape

23 www.cs.wisc.edu/condor 23 HawkEye Design Goals › Monitoring  Reliable presence  Get Data off the node in an extensible, consistent manner › Run Tasks  In response to probe information  Repeat or once-only semantics  Audit Log › Independent and self-contained › Cross-Platform

24 www.cs.wisc.edu/condor 24 Current Status › Just Beginning this project › Initial release early summer › Prototypes already running – Stop in and see initial HawkEye Work Rm 3385 on Weds 9am – 12pm

25 www.cs.wisc.edu/condor 25 Thank you! I was an overworked sysadmin. Now I have more free time thanks to HawkEye!


Download ppt "Www.cs.wisc.edu/condor 1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of."

Similar presentations


Ads by Google