Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

Similar presentations


Presentation on theme: "Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat."— Presentation transcript:

1 Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat

2 215-Jun-15 Overall Problem Monitoring a cluster of cooperating computers –Different from client-server where only server’s matter –Requires substantial information from all machines –100’s-1000’s of nodes –Client-server becomes subset of this problem

3 315-Jun-15 Problems & Solutions Cluster software and hardware is constantly evolving –Monitoring software must be extensible and flexible  Use relational tables Failures will occur in the cluster –Monitoring software must detect and recover from failures  Use timestamps for weak synchronization Scalability needed to hundreds of nodes –Need to efficiently transfer data from sources to sinks  Use hierarchy & hybrid push-pull protocol –Need to display statistics and information from all nodes  Use statistical aggregation + color,shade to minimize info. loss

4 415-Jun-15 Overview Details of solutions –Handling evolving software –Detecting and recovering from failures –Scaling data management –Scaling visualization Implementation –Architecture –Programs –Snapshot –Experience Conclusion & Future Work

5 515-Jun-15 Problem: Clusters Evolve Solution: Relational tables Increases flexibility by decoupling data users from data providers Increases extensibility by structuring data into independent tables Increases extensibility by allowing additional columns in tables without breaking old programs Retains performance through transparent use of indicies Improvement over tree structures in previous systems

6 615-Jun-15 Problem: Failures Occur Solution: Use timestamps 1 Loss of periodic updates to timestamps allow remote nodes to detect failures 2 Timestamps allow weak synchronization between databases –Better availability during failures, simpler recovery 3 Timestamps allow stale data to be eliminated –Only requires purges run every so often rather than relying on programs to clean up after themselves Reasons 2 & 3 are useful even in normal operation

7 715-Jun-15 Problem: Scalable Data Access Solution: Hierarchy + efficient protocol Hierarchy allows –Batching of data from different nodes (all data from routers) –Specialization to particular data (all data on processes) Efficient protocol (Hybrid of push/pull) –Sink sends (SQL select command, interval, count ) to source –Changed data is extracted via SQL every interval seconds and forwarded to the sink count times –Sink can cancel requests at any time –Achieves the best of pull and push protocols in terms of wasted data transfers, freshness, and network bandwidth

8 815-Jun-15 Problem: Scalable Visualization Solution: Statistical aggregation + use of shade & color to minimize information loss Aggregate across similar variables (average load of 10 machines); show dispersion (std. dev.) as shade Aggregate across variables from one node (utilization = max{disk,network,cpu}) Both forms of aggregation at the same time — hierarchical aggregation Use color to draw attention to special things (nodes down) to limit visual overload

9 915-Jun-15 Implementation Architecture gather node-level DB forwarder node-level DB forwarder gather node-level DB forwarder node-level DB forwarder mid-level DB joinpush forwarder mid-level DB joinpush forwarder top-level DB joinpush javaserver Java applet

10 1015-Jun-15 Implementation Details Databases are MiniSQL –Freely available with source code –Implements subset of SQL Forwarder implements source part of hybrid protocol –Using polling to get data from database Joinpush implements merging part of hierarchy –Control of merge sources external to the program Both forwarder & joinpush implemented in threaded C –Simpler implementation for blocking operations –Could be merged in with the database

11 1115-Jun-15 Implementation Details, cont. Gather implemented in perl –Simpler to add new data sources, but would like threading –Somewhat inefficient, might re-implement in C Javaserver implemented in perl –Easier to extend with additional aggregation forms –Application level proxy because Java can’t access network Javaclient implemented in Java –Allows clients to run in browser anywhere in the world –Weak feedback to javaserver to control information displayed

12 1215-Jun-15 Implementation Snapshot

13 1315-Jun-15 Experience Configuration information should be in database –Had them in random files; database collects it together Reset-world operation very important –Puts system in known state Useful for default destination of statistics of remote database –Minimizes load on monitored nodes –Potentially reduces fault tolerance Browser user interface very useful –Limitations of Java very obnoxious

14 1415-Jun-15 Conclusion Four problems & solutions important for any cluster monitoring system –Evolution inherent in uses of clusters –Independent failures occur in all clusters –Scalability of data management needed for large clusters –Scalability of visualization also needed for large clusters Implementation works, and initially useful, further deployment needed Experience identified problems, places for improvements.

15 1515-Jun-15 Future Work Automatic identification of statistics relevant to problems –Expect to be able to use Boolean disjunction learning algorithms Tracking of long term trends and statistical measures Self tuning of specialized databases based on usage Addition of notification, repair components Gathering of more statistics (via SNMP for example) Distribution of system to external sites


Download ppt "Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat."

Similar presentations


Ads by Google