Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.

Similar presentations


Presentation on theme: "Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell."— Presentation transcript:

1 Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell Sandia National Laboratories

2 Lilith: a tool framework for very large clusters Most current tools for clusters are designed as monolithic programs, to do one task well. If you need a new task, you need a new tool. The Lilith framework allows users to easily construct new tools using a component framework.

3 Control of large distributed systems System administration Auditing & job control by users Interrogation of processes Simple Applications 1 sec program on 1000 nodes 16min10sec

4 Lilith: Scalable component framework Lilith spans a tree of machines executing user- defined code. User code (Lilim/Lilly) provides component functionality on a single node Provides scalable distribution, result collection

5 Component Methods MO[] distributeOnTree(MO, int[]) –data distribution down the tree MO onTree(MO) –component action on the node MO collateOnTree(MO[]) –result collection and condensation

6 Security Uses purely Java 2 mechanisms at this time…. User sends credential with call LilithHost creates ProtectionDomain from user credential LilithHost calls checkPermission LilithHost Policy Keys Method invocation Sandbox setup similarly using the User credential and PolicyFile

7 Prototypical tools System monitoring tool to track the state of a cluster of machines PS-tool to get sortable process information from selected nodes of the cluster.

8 Lilith Lights tool Snake toy app –demo that draws a snake over front panel –no global repository for state --- all info distributed –Snake’s movement was limited to left half of machine program error in declaration of drand48() biased results

9 Who serves who? Programmers adapt to: –The OS that runs on the machine, –The system configuration chosen by the admins –Changing system environments economically driven to heterogeneous distributed computing Why can’t the user dictate the software environment as a resource request?

10 DASE Dynamically Adaptive Software Environment Provide multi-OS/multi-environment capability Manage multiple SW environments “save” user environment for reuse later Integration with SW component architectures

11 DASE Service Object Model Physical systemLogical partitioning “system” model Partitioner App Object - resource spec - data/map objects Solver Visualizer Mesher Scheduler Resource Request

12 Flexible Resource Management

13 Scalable Unit

14 System Support Hierarchy sss1 Admin access sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 Master copy of system software

15 Hardware Management Discovery and Control –Perl scripts that control individual devices (power controller, terminal server, machine, switch) build a database of configuration info (MAC and IP addresses, serial numbers, etc.) Roles –database is augmented with each components role in the system (compute, sss0, terminal server, etc.)

16 “Virtual Machines” Allows arbitrary grouping of scalable units that use the same system software Operations to update system software and boot nodes, scalable units, or machines Updates system software on an SU in 1 min. Update system software on 24 SUs in 1.5 min. Boot an SU in 5 min. (staged for power drain) Boot 24 SUs in 10 min.

17 “Virtual Machines” sss1 Uses rdist to push system software down sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 sss0 node Scalable Unit In-use copy of system software NFS mount root from SSS0 Linux 2.3 Beta Alpha Production SU configuration database

18 http://dancer.ca.sandia.gov http://www.cplant.ca.sandia.gov http://www.cs.sandia.gov/cplant


Download ppt "Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell."

Similar presentations


Ads by Google