Presentation is loading. Please wait.

Presentation is loading. Please wait.

Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.

Similar presentations


Presentation on theme: "Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx."— Presentation transcript:

1 Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx

2 Resource Management and Accounting Working Group Working group scope and components Working group scope and components Progress made Progress made Current and future issues Current and future issues Next steps Next steps

3 Working Group Scope The Resource Management Working Group encompasses the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Queue Manager Scheduler Scheduler Allocation Manager Allocation Manager Meta Scheduler Meta Scheduler Our charter will also encompass the following capabilities: Accounting Accounting Usage Reports Usage Reports

4 Phase 1 Milestones 6 months:Contribute to checkpoint/restart report with regard to scheduling related aspects 6 months:Contribute to checkpoint/restart report with regard to scheduling related aspects 12 months: Establish and release initial resource management interface specifications 12 months: Establish and release initial resource management interface specifications 12 months: Establishment of the CVS repository and module structure, agreement on document conventions 12 months: Establishment of the CVS repository and module structure, agreement on document conventions 12 months: Finalized API for system initiated checkpoint/restart of parallel MPI jobs on Linux systems 12 months: Finalized API for system initiated checkpoint/restart of parallel MPI jobs on Linux systems 18 months: Release v1.0 of the Center’s resource management system based on existing open source code and the results of the scalability testing. 18 months: Release v1.0 of the Center’s resource management system based on existing open source code and the results of the scalability testing.

5 High Level Progress Establishing high level design covering initial component functionality and required interfaces Establishing high level design covering initial component functionality and required interfaces Determining inter-group requirements (GUI, security, IS, process management, etc) Determining inter-group requirements (GUI, security, IS, process management, etc) Preparing existing tools (Maui, Silver, QBank) for use within SSS Preparing existing tools (Maui, Silver, QBank) for use within SSS Creating infrastructure within which to develop and test RM deliverables Creating infrastructure within which to develop and test RM deliverables Creating infrastructure within which to develop and test intra- and inter-group interfaces Creating infrastructure within which to develop and test intra- and inter-group interfaces

6 Proposed Component Architecture Queue Manager Allocation Manager Collector Meta Scheduler Node Manager Process Manager Security System Information Service Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Config and Infrastructure

7 Component Interaction Diagram Job submitted to Queue Manager User Interface CollectorMeta Scheduler Queue Manager Allocation Manager SchedulerProcess Manager 2 1 3 4 6 5 7 9 8 10 11

8 Component Interaction Trace Job submitted to Queue Manager 1.A user submits a job to the Queue Manager 2.The Queue Manager does a sanity balance check with the Bank 3.The Queue Manager notifies the Scheduler that a new job has arrived 4.The Scheduler queries node and job status until job can run 5.A bank reservation is made with the Allocation Manager 6.The Scheduler requests the Queue Manager to run the job 7.The Queue Manager passes job control to the Process Manager 8.The Process Manager notifies Queue Manager of job completion 9.The Queue Manager notifies Scheduler of job completion 10.A bank withdrawal is made with the Allocation Manager 11.The user is notified of job completion

9 Component Interaction Diagram Job submitted to Meta Scheduler User Interface CollectorMeta Scheduler Queue Manager Allocation Manager SchedulerProcess Manager 2 1 3 4 6 5 8 7 10 9 11 13 12 14 15

10 Component Interaction Trace Job submitted to Meta Scheduler 1.A user submits a job to the Meta Scheduler 2.The Meta Scheduler contacts Schedulers to determine which systems could run the job the soonest 3.The Schedulers request quotes from Allocation Banks to determine which systems would run the job for the lowest cost 4.A Scheduler reservation is created for the job on the resource providing the best service -- this reservation can be moved or improved upon until the job is staged 5.The job is staged and queued at the system where it is to run 6.The Queue Manager notifies the Scheduler that a new job has arrived 7.The Scheduler queries node and job status until job can run 8.A bank reservation is made with the Allocation Manager 9.The Scheduler requests the Queue Manager to run the job 10.The Queue Manager passes job control to the Process Manager 11.The Process Manager notifies Queue Manager of job completion 12.The Queue Manager notifies Scheduler of job completion 13.A bank withdrawal is made with the Allocation Manager 14.The Scheduler notifies the Meta Scheduler of job completion 15.The user is notified of job completion

11 Design/Interface Progress Initial high level RMS architecture defined Initial high level RMS architecture defined Resource management dictionary created defining objects within resource management ‘world’ Resource management dictionary created defining objects within resource management ‘world’ Object ‘tokens’ declared for major objects Object ‘tokens’ declared for major objects Component functional interfaces identified Component functional interfaces identified Initial XML request/response syntax proposed Initial XML request/response syntax proposed Prototypes being constructed to test communication protocols Prototypes being constructed to test communication protocols Initial detailed extra-group component requirements document created Initial detailed extra-group component requirements document created

12 Local Scheduler Rationale Local interfaces with majority of inter and intra RM components Establish test platform from which interfaces can be tested Leverage existing capabilities to accelerate SSS development Establish infrastructure within which scheduling and metascheduling services and capabilities can be developed Establish ‘driver’ to evaluate other resource management components

13 Local Scheduler Progress Baseline scheduler established (Maui 3.2) for SSS scheduling services integrating production and development capabilities Baseline scheduler established (Maui 3.2) for SSS scheduling services integrating production and development capabilities Prototype interface enabling XML communication with queue manager, metascheduler, and node manager Prototype interface enabling XML communication with queue manager, metascheduler, and node manager Extended QoS infrastructure integrated Extended QoS infrastructure integrated Extended Job prioritization infrastructure integrated Extended Job prioritization infrastructure integrated Prototype created for object-oriented data access Prototype created for object-oriented data access Advanced metascheduling interface integrated Advanced metascheduling interface integrated

14 Meta Scheduler Progress Initial distribution packaging created to allow collaborative development Initial distribution packaging created to allow collaborative development Documentation enhanced and extended Documentation enhanced and extended Prototype XML scheduler to metascheduler query interface developed Prototype XML scheduler to metascheduler query interface developed Initial fault tolerance framework designed Initial fault tolerance framework designed

15 Queue Manager Design Established need for unified queue manager design common to Scheduler and Metascheduler Established need for unified queue manager design common to Scheduler and Metascheduler Queue manager will interface directly with Process manager Queue manager will interface directly with Process manager In process of refining the queue manager tasks In process of refining the queue manager tasks Queue manager will provide an interface to obtain information about any job regardless of job state including completed jobs (i.e. it will maintain a job information archive) Queue manager will provide an interface to obtain information about any job regardless of job state including completed jobs (i.e. it will maintain a job information archive)

16 Allocation Manager Progress QBank placed under revision control QBank placed under revision control Java prototype created which sends requests in XML Java prototype created which sends requests in XML Experimenting with protocol frameworks (simple octet-counting, octet-stuffing, SOAP, BEEP) Experimenting with protocol frameworks (simple octet-counting, octet-stuffing, SOAP, BEEP)

17 Next Steps (In Progress) Software Lifecycle Infrastructure Software Lifecycle Infrastructure –Online intra-RM schedule and dependencies document –Detailed extra-RM working group requirements –Coordinate creation of component level regression test suite –Bug tracking systems activated (used to track internal defects and development plans) Interface Interface –Produce validating intra-RM XML schema –Produce prototype RM components communicating in initial protocol Feature Enhancements Feature Enhancements –Contribution to checkpoint/restart report –Creation of queue manager prototype

18 Next Steps (6 Months) Usability Usability –GUI-server interface, GUI format, security determined and prototypes created –Documentation of initial meta job constraints/features and specification language Inter-group Collaboration Inter-group Collaboration –Creation of early scheduler XML implementation for use as RM driver –Development of initial dynamic job scheduler-queue manager interface –Extension of RM specifications/requirement document –Extension of internal component test infrastructure –Determination of ‘best practices’ in documentation maintenance –Evaluation and adoption of web project management and collaboration tools –Creation of prototype queue manager with scheduler/task manager interfaces

19 Next Steps (6 Months) Fault Tolerance Fault Tolerance –Enhance metascheduler to ‘survive’ local daemon failure –Enhancement of threaded scheduling interface. –Development of threaded metascheduling interface. Resource Optimization Resource Optimization –Development of local optimization features of meta workload Feature Enhancements Feature Enhancements –Creation of resource manager extension features. –Development of direct metascheduler to queue manager staging roadmap. Interfaces Interfaces –Specification of ‘best guess’ security infrastructure and evaluation of impact on system internals and communication protocols

20 Next Steps (1 year) Software Lifecycle Infrastructure Software Lifecycle Infrastructure –Create multi-component regression tests –Generate ‘alpha’ package of scheduling, metascheduling, and allocation management packages. Interfaces Interfaces –Development of functional XML interfaces for all components –Early adoption of security infrastructure –Creation of optional information service interfaces –Admin and end-user GUI’s proposed to enable use of new functionality Inter-group Collaboration Inter-group Collaboration –Enhanced suspend/resume and checkpoint/restart features with detailed roadmap specified for all remaining suspend/resume and checkpoint restart deliverables

21 Current Issues Should there be an enveloping protocol framework which handles framing (where the XML document begins and ends), authentication, multiplexing, streaming data, etc? (should we look at something like BEEP, or start from scratch and invent something of our own?) Should there be an enveloping protocol framework which handles framing (where the XML document begins and ends), authentication, multiplexing, streaming data, etc? (should we look at something like BEEP, or start from scratch and invent something of our own?) The queue manager/collector to node/process manager functionality and data interface requires further refinement. The queue manager/collector to node/process manager functionality and data interface requires further refinement. Queue manager/collector and node/process manager development schedules must be determined and coordinated. Queue manager/collector and node/process manager development schedules must be determined and coordinated.

22 Issues Issues Continued effort is required to complete an ‘intra-RM’ XML schema to handle initial RMS interaction needs. Boundaries between internal ‘intra-RM’ and global XML schema is needed. Continued effort is required to complete an ‘intra-RM’ XML schema to handle initial RMS interaction needs. Boundaries between internal ‘intra-RM’ and global XML schema is needed. Understanding of open source requirements (I.e. can software be included in SSS distribution that requires registration and usage agreements) Understanding of open source requirements (I.e. can software be included in SSS distribution that requires registration and usage agreements)

23 Inter-Group Issues Need for coordination of resource management system across working groups – so that the pieces all function together properly and no part is overlooked. Need to coordinate schedules for delivery of RMWG-dependent non-RMWG components. Need for coordination of resource management system across working groups – so that the pieces all function together properly and no part is overlooked. Need to coordinate schedules for delivery of RMWG-dependent non-RMWG components. Early vendor/industry collaborations (We’d better do this while it can still influence our design. Need to talk to decision makers and develop business plans) Early vendor/industry collaborations (We’d better do this while it can still influence our design. Need to talk to decision makers and develop business plans)

24 Inter-group Issues Information service – should we rather be looking for something existing? (i.e. MDS2) Information service – should we rather be looking for something existing? (i.e. MDS2) Need to solidify SSS-wide standards for packaging, revision control, documentation content, format, and packaging, problem tracking, … and establish mechanisms and places to home them. Need to solidify SSS-wide standards for packaging, revision control, documentation content, format, and packaging, problem tracking, … and establish mechanisms and places to home them. Creation of regression and integration test suite (w/ Validation and Testing WG – we need this from an early stage) Creation of regression and integration test suite (w/ Validation and Testing WG – we need this from an early stage)

25 Conclusions Questions… Questions…


Download ppt "Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx."

Similar presentations


Ads by Google