Presentation is loading. Please wait.

Presentation is loading. Please wait.

RMS and Scheduling for Future Generation Grids Ramin Yahyapour University Dortmund Leader CoreGRID Institute on Resource Management and Scheduling CoreGRID.

Similar presentations


Presentation on theme: "RMS and Scheduling for Future Generation Grids Ramin Yahyapour University Dortmund Leader CoreGRID Institute on Resource Management and Scheduling CoreGRID."— Presentation transcript:

1 RMS and Scheduling for Future Generation Grids Ramin Yahyapour University Dortmund Leader CoreGRID Institute on Resource Management and Scheduling CoreGRID – Summer School Bonn, 24 July 2006

2 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Introduction We all know what the Grid is… –one of the many definitions:Resource sharing & coordinated problem solving in dynamic, multi- institutional virtual organizations (Ian Foster) –however, the actual scope of the Grid is still quite controversial Many people consider High Performance Computing (HPC) as the main Grid application. –todays Grids are mostly Computational Grids or Data Grids with HPC resources as building blocks –thus, Grid resource management is much related to resource management on HPC resources (our starting point). –we will return to a broader Grid scope and its implications later

3 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Key Question Which services/resources to use for an activity, when, where, how? Typically: A particular user, or business application, or component application needs for an activity one or several services/resources under given constraints Trust & Security Timing & Economics Functionality & Service level Application-specifics & Inter-dependencies Scheduling and Access Policies èThis question has to be answered in an automatic, efficient, and reliable way. èPart of the invisible and smart infrastructure!

4 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Motivation Resource Management for Future/Next Generation Grids! But what are Future Generation Grids? HPC Computing –Parallel Computing –Cluster Computing –Desktop Computing HPC Computing –Parallel Computing –Cluster Computing –Desktop Computing Enterprise Grids –Business Services –Application Server –Webservices Enterprise Grids –Business Services –Application Server –Webservices Ambient Intelligence Ubiquitous Computing –PDA, Mobile Devices Ambient Intelligence Ubiquitous Computing –PDA, Mobile Devices depends on who you ask!

5 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Resource Definition Concluding from the different interpretations of Grid: for broad acceptance Grid RMS should probably cover the whole scope; Resources: Compute Network Storage Data Software –components, licenses Services –functionality, ability Management of some resources is less complex, while other resources require coordination and orchestration to be effective (e.g. HW and SW). Management of some resources is less complex, while other resources require coordination and orchestration to be effective (e.g. HW and SW).

6 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Resource Management Layer Grid Resource Management System consists of : Local resource management system (Resource Layer) –Basic resource management unit –Provide a standard interface for using remote resources –e.g. GRAM, etc. Global resource management system (Collective Layer) –Coordinate all Local resource management system within multiple or distributed Virtual Organizations (VOs) –Provide high-level functionalities to efficiently use all of resources Job Submission Resource Discovery and Selection Scheduling Co-allocation Job Monitoring, etc. –e.g. Meta-scheduler, Resource Broker, etc.

7 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Resource Broker Grid Resource Manager Information Services Monitoring Services Security Services Core Grid Infrastructure Services Grid Middleware PBSLSF… Resource Local Resource Management Higher-Level Services User/ Application Grid RMS

8 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Core Functionalities of a Grid RMS Resource Discovery –online, on-demand process Access to Resource Information –static and dynamic information Status Monitoring –general resource monitoring –monitoring with respect to a job Allocation/Scheduling –coordination is required SLA Management –reliable agreements Execution Management/Provisioning –start of a job / use of a resource Accounting and Billing

9 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Case 1: RMS for specialized Applications Specialized resource management dedicated to a single application domain. –Goal: high efficiency –Cost: higher development effort The RMS is adapted to: –application and its workflow –resource configuration There is need for specific interfaces to the resources. Highly specialized for the application and therefore easier to handle for the user. –The know-how has been built into the system. Only certain types of jobs and resources are considered.

10 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Case 2: RMS as Generic Grid-Middleware Grid RMS is open for many applications This may be less efficient than Case 1. Generic interfaces are required that are adapted to many front- and backends. This approach requires additional user-/application supplied information: –job description workflow, objectives, requirements, constraints Consideration of security is an integral aspect –wide variety of security levels RMS for Future Generation Grids needs the flexibility to cover all kind of jobs and resources

11 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies FGG Resource Management Need for well-defined interfaces to core services Inherent support for different implementations While maintaining cooperation between these implementations Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing

12 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements Resource Discovery: –scalable from cluster grids, business grids to global grids –centralized or decentralized implementations, P2P –unified naming scheme Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: flexibility scalability efficiency Aspects: flexibility scalability efficiency

13 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements Resource Discovery: –scalable from cluster grids, business grids to global grids –centralized or decentralized implementations, P2P –unified naming scheme Access to resource information: –static and historic information, –dynamic (future) information: planned, predicted –may be subject to privacy concerns user and owner dependent Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: flexibility scalability efficiency Aspects: flexibility scalability efficiency

14 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Problem: Job Submission Descriptions differ The deliverables of the GGF/OGF Working Group JSDL : A specification for an abstract standard Job Submission Description Language (JSDL) that is independent of language bindings, including; –the JSDL feature set and attribute semantics, –the definition of the relationship between attributes, –and the range of attribute values. A normative XML Schema corresponding to the JSDL specification. A document of translation tables to and from the scheduling languages of a set of popular batch systems for both the job requirements and resource description attributes of those languages, which are relevant to the JSDL.

15 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies JSDL Attribute Categories The job attribute categories include: –Job Identity Attributes ID, owner, group, project, type, etc. –Job Resource Attributes hardware, software, including applications, Web and Grid Services, etc. –Job Environment Attributes environment variables, argument lists, etc. –Job Data Attributes databases, files, data formats, and staging, replication, caching, and disk requirements, etc. –Job Scheduling Attributes start and end times, duration, immediate dependencies etc. –Job Security Attributes authentication, authorisation, data encryption, etc.

16 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements Status monitoring: –job and resource condition –SLA status Autonomic aspects: –detection of unexpected changes –allows prediction of system behavior related to an individual job and to general demand –trigger of re-scheduling/re- allocation Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: reliability scalability Aspects: reliability scalability

17 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements Allocation/Scheduling: –Different application scenarios parallel, sequential jobs co-allocation and orchestration workflows –Provider policies access, cost, security –User/application policies scheduling objectives, cost/budget management deadlines –Cooperation between RM systems –Support for different (= individual) algorithms and strategies Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: flexibility, easy-to-use support business models person-centric efficiency Aspects: flexibility, easy-to-use support business models person-centric efficiency

18 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Different Level of Scheduling Resource-level scheduler –low-level scheduler, local scheduler, local resource manager –scheduler close to the resource, controlling a supercomputer, cluster, or network of workstations, on the same local area network –Examples: Open PBS, PBS Pro, LSF, SGE Enterprise-level scheduler –Scheduling across multiple local schedulers belonging to the same organization –Examples: PBS Pro peer scheduling, LSF Multicluster Grid-level scheduler –also known as super-scheduler, broker, community scheduler –Discovers resources that can meet a jobs requirements –Schedules across lower level schedulers

19 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid-Level Scheduler Discovers & selects the appropriate resource(s) for a job If selected resources are under the control of several local schedulers, a meta-scheduling action is performed Architecture: –Centralized: all lower level schedulers are under the control of a single Grid scheduler not realistic in global Grids –Distributed: lower level schedulers are under the control of several grid scheduler components; a local scheduler may receive jobs from several components of the grid scheduler

20 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid Scheduling Scheduler Schedule time Job-Queue Machine 1 Scheduler Schedule time Job-Queue Machine 2 Scheduler Schedule time Job-Queue Machine 3 Grid-Scheduler Grid User

21 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Activities of a Grid Scheduler GGF Document: 10 Actions of Super Scheduling (GFD-I.4) Source: Jennifer Schopf

22 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Select a Resource for Execution Most systems do not provide advance information about future job execution –user information not accurate as mentioned before –new jobs arrive that may surpass current queue entries due to higher priority Grid scheduler might consider current queue situation, however this does not give reliable information for future executions: –A job may wait long in a short queue while it would have been executed earlier on another system. Available information: –Grid information service gives the state of the resources and possibly authorization information –Prediction heuristics: estimate jobs wait time for a given resource, based on the current state and the jobs requirements.

23 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements (contd) SLA management: –reliability –orchestration of services –quality of service –business models –accountability Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: persistence support business models Aspects: persistence support business models

24 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Co-allocation It is often requested that several resources are used for a single job. –that is, a scheduler has to assure that all resources are available when needed. in parallel (e.g. visualization and processing) with time dependencies (e.g. a workflow) The task is especially difficult if the resources belong to different administrative domains. –The actual allocation time must be known for co-allocation –or the different local resource management systems must synchronize each other (wait for availability of all resources)

25 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Example Multi-Site Job Execution Scheduler Schedule time Job-Queue Machine 2 Scheduler Schedule time Job-Queue Machine 3 A job uses several resources at different sites in parallel. Network communication is an issue. Scheduler Schedule time Job-Queue Machine 1 Grid-Scheduler Multi-Side Job

26 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Advanced Reservation Co-allocation and other applications require a priori information about the precise resource availability With the concept of advanced reservation, the resource provider guarantees a specified resource allocation. –includes a two- or three-phase commit for agreeing on the reservation Implementations: –GARA/DUROC/SNAP provide interfaces for Globus to create advanced reservation –implementations for network QoS available. setup of a dedicated bandwidth between endpoints –WS-Agreement defines a protocol for agreement management

27 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Using Service Level Agreements The mapping of jobs to resources can be abstracted using the concept of Service Level Agreement (SLAs) SLA: Contract negotiated between –resource provider, e.g. local scheduler –resource consumer, e.g., grid scheduler, application SLAs provide a uniform approach for the client to –specify resource and QoS requirements, while –hiding from the client details about the resources, –such as queue names and current workload

28 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies GGF/OGF – GRAAP Working Group Goal: Defining WebService-based protocols for negotiation and agreement management WS-Agreement Protocol:

29 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements SLA management: –reliability –orchestration of services –quality of service –business models –accountability Execution Management –services, software, data/storage, compute, network Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: persistence support business models Aspects: persistence support business models

30 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies GGF/OGF-WG DRMAA GGF Working Group Distributed Resource Management Application API From the charter: Develop an API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems. The scope of this specification is all the high level functionality which is necessary for an application to consign a job to a DRM system including common operations on jobs like termination or suspension. The objective is to facilitate the direct interfacing of applications to today's DRM systems by application's builders, portal builders, and Independent Software Vendors (ISVs).

31 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Requirements SLA management: –reliability –orchestration of services –quality of service –business models –accountability Execution Management –services, software, data/storage, compute, network Accounting and Billing –providing economic/financial services –foundation of business models Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Resource Discovery Access to Resource Information Status Monitoring Allocation/Scheduling SLA Management Execution Management/Provisioning Accounting and Billing Aspects: persistence support business models Aspects: persistence support business models

32 Scheduling in Future Generation Grids Outlook on future Grid Resource Management and Scheduling

33 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Limitations of current Grid RMS The interaction between local scheduling and higher-level Grid scheduling is currently a one-way communication –current local schedulers are not optimized for Grid-use –limited information available about future job execution –a site is usually selected by a Grid scheduler and the job enters the remote queue. The decision about job placement is inefficient. –Actual job execution is usually not known –Co-allocation is a problem as many systems do not provide advance reservation

34 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Example of Grid Scheduling Decision Making Scheduler Schedule time Job-Queue Machine 1 Scheduler Schedule time Job-Queue Machine 2 Scheduler Schedule time Job-Queue Machine 3 Grid-Scheduler Grid User 15 jobs running 20 jobs queued 5 jobs running 2 jobs queued 40 jobs running 80 jobs queued Where to put the Grid job?

35 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Available Information from the Local Schedulers Decision making is difficult for the Grid scheduler –limited information about local schedulers is available –available information may not be reliable Possible information: –queue length, running jobs –detailed information about the queued jobs execution length, process requirements,… –tentative schedule about future job executions These information are often technically not provided by the local scheduler In addition, these information may be subject to privacy concerns!

36 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Consequence Consider a workflow with 3 short steps (e.g. 1 minute each) that depend on each other Assume available machines with an average queue length of 1 hour. The Grid scheduler can only submit the subsequent step if the previous job step is finished. Result: –The completion time of the workflow may be larger than 3 hours (compared to 3 minutes of execution time) –Current Grids are suitable for simple jobs, but still quite inefficient in handling more complex applications Need for better coordination of higher- and lower-level scheduling!

37 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Example Grid Scenario Remote Center Reads and Generates TB of Data LAN/WAN Transfer WAN Transfer Compute Resources Visualization Assume a data-intensive simulation that should be visualized and steered during runtime!

38 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Resource Request of a Simple Grid Job A specified architecture with l48 processing nodes, l1 GB of available memory, and la specified licensed software package lfor 1 hour between 8am and 6pm of the following day Time must be known in advance. A specific visualization device during program execution Minimum bandwidth between the VR device and the main computer during program execution Input: a specified data set from a data repository at most 4 lpreference of cheaper job execution over an earlier execution.

39 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Example: Coordinated Simulation and Visualization Expected output of a Grid scheduler: time Data Transfer Loading Data Parallel Computation Providing Data Data Transfer Network 1 Computer 1 Parallel Computation Computer 2 Communication for Computation Network 3 VR-Cave Visualization Data Data Access Storing Data Communication for Visualization Network 2 Software Usage Software License Data Storage Storage resources Reservations are necessary!

40 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Conclusions for Grid Scheduling Grids ultimately require coordinated scheduling services. Support for different scheduling instances –different local management systems –different scheduling algorithms/strategies For arbitrary resources –not only computing resources, also –data, storage, network, software etc. Support for co-allocation and reservation –necessary for coordinated grid usage (see data, network, software, storage) Different scheduling objectives –cost, quality, other

41 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid-Level Scheduler Discovers & selects the appropriate resource(s) for a job If selected resources are under the control of several local schedulers, a meta-scheduling action is performed Architecture: –Centralized: all lower level schedulers are under the control of a single Grid scheduler not realistic in global Grids –Distributed: lower level schedulers are under the control of several grid scheduler components; a local scheduler may receive jobs from several components of the grid scheduler

42 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid Scheduling Scenarios – Example I

43 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid Scheduling Scenarios – Example II

44 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid Scheduling Scenarios – Example III

45 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Towards Grid Scheduling Grid Scheduling Methods: –Support for individual scheduling objectives and policies –Multi-criteria scheduling models –Economic scheduling methods to Grids Architectural requirements: –Generic job description –Negotiation interface between higher- and lower-level scheduler –Economic management services –Workflow management –Integration of data and network management

46 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Scheduling Objectives in the Grid In contrast to local computing, there is no general scheduling objective anymore –minimizing response time, minimizing cost –tradeoff between quality, cost, response-time etc. Cost and different service quality come into play –the user will introduce individual objectives –the Grid can be seen as a market where resource are concurring alternatives Similarly, the resource provider has individual scheduling policies Problem: –the different policies and objectives must be integrated in the scheduling process –different objectives require different scheduling strategies –part of the policies may not be suitable for public exposition (e.g. different pricing or quality for certain user groups)

47 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Grid Scheduling Algorithms Due to the mentioned requirements in Grids its not to be expected that a single scheduling algorithm or strategy is suitable for all problems. Therefore, there is need for an infrastructure that –allows the integration of different scheduling algorithms –the individual objectives and policies can be included –resource control stays at the participating service providers Transition into a market-oriented Grid scheduling model

48 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Economic Scheduling Market-oriented approaches are a suitable way to implement the interaction of different scheduling layers –agents in the Grid market can implement different policies and strategies –negotiations and agreements link the different strategies together –participating sites stay autonomous Needs for suitable scheduling algorithms and strategies for creating and selecting offers –need for creating the Pareto-Optimal scheduling solutions Performance relies highly on the available information –negotiation can be hard task if many potential providers are available.

49 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Economic Scheduling (2) Several possibilities for market models: auctions of resources/services auctions of jobs Offer-request mechanisms support: inclusion of different cost models, price determination individual objective/utility functions for optimization goals Market-oriented algorithms are considered: robust flexible in case of errors simple to adapt markets can have unforeseeable dynamics

50 European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies Conclusions Key Challenges for FGG RMS –Cooperation interoperability between Grid-RMS implementations and types and between Grid-RMS and local RM systems –Interoperability through well defined interfaces identification and adaptation –Scalability domain-specific implementation may have limited scalability, but the general architecture should cover millions of resources. –Fault-tolerance resources and instances of core services –Common security model The RMS should be invisible to the user and provide a pervasive common architecture allowing different implementations while maintaining interoperability.


Download ppt "RMS and Scheduling for Future Generation Grids Ramin Yahyapour University Dortmund Leader CoreGRID Institute on Resource Management and Scheduling CoreGRID."

Similar presentations


Ads by Google