Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Computation Management Agent for Multi-Institutional Grids

Similar presentations


Presentation on theme: "A Computation Management Agent for Multi-Institutional Grids"— Presentation transcript:

1 A Computation Management Agent for Multi-Institutional Grids
FATIH UNIVERSITY Computer Engineering Condor-G A Computation Management Agent for Multi-Institutional Grids Helton MALAMBANE

2 Outline INTRODUCTION Large-scale sharing of computational resources
Grid Protocols Computation management (Condor-G Core) GlideIn mechanism

3 1. INTRODUCTION Grid user requirements
They want to be able to discover, acquire, and reliably manage computational resources dynamically, in the course of their everyday activities They do not want to be bothered with the location of these resources, the mechanisms that are required to use them, with keeping track of the status of computational tasks operating on these resources, or with reacting to failure They do care about how long their tasks are likely to run and how much these tasks will cost

4 Solution: The Condor-G
Leverages software from Globus and Condor. “allows the user to control multi-domain resources as if they all belong to one personal domain “ Globus Toolkit : inter-domain resource management protocols. Condor: intra-domain resource management methods.

5 2. Large-scale sharing of computational resources
How to build and manage a multi-site computation that uses resources that belong to different sites? DIFFICULTIES: Different sites may feature different authentication and authorization mechanisms, schedulers, hardware architectures, operating systems, file systems, etc. The user has little knowledge of the characteristics of resources at remote sites, and no easy means of obtaining this information Due to the distributed nature of the multi-site computing environment, computers, networks, and subcomputations can fail in various ways. Keeping track of the status of different elements of a computation involves tedious bookkeeping, especially in the event of failure and dependencies among subcomputations.

6 2. Large-scale sharing of computational resources
How to build and manage a multi-site computation that uses resources that belong to different sites? APPROACH: Remote resource access issues are addressed by requiring that remote resources speak standard protocols for resource discovery and management. Computation management issues are addressed via the introduction of a robust, multi-functional user computation management agent responsible for resource discovery, job submission, job management, and error recovery. From Condor Remote execution environment issues are addressed via the use of mobile sandboxing technology that allows a user to create a tailored execution environment on a remote node.

7 3. Grid Protocols - Outline
Protocols used in the Condor-G system: 3.1. GSI (Grid Security Infrastructure) 3.2. GRAM (Grid Resource Allocation and Management) 3.3. MDS-2 (Monitor and Discovery System) 3.4. GASS (Global Access to Secondary Storage)

8 The Globus Toolkit’s Grid Security Infrastructure
3.1. GIS The Globus Toolkit’s Grid Security Infrastructure Makes it possible to authenticate a user just once. Uses Public Key Infrastructure (PKI) GSI employs the user’s private key to create a proxy credential, which serves as a new private-public key pair that allows a proxy (such as the Condor-G agent) to make remote requests on behalf of the user

9 The Grid Resource Allocation and Management
3.2. GRAM protocol The Grid Resource Allocation and Management The Grid Resource Allocation and Management protocol supports remote submission monitoring and control of a computational request to a remote computational resource. Eg: “run program P”. Uses GSI for authentication/authorization. Two-phase commit (using requests sequences and commit command) . Logs details of all active jobs (useful for crash recovery).

10 Monitor and Discovery System
3.3. MDS protocols Monitor and Discovery System Allows discovering and disseminating information about the structure and state of Grid resources. Uses GSI for access control. The idea: A resource uses the Grid Resource Registration Protocol (GRRP) to notify other entities that it is part of the Grid. Those entities can then use the Grid Resource Information Protocol (GRIP) to obtain information about resource status

11 The Globus Toolkit’s Global Access to Secondary Storage
3.4. GASS service The Globus Toolkit’s Global Access to Secondary Storage Provides mechanisms for transferring data between a remote HTTP, FTP, or GASS server In the current context, we use these mechanisms to stage executables and input files to a remote computer GSI mechanisms are used for authentication

12 4. Computation management
The Condor-G agent: 4.1. User interface 4.2. Supporting remote execution 4.3. Credential management 4.4. Resource discovery and scheduling

13 4.1. User interface The Condor-G agent allows the user to treat the Grid as an entirely local resource, with an API and command line tools that allow the user to perform the following job management operations: Submit jobs, indicating an executable name, input/output files and arguments; Query a job’s status, or cancel the job; Be informed of job termination or problems, via callbacks or asynchronous mechanisms such as ; Obtain access to detailed logs, providing a complete history of their jobs’ execution.

14 4.1. User interface The innovation in Condor-G is that these capabilities are provided by a personal desktop gent and supported in a Grid environment, while guaranteeing fault tolerance and exactly-once execution semantics. providing the user with a familiar and reliable single access point to all the resources he/she is authorized to use.

15 4.2. Supporting remote execution Job Submission Process
User indicates jobs to the scheduler. Scheduler creates a GridManager daemon. For each job the GriManager creates a JobManager using two- phase commit GRAM. GASS is used to transfer job executables, input files and to provide output. JobManager submits the jobs to the local scheduling system.

16 4.2. Supporting remote execution Crash Tolerance
Condor-G is built to tolerate four types of failure: 1. Crash of the Globus JobManager: The GridManager then probes the GateKeeper. If Gatekeeper responds then a new JobManager is started.

17 4.2. Supporting remote execution Crash Tolerance
Condor-G is built to tolerate four types of failure: 2 & 3. Resource Management Machine Or Network Failure: The GridManager waits until connection is re- established. Then reconnects to the jobManager.

18 4.2. Supporting remote execution Crash Tolerance
Condor-G is built to tolerate four types of failure: 4. Job Submission Machine: The GridManager gives the jobManager its New IP and PORT.

19 4.3. Credential Management
GSI proxy credential is used to authenticate with resorces. Because Proxy credentials expire the agent periodically checks user creentials. When credentials expire the jobs are put on hold and the user is notified. Problem: long tasks will require frequent proxy updates.

20 4.3. Credential Management
Solution: MyProxy System (Long-lived proxy credentials) Remote services acting on behalf of the user can then obtain short-lived proxies (e.g. 12 hours) from the server.

21 4.4. Resource discovery and scheduling
The Simple Approach: a user-supplied list of GRAM servers. The resource broker: gathers information about available GRAM servers using the Monitor and Discovery System (MDS). User Can then choose from the list of available servers. For the case of high throughput computations “flooding” is applied.

22 5. GlideIn mechanism What happens when a job executes on a remote platform where required files are not available and local policy may not permit access to local file systems? Solution: Sandboxing

23 5. GlideIn mechanism The Idea:
Starts a daemon on the remote computer that learns about the available settings and resources. Runs each user task in a “sandbox”: where system calls are redirected to the local system.

24 THANKS FATIH UNIVERSITY QUESTIONS? Helton MALAMBANE
Computer Engineering THANKS QUESTIONS? Helton MALAMBANE


Download ppt "A Computation Management Agent for Multi-Institutional Grids"

Similar presentations


Ads by Google