Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

Similar presentations


Presentation on theme: "Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun."— Presentation transcript:

1 Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun Lu 2009/12/17

2 2 Abstract We present the Condor-G system, which leverages software from Globus and Condor to allow users to harness multi-domain resources as if they all belong to one personal domain. We describe the structure of Condor-G and how it handles job management, resource selection, security, and fault tolerance.

3 Outline Introduction Large-scale sharing of computational resources Grid protocol overview –GSI 、 GRAM 、 MDS 、 GASS Computation management : Condor-G Agent “GlideIn” mechanism Conclusion 3

4 Introduction ●They want to be able to discover, acquire, and reliably manage computational resources dynamically, in the course of their everyday activities. ●They do not want to be bothered with the location of these resources, the mechanisms that are required to use them, with keeping track of the status of computational tasks operating on these resources, or with reacting to failure. ●They do care about how long their tasks are likely to run and how much these tasks will cost. 4

5 Introduction (Cont.) ﻪWe combine the inter-domain resource management protocols of the Globus Toolkit and the intra-domain resource management methods of Condor to allow the user to harness multi-domain resources as if they all belong to one personal domain. 5

6 Large-scale sharing of computational resources Different sites may feature different authentication and authorization mechanisms, schedulers, operating systems, etc.. The user has little knowledge of the characteristics of resources at remote sites, and no easy means of obtaining this information. Due to the distributed nature of the multi-site computing environment, computers, networks, and subcomputations can fail in various ways. Keeping track of the status of different elements of a computation involves tedious bookkeeping, especially in the event of failure and dependencies among subcomputations. 6

7 Large-scale sharing of computational resources (Cont.) Remote resource access Computation management Remote execution environment 7

8 Grid security infrastructure (GSI) ﻪThis authentication and authorization system makes it possible to authenticate a user just once, using public key infrastructure (PKI) mechanisms to verify a user-supplied “Grid credential.” 8

9 GRAM protocol and implementation GSI security mechanisms ﻩGSI security mechanisms are used in all operations to authenticate the requestor and for authorization Two-phase commit ﻩEach request from a client is accompanied by a unique sequence number, which is also included in the associated response. Resource-side fault tolerance ﻩresource may often contain multiple processors with specialized “interface” machines running the GRAM server. 9

10 MDS protocols and implementation GRRP ﻩA resource uses the GRRP to notify other entities that it is part of the Grid. GRIP ﻩThose entities can then use the GRIP to obtain information about resource status. 10

11 GASS The GASS service provides mechanisms for transferring data between a remote HTTP, FTP, or GASS server. 11

12 Computation management: The Condor-G agent User interface ﻩsubmit jobs, indicating an executable name, input/ output files and arguments. ﻩquery a job’s status, or cancel the job. ﻩbe informed of job termination or problems, via callbacks or asynchronous mechanisms such as email. ﻩobtain access to detailed logs, providing a complete history of their jobs’ execution. 12

13 Condor-G 13

14 Supporting remote execution stages a job's standard I/O and executable using GASS. submits a job to a remote machine using the revised GRAM job request protocol. subsequently monitors job status and recovers from remote failures using the revised GRAM protocol and GRAM callbacks and status calls. authenticating all requests via GSI mechanisms. 14

15 Condor-G is built to tolerate four types of failure Crash of the Globus Job Manager Crash of the machine that manages the remote resource Crash of the machine on which the Grid Manager is executing Failures in the network connecting the two machines. 15

16 Credential management periodically analyzing the credentials for all users with currently queued jobs. To reduce user hassle in dealing with expired credentials, Condor-G could be enhanced to work with a system like “My Proxy.” 16

17 Resource discovery and scheduling Available resources can be ranked by user preferences such as allocation cost and expected start or completion time. One promising approach to constructing such a resource broker is to use the Condor Matchmaking framework to implement the brokering algorithm. 17

18 GlideIn mechanism Condor-G uses standard Condor mechanisms to match locally queued jobs with the resources advertised. It runs each user task received in a “sandbox,” using system call trapping technologies provided. It periodically checkpoints the job to another location and migrates the job to another location. 18

19 Conclusion Condor-G mechanisms complement this work by addressing issues of uniform remote access, failure, credential expiry, etc. Condor-G could potentially be used as a backend for an application level scheduling system. 19


Download ppt "Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun."

Similar presentations


Ads by Google