3 1. INTRODUCTION Grid user requirements They want to be able to discover, acquire, and reliably manage computational resources dynamically, in the course of their everyday activitiesThey do not want to be bothered with the location of these resources, the mechanisms that are required to use them, with keeping track of the status of computational tasks operating on these resources, or with reacting to failureThey do care about how long their tasks are likely to run and how much these tasks will cost
4 Solution: The Condor-G Leverages software from Globus and Condor.“allows the user to control multi-domain resources as if they all belong to one personal domain “Globus Toolkit : inter-domain resource management protocols.Condor: intra-domain resource management methods.
5 2. Large-scale sharing of computational resources How to build and manage a multi-site computation that uses resources that belong to different sites?DIFFICULTIES:Different sites may feature different authentication and authorization mechanisms, schedulers, hardware architectures, operating systems, file systems, etc.The user has little knowledge of the characteristics of resources at remote sites, and no easy means of obtaining this informationDue to the distributed nature of the multi-site computing environment, computers, networks, and subcomputations can fail in various ways.Keeping track of the status of different elements of a computation involves tedious bookkeeping, especially in the event of failure and dependencies among subcomputations.
6 2. Large-scale sharing of computational resources How to build and manage a multi-site computation that uses resources that belong to different sites?APPROACH:Remote resource access issues are addressed by requiring that remote resources speak standard protocols for resource discovery and management.Computation management issues are addressed via the introduction of a robust, multi-functional user computation management agent responsible for resource discovery, job submission, job management, and error recovery. From CondorRemote execution environment issues are addressed via the use of mobile sandboxing technology that allows a user to create a tailored execution environment on a remote node.
7 3. Grid Protocols - Outline Protocols used in the Condor-G system:3.1. GSI (Grid Security Infrastructure)3.2. GRAM (Grid Resource Allocation and Management)3.3. MDS-2 (Monitor and Discovery System)3.4. GASS (Global Access to Secondary Storage)
8 The Globus Toolkit’s Grid Security Infrastructure 3.1. GISThe Globus Toolkit’s Grid Security InfrastructureMakes it possible to authenticate a user just once.Uses Public Key Infrastructure (PKI)GSI employs the user’s private key to create a proxy credential, which serves as a new private-public key pair that allows a proxy (such as the Condor-G agent) to make remote requests on behalf of the user
9 The Grid Resource Allocation and Management 3.2. GRAM protocolThe Grid Resource Allocation and ManagementThe Grid Resource Allocation and Management protocol supports remote submission monitoring and control of a computational request to a remote computational resource. Eg: “run program P”.Uses GSI for authentication/authorization.Two-phase commit (using requests sequences and commit command) .Logs details of all active jobs (useful for crash recovery).
10 Monitor and Discovery System 3.3. MDS protocolsMonitor and Discovery SystemAllows discovering and disseminating information about the structure and state of Grid resources.Uses GSI for access control.The idea:A resource uses the Grid Resource Registration Protocol (GRRP) to notify other entities that it is part of the Grid.Those entities can then use the Grid Resource Information Protocol (GRIP) to obtain information about resource status
11 The Globus Toolkit’s Global Access to Secondary Storage 3.4. GASS serviceThe Globus Toolkit’s Global Access to Secondary StorageProvides mechanisms for transferring data between a remote HTTP, FTP, or GASS serverIn the current context, we use these mechanisms to stage executables and input files to a remote computerGSI mechanisms are used for authentication
12 4. Computation management The Condor-G agent:4.1. User interface4.2. Supporting remote execution4.3. Credential management4.4. Resource discovery and scheduling
13 4.1. User interfaceThe Condor-G agent allows the user to treat the Grid as an entirely local resource, with an API and command line tools that allow the user to perform the following job management operations:Submit jobs, indicating an executable name, input/output files and arguments;Query a job’s status, or cancel the job;Be informed of job termination or problems, via callbacks or asynchronous mechanisms such as ;Obtain access to detailed logs, providing a complete history of their jobs’ execution.
14 4.1. User interfaceThe innovation in Condor-G is that these capabilities are provided by a personal desktop gent and supported in a Grid environment, while guaranteeing fault tolerance and exactly-once execution semantics.providing the user with a familiar and reliable single access point to all the resources he/she is authorized to use.
15 4.2. Supporting remote execution Job Submission Process User indicates jobs to the scheduler.Scheduler creates a GridManager daemon.For each job the GriManager creates a JobManager using two- phase commit GRAM.GASS is used to transfer job executables, input files and to provide output.JobManager submits the jobs to the local scheduling system.
16 4.2. Supporting remote execution Crash Tolerance Condor-G is built to tolerate four types of failure:1. Crash of the Globus JobManager:The GridManager then probes the GateKeeper.If Gatekeeper responds then a new JobManager is started.
17 4.2. Supporting remote execution Crash Tolerance Condor-G is built to tolerate four types of failure:2 & 3. Resource Management Machine Or Network Failure:The GridManager waits until connection is re- established.Then reconnects to the jobManager.
18 4.2. Supporting remote execution Crash Tolerance Condor-G is built to tolerate four types of failure:4. Job Submission Machine:The GridManager gives the jobManager its New IP and PORT.
19 4.3. Credential Management GSI proxy credential is used to authenticate with resorces.Because Proxy credentials expire the agent periodically checks user creentials.When credentials expire the jobs are put on hold and the user is notified.Problem: long tasks will require frequent proxy updates.
20 4.3. Credential Management Solution: MyProxy System (Long-lived proxy credentials) Remote services acting on behalf of the user can then obtain short-lived proxies (e.g. 12 hours) from the server.
21 4.4. Resource discovery and scheduling The Simple Approach:a user-supplied list of GRAM servers.The resource broker:gathers information about available GRAM servers using the Monitor and Discovery System (MDS).User Can then choose from the list of available servers.For the case of high throughput computations “flooding” is applied.
22 5. GlideIn mechanismWhat happens when a job executes on a remote platform where required files are not available and local policy may not permit access to local file systems?Solution:Sandboxing
23 5. GlideIn mechanism The Idea: Starts a daemon on the remote computer that learns about the available settings and resources.Runs each user task in a “sandbox”: where system calls are redirected to the local system.
24 THANKS FATIH UNIVERSITY QUESTIONS? Helton MALAMBANE Computer EngineeringTHANKSQUESTIONS?Helton MALAMBANE