Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Grid,Globus Tool Kit, Condor – G and What?. Outline 1. The “Grid Problem” 2. The “Grid Architecture” 3. The “Globus Toolkit” 4. The “Condor-G”

Similar presentations


Presentation on theme: "The Grid,Globus Tool Kit, Condor – G and What?. Outline 1. The “Grid Problem” 2. The “Grid Architecture” 3. The “Globus Toolkit” 4. The “Condor-G”"— Presentation transcript:

1 The Grid,Globus Tool Kit, Condor – G and What?

2 Outline 1. The “Grid Problem” 2. The “Grid Architecture” 3. The “Globus Toolkit” 4. The “Condor-G”

3 Culled From / http://www.globus.orgwww.globus.org / The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001 / Globus: A Metacomputing Infrastructure Toolkit. I. Foster, C. Kesselman. Intl J. Supercomputer Applications, 11(2):115-128, 1997 / Condor-G: A Computation Management Agent for Multi- Institutional Grids. J.Frey, T.Tannenbaum, I.Foster, M.Livny,S.Tuecke. Proc. of HPDC10, 2001 / http://charm.cs.uiuc.edu/ppl_research/faucets/

4 "The Grid" /Coined in 1990's to denote a proposed distributed computing architecture. /"Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources" -From "The Anatomy of the Grid" /Resource Sharing  Computers,Storage,Sensors, Networks, Scientific Instruments  sharing is highly controlled -- Providers & Consumer define  What is shared  Who is allowed to share  Conditions for sharing /Coordinated problem solving  Beyond client-server: distributed data analysis, visualization,computation, collaboration /Similar to the Power Grid, Faucets (Water supply), Nationwide Phone System.

5 Virtual Organization /A set of individual/institutions defined by some set of sharing rules form a Virtual Organization (VO). /VO may contain  Application Service Providers (ASP)  Storage Service Providers (SSP)  Cycle Providers /They lack  Central Control  Central Location  Existing Trust Relationships

6 Why Grids? / Civil Engineers collaborate to design, execute & analyze shake table experiments / Climate Scientists visualize, annotate & analyze terabytes of simulation datasets / An Emergency response team couples real-time data, weather model, population data / An application service provider purchases cycles from compute cycle provider / A biomedical engineer exploits 10K computers to screen 100K compounds in a hour. / NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers with each other. (Argonne,NCSA,Michigan,UIUC,USC)

7 DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago tomographic reconstruction real-time collection wide-area dissemination desktop & VR clients with shared controls Advanced Photon Source Online Access to Scientific Instruments archival storage

8 Image courtesy Harvey Newman, Caltech Data Grids for High Energy Physics Tier2 Centre ~1 TIPS Online System Offline Processor Farm ~20 TIPS CERN Computer Centre FermiLab ~4 TIPS France Regional Centre Italy Regional Centre Germany Regional Centre Institute Institute ~0.25TIPS Physicist workstations ~100 MBytes/sec ~622 Mbits/sec ~1 MBytes/sec There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight (deprecated) Tier2 Centre ~1 TIPS Caltech ~1 TIPS ~622 Mbits/sec Tier 0 Tier 1 Tier 2 Tier 4 1 TIPS is approximately 25,000 SpecInt95 equivalents

9 Is it Really NEW? / “Grid Computing” has much in common with the existing industrial thrusts  Application & Storage service providers (ASPs, SSPs)  Internet & Peer-to-Peer computing  Enterprise Computing Systems  Business-to-Business exchanges / SSPs & ASPs allow organizations to outsource storage & computing requirements to other parties (typically via a VPN) / Enterprise distributed computing technologies (CORBA, Enterprise Java) enable resource sharing within a single organization / Business-to-Business & virtual enterprise technologies exchanges focus on information sharing via central servers.

10 Is it NEW? / Sharing is not adequately addressed by these technologies.  Complicated Requirements: “run program X at site Y subject to community policy P, using data at Z according to the policy Q”  High Performance requirements of most of the applications.  Users may not care where their program may run, as long as it satisfies their QoS requirements (Faucets) / “controlled, dynamic sharing within VOs” / Current doesn’t accommodate the range of resource types or doesn’t provide the flexibility and control on sharing relationships to establish VOs

11 But Why Now??? / The internet & the increasing use of wireless devices provides the universal connectivity. / Many current research projects need teamwork, collaboration. / Network Vs. Computer Performance  Computer speed doubles every 18 months  Network speed doubles every 9 months Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

12 Major Grid Projects NameURL & Sponsors Focus BlueGridIBMGrid testbed linking IBM laboratories DISCOMwww.cs.sandia.gov/ discom DOE Defense Programs Create operational Grid providing access to resources at three U.S. DOE weapons laboratories DOE Science Grid sciencegrid.org DOE Office of Science Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities Earth System Grid (ESG) earthsystemgrid.org DOE Office of Science Delivery and analysis of large climate model datasets for the climate research community European Union (EU) DataGrid eu-datagrid.org European Union Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics

13 Major Grid Projects NameURL/SponsorFocus EuroGrid, Grid Interoperability (GRIP) eurogrid.org European Union Create tech for remote access to supercomp resources & simulation codes; in GRIP, integrate with Globus Toolkit™ Fusion Collaboratory fusiongrid.org DOE Off. Science Create a national computational collaboratory for fusion research Globus Project globus.org DARPA, DOE, NSF, NASA, Msoft Research on Grid technologies; development and support of Globus Toolkit™; application and deployment GridPP gridpp.ac.uk U.K. eScience Create & apply an operational grid within the U.K. for particle physics research Information Power Grid Ipg.nasa.govCreate & apply a production Grid for aero sciences and other NASA missions Grid Research Integration Dev. & Support Center grids-center.org NSF Integration, deployment, support of the NSF Middleware Infrastructure for research & education

14 www.teragrid.org

15 www.ivdgl.org iVDGL: International Virtual Data Grid Laboratory Tier0/1 facility Tier2 facility 10 Gbps link 2.5 Gbps link 622 Mbps link Other link Tier3 facility

16 Grid Architecture

17 Why Do We Need It? / To structure the development of new technology  Common Vocabulary, Guidance, Perspective / To  Identify the fundamental system components  Specify the purpose and function of these components  Indicate how these components interact  Emphasizes  identification and definition of protocols and services  APIs and SDKs  A Protocol architecture – mechanism for VO users and resources to negotiate, manage sharing relationships  Facilitates  Extensibility  Interoperability

18 / Why is interoperability a fundamental concern? / Standard protocols  standard services to abstract away resource specific details. / To accelerate application development in complex and dynamic execution environments we need APIs,SDKs / The Technology + Services  middleware

19 / Analogy to Internet Architecture / High level description – places few constraints on design & implementation Application Fabric “Controlling things locally”: Access to, & control of, resources Connectivity “Talking to things”: communication (Internet protocols) & security Resource “Sharing single resources”: negotiating access, controlling use Collective “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Internet Transport Application Link Internet Protocol Architecture Layered Grid Architecture

20 Features / Open and Extensible  Built on Internet protocols & services  Communication, routing, name resolution, etc.  “Layering” is conceptual  No constraints on who can call what  Protocols/Services/APIs/SDKs will (ideally) be largely self-contained  Things like communication, security are very fundamental  Advantageous for higher layer functions to use common lower-level functions

21 The Hourglass Model Diverse global services Core services Local OS A p p l i c a t i o n s Resource and connectivity form the “neck in the hour glass” Designed to be implemented on top of diverse range of resource types (fabric layer) Can be used to construct wide range of global and application specific Services (collective layer)

22 What's the Status? / No “official” standards yet. / Globus Toolkit is the “unofficial” standard for several connectivity, resource, collective protocols / Global Grid Forum (GGF) has Grid Protocol Architecture group / In security some RFCs are available / In scheduling & resource management some working documents are available

23 Globus Toolkit / A software toolkit (GTK) to address key issues to pave the road / Offers a set of technologies (NCSA’s “Grid-in-a-box”) / Try to standardize the Grid protocols and APIs / Open Architecture & Open Source (Reference implementation) / Define Grid Protocols & APIs / Integrate and extend existing protocols  provides software tools that make it easier to build computational grids and grid-based applications / Learn from the experiences gained through deployment and applications

24 Key Components / Security / Communication / Information Infrastructure / Fault Detection / Resource Management / Portability / Data Management

25 Fabric Layer / What do you expect? -- diverse resource that may be shared / Can be a logical entity such as a distributed file system, computer cluster – But the Grid architecture don’t care / Components implement the local, resource-specific operations that occur on specific resources (physical or logical) / Trade-off (Rich fabric functionality vs. Easy of deployment)  Richer functionality  more sophisticated sharing operations (e.g., reservation)  Few demands  simplified Grid infrastructure deployment.  Should implement enquiry mechanisms and resource management

26 Fabric in Globus Toolkit / Is designed to use existing fabric components / If a vendor doesn’t provide the necessary behavior, GTK includes the missing piece / Resource management, is generally the domain of local resource managers. / GARA (General - purpose Architecture for Reservation and Allocation) can provide QoS for different types resources

27 Connectivity / Communication  Enable exchange of data between fabric layer resources  Include transport, routing, naming (TCP,IP,DNS)  Authentication  Build on communication protocols  Uniform authentication, authorization and message protection in multi institutional scenario  Based on existing standards whenever possible  Various Requirements  Single sign on (access to multiple Grid resources without user intervention)  Delegation (user’s program is able to access resources on which user is authorized)  Integration with various local security solutions (identity mapping)  User-based trust relationships (must not require for various resources to cooperate in configuring security environment)  stake holder should have the final control the authorization decisions

28 Security in GTK / Grid Security Infrastructure (GSI) is based on  Public key  X.509 certificates  SSL/TLS communication  GSS-API (Generic Authorization and Access)  Extensions are added for single sign-on and delegation  Stakeholder control is supported via GAA (Generic Authorization and Access).  GSI adheres to the IETF’s standard GSS-API

29 Site A (Kerberos) Site B (Unix) Site C (Kerberos) Computer User Single sign-on via “grid-id” & generation of proxy cred. Or: retrieval of proxy cred. from online repository User Proxy Proxy credential Computer Storage system Communication* GSI-enabled FTP server Authorize Map to local id Access file Remote file access request* GSI-enabled GRAM server GSI-enabled GRAM server Remote process creation requests* * With mutual authentication Process Kerberos ticket Restricted proxy Process Restricted proxy Local id Authorize Map to local id Create process Generate credentials Ditto GSI Example Scenario “Create Processes at A and B that Communicate & Access Files at C”

30 Resource Layer / Concerned entirely with individual resources / Addresses  Resource discovery  Reservation & Allocation  Monitoring & Control  Secure Negotiation / Two primary classes of protocols 1. Information protocols – to obtain information about the structure & state of a resource 2. Management Protocols – “policy application point” / “in the neck of hourglass”  the number of such protocols should be limited to small and focused set.

31 Resource Layer Components in GTK / GRIP (Grid Resource Information Protocol)  Based on LDAP  Defines standard resource information / GRRP (Grid Resource Registration Protocol)  To register resources with Grid Index Information Servers  GRAM (Grid Resource Access and Management)  HTTP based RPC  Used for allocation of computational resources  for monitoring and controlling of computation on these resources / GridFTP  allows grid applications to have secure, ubiquitous, high-performance access to data  uses the GSI for authentication  new extensions to the FTP protocol for parallel data transfer partial file transfer third-party (server-to-server) data transfer

32 Collective Layer / Coordinate multiple resources / Protocols & services (APIs,SDKs) are not associated with any resource but rather are global in nature / Capture interactions across collections of resources / Examples:  Index servers aka Metadirectory services – custom views on dynamic resource collections  Resource Brokers (e.g., Condor-G Matchmaker, AppLes, Nimrod-G, DRM broker) – Resource discovery & allocation  Replica catalogs & services  Co-reservation & Co-allocation services  Software discovery services – select best s/w implementation and platform based on the problem parameters (e.g., NetSolve, Ninf)  Community accounting and payment services  Collaboratory services (e.g., CAVERNsoft)

33 Information Infrastructure in GTK / An infrastructure that provides coherent system information spanning virtual organizations is necessary / MDS (Metacomputing Directory Service)  Uses LDAP (Lightweight Directory Access Protocol)  Provides uniform means of querying system information from a rich variety of system components  uniform namespace for resource information across a system that may involve many organizations.  GRIS (Grid Resource Information Service)  a uniform means of querying resources for their current configuration, capabilities, and status  GIIS (Grid Index Information Service)  a means of knitting together arbitrary GRIS services to provide a coherent system image.  GIISes provide a mechanism for identifying "interesting" resources

34 Resource Management Services in GTK / Three main components 1. RSL (Resource Specification Language)  to communicate resource requirements 2. GRAM (Grid Resource Allocation Management)  standardized interface to all of the various local resource management tools (e.g., Condor,LSF,PBS) 3. DUROC (Dynamically-Updated Request Online Co-allocator)  coordinates a single request that may span multiple GRAMs / Resource Broker  handle the mapping of high level application requests into requests to individual resource managers

35 GRAM LSFCondorPBS Application RSL Simple ground RSL Information Service Local resource managers RSL specialization Broker RSL Co-allocator Queries & Info Resource Management Architecture

36 GRAM Components Grid Security Infrastructure Job Manager GRAM client API calls to request resource allocation and process creation. MDS client API calls to locate resources Query current status of resource Create RSL Library Parse Request Allocate & create processes Process Monitor & control Site boundary ClientMDS: Grid Index Info Server Gatekeeper MDS: Grid Resource Info Server Local Resource Manager MDS client API calls to get resource info GRAM client API state change callbacks

37 GRAM / GRAM-1  HTTP-based RPC / GRAM-1.5 (Reliability improvements)  Once-and-only-once submission  Reliable termination detection  GRAM-2  towards integration with web-services (SOAP)  OGSA (Open Grid Services Architecture)  Gate Keeper  Single point of entry  “secure inetd”  Job Manager  Layers on top of local resource management system (e.g., PBS,LSF, etc)  Handles remote interaction with the job

38 Data Management in GTK / "Access to distributed data is typically as important as access to distributed computational resources“ - Globus / Tools for managing data in Grid systems and applications / Also called “Data Grid”. / GridFTP / Data Replication  Two tools for managing data replicas: multiple copies of data stored in different systems to improve access across geographically-distributed Grids  Replica Catalog – based on LDAP directory  Replica Manager – combines the Replica Catalog with GridFTP to manage data replication  GASS (Global Access to Secondary Storage)  Allows applications to access data stored in any remote file system by specifying a URL  Can be in HTTP, FTP

39 Condor-G A Computation Management Agent for Multi-Institutional Grid

40 What is Condor-G? / Condor enhanced with Globus Toolkit components to “harness multi-domain resources as if they all belong to one personal domain” / Example of applying the general purpose GTK components to solve a particular problem (i.e., high-throughput computing on the Grid)

41 Separation of Concern 1. Remote Resource Access  Secure remote resource discovery,allocation, management  Uses Globus Toolkit components 2. Computation Management  Via user computation management agent responsible for job submission, job management, error recovery  Taken from Condor system 3. Remote Execution  Via mobile sandboxing – to create a user tailored execution environment on a remote node  Taken from Condor system

42 Why Condor for Grid Jobs?? / Adv. Of using condor-G to manage Grid jobs  Can Query a job’s status or cancel a job  Credential management  Get informed of job termination or problems via callbacks or asynchronous mechanisms such as email  Access to detailed logs with complete history of the jobs’ execution  Fault tolerance and exactly once semantics

43 Job Execution / Stages a job’s standard I/O and executable using GASS / Submits jobs to remote machines using revised GRAM(1.5)  Job manager checkpoint & restart  Two-phase commit during job submission  Monitors job status & recovers from remote failures using revised GRAM callbacks and status calls  Condor-G handles resubmission of failed jobs, communications with the user concerning unusual & erroneous conditions, recording of computation status to support restart

44 Execution Mechanism

45 Fault Tolerance / Tolerates four types of failure  Local Crash 1. Crash of the host on which GridManager is running (or crash of the GridManager alone)  Queue state stored on disk  Reconnect to the JobManagers that were running at the time of crash  Remote Crash 2. Crash of the GlobusJobManager  Start a new JobManager 3. Crash of the machine that hosts the remote resource (GateKeeper,JobManager)  Wait until connectivity returns  Start a new JobManager  Network Failures

46 Credential Management / Authentication in is done with limited-lifetime X509 proxies / Credential may expire before the job finishes execution / Condor-G agent periodically analyzes the credentials for all users with currently queued jobs / Can put jobs on hold and e-mail user to refresh proxy / Can forward new credentials to execution sites  Using the MyProxy system, which lets a user store a long-lived proxy credential on a secure server.  Remote services acting on behalf of user can obtain short-lived proxies  Condor-G can use these to refresh the user credential automatically

47 Resource Discovery & Scheduling / Simple – user supplies list of GRAM servers / Resource broker in Condor-G agent – Condor Matchmaking / “flood” candidate resources – “Glidelns”

48 GlideIn Mechanism / Use same codor mechanism to start on a remote node not a user job, but a daemon / The deamon traps system calls made by user’s job and redirects back to the originating system / Periodically checkpoints the job and migrates job to another location if it is requested / The Condor-G GlideIn mechanism uses Grid protocols to dynamically create a personal condor pool out of Grid resources by “gliding-in” Condor daemons to remote resource / Allows to delay the binding of an application to a resource  Prevent queuing delays  Can guarantee optimal queuing times to users

49 Remote Execution via GlideIn

50 Grid Architecture in Practice Compute Resource SDK API Access Protocol Checkpoint Repository SDK API C-point Protocol High Throughput Computing System Dynamic checkpoint, job management, failover, staging Brokering, certificate authorities Access to data, access to computers, access to network performance data Communication, service discovery (DNS), authentication, authorization, delegation Storage systems, schedulers Collective (App) App Collective (Generic) Resource Connect Fabric

51 Future & Conclusions / Evolution  Past-Present: O(10 2 ) computers; Mb/s networks; local (centralized) control  Present: O(10 4 -10 6 ) data systems, computers; Gb/s networks; restricted decentralized control  Future: O(106-109) data,sensors,computer,instruments; highly flexible policy,control  “A computer (includes software) is a dynamically, often collaboratively constructed collection of processors,data sources,networks,sensors,instruments”  “Open the faucet get the water – Connect to the Grid get the compute power”  We need a powerful computational economy model (Bidding systems – new optimization algorithms)

52 Summary / The Grid Problem: Resource sharing & coordinated problem solving in dynamic multi-institutional virtual organizations / Grid Architecture: Emphasizes protocol and service definition to enable interoperability and resource sharing / Globus Toolkit: a source of protocol and APIs, reference implementation / Condor-G: applies general purpose Globus Toolkit to solve high- throughput computing on the Grid

53 Thanks for Ur Patience


Download ppt "The Grid,Globus Tool Kit, Condor – G and What?. Outline 1. The “Grid Problem” 2. The “Grid Architecture” 3. The “Globus Toolkit” 4. The “Condor-G”"

Similar presentations


Ads by Google