Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Grid - Multi-Domain Distributed Computing Kai Rasmussen Paul Ruggieri.

Similar presentations


Presentation on theme: "The Grid - Multi-Domain Distributed Computing Kai Rasmussen Paul Ruggieri."— Presentation transcript:

1 The Grid - Multi-Domain Distributed Computing Kai Rasmussen Paul Ruggieri

2 Topic Overview  The Grid  Types  Virtual Organizations  Security  Real Examples  Grid Tools  Condor  Cactus  Cactus-G  Globus  OGSA

3 The Grid  What is a Grid system?  Highly heterogeneous set of resources that may or may not be maintained by multiple administrative domains  Early idea  Computational resources would be universally available as electric power

4 “A hardware and software infrastructures that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities” - Ian Foster Resources are distributed across sites and organizations with no centralized point of control Resources are distributed across sites and organizations with no centralized point of control What constitutes a Grid? What constitutes a Grid? Resources coordinated without being subjected to a centralized control Resources coordinated without being subjected to a centralized control Uses standard, open source protocols and interfaces Uses standard, open source protocols and interfaces Delivers non-trivial qualities of service Delivers non-trivial qualities of service

5 Grid Types  Computations Grids  Resource pure CPU  Strength: Computational Intensive applications  Data Grids  Shared storage and data  Terabytes of storage space.  Sharing of data among collaborators  Fault Tolerance  Equipment Grids  Set of resources that surround shared equipments, such as a telescope

6 Virtual Organizations  Grids are Multi-domain  Resources are administrated by separate departments or institutions  All wish to maintain individual control  There is a cross site grouping of collaborators sharing resources  “Virtual Organization”

7 Virtual Organizations  Users of VO’s share a common goal and trust  Collection of resources, users and rules governing sharing  Highly controlled - What is Shared? Who is Sharing? How can resources be used?  One global domains acting over individual collaborating domains

8 Grid Security  Highly distributed nature  VOs spread over many security domains  Authentication  Proving identity  Authorization  Obtaining privileges  Confidentiality & Integrity  Identity and privileges can be trusted

9 Authentication  Certificate Authority (CA)  Entity that signs certificate that proves users identity  Certificate then used as credentials to use system  Typically several CAs to prevent single point of failure/attack  Globus Grid Security Infrastructure (GSI)  Globus’s Authentication component  Global security credential later mapped to local  Kerberos tickets or local username and password  Typically generate short-term proxy certificate with long-term certificate

10 Authentication  Certification Authority Coordination Group  Maintains a global infrastructure of trusted CA agents  CA must meet standards  Physically secure  Must validate identity with Registration Authorities using official documents or photographic identification  Private Keys must be minimum of 1020 Bits and have max 1 year life  28 approved CAs is European union

11 Security Issues  Delegation  User entrusts separate entity to perform task  Entity must be given certification and trusted to behave  Limit proxies strength  Endow proxy with specific purpose

12 Grid Projects  EGEE - Enabling Grids for eScience  70 sites in over 27 countries  Mostly European  40 Virtual Organizations  GENIUS Grid-Portal is used for submission  Individual collaborators use own middle-ware tools to group resources

13 LCG  Large Hadron Collider Computation Grid  Developed distributed systems needed to support computation and data needs of LHC physics experiments  EGEE Collaborator  100 Sites  Worlds largest Grid

14

15 Grid 2003  US effort  27 National sites  28000 Processors, 13000 Simultaneous Jobs  Infrastructure for  Particle Physics Grid  Virtual Data Grid Laboratory  Develop Application Grid Laboratory - Grid3  Platform for experimental CS Research  Built on Virtual Data Toolkit  Collection of Globus, Condor and other middleware tools

16 TeraGrid  40 Teraflops of Computational Power  8 National Sites with strong backbone  Used for NSF sponsored High Performance Computing  Mapping the human arterial tree model  TeraShake - Earthquake simulation

17

18 Applications  Climate Monitoring + Simulation  Network Weather Service  Climate Data-Analysis Tool  Both run on the Earth System Grid running on Globus  MEANDER nowcast meteorology  Run on Hungarian Supergrid  ATLAS Challenge  Simulate high energy proton-proton collisions  Computational Science Simulations  Biology, Fluid Dynamics

19 Grid Tools  Many middleware implementations  Globus  Condor  Condor-G  Cactus-G  OGSA  Solves common Grid problems  Resource discovery/management/allocation  Security/Authentication

20 Condor  Initially developed in 1983 at University of Wisconsin  Pre-Grid tool  A Local Resource Management System  Allows creation of communities with distributed resources  Communities should grown naturally  Sharing as much or as little as they care too  Sounds like Virtual Organizations

21 Condor  Responsibilities  Job Management, Scheduling  Resource monitoring and management  Checkpointing and Migration  Utilize idle CPU  Cycle ‘Scavenge

22 Condor Pool  Full set of users and resources in community  Composed of three Entities  Agent  Finds resources and executes jobs  Resource  Advertise itself and how it can be used in pool  Matchmaker  Knows of all agents and resources  Puts together compatible pairs  Pool is defined by single matchmaker

23

24 Matchmaking  Problem of centralized Scheduling  Resources have multiple owners  Unique use requirements  Matchmaking finds balance between user and resource needs  ClassAds  Agents advertise requirements  Resources advertise how it can be used

25 Matchmaking  Matchmaker scans all known ClassAds  Creates matching pairs of agents and resources  Informs both parties  Individually responsible to negotiate job and initiating execution of job  Separation of matching and claiming  Matchmaker unaware of complicated allocation  Stale information may exist. Resource can deny match

26 Condor Flocking  Linking condor pools necessary for collaboration  Sharing of resources beyond the organizational level  Individuals belonging to multiple communities  Gateway Flocking  Entire communities are linked  Direct Flocking  Individual collaborators belong to many pools

27 Gateway Flocking  Gateway entity serves as a singular point of access for cross pool communication  Matchmakers talk to Gateways  Gateways talk to Gateways  Transparent to user  Organizational level sharing  Powerful, but difficult to setup and maintain

28 Gateway Flocking

29 Direct Flocking  Agents report to multiple matchmakers  Individual collaboration  Natural idea for users  Less powerful but simpler to build and deploy  Eventually used in favor Gateway Flocking

30 Direct Flocking

31 Cactus  General-purpose, open-source parallel computation framework  Developed for numerical solution to Einstein’s equation  Two main components flesh and thorns  Flesh – central core  Thorns – application modules  Provides simple abstract API  Hides MPI parallel driver, I/O (thorns)

32 Cactus-G  “Grid-enabled” Cactus  Combines Cactus and MPICH-G2 (more later)  Layered approach  Application thorns  Grid-aware infrastructure thorns  Grid-enabled communication library (MPICH-G2 in this case)

33 Globus  Condor  Pre-Grid tool applied to Grid Systems  Multi-domain possible but limited  No security. Focus primarily on resource management  Globus  Set of Grid specific tools  Extendable and Hierarchical

34 The Toolkit  Globus Toolkit  Components for basic security, resource management, etc  Well defined interfaces - “Hour-glass” architecture  Local services sit behind API  Global services built on top of these local services  Interfaces useful to manage heterogeneity  Information Service integral component  Information-rich environment needed

35 Globus Services

36 Resource Management  Globus Resource Allocation Manager (GRAM)  Responsible for set of local resources  Single domain  Implemented with set a local RM tools  Condor, NQE, Fork, Easy-LL, etc…  Resource requests expressed in Resource Specification Language (RSL

37 Resource Broker  Manages RSL requests  Uses Information services to discover GRAMS  Transforms abstract RSLs into more specific requirements  Sends allocation requests to appropriate GRAM

38

39 Information Service  Grid always in flux  Information rich system produces information users find useful  Enhances flexibility and performance  Necessity for administration  Globus Metacomputing Directory Service (MDS)  Stores and makes accessible Grid information  Lightweight Directory Access Protocol (LDAP)  Extensible representation for information  Stores component information in directory information tree

40 Security  Local Heterogeneity  Resources operated in multiple security domains  All use different authentication techniques  N-Way authentication  Job may be any number of processes on any number of resources  One logical entity. User should only authenticate once.

41 Security  Globus Security Infrastructure (GSI)  Modular design constructed on top of local services  Solves local heterogeneity  Globus Identity  Mapped into local user identities by local GSI  Allows for n-way authorization

42

43 OGSA  Open Grid Services Architecture  Defines a Grid Service  Provides standard interface for naming, creating, discovering a Grid Service  Location Transparent  Globus Toolkit  GRAM – resource allocation/management  MDS-2 – information discovery  GSI – authentication (single sign-on)  Web services  Widely used  Language/system independent

44 OGSA – Grid Service Interface

45 OGSA – VO Structure

46 Condor-G  Hybrid Condor-Globus System  Local Condor agent (Condor-G)  Communicates with Globus GRAM, MDS, GSI, etc  Optimized Globus’s GRAM to work with Condor better

47 Specific Testbed  Grid2003  Organized into 6 VOs (one for each application)  At each VO site, middleware installed with grid certificate databases  GSI, GRAM, and GridFTP used from Globus  MDS  MonALISA  Agent-based monitoring used in conjunction with MDS

48 MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface Nicholas Karonis, Brian Toonen, Ian Foster

49 Abstract  Grid Enabled MPI implementation  Extends MPICH  Utilizes Globus Toolkit  Authentication, Authorization, Resource Allocation, Executable Staging, I/O, Process management creation and control  Hide/Expose critical aspects of heterogeneous environment

50 The Problem  Grids difficult to program for…  heterogeneous, highly distributed  Build on existing MPI API  MPICH specifically  Can we implement MPI constructs in a highly heterogeneous environment efficiently and transparently?  Yes, use Globus!  Can we also allow users to manage heterogeneity?  Yes, existing MPI Communicator Construct!

51 MPICH-G2  Global Security Infrastructure (GSI)  Single sign-on authentication  Monitoring and Discovery Service (MDS)  Select nodes to execute on  Resource Specification Language  Generated by mpirun  Specifies job resource requirements  Dynamically-Updated Request Online Coallocator (DUROC)

52 MPICH-G2 Flow Diagram

53 MPICH-G2 Improvements  Replaces MPICH-G  Replace use of Nexus (Globus) for all communication with optimized code  Increased Bandwidth  Cutout extra layer (Nexus)  Reduce intra-machine vendor MPI messaging latency  Eliminate unnecessary polling based on source rank info (for Recv)  Specified, Specified-pending, multimethod (more later)  Only poll TCP (expensive) when necessary (ie using TCP not vendor MPI)

54 MPICH-G2 Improvements 2  More efficient use of sockets  Uses one socket for both directions  Multilevel topology-aware collective operations  Collective operations originally implemented assuming equidistance  Not likely in Grid scenario

55 App Heterogeneity Management  Topology Discovery  Need method of discovering topology to minimize expensive transfers  intra-site communication vs intra-machine communication  Use existing MPI communicator construct  Associate attributes with communicators  Topology depths and colors  Allow MPI developers to create communicators which group processes topologically

56 Example MPICH-G2 App

57 Performance Groupings  Specified  MPI_Recv explicitly specifies process on same machine  No outstanding asynchronous operations  Explicitly call vendor MPI  Specified-pending  MPI_Recv explicitly specifies process on same machine  Outstanding recv requests on same machine  Forced to continuously poll vendor MPI  Multimethod  MPI_Recv source rank is MPI_ANY_SOURCE  OR outstanding recv requests which may require TCP  Forced to continuously poll vendor MPI and TCP

58 Vendor MPI Results  Increased performance compared to MPICH-G  Relatively close performance to straight vendor MPI

59 Vendor MPI Results

60 TCP/IP Results  Similar results as Vendor MPI (less interesting)  Authors explicitly say they did not attempt to modify the TCP code

61 TCP/IP Results

62 Conclusions  Good performance  Improved performance opposed to previous version  “good enough” performance to justify use  Eases transition of MPI applications to the context of a Grid  Just works  Provides developer with a relatively simply means of writing “smart” apps which are aware of their topology

63 P-GRADE Portal

64 MTA SZTAKI  Computer and Automation Research Institute of the Hungarian Academy of Sciences  Laboratory of Parallel and Distributed Computing  Peter Kacsuk  Joszef Patvarczki  HunGrid  Member of both SEE-Grid and EGEE

65 Two Grid Problems  Middleware tools build together into a Grid  Too many complex parts  Confusing for users with little experience  Mostly research scientists  PVM and MPI allow for Parallel execution  Executed within a Globus or Condor site shows good performance  Performance decreases when executed in multiple sites

66 P-GRADE Portal  A Web based Portal for accessing Grid  High level tools hide complexity of middleware  Can be accessed anywhere  Workflow solution  Complex problems are broken into several parts treated as single framework  Executed as an acyclic graph  Parallelism at two levels  Independent branches run on several grid sites  Individual nodes can be parallel programs (MPI or PVM)

67 Portal  Fully functional; built upon middleware tools  Grid Certificate management  Setting up Grid environment  Creation and modification of workflow apps  Management and parallel execution of workflow apps on grid resources  Visualization of workflow progress

68 Grid Certificate  Security done through Globus GSI  Connect to Proxy server; download Certificate  Monitor status

69 Resource Management  Use Globus tools to attach jobs to resources  Two Strategies  Static Allocation  Connect Directly to GRAM Servers  Dynamic Allocation  Connect to MDS service  Allocate through Grid resource broker

70 Workflow Creation & Monitoring  P-GRADE  Java app for creating parallel workflows  Directed input and output files

71 Parameter Study  Singular job run under varying input parameters  Outputs later compared against each other  Logical Grid Application  Each job independent and can be run in parallel

72 P-GRADE Portal w/ PStudy  Adaped Portal to create and manage Parametric studies  New workflow Editor  Creation of parameterized input file  Manage parameter values  Workflow Management  Submit workflows by parameter ranges  Compare outputs  Monitor individual job status

73 Pstudy Manager

74 Visualization

75 PGRADE Demo  http://hgportal.hpcc.sztaki.hu:8080/gridsph ere/gridsphere http://hgportal.hpcc.sztaki.hu:8080/gridsph ere/gridsphere http://hgportal.hpcc.sztaki.hu:8080/gridsph ere/gridsphere


Download ppt "The Grid - Multi-Domain Distributed Computing Kai Rasmussen Paul Ruggieri."

Similar presentations


Ads by Google