Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Grids Ben Clifford University of Chicago

Similar presentations


Presentation on theme: "Introduction to Grids Ben Clifford University of Chicago"— Presentation transcript:

1 Introduction to Grids Ben Clifford University of Chicago benc@ci.uchicago.edu

2 About Me ● Programmer at University of Chicago Computation Institute ● Open Science Grid – Education, Outreach, Training ● A Developer on Swift – grid workflow system ● Application support for grid users at University of Chicago

3 Motivation ● You can run lots of science applications on your desktop PC ● Sometimes they get too big ● Too much data to process ●... or... ● Too much computation

4 Scaling up Science: Citation Network Analysis in Sociology Work of James Evans, University of Chicago, Department of Sociology 4

5 Scaling up the analysis ● Query and analysis of 25+ million citations ● Work started on desktop workstations ● Queries grew to month-long duration ● With data distributed across U of Chicago TeraPort cluster: – 50 (faster) CPUs gave 100 X speedup – Many more methods and hypotheses can be tested! ● Higher throughput and capacity enables deeper analysis and broader community access. 5

6 High Energy Physics Tier2 Centre ~1 TIPS Online System Offline Processor Farm ~20 TIPS CERN Computer Centre FermiLab ~4 TIPS France Regional Centre Italy Regional CentreGermany Regional Centre Institute Institute ~0.25TIPS Physicist workstations ~100 MBytes/sec ~622 Mbits/sec ~1 MBytes/sec There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight (deprecated) Tier2 Centre ~1 TIPS Caltech ~1 TIPS ~622 Mbits/sec Tier 0 Tier 1 Tier 2 Tier 4 1 TIPS is approximately 25,000 SpecInt95 equivalents Image courtesy Harvey Newman, Caltech 6

7 Computing “Clusters” are today’s Supercomputers Cluster Management “frontend” Tape Backup robots I/O Servers typically RAID fileserver Disk Arrays Lots of Worker Nodes A few Headnodes, gatekeepers and other service nodes 7

8 Cluster Architecture Cluster User 8 Head Node(s) Login access (ssh) Cluster Scheduler (PBS, Condor, SGE) Remote File Access (scp, GridFTP) Node 0 … … … Node N Shared Cluster Filesystem Storage (applications and data) Job execution requests & status Comp ute Node s (10 to 10,000 PC’s with local disks) … Cluster User Internet Protocols

9 Grids consist of distributed clusters Grid Client Application & User Interface Grid Client Middleware Resource, Workflow & Data Catalogs 9 Grid Site 2: Sao Paolo Grid Service Middleware Compute Cluster Grid Storage Grid Protocols Grid Site 1: Fermilab Grid Service Middleware Compute Cluster Grid Storage …Grid Site N: UWisconsin Grid Service Middleware Compute Cluster Grid Storage

10 Ian Foster’s Grid Checklist ● A Grid is a system that: – Coordinates resources that are not subject to centralized control – Uses standard, open, general-purpose protocols and interfaces – Delivers non-trivial qualities of service 10

11 Open Science Grid map

12

13 March 2007JP Navarro 13 Grid Resources in the US Origins: –National Grid (iVDGL, GriPhyN, PPDG) and LHC Software & Computing Projects Current Compute Resources: –61 Open Science Grid sites –Connected via Inet2, NLR.... from 10 Gbps – 622 Mbps –Compute & Storage Elements –All are Linux clusters –Most are shared Campus grids Local non-grid users –More than 10,000 CPUs A lot of opportunistic usage Total computing capacity difficult to estimate Same with Storage Origins: –National Grid (iVDGL, GriPhyN, PPDG) and LHC Software & Computing Projects Current Compute Resources: –61 Open Science Grid sites –Connected via Inet2, NLR.... from 10 Gbps – 622 Mbps –Compute & Storage Elements –All are Linux clusters –Most are shared Campus grids Local non-grid users –More than 10,000 CPUs A lot of opportunistic usage Total computing capacity difficult to estimate Same with Storage Origins: –National Super Computing Centers, funded by the National Science Foundation Current Compute Resources: –9 TeraGrid sites (soon 14) –Connected via dedicated multi-Gbps links –Mix of Architectures ia64, ia32 LINUX Cray XT3 Alpha: True 64 SGI SMPs – Resources are dedicated but Grid users share with local users 1000s of CPUs, > 40 TeraFlops –100s of TeraBytes Origins: –National Super Computing Centers, funded by the National Science Foundation Current Compute Resources: –9 TeraGrid sites (soon 14) –Connected via dedicated multi-Gbps links –Mix of Architectures ia64, ia32 LINUX Cray XT3 Alpha: True 64 SGI SMPs – Resources are dedicated but Grid users share with local users 1000s of CPUs, > 40 TeraFlops –100s of TeraBytes The TeraGrid The OSG

14 14 Grid Middleware glues the grid together ● A short, intuitive definition: the software that glues together different clusters into a grid, taking into consideration the socio-political side of things (such as common policies on who can use what, how much, and what for)

15 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 15

16 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 16

17 17 Local Resource Managers (LRM) Compute resources have a local resource manager (LRM) that controls: – Who is allowed to run jobs – How jobs run on a specific resource – Specifies the order and location of jobs Example policy:  Each cluster node can run one job.  If there are more jobs, then they must wait in a queue ● LRMs allow nodes in a cluster can be reserved for a specific person ● Examples: PBS, LSF, Condor

18 18 Globus Toolkit ● the de facto standard for grid middleware. ● Open source ● Adopted by different scientific communities and industries ● Conceived as an open set of architectures, services and software libraries that support grids and grid applications ● Provides services in major areas of distributed systems: – job submission – data management – security

19 19 GRAM Globus Resource Allocation Manager ● GRAM = provides a standardised interface to submit jobs to LRMs. ● Clients submit a job request to GRAM ● GRAM translates into something a(ny) LRM can understand …. Same job request can be used for many different kinds of LRM

20 20 Condor ● Condor is a specialized workload management system for compute-intensive jobs. ● is a software system that creates an HTC environment  Created at UW-MadisonUW-Madison  Detects machine availability  Harnesses available resources  Uses remote system calls to send R/W operations over the network  Provides powerful resource management by matching resource owners with consumers (broker)

21 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 21

22 22 High-performance tools needed to solve several data problems. ● The huge raw volume of data: – Storing it – Moving it – Measured in terabytes, petabytes, and further … ● The huge number of filenames: – 10 12 filenames is expected soon – Collection of 10 12 of anything is a lot to handle efficiently ● How to find the data

23 23 Data Questions on the Grid Where are the files I want? How to move data/files to where I want?

24 24 GridFTP high performance, secure, and reliable data transfer protocol based on the standard FTP  http://www.ogf.org/documents/GFD.20.pdf http://www.ogf.org/documents/GFD.20.pdf Extensions include  Strong authentication, encryption via Globus GSI  Multiple data channels for parallel transfers  Third-party transfers  Tunable network & I/O parameters  Authenticated reusable channels  Server side processing, command pipelining

25 25 RFT = Reliable file transfer protocol that provides reliability and fault tolerance for file transfers Part of the Globus Toolkit RFT acts as a client to GridFTP, providing management of a large number of transfer jobs (same as Condor to GRAM) RFT can keep track of the state of each job run several transfers at once deal with connection failure, network failure, failure of any of the servers involved.

26 26 RLS -Replica Location Service RLS It provides access to mapping information from logical names to physical names of items Its goal is to reduce access latency, improve data locality, improve robustness, scalability and performance for distributed applications RLS produces replica catalogs (LRCs), which represent mappings between logical and physical files scattered across the storage system. For better performance, the LRC can be indexed.

27 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 27

28 Grid Information ● Get useful information about grid sites to help use those sites. ● Which sites exist? ● Which sites are working? ● How do I use a site?

29 Why is this information important? ● Why is this important? ● Because the grid is always changing – sites come and go. Some work one day but not the next. New sites are often created. ● All sites look similar but different.

30 VORS map

31 March 2007JP Navarro 31 VORS entry for OSG_LIGO_PSU Gatekeeper: grid3.aset.psu.edu OSG Consortium Mtg March 2007 Quick Start Guide to the OSG

32 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 32

33 33 What do we need ? ● Identity ● Authentication ● Message Protection ● Authorization ● Single Sign On

34 34 Identity & Authentication ● Each entity should have an identity – entity = people, grid resources ● Authenticate: Establish identity – Is the entity who he claims he is ? – Stops masquerading imposters ● Authorisation: What is each identity allowed to do?

35 35 Single Sign-on ● Type in password once ● Get access to all authorised resources

36 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 36

37 Grid Application ● Application is what you want to do, to do science ● Usually cannot take an existing application and run it on the grid straight away. ● Make application parallel ● Make application distributed

38 File coupled application model ● Very common grid application model ● Each piece of your application is a separate program ● Pieces communicate through files ● Other models of grid applications are possible

39 File-coupled application Input file 1 Intermediat e file 1 Executable A Input file 3 Intermediat e file 3 Executable A Executable B Output file

40 On the grid Input file 1 Intermediat e file 1 Executable A Input file 3 Intermediat e file 3 Executable A Executabl e B Output file Input file 1 Input file 3 Intermediat e file 1 Your computer Grid site 1 Grid site 2

41 The Grid Middleware Stack Grid Security Infrastructure Job Management Data Management Grid Information Services Standard Internet Network Protocols Workflow system (explicit or ad-hoc) Grid Application 41

42 Grid workflow ● You've seen what a grid application might look like... ●... and how the various low level components like GridFTP, GRAM, Condor can be used to make your application go... ●... but lots of fiddly parts to fit together. ● eg. previous slide app had 4 file transfers and 3 job submissions

43 Grid workflow ● Make it easier to express your application in a higher-level language ● Take advantage of work other people have done for: – fault tolerance – workflow sharing – site selection – provenance tracking

44 44 DAGMan Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., “Don’t run job “B” until job “A” has completed successfully.”)‏

45 Swift is… ● A language for writing scripts that: –process and produce large collections of persistent data –with large and/or complex sequences of application programs –on diverse distributed systems –with a high degree of parallelism –persisting over long periods of time –surviving infrastructure failures –and tracking the provenance of execution

46 About Open Science Grid – the organization The OSG is supported by the National Science Foundation and the U.S. Department of Energy's Office of Science. OSG has: the grid service providers – people who provide the software, compute and storage resources, technical support the grid consumers – scientists who use OSG for their applications – sometimes individual researchers, sometimes global collaborations.

47

48 March 2007JP Navarro 48 The OSG Software Stack A collection of software –Grid software: Condor, Globus and lots more –Utilities: Monitoring, Authorization, Configuration –Built for >10 flavors/versions of Linux Automated Build and Test: Integration and regression testing. An easy installation: Push a button, everything just works. Quick update processes. Responsive to user needs: process to add new components based on community needs. A support infrastructure: front line software support, triaging between users and software providers for deeper issues.

49 Where to go for more... ● OSG Education, Outreach, Training group – me + Alina Bejan + Mike Wilde + Charles Bacon – Notes available online 'forever' – http://opensciencegrid.org/OnlineGridCourse ● basics similar to this course ● get access to the real Open Science Grid to implement your applications ● start using real OSG resources in OSGEDU VO ● Me: benc@ci.uchicago.edubenc@ci.uchicago.edu


Download ppt "Introduction to Grids Ben Clifford University of Chicago"

Similar presentations


Ads by Google