Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk.

Similar presentations


Presentation on theme: "Introduction to CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk."— Presentation transcript:

1 Introduction to CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk

2 Why grids? The idea comes from electricity grids: you don’t care which power station your kettle’s using. Also, there are lots of underutilised resources around. The trick is to access them transparently. Not all resources need to be HPC with large amounts of shared memory and fast interconnects. Many research problems are “embarrassingly parallel”, e.g. phase space sampling. We’d like to use “anything”: dedicated servers or desktops.

3 What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing (HTC) facility. Machines in a Condor pool can submit and/or service jobs in the pool. Highly configurable of how/when/whose jobs can run. Condor has several useful mechanisms such as : –Process checkpoint/ restart / migration –MPI support (with some effort) –Failure resilience –Workflow support

4 Getting Started: Submitting Jobs to Condor Choosing a “Universe” for your job (i.e. sort of environment the job will run in): vanilla, standard, Java, parallel (MPI)… Make your job “batch-ready” (namely stdin) Must be able to run in the background: no windows, GUI, etc. Creating a submit description file Run condor_submit on your submit description file

5 A Submit Description File # Example condor_submit input file # (Lines beginning with # are comments) Universe = vanilla Executable = job.$$(OpSys).$$(Arch) InitialDir = /home/mark/condor/run_$(Process) Input = job.stdin Output = job.stdout Error = job.stderr Arguments = arg1 arg2 Requirements = Arch == “X86_64” && OpSys == “Linux” Rank = KFlops Queue 100

6 DAGMan – Condor’s workflow manager Directed Acyclic Graph Manager DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. Allows complicated workflows to be built up (can embed DAGs). E.g., “Don’t run job “B” until job “A” has completed successfully.” Failed nodes can be automatically retried.

7 Condor Flocking Condor attempts to run a submitted job in its local pool. However, queues can be configured to try sending jobs to other pools: “flocking”. User-priority system is “flocking-aware” –A pool’s local users can have priority over remote users “flocking” in. This is how CamGrid works: each group/department maintains its own pool and flocks with the others.

8

9 CamGrid Started in Jan 2005 by five groups (now up to eleven; 13 pools). UCS has its own, separate Condor facility known as “PWF Condor”. Each group sets up and runs its own pool, and flocks to/from other pools. Hence a decentralised, federated model. Strengths: –No single point of failure –Sysadmin tasks shared out Weaknesses: –Debugging is complicated, especially networking issues. –Many linux variants: can cause library problems.

10 Participating departments/groups Cambridge eScience Centre Dept. of Earth Science (2) High Energy Physics School of Biological Sciences National Institute for Environmental eScience (2) Chemical Informatics Semiconductors Astrophysics Dept. of Oncology Dept. of Materials Science and Metallurgy Biological and Soft Systems

11 Local details (1) CamGrid uses a set of RFC 1918 (“CUDN-only”) IP addresses. Hence each machine needs to be given an (extra) address in this space. A CamGrid Management Committee, with members drawn from participating groups, maps out policy. Currently have ~1,000 core/processors, mostly 4-core Dell 1950 (8GB memory) like HPCF. Aside: SMP/MPI works very nicely! Pretty much all linux, and mostly 64 bit. Administrators can decide configuration of their pool, e.g. such issues as: - Extra priority for local users - Renice Condor job - Only run at certain times - Have a preemption policy

12 Local details (2) Responsibility of individual pools to authenticate local submitters. Need to trust root on remote machine, especially for Standard Universe. There’s no shared FS across CamGrid, but Parrot (from the Condor project) is a nice user-space file system tool for linux. Means a job can mount a remote data source like a local file system (á la NFS). Firewalls: a submit host must be able to communicate with every possible execute node. However, can have a well defined port range. Two mailing lists set up: one for users (92 currently registered) and the other for sysadmins. Have a nice web-based utility for viewing job files in realtime on execute hosts.

13

14 41 refereed publications to date, (Science, Phys. Rev. Lett., PLOS,…)

15 USERS YOUR GRID GOD SAVE THE GRID

16 How you can help us help you Pressgang local resources. Why aren’t those laptops/desktops on CamGrid? When applying for grants, please ask for funds to put towards computational resources (~£10k?) Publications, publications, publications! Please remember to mention CamGrid and inform me of accepted articles. Evangelise locally, especially to hierarchy. Tell us what you’d like to see (centralised storage, etc.)

17 reports books images audio papers research data pdf data preprints documents eprints PhD learning objects TIFF bitstreams scholarly conference papers video text articles xml working papers web pages digital theses multimedia statistics manuscripts photos source code We can archive your digital assets… Elin Stangeland, Repository Manager www.dspace.cam.ac.uk support@repository.cam.ac.uk

18 Take home message It works. Cranked out 386 years of CPU usage since Feb ’06 (King James I, Jamestown Massacre). Those who put the effort in and get over the initial learning curve are very happy with it: “Without CamGrid this research would simply not be feasible.” – Prof. Bill Amos (population geneticist ) “We acknowledge CamGrid for invaluable help." – Prof. Fernando Quevedo (theoretical physicist) Does not need outlay for any new hardware and the middleware’s free (and open source). This is a grass-roots initiative: you need to help recruit more/newer resources.

19 Links CamGrid: www.escience.cam.ac.uk/projects/camgrid/ Condor: www.cs.wisc.edu/condor/ Email: mc321@cam.ac.uk Questions?


Download ppt "Introduction to CamGrid Mark Calleja Cambridge eScience Centre www.escience.cam.ac.uk."

Similar presentations


Ads by Google