Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ian C. Smith ULGrid – Experiments in providing a campus grid.

Similar presentations


Presentation on theme: "Ian C. Smith ULGrid – Experiments in providing a campus grid."— Presentation transcript:

1 Ian C. Smith ULGrid – Experiments in providing a campus grid

2 Overview  Current Liverpool systems  PC Condor pool  Job management in ULGrid using Condor-G  The ULGrid portal  Storage Resource Broker  Future developments  Questions

3 Current Liverpool campus systems  ulgbc1  24 dual processor Athlon nodes, 0.5 TB storage GigE  ulgbc2  38 single processor nodes, 0.6 TB storage, GigE  ulgbc3 / lv1.nw-grid.ac.uk  NW-GRID - 44 dual-core, dual-processor nodes, 3 TB storage, GigE  HCC - 35 dual-core, dual-processor nodes, 5 TB storage, InfiniPath  ulgbc4 / lv2.nw-grid.ac.uk  94 single core nodes, 8TB RAID storage, Myrinet  PC Condor pool  ~ 300 Managed Windows Service PCs

4

5 PC Condor Pool  allows jobs to be run remotely on MWS teaching centre PCs at times at which they would otherwise be idle ( ~ 300 machines currently )  provides high throughput computing rather than high performance computing (maximise number of jobs which can be processed in a given time)  only suitable for DOS based applications running in batch mode  no communication between processes possible (“pleasantly parallel” applications only)  statically linked executables work best (although can cope with DLLs)  can access application files on a network mapped drive  long running jobs need to use Condor DAGMan  authentication of users prior to job submission via ordinary University security systems ( NIS+/LDAP )

6 Condor and power saving  power saving employed on all teaching centre PCs by default  machines power down automatically if idle for > 30 min and no user logged in but... ... will remain powered up if Condor job running until it completes  NIC remains active allowing remote wake-on-LAN  submit host detects if no. of idle jobs > no. of idle machines and wakes up the pool as necessary  couple of teaching centres remain "always available" for testing etc

7 Teaching Centre 1 Teaching Centre 2... other centres Condor submit host Condor central manager Condor view server Condor portal user login Condor pool

8 Condor research applications  molecular statics and dynamics (Engineering)  prediction of shapes and properties of molecules using quantum mechanics (Chemistry)  modelling of avian influenza propagation in poultry flocks (Vet Science)  modelling of E. Coli propagation in dairy cattle (Vet Science)  model parameter optimization using Genetic Algorithms (Electronic Engineering)  computational fluid dynamics (Engineering)  numerical simulation of ocean current circulation (Earth and Ocean Science)  numerical simulation of geodynamo magnetic field (Earth and Ocean Science)

9

10

11

12

13

14

15

16

17

18 Boundary layer fluctuations induced by freestream streamwise vortices Flow

19 Boundary layer ‘streaky structures’ induced by freestream streamwise vortices Flow

20

21

22

23

24 ULGrid aims  provide a user friendly single point of access to cluster resources  Globus based with authentication through UK e-Science certificates  job submission should be no more difficult than using a coventional batch system  users should be able to determine easily which resources are available  meta-scheduling of jobs  users should be able to monitor progress of all jobs easily  jobs can be single process or MPI  job submission from either the command line (qsub-style script) or web

25 ULGrid implementation  originally tried Transfer-queue-over-Globus (ToG) from EPCC for job submission but...  messy to integrate with SGE  limited reporting of job status  no meta-scheduling possible  decided to switch to Condor-G  Globus monitoring and discovery service (MDS) originally used to publish job status and resource info but...  very difficult configure  hosts mysteriously vanish because of timeouts (processor overload ? network delays ? who knows )  all hosts occasionally disappear after single cluser reboot  eventually used Apache web servers to publish information in the form of Condor ClassAds

26 Condor-G pros  familiar and reliable interface for job submission and monitoring  very effective at hiding the Globus middleware layer  meta-scheduling possible though the use of ClassAds  automatic renewal of proxies on remote machines  proxy expiry handled gracefully  workflows can be implemented using DAGman  nice sysadmin features e.g.  fair-share scheduling  changeable user priorities  accounting

27 Condor-G cons  user interface is different from SGE, PBS etc  limited file staging facilities  limited reporting of remote job status  user still has to deal directly with Globus certificates  matchmaking can be slow

28 Local enhancements to Condor-G  extended resource specifications – e.g. parallel environment, queue  extended file staging  ‘Virtual Console’ - streaming of output files from remotely running jobs  reporting of remote job status (e.g. running, idle, error)  modified version of LeSC SGE jobmanager runs on all clusters  web interface  MyProxy server for storage/retrieval of e-Science certificates  automatic proxy certificate renewal using MyProxy server

29 Specifying extended job attributes  without RSL schema extensions: globusrsl = ( environment = (transfer_input_files file1,file2,file3)\ (transfer_output_files file4,file5 )\ (parallel_environment mpi2) )  with RSL schema extensions: globusrsl = (transfer_input_files = file1, file2, file3)\ (transfer_output_files = file4,file5 )\ (parallel_environment = mpi2) or... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3 transfer_output_files = file4, file5 or... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3

30 Typical Condor-G job submission file universe = globus globusscheduler = $$(gatekeeper_url) x509userproxy=/opt2/condor_data/ulgrid/certs/bonarlaw.cred requirements = ( TARGET.gatekeeper_url =!= UNDEFINED ) && \ ( name == "ulgbc1.liv.ac.uk" ) output = condori_5e_66_cart.out error = condori_5e_66_cart.err log = condori_5e_66_cart.log executable = condori_5e_66_cart_$$(Name) globusrsl = ( input_working_directory = $ENV(PWD) )\ ( job_name = condori_5e_66_cart )( job_type = script )\ ( stream_output_files = pcgamess.out ) transfer_input_files=pcgamess.in notification = never queue

31 NW-GRID cluster (ulgbc3) Condor-G submit host CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) NW-GRID/POL cluster (ulgp4) Condor-G portal CSD AMD cluster (ulgbc1) Condor-G central manager MyProxy server User login Condor ClassAds Globus file staging

32

33

34

35

36

37 Storage Resource Broker (SRB)  open source grid middleware developed by San Diego Supercomputing Center allowing distributed storage of data  absolute filenames reflect the logical structure of data rather than its physical location (unlike NFS)  meta-data allows annotation of files so that results can be searched easily at a later date  high speed data movement through parallel transfers  several interfaces available: shell (Scommands), Windows GUI (InQ), X/Windows GUI, web browser (MySRB) also APIs for C/C++, Java, Python  provides most of the functionality needed to build a data grid  many other features

38

39

40 NW-GRID cluster (ulgbc3) CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) NW-GRID/POL cluster (ulgp4) CSD AMD cluster (ulgbc1) Condor-G central manager/submit host Globus file staging SRB MCAT server SRB data vaults (distributed storage) meta-data ‘real’ data

41 Future developments  make increased use of SRB for file staging and archiving of results in ULGrid  expand job submission to other NW-GRID sites ( and NGS ? )  encourage use of Condor-G for job submission on UL-Grid/NW- GRID  incorporate more applications into the portal  publish more information in Condor-G ClassAds  provide better support for long running jobs via the portal and improved reporting of job status

42 Further Information http://www.liv.ac.uk/e-science/ulgrid


Download ppt "Ian C. Smith ULGrid – Experiments in providing a campus grid."

Similar presentations


Ads by Google