Ian C. Smith ULGrid – Experiments in providing a campus grid.

Slides:



Advertisements
Similar presentations
Generic MPI Job Submission by the P-GRADE Grid Portal Zoltán Farkas MTA SZTAKI.
Advertisements

Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.
Peter Berrisford RAL – Data Management Group SRB Services.
Experience of the SRB in support of collaborative grid computing Martin Dove University of Cambridge.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
A Computation Management Agent for Multi-Institutional Grids
Jon Wakelin Condor, Globus and SRB: Tools for Constructing a Campus Grid.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Supporting MPI Applications on EGEE Grids Zoltán Farkas MTA SZTAKI.
Dr. David Wallom Experience of Setting up and Running a Production Grid on a University Campus July 2004.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
OxGrid, A Campus Grid for the University of Oxford Dr. David Wallom.
1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Grids and Globus at BNL Presented by John Scott Leita.
Globus Computing Infrustructure Software Globus Toolkit 11-2.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Riccardo Bruno INFN.CT Sevilla, Sep 2007 The GENIUS Grid portal.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Ashok Agarwal 1 BaBar MC Production on the Canadian Grid using a Web Services Approach Ashok Agarwal, Ron Desmarais, Ian Gable, Sergey Popov, Sydney Schaffer,
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
Ian C. Smith The University of Liverpool Condor Pool.
Grid tool integration within the eMinerals project Mark Calleja.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Ian C. Smith Experiences with running MATLAB jobs on a power- saving Condor Pool.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
The eMinerals minigrid and the national grid service: A user’s perspective NGS169 (A. Marmier)
NW-GRID Campus Grids Workshop Liverpool31 Oct 2007 NW-GRID Campus Grids Workshop Liverpool31 Oct 2007 Moving Beyond Campus Grids Steven Young Oxford NGS.
© Geodise Project, University of Southampton, Geodise Middleware & Optimisation Graeme Pound, Hakki Eres, Gang Xue & Matthew Fairman Summer 2003.
Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Building the e-Minerals Minigrid Rik Tyer, Lisa Blanshard, Kerstin Kleese (Data Management Group) Rob Allan, Andrew Richards (Grid Technology Group)
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Using the ARCS Grid and Compute Cloud Jim McGovern.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Manchester Computing Supercomputing, Visualization & eScience Seamless Access to Multiple Datasets Mike AS Jones ● Demo Run-through.
Holding slide prior to starting show. Lessons Learned from the GECEM Portal David Walker Cardiff University
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 22 February 2006.
Antonio Fuentes RedIRIS Barcelona, 15 Abril 2008 The GENIUS Grid portal.
Data services on the NGS
Presentation transcript:

Ian C. Smith ULGrid – Experiments in providing a campus grid

Overview  Current Liverpool systems  PC Condor pool  Job management in ULGrid using Condor-G  The ULGrid portal  Storage Resource Broker  Future developments  Questions

Current Liverpool campus systems  ulgbc1  24 dual processor Athlon nodes, 0.5 TB storage GigE  ulgbc2  38 single processor nodes, 0.6 TB storage, GigE  ulgbc3 / lv1.nw-grid.ac.uk  NW-GRID - 44 dual-core, dual-processor nodes, 3 TB storage, GigE  HCC - 35 dual-core, dual-processor nodes, 5 TB storage, InfiniPath  ulgbc4 / lv2.nw-grid.ac.uk  94 single core nodes, 8TB RAID storage, Myrinet  PC Condor pool  ~ 300 Managed Windows Service PCs

PC Condor Pool  allows jobs to be run remotely on MWS teaching centre PCs at times at which they would otherwise be idle ( ~ 300 machines currently )  provides high throughput computing rather than high performance computing (maximise number of jobs which can be processed in a given time)  only suitable for DOS based applications running in batch mode  no communication between processes possible (“pleasantly parallel” applications only)  statically linked executables work best (although can cope with DLLs)  can access application files on a network mapped drive  long running jobs need to use Condor DAGMan  authentication of users prior to job submission via ordinary University security systems ( NIS+/LDAP )

Condor and power saving  power saving employed on all teaching centre PCs by default  machines power down automatically if idle for > 30 min and no user logged in but... ... will remain powered up if Condor job running until it completes  NIC remains active allowing remote wake-on-LAN  submit host detects if no. of idle jobs > no. of idle machines and wakes up the pool as necessary  couple of teaching centres remain "always available" for testing etc

Teaching Centre 1 Teaching Centre 2... other centres Condor submit host Condor central manager Condor view server Condor portal user login Condor pool

Condor research applications  molecular statics and dynamics (Engineering)  prediction of shapes and properties of molecules using quantum mechanics (Chemistry)  modelling of avian influenza propagation in poultry flocks (Vet Science)  modelling of E. Coli propagation in dairy cattle (Vet Science)  model parameter optimization using Genetic Algorithms (Electronic Engineering)  computational fluid dynamics (Engineering)  numerical simulation of ocean current circulation (Earth and Ocean Science)  numerical simulation of geodynamo magnetic field (Earth and Ocean Science)

Boundary layer fluctuations induced by freestream streamwise vortices Flow

Boundary layer ‘streaky structures’ induced by freestream streamwise vortices Flow

ULGrid aims  provide a user friendly single point of access to cluster resources  Globus based with authentication through UK e-Science certificates  job submission should be no more difficult than using a coventional batch system  users should be able to determine easily which resources are available  meta-scheduling of jobs  users should be able to monitor progress of all jobs easily  jobs can be single process or MPI  job submission from either the command line (qsub-style script) or web

ULGrid implementation  originally tried Transfer-queue-over-Globus (ToG) from EPCC for job submission but...  messy to integrate with SGE  limited reporting of job status  no meta-scheduling possible  decided to switch to Condor-G  Globus monitoring and discovery service (MDS) originally used to publish job status and resource info but...  very difficult configure  hosts mysteriously vanish because of timeouts (processor overload ? network delays ? who knows )  all hosts occasionally disappear after single cluser reboot  eventually used Apache web servers to publish information in the form of Condor ClassAds

Condor-G pros  familiar and reliable interface for job submission and monitoring  very effective at hiding the Globus middleware layer  meta-scheduling possible though the use of ClassAds  automatic renewal of proxies on remote machines  proxy expiry handled gracefully  workflows can be implemented using DAGman  nice sysadmin features e.g.  fair-share scheduling  changeable user priorities  accounting

Condor-G cons  user interface is different from SGE, PBS etc  limited file staging facilities  limited reporting of remote job status  user still has to deal directly with Globus certificates  matchmaking can be slow

Local enhancements to Condor-G  extended resource specifications – e.g. parallel environment, queue  extended file staging  ‘Virtual Console’ - streaming of output files from remotely running jobs  reporting of remote job status (e.g. running, idle, error)  modified version of LeSC SGE jobmanager runs on all clusters  web interface  MyProxy server for storage/retrieval of e-Science certificates  automatic proxy certificate renewal using MyProxy server

Specifying extended job attributes  without RSL schema extensions: globusrsl = ( environment = (transfer_input_files file1,file2,file3)\ (transfer_output_files file4,file5 )\ (parallel_environment mpi2) )  with RSL schema extensions: globusrsl = (transfer_input_files = file1, file2, file3)\ (transfer_output_files = file4,file5 )\ (parallel_environment = mpi2) or... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3 transfer_output_files = file4, file5 or... globusrsl = (parallel_environment = mpi2) transfer_input_files = file1, file2, file3

Typical Condor-G job submission file universe = globus globusscheduler = $$(gatekeeper_url) x509userproxy=/opt2/condor_data/ulgrid/certs/bonarlaw.cred requirements = ( TARGET.gatekeeper_url =!= UNDEFINED ) && \ ( name == "ulgbc1.liv.ac.uk" ) output = condori_5e_66_cart.out error = condori_5e_66_cart.err log = condori_5e_66_cart.log executable = condori_5e_66_cart_$$(Name) globusrsl = ( input_working_directory = $ENV(PWD) )\ ( job_name = condori_5e_66_cart )( job_type = script )\ ( stream_output_files = pcgamess.out ) transfer_input_files=pcgamess.in notification = never queue

NW-GRID cluster (ulgbc3) Condor-G submit host CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) NW-GRID/POL cluster (ulgp4) Condor-G portal CSD AMD cluster (ulgbc1) Condor-G central manager MyProxy server User login Condor ClassAds Globus file staging

Storage Resource Broker (SRB)  open source grid middleware developed by San Diego Supercomputing Center allowing distributed storage of data  absolute filenames reflect the logical structure of data rather than its physical location (unlike NFS)  meta-data allows annotation of files so that results can be searched easily at a later date  high speed data movement through parallel transfers  several interfaces available: shell (Scommands), Windows GUI (InQ), X/Windows GUI, web browser (MySRB) also APIs for C/C++, Java, Python  provides most of the functionality needed to build a data grid  many other features

NW-GRID cluster (ulgbc3) CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) NW-GRID/POL cluster (ulgp4) CSD AMD cluster (ulgbc1) Condor-G central manager/submit host Globus file staging SRB MCAT server SRB data vaults (distributed storage) meta-data ‘real’ data

Future developments  make increased use of SRB for file staging and archiving of results in ULGrid  expand job submission to other NW-GRID sites ( and NGS ? )  encourage use of Condor-G for job submission on UL-Grid/NW- GRID  incorporate more applications into the portal  publish more information in Condor-G ClassAds  provide better support for long running jobs via the portal and improved reporting of job status

Further Information