Www.epcc.ed.ac.uk/sungrid 1 Grid scheduling issues in the Sun Data and Compute Grids Project NeSC Grid Scheduling Workshop, Edinburgh, UK 21 st October.

Slides:



Advertisements
Similar presentations
The Quantum Chromodynamics Grid James Perry, Andrew Jackson, Matthew Egbert, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Advertisements

WS-JDML: A Web Service Interface for Job Submission and Monitoring Stephen M C Gough William Lee London e-Science Centre Department of Computing, Imperial.
18 April 2002 e-Science Architectural Roadmap Open Meeting 1 Support for the UK e-Science Roadmap David Boyd UK Grid Support Centre CLRC e-Science Centre.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
CSF4, SGE and Gfarm Integration Zhaohui Ding Jilin University.
A Computation Management Agent for Multi-Institutional Grids
The Community Authorisation Service – CAS Dr Steven Newhouse Technical Director London e-Science Centre Department of Computing, Imperial College London.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Globus Toolkit 4 hands-on Gergely Sipos, Gábor Kecskeméti MTA SZTAKI
Universität Dortmund Robotics Research Institute Information Technology Section Grid Metaschedulers An Overview and Up-to-date Solutions Christian.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
John Kewley e-Science Centre GIS and Grid Computing Workshop 13 th September 2005, Leeds Grid Middleware and GROWL John Kewley
Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.
Data Grids: Globus vs SRB. Maturity SRB  Older code base  Widely accepted across multiple communities  Core components are tightly integrated Globus.
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Jun Peng Stanford University – Department of Civil and Environmental Engineering Nov 17, 2000 DISSERTATION PROPOSAL A Software Framework for Collaborative.
Globus Computing Infrustructure Software Globus Toolkit 11-2.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
- 1 - Grid Programming Environment (GPE) Ralf Ratering Intel Parallel and Distributed Solutions Division (PDSD)
BaBar WEB job submission with Globus authentication and AFS access T. Adye, R. Barlow, A. Forti, A. McNab, S. Salih, D. H. Smith on behalf of the BaBar.
Quality Attributes of Web Software Applications – Jeff Offutt By Julia Erdman SE 510 October 8, 2003.
Web-based Virtual Research Environments (VRE): Supporting Collaboration in e-Science Xiaobo Yang, Rob Allan CCLRC e-Science Centre Daresbury Laboratory,
1 UK NeSC Meeting, November 18 th, 2004 Terry Sloan EPCC, The University of Edinburgh INWA : using OGSA-DAI in a commercial environment.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
High Performance Louisiana State University - LONI HPC Enablement Workshop – LaTech University,
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
1 EPCC Sun Data and Compute Grids Project Using Sun Grid Engine and Globus to Schedule Jobs Across a Combination of Local.
Using OGSA-DAI in a commercial environment Terry Sloan EPCC Telephone:
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
Algoval: Evaluation Server Past, Present and Future Simon Lucas Computer Science Dept Essex University 25 January, 2002.
1 All-Hands Meeting 2-4 th Sept 2003 e-Science Centre The Data Portal Glen Drinkwater.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
London e-Science Centre GridSAM Job Submission and Monitoring Web Service William Lee, Stephen McGough.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
WP8 Meeting Glenn Patrick1 LHCb Grid Activities in UK Grid WP8 Meeting, 16th November 2000 Glenn Patrick (RAL)
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1October 9, 2001 Sun in Scientific & Engineering Computing Grid Computing with Sun Wolfgang Gentzsch Director Grid Computing Cracow Grid Workshop, November.
© Geodise Project, University of Southampton, Geodise Middleware & Optimisation Graeme Pound, Hakki Eres, Gang Xue & Matthew Fairman Summer 2003.
GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Middleware for Campus Grids Steven Newhouse, ETF Chair (& Deputy Director, OMII)
Geoff Cawood, Terry Sloan Edinburgh Parallel Computing Centre (EPCC) Telephone: EPCC Sun Data and Compute.
Portal Update Plan Ashok Adiga (512)
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.
© Geodise Project, University of Southampton, Geodise Middleware Graeme Pound, Gang Xue & Matthew Fairman Summer 2003.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Grid Execution Management for Legacy Code Architecture Exposing legacy applications as Grid services: the GEMLCA approach Centre.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Portal Services IeSE (the Integrated e-Science Environment)
Building Grids with Condor
Grid Computing Software Interface
Presentation transcript:

1 Grid scheduling issues in the Sun Data and Compute Grids Project NeSC Grid Scheduling Workshop, Edinburgh, UK 21 st October 2003 Terry Sloan EPCC Telephone:

2 Overview  Project Background, Goal, Functional Aims  TOG – Transfer-queue Over Globus  JOSH – JOb Scheduling Hierachically  Issues arising

3 Background  EPCC –Edinburgh Parallel Computing Centre: –Part of the University of Edinburgh in Scotland –“A technology-transfer centre for high-performance-computing” –Representing NeSC (UK National e-Science Centre)  A collaborative project between EPCC and Sun –Often referred as ‘Sungrid’ or ‘Sun DCG' within EPCC –Project web site : –57 person months of EPCC effort over 2 years –Started February 2002, will finish January 2004 –Sun approve project deliverables –Funded by UK DTI/EPSRC e-Science Grid Core Programme

4 Project Goal “Develop a fully Globus-enabled compute and data scheduler based around Grid Engine, Globus and a wide variety of data technologies”  Grid Engine – over downloads (Sep ‘03) –Open source Distributed Resource Management (DRM) tool – –Schedules activities across networked resources –Allows collaboration amongst teams within a single enterprise –Integration with Globus Toolkit enables collaboration amongst a number of enterprises  Globus Toolkit :

5 Project Goal (cont)  Given the project goal –“Develop a fully Globus-enabled compute and data scheduler based around Grid Engine, Globus and a wide variety of data technologies”  What does that mean in practice?  Identify five key functional aims –1. Job scheduling across Globus to remote Grid Engines –2. File transfer between local client site and remote jobs –3. File transfer between any site and remote jobs –4. Allow 'datagrid aware' jobs to work remotely –5. Data-aware job scheduling

6 Globus SchedulerScheduler Functional Aim  1. Job scheduling across Globus to remote GEs –Schedule jobs securely onto remote machines –Allow collaborating enterprises to share compute resources –Efficiency benefits of using lightly loaded remote machines Grid Engine abcd efgh Site ASite B SchedulerScheduler

7 Functional Aim  2. File transfer between local site and remote jobs –Data staging, executable staging –Allow jobs expecting local file I/O to function properly Client Site Compute Site 1. Job submitted 3. Job runs 2. Input files transferred (PRE) 4. Output files transferred (POST)

8 Functional Aim  3. File transfer between any site and remote jobs –Allow access to shared data files anywhere HTTP sites, FTP sites, GridFTP sites Client Site Compute Site 1. Job submitted 3. Job runs 2. Input files transferred (PRE) 4. Output files transferred (POST) Data Site

9 Functional Aim  4. Allow 'datagrid aware' jobs to work remotely –Ensure that attempts by a job to access Globus-enabled data sources are not hindered by running the job remotely Different style of job - dynamic data access not just pre and post Data sources: GridFTP sites, OGSA-DAI databases, SRB Client Site Compute Site 1. Job submitted 2. Job starts running 3. Job reads from database 4. Job does some processing Data Site 1Data Site 2 5. Job writes to database 6. Goto 3 until done

10 Functional Aim  5. Data-aware job scheduling –Schedule data-intensive jobs ‘close’ to the data they require –In this example Compute Site 2 has better access to the Data Site Client Site Compute Site 1 1. Job submitted Data SiteCompute Site 2 2. Data transferred quickly

11 TOG (Transfer-queue Over Globus) Grid Engine abcd e efgh d Site A Site B Globus 2.2.x User A User B –No new meta-scheduler - solution uses Grid Engine at two levels –Integrates GE and Globus 2.2.x –Supply GE execution methods (starter method etc.) to implement a 'transfer queue' which sends jobs over Globus to a remote GE –GE complexes used for configuration –Globus GSI for security, GRAM for interaction with remote GE –GASS for small data transfer, GridFTP for large datasets –Written in Java - Globus functionality accessed through Java COG kit Transfer queue

12 TOG Software  Functionality –1. Job scheduling across Globus to remote Grid Engines –2. File transfer between local client site and remote jobs Add special comments to job script to specify set of files to transfer between local site and remote site –4. Allow 'datagrid aware' jobs to work remotely Use of Globus GRAM ensures proxy certificate is present in remote environment  Absent –3. File transfer between any site and remote jobs Files are transferred between remote site and local site only –5. Data-aware job scheduling

13 TOG Software  Pros –Simple approach –Usability ● Existing Grid Engine interface ● Users only need to learn Globus certificates –Remote administrators still have full control over their resources  Cons –Low quality scheduling decisions ● State of remote resource – is it fully loaded? ● Ignores data transfer costs –Scales poorly - one local transfer queue for each remote queue –Manual set-up ● Configuring the transfer queue with same properties as remote queue –Java virtual machine invocation per job submission

14 TOG Software  Software and docs available from Grid Engine community web site – html  TOG usage –ODDGenes –Raytheon, USA –Curtin Business School, Australia –NeSC, Glasgow –…

15 JOSH (JOb Scheduling Hierarchically)  Developing JOSH software –Address the shortcomings of TOG –Incorporate Globus 3 and grid services  Adds a new 'hierarchical' scheduler above Grid Engine –Command line interface –hiersched submit_ge Takes GE job script as input (embellished with data requirements) Queries grid services at each compute site to find best match and submits job Job controlled through resulting 'job locator' User Job Spec Hierarchical Scheduler hiersched user Interface Grid Engine Grid Service Layer Input Data Site Output Data Site Grid Engine Grid Service Layer

16 JOSH Sites  Compute sites –One Grid Engine per compute site –A persistent JOSH 'site service' runs in a Grid Service Container at each compute site  Configuration file at client site –Maps compute site names to site service handles 'edi' -> 'glasgow' -> 'london' -> –Defines a 'pool' of candidate sites for the hierarchical scheduler to choose from Grid Engine Grid Service Container JOSH Site Service Hierarchical Scheduler hiersched user Interface Grid Engine JOSH Site Service Grid Engine JOSH Site Service Grid Engine JOSH Site Service Client Site Compute Site Config File edi

17 JOSH Usage geoffc% hiersched sitestatus Hierarchical Scheduler Site Status ================================== Site Name Status Site Handle edi Up glasgow Down london Up geoffc% hiersched submit-ge myjob.sh edi:4354 geoffc% hiersched jobstatus edi:4354 Pending geoffc% hiersched terminate edi:4354 SUCCESS geoffc%

18 Scheduling Algorithm  The hierarchical scheduler chooses a site for a job according to the following criteria: –Physical capability Sites which have no queues that can satisfy the job are rejected –Load score Define a site's load score as the minimum load of its capable queues for a given job Sites with lower load are favoured –Data proximity Sites 'closer' to their data sources are favoured  Weighting factors –Can supply multipliers for load and data proximity with the job Eg. so data proximity can be ignored for compute intensive jobs

19 JOSH Implementation  One hiersched job submission is implemented as three Grid Engine jobs at the compute site –1. PRE Job - pulls input data files to the compute site –2. MAIN Job - the user's original job script –3. POST Job - pushes output data files to their destinations and cleans up  Separate jobs approach solve various issues –Don't want a high performance queue clogged up with a big data transfer task PRE and POST jobs can be run in dedicated data transfer queue –Do want a terminated job to be cleaned up and partial output data returned Terminating MAIN job releases POST job

20 Compute Site Implementation  Grid service operations –Call out to GE (and other) executables –Parse output (fiddly) –Some more-specific GE commands would be handy Grid Services Container JoshSiteService canRun(...) jobStatus(...) loadScore(…) dataProximityScore(...)... ping qsub -w v qstat/qacct  Globus 3 operations always run as container owner –But job submission, termination etc. must run under the client user's remote account –Ensure correct privileges, prevent killing someone else's job etc. –Workround is to use the GT3 Managed Job Service - bit ugly –Allows a script to run under the client user's account Compute Site

21 JOSH  Pros –Satisfies the five functionality goals –Need only minor additions to existing GE job scripts Data requirement comments –Remote administrators still have full control over their GEs –Tries to make use of existing GE functionality eg. 'can run'  Cons –Latency in decision making –Not so much 'scheduling' as 'choosing' –Grid Engine specific solution

22 JOSH Usage  JOSH development finish January 2004  Possible future work includes –A Loadleveller version

23 Issues arising  DRMs have different job specifications and policies –Higher level tools need to query everything about a site –DRMAA is not enough –JSDL-WG and WS-Agreement are good but when … –Reason for GE-specific solution in TOG/JOSH  Resource control –Priorities for local users –Security, opening of firewalls  GT 3 Managed Job service overhead –A whole new Grid Services container is launched for the mapped user –Too specific, only for job submission –Unrelated services that need to run as a mapped user need to pretend to be a job

24 Issues arising  Gridmap file –User needs to have an account or access to an account on all resources –Community Authorisation Service solves but beta release after Sun DCG finishes  Platform Community Scheduler Framework (CSF) –Extensible framework for building meta-schedulers –WS-agreement based to be included in future GT releases Targetted for fall 2003 –Currently beta release available from Platform

25 Wrap-Up  Questions ??