Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Slides:



Advertisements
Similar presentations
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Advertisements

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Bosco: Enabling Researchers to Expand Their HTC Resources The Bosco Team: Dan Fraser, Jaime Frey, Brooklin Gore, Marco Mambelli, Alain Roy, Todd Tannenbaum,
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Ways to Connect to OSG Tuesday afternoon, 3:00 pm Lauren Michael Research Computing Facilitator University of Wisconsin-Madison.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Discussion Topics DOE Program Managers and OSG Executive Team 2 nd June 2011 Associate Executive Director Currently planning for FY12 XD XSEDE Starting.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
The impacts of climate change on global hydrology and water resources Simon Gosling and Nigel Arnell, Walker Institute for Climate System Research, University.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Five todos when moving an application to distributed HTC.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Running User Jobs In the Grid without End User Certificates - Assessing Traceability Anand Padmanabhan CyberGIS Center for Advanced Digital and Spatial.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Ways to Connect to OSG Tuesday, Wrap-Up Lauren Michael, CHTC.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
HTCondor Accounting Update
Workload Management Workpackage
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Design rationale and status of the org.glite.overlay component
GWE Core Grid Wizard Enterprise (
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Building Grids with Condor
Condor: Job Management
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
Presentation transcript:

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

May 5, 2011 HCC: Campus Grids Motivation We have 3 clusters in 2 cities. Our largest (4400 cores) is always full 2

May 5, 2011 HCC: Campus Grids Motivation Workflows may require more power than available on a single cluster.  Certainly more than a full cluster can provide. Offload single core jobs to idle resources, making room for specialized (MPI) jobs. 3

May 5, 2011 HCC Campus Grid Framework Goals Encompass: The campus grid should reach all clusters on the campus. Transparent execution environment: There should be an identical user interface for all resources, whether running locally or remotely. Decentralization: A user should be able to utilize his local resource even if it becomes disconnected from the rest of the campus. An error on a given cluster should only affect that cluster.

May 5, 2011 HCC Campus Grid Framework Goals Encompass: The campus grid should reach all clusters on the campus. Transparent execution environment: There should be an identical user interface for all resources, whether running locally or remotely. Decentralization: A user should be able to utilize his local resource even if it becomes disconnected from the rest of the campus. An error on a given cluster should only affect that cluster. CONDOR

May 5, 2011 Encompass Challenges Clusters have different job schedulers: PBS & Condor? Each cluster has their own policies  User Priorities  Allowed users We may need to expand outside the Campus

May 5, 2011 HCC Model for a Campus Grid Me, my friends and everyone else Grid Campus Local 7

May 5, 2011 Preferences/Observations Prefer not installing Condor on every worker node when PBS is already there.  Less intrusive for sysadmins. PBS and Condor should coordinate job scheduling.  Running Condor jobs look like idle cores to PBS.  We don’t want PBS to kill Condor jobs if it doesn’t have to.

May 5, 2011 Problem: PBS & Condor Coordination Initial: Condor is running a job. 9

May 5, 2011 Problem: PBS & Condor Coordination PBS Starts a job – Condor restarts job 10

May 5, 2011 Problem: PBS & Condor Coordination Real Problem: PBS doesn’t know about Condor  Sees nodes as idle. 11

May 5, 2011 Campus Grid Goals - Technologies Encompassed  BLAHP  Glideins (See earlier talk by Igor/Jeff)  Campus Grid Factory Transparent execution environment  Condor Flocking  Glideins Decentralized  Campus Grid Factory  Condor Flocking 12

May 5, 2011 Encompassed – BLAHP Written for European Grid Initiative Translates Condor job into PBS job Distributed with Condor With BLAHP: Condor can provide a single interface for all jobs, whether Condor or PBS.

May 5, 2011 Putting it all Together Campus Grid Factory sfactory/wiki

May 5, 2011 Putting it all Together Provides on- demand Condor pool for unmodified clients with Flocking.

May 5, 2011 Putting it all Together Creates an on demand condor cluster  Condor + Glideins + BLAHP + GlideinWMS + Glue

May 5, 2011 Campus Grid Factory Glideins on worker nodes create on- demand overlay cluster

May 5, 2011 Advantages for the Local Scheduler Allows PBS to know and account for outside jobs. Can co-schedule with local user priorities. PBS can preempt grid jobs for local jobs.

May 5, 2011 Advantages of the Campus Factory User is presented with an uniform Condor interface to resources. Can create overlay network on any resource Condor (BLAHP) can submit to PBS, LSF,… Uses well established technologies: Condor, BLAHP, Glidein.

May 5, 2011 Problem with Pilot Job Submission Problem with Campus Factory: If it sees idle jobs, it assumes they will run on Glideins.  Jobs may require specific software, ram size.  Campus Factory will waste cycles submitting idle Glideins.  Solutions in past were filters, albeit sophisticated.

May 5, 2011 Advanced Pilot Scheduling What if we equated: Completed Glidein = Offline Node 21

May 5, 2011 Advanced Scheduling: OfflineAds OfflineAds were put in Condor for power management  When nodes were not needed, Condor can turn them off  Condor needs to keep track of what nodes it has turned off, and their (maybe special) abilities. OfflineAds describe an turned off computer.

May 5, 2011 Advanced Scheduling: OfflineAds Submitted Glidein = Offline Node  When a Glidein is no longer needed, turns off.  Keep Glidein description in an OfflineAd  When a match is detected with the OfflineAd, submit an actual Glidein.  It is reasonably expected that one can get a similar Glidein when you submit to the local scheduler (BLAHP).

May 5, 2011 Extending Beyond the Campus Nebraska does not have idle resources: Running jobs on Firefly. ~4300 cores

May 5, 2011 In order to extend transparent execution goal, need to send Condor outside the campus. Options for getting outside the campus  Flocking to external Condor clusters  Grid workflow manager: GlideinWMS Extending Beyond the Campus - Options

May 5, 2011 Extending Beyond the Campus: GlideinWMS Expand further with OSG Production Grid GlideinWMS  Creates a on-demand Condor cluster on grid resources  Campus Grid can flock to this on-demand cluster just as it would another local cluster

May 5, 2011 Campus Grid at Nebraska Prairiefire PBS/Condor (Like Purdue) Firefly – Only PBS GlideinWMS interface to OSG Flock to Purdue

May 5, 2011 HCC Campus Grid– 8 Million Hours 8 Million Hours

May 5, 2011 Questions? Me, my friends and everyone else Grid Campus Local 29