Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Bosco: Enabling Researchers to Expand Their HTC Resources The Bosco Team: Dan Fraser, Jaime Frey, Brooklin Gore, Marco Mambelli, Alain Roy, Todd Tannenbaum,
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
HTPC - High Throughput Parallel Computing (on the OSG) Dan Fraser, UChicago OSG Production Coordinator Horst Severini, OU (Greg Thain, Uwisc) OU Supercomputing.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
OSG Area Coordinators Campus Infrastructures Update Dan Fraser Miha Ahronovitz, Jaime Frey, Rob Gardner, Brooklin Gore, Marco Mambelli, Todd Tannenbaum,
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Grid Computing I CONDOR.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
Discussion Topics DOE Program Managers and OSG Executive Team 2 nd June 2011 Associate Executive Director Currently planning for FY12 XD XSEDE Starting.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid job submission using HTCondor Andrew Lahiff.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
how Shibboleth can work with job schedulers to create grids to support everyone Exposing Computational Resources Across Administrative Domains H. David.
OSG Area Report Production – Operations – Campus Grids Jan 11, 2011 Dan Fraser.
CMS Usage of the Open Science Grid and the US Tier-2 Centers Ajit Mohapatra, University of Wisconsin, Madison (On Behalf of CMS Offline and Computing Projects)
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
OSG Area Report Production – Operations – Campus Grids June 19, 2012 Dan Fraser Rob Quick.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Workload Management System
CREAM-CE/HTCondor site
Building Grids with Condor
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
PU. Setting up parallel universe in your pool and when (not
Presentation transcript:

Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!

March 8, 2011 Motivation for a Campus Grid Increase available resources by distributing jobs around the campus. Offload single core jobs to idle resources, making room for specialized (MPI) jobs. Independence from resource failures.

March 8, 2011 Campus Grids you may know GLOW  Condor is Everywhere!  AFS for data Purdue  Condor & PBS on nodes  Shared Home directories, Large clusters with share file systems, Condor file transfer between departmental clusters.

March 8, 2011 What changes do we want? Prefer not installing condor on every worker node. Less intrusive for sysadmins. PBS and Condor should coordinate job scheduling. We don’t want PBS to kill condor jobs if it doesn’t have to.

March 8, 2011 Campus Grid Technology Condor + Glideins Condor will handle job submission, job execution, and file transfer Introducing Campus Grid Factory to handle on-demand glidein submissions.

March 8, 2011 Campus Grid Factory Submit Condor worker nodes as jobs to PBS Uses GlideinWMS libraries to manage glideins on the cluster. Condor + GlideinWMS + BLAHp + Custom Glue Provides virtual Condor pool for outside clients for Flocking.

March 8, 2011 Campus Factory Operation 1.Factory queries user queue 2.Submit to PBS (through Condor BLAHP) 3.PBS Starts the Glidein job.

March 8, 2011 Campus Factory Operation 4.Glidein reports to the factory’s collector 5.User queue requests idle slots from factory 6.Condor negotiates and starts the job.

March 8, 2011 Campus Grid Internals

March 8, 2011 Advantages of the Campus Factory User is presented with an uniform Condor interface to resources. Can create overlay network on any resource Condor (BLAHp) can submit to (PBS, LSF,…) Uses well established technologies, Condor, BLAHp, and ideas, Condor, flocking.

March 8, 2011 Advantages for the Local Scheduler Allows PBS to know and account for outside jobs. Can co-schedule with local user priorities. PBS can preempt grid jobs for local jobs.

March 8, 2011 Getting outside the Campus Campus resources are busy: Running jobs on Firefly. ~4300 cores

March 8, 2011 Getting outside the Campus Users will use whatever resources are available, usually to the max.  Have you had a user submit 50, core jobs to PBS? We have! There are resources available out there, just need to get there. Users want a smooth transition to other resources.

March 8, 2011 Getting outside the Campus: Flocking User Schedd advertises idle jobs to resource. Resource matches idle jobs with job slots User schedd transfers data directly to worker node.

March 8, 2011 Getting outside the Campus: Flocking Flocking  Need to have Condor on the other side.  Need to have mutual trust relationships. Not necessarily x509.  Nebraska is part of Diagrid, large Condor flock led by Purdue.

March 8, 2011 Getting outside the Campus: OSG Interface Expand further with OSG Production Grid GlideinWMS  GlideinWMS Frontend provides virtual Condor pool of grid worker nodes. Campus submit hosts can Flock to the GlideinWMS Frontend.  Submission at the Frontend can also flock to campus resources.  We don’t have to submit globus jobs; all Condor vanilla universe.  Factory is at UCSD submit globus jobs for us.

March 8, 2011 Ever-increasing circle of resources Me, my friends and everyone else Grid Campus Local

March 8, 2011 Campus Grid at Nebraska Prairiefire PBS/Condor (Like Purdue) Firefly – Only PBS GlideinWMS interface to OSG Flock to Purdue

March 8, 2011 Usage of the Campus Grid Bridging to Purdue provides a lot of time OSG resources (through GlideinWMS)

March 8, 2011 Future Work Accounting  Submission side accounting is easy, we already do it in the OSG.  To accurately get data, we need to account on the execute site (worker nodes). Data  Is there a better way than Condor File Transfer?  What about the cache models being proposed by CMS/Atlas?

March 8, 2011 Bonus: HTPC on Campus Specifying the right requirements, exactly the same as regular jobs. Flocking HTPC to Campus factory exactly the same, just specify requirements.

March 8, 2011 Acknowledgements Brian Bockelman (Technical consult) Dan Fraser (Campus Grids Lead) David Swanson (Advisor) Igor Sfiligoi (GlideinWMS) Jeff Dost (UCSD Factory Support)