High Throughput Urgent Computing Jason Cope Condor Week 2008.

Slides:

Advertisements

Similar presentations

Agreement-based Distributed Resource Management Alain Andrieux Karl Czajkowski.

Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Grid Deployments and Cyberinfrastructure Andrew J. Younge 102 Lomb Memorial Drive Rochester, NY 14623

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

(e)Science-Driven, Production- Quality, Distributed Grid and Cloud Data Infrastructure for the Transformative, Disruptive, Revolutionary, Next-Generation.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

SPRUCE Special PRiority and Urgent Computing Environment Nick Trebon Argonne National Laboratory University of Chicago

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

W w w. h p c - e u r o p a. o r g Single Point of Access to Resources of HPC-Europa Krzysztof Kurowski, Jarek Nabrzyski, Ariel Oleksiak, Dawid Szejnfeld.

GIG Software Integration: Area Overview TeraGrid Annual Project Review April, 2008.

December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.

Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.

High Performance Louisiana State University - LONI HPC Enablement Workshop – LaTech University,

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Gayathri Namala Center for Computation & Technology Louisiana State University Representing the SURA Coastal Ocean Observing and Prediction Program (SCOOP)

INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

Condor Birdbath Web Service interface to Condor

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

Grid Workload Management Massimo Sgaravatto INFN Padova.

Grid Computing at The Hartford Condor Week 2008 Robert Nordlund

Grid Architecture William E. Johnston Lawrence Berkeley National Lab and NASA Ames Research Center (These slides are available at grid.lbl.gov/~wej/Grids)

SPRUCE Special PRiority and Urgent Computing Environment Pete Beckman Argonne National Laboratory University of Chicago

Rochester Institute of Technology Cyberaide Shell: Interactive Task Management for Grids and Cyberinfrastructure Gregor von Laszewski, Andrew J. Younge,

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Authors: Ronnie Julio Cole David

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

1 Alexandru V Staicu 1, Jacek R. Radzikowski 1 Kris Gaj 1, Nikitas Alexandridis 2, Tarek El-Ghazawi 2 1 George Mason University 2 George Washington University.

SPRUCE Special PRiority and Urgent Computing Environment Advisor Demo Nick Trebon University of Chicago Argonne National Laboratory

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

National Computational Science National Center for Supercomputing Applications National Computational Science Integration of the MyProxy Online Credential.

SPRUCE Special PRiority and Urgent Computing Environment User Demo Nick Trebon Argonne National Laboratory University of Chicago

Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.

Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

4/9/ 2000 I Datagrid Workshop- Marseille C.Vistoli Wide Area Workload Management Work Package DATAGRID project Parallel session report Cristina Vistoli.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

Software Integration Highlights CY2008 Lee Liming, JP Navarro GIG Area Directors for Software Integration University of Chicago, Argonne National Laboratory.

GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Purdue RP Highlights TeraGrid Round Table November 5, 2009 Carol Song Purdue TeraGrid RP PI Rosen Center for Advanced Computing Purdue University.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Probabilistic Upper Bounds for Urgent Applications Nick Trebon and Pete Beckman University of Chicago and Argonne National Lab, USA Case Study Elevated.

Slot Acquisition Presenter: Daniel Nurmi. Scope One aspect of VGDL request is the time ‘slot’ when resources are needed –Earliest time when resource set.

TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.

Workload Management Workpackage

Information Services Discussion TeraGrid ‘08

Basic Grid Projects – Condor (Part I)

Wide Area Workload Management Work Package DATAGRID project

Probabilistic Upper Bounds

GLOW A Campus Grid within OSG

Condor-G: An Update.

Presentation transcript:

High Throughput Urgent Computing Jason Cope Condor Week 2008

2 Project Collaborators  Argonne National Laboratory / University of Chicago  Pete Beckman  Suman Nadella  Nick Trebon  University of Wisconsin-Madison  Ian Alderman  Miron Livny

3 Urgent Computing Use Cases

4 High Throughput Urgent Computing  Urgent computing provides immediate, cohesive access to computing resources for emergency computations  Support for urgent high throughput computing environments is necessary  Support for high throughput emergency computing applications  Urgent cycle scavenging

5 Resources for Urgent Computing Environments ProsCons Dedicated Resources  Immediate access  Wasted cycles  Cost Shared Resources  Reuse existing resources  Increased utilization  Resource contention  Scheduling, authorization

6 SPRUCE  Special PRiority Urgent Computing Environment (SPRUCE)  TeraGrid Science Gateway   GOAL: Provide cohesive urgent computing infrastructure for emergency computations  Authorization  Resource Selection  Resource Allocation

7 SPRUCE Architecture Overview ( 1 / 2 ) Automated Trigger Human Trigger Right-of-Way Token 2 1 Event First Responder Right-of-Way Token SPRUCE Gateway / Web Services Source: Pete Beckman, ‘SPRUCE: An Infrastructure for Urgent Computing’

8 SPRUCE Architecture Overview ( 2 / 2 ) Conventional Job Submission Parameters ! Urgent Computing Parameters Urgent Computing Job Submission SPRUCE Job Manager Supercomputer Resource Local Site Policies Priority Job Queue User TeamAuthentication 3 Choose a Resource 4 5 Source: Pete Beckman, ‘SPRUCE: An Infrastructure for Urgent Computing’

9 SPRUCE Resources  Deployed on TeraGrid resources at IU, NCSA, NCAR, Purdue, TACC, SDSC, UC/ANL  Supported Resource Managers  PBS  PBS Pro  LSF  SGE  LoadLeveler  Cobalt  Local and Grid resource managers supported

10 SPRUCE and Condor Conventional Job Submission Parameters ! Urgent Computing Parameters Urgent Computing Job Submission SPRUCE Job Manager Condor Pool Local Site Policies User TeamAuthentication 3 Choose a Resource 4 Adapted from Pete Beckman, ‘SPRUCE: An Infrastructure for Urgent Computing’

11 SPRUCE / Condor Integration  Added support for urgent computing ClassAds  SPRUCE_URGENCY  SPRUCE_TOKEN_VALID  SPRUCE_TOKEN_VALID_CHECK_TIME  Modifications to the Condor schedd that support identifying SPRUCE jobs  SPRUCE Grid ASCII Helper Protocol (GAHP) Server  Asynchronously invoke SPRUCE Web service operations  GAHP calls integrated into the Condor schedd

12 SPRUCE / Condor Integration

13 SPRUCE / Condor Integration  SPRUCE provides an authorization mechanism for access to Condor resources  “Right-of-Way” access to Condor resources  Same authorization infrastructure for supercomputer and Grid resource access  Leverage existing Condor features to enhance scheduling policies  Job ranking / suspension / preemption  Site administrators define local scheduling policies

14 SPRUCE / Condor Status  Prototype complete August, 2007  Demonstrated urgent authorization and scheduling capabilities  Deployed and tested on equipment at the University of Colorado  Currently revising the prototype for a stable software release  Condor 7.0 support  Final software development iteration before official release  Evaluation of SPRUCE-related software integrated into larger Condor pools

15 Future Work  High throughput support for urgent computing applications  SURA SCOOP CH3D Grid Appliance  Many additional evaluation tasks  Application requirements  Security  Deadline scheduling / response time  Reliability / fault tolerance analysis  Data management

16 High Throughput Urgent Computing Questions?