Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010.

Slides:



Advertisements
Similar presentations
COM vs. CORBA.
Advertisements

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
1 Workshop 20: Teaching a Hands-on Undergraduate Grid Computing Course SIGCSE The 41st ACM Technical Symposium on Computer Science Education Friday.
Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 1, pp For educational use only.
1 Short Course on Grid Computing Jornadas Chilenas de Computación 2010 INFONOR-CHILE 2010 November 15th - 19th, 2010 Antofagasta, Chile Dr. Barry Wilkinson.
1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.
1 Teaching Grid Computing across North Carolina and Beyond Dr. Clayton Ferner University of North Carolina Wilmington Dr. Barry Wilkinson University of.
1 Short Course on Grid Computing Jornadas Chilenas de Computación 2010 INFONOR-CHILE 2010 November 15th - 19th, 2010 Antofagasta, Chile Dr. Barry Wilkinson.
Workload Management Massimo Sgaravatto INFN Padova.
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
1 Short Course on Grid Computing Jornadas Chilenas de Computación 2010 INFONOR-CHILE 2010 November 15th - 19th, 2010 Antofagasta, Chile Dr. Barry Wilkinson.
Globus Computing Infrustructure Software Globus Toolkit 11-2.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.
Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.
Grid Computing I CONDOR.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
COMP3019 Coursework: Introduction to GridSAM Steve Crouch School of Electronics and Computer Science.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
Rochester Institute of Technology Job Submission Andrew Pangborn & Myles Maxfield 10/19/2015Service Oriented Cyberinfrastructure Lab,
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
3-1.1 Schedulers Slides for Grid Computing: Techniques and Applications by Barry Wilkinson, Chapman & Hall/CRC press, © Chapter 3, pp For.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
3-1.1 Schedulers © 2011 B. Wilkinson/Clayton Ferner. Fall 2011 Grid computing course. Modification date: Oct 15, 2011.
Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.
9-1.1 “Grid-enabling” applications Part 1 © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. slides9-1.ppt Modification date: Feb 26,
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1 Short Course on Grid Computing Jornadas Chilenas de Computación 2010 INFONOR-CHILE 2010 November 15th - 19th, 2010 Antofagasta, Chile Dr. Barry Wilkinson.
© Geodise Project, University of Southampton, Geodise Middleware & Optimisation Graeme Pound, Hakki Eres, Gang Xue & Matthew Fairman Summer 2003.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Job Submission with Globus, Condor, and Condor-G Selim Kalayci Florida International University 07/21/2009 Note: Slides are compiled from various TeraGrid.
14.1 “Grid-enabling” applications Copyright B. Wilkinson, This material is the property of Professor Barry Wilkinson (UNC-Charlotte) and is for the.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
6d.1 Schedulers and Resource Brokers Topics ITCS 4146/5146, UNC-Charlotte, B. Wilkinson, 2007 Feb 12, 2007 Local schedulers Condor.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.
Workload Management Workpackage
Basic Grid Projects – Condor (Part I)
Wide Area Workload Management Work Package DATAGRID project
Grid Computing Software Interface
Presentation transcript:

Tutorial on Distributed High Performance Computing 14:30 – 19:00 (2:30 pm – 7:00 pm) Wednesday November 17, 2010 Jornadas Chilenas de Computación 2010 INFONOR-CHILE 2010 November 15th - 19th, 2010 Antofagasta, Chile Dr. Barry Wilkinson University of North Carolina Charlotte Nov 3, 2010 © Barry Wilkinson

Part 2a Job schedulers, grid-enabling applications, higher-level interfaces

Job Schedulers Assigns work (jobs) to compute resources to meet specified job requirements within constraints of available resources and their characteristics An optimization problem. Objective usually to maximum throughput of jobs.

Scheduler with automatic data placement components (Input/output staging) Fig 3.4 e.g. Stork

Advance reservation Term used for requesting actions at times in future In this context, requesting a job to start at some time in the future. Both computing resources and network resources are involved Network connection usually being the Internet is not reserved. Found in recent schedulers

Some reasons one might want advance reservation in Grid computing Reserved time chosen to reduce network or resource contention. Resources not physically available except at certain times. Jobs require access to a collection of resources simultaneously, e.g. data generated by experimental equipment. A deadline for results of work Parallel programming jobs in which jobs must communicate between themselves during execution. Workflow tasks in which jobs must communicate between themselves during execution. Without advance reservation, schedulers will schedule jobs from a queue with no guarantee when they actually would be scheduled to run.

Scheduler Examples Sun Grid Engine Condor/Condor-G

Grid Engine job submission GUI interface Fig. 3.8

Submitting a job through GRAM and through an SGE scheduler Fig. 3.10

Running Globus job with SGE scheduler using globusrun-ws command Scheduler selected by name using -Ff option (i.e. factory type). Name for Sun Grid Engine (obviously) is SGE. Hence: globusrun-ws –submit -Ft SGE -f prog1.xml submits job described in job description file called prog1.xml.

Output Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:d23a7be0-f87c-11d9-a53b aae1f Termination time: 07/20/ :44 GMT Current job state: Active Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. Note: the user credentials have to be delegated

Actual machine running job Scheduler will choose machine that job is run on, which can vary for each job. Hence globusrun-ws –submit –s -Ft SGE –c /bin/hostname submits executable hostname to SGE scheduler in streaming mode redirecting output to console, with usual Globus output. Output: Hostname displayed as output will be that of machine running job and may vary.

Specifying Submit Host Submit host and location for factory service can be specified by using -F option, e.g.: globusrun-ws –submit –s -F -Ft SGE –c /bin/hostname

Condor Developed at University of Wisconsin-Madison in mid 1980’s to convert collection of distributed workstations and clusters into a high-throughput computing facility. Key concept - using wasted computer power of idle workstations. Hugely successful. Many institutions now operate Condor clusters.

Condor Essentially a job scheduler  jobs scheduled in background on distributed computers, but without user needing an account on individual computers. Users compile their programs for computers Condor is going to use, and include Condor libraries, which apart from other things handles input and captures output. Job described in a job description file. Condor then ships job off to appropriate computers.

Example job submission # This is a comment condor submit file for prog1 job Universe = vanilla Executable = prog1 Output = prog1.out Error = prog1.error Log = prog1.log Queue Condor has its own job description language to describe job in a “submit description file” Not in XML as predates XML Simple Submit Description File Example One of 9 environments – vanilla only requires executable. Checkpointing and remote system calls not allowed.)

Submitting job condor_submit command condor_submit prog1.sdl where prog1.sdl is submit description file. Without any other specification, Condor will attempt to find suitable executable machine from all available. Condor works with and without a shared file system. Most local clusters set up with shared file system and Condor will not need to explicitly transfer files.

Submitting Multiple Jobs Done by adding number after Queue command, i.e.: Submit Description File Example # condor submit file for program prog1 Universe = vanilla Executable = prog1 Queue 500 will submit 500 identical prog1 jobs at once. Can use multiple Queue commands with Arguments for each instance.

Grid universe Condor can be used as the environment for Grid computing: Stand-alone without Grid middleware such as Globus or alternatively Integrated with the Globus toolkit.

Condor’s matchmaking mechanism To chooses best computer to run the job Condor ClassAd Based upon notion that jobs and resources advertise themselves in “classified advertisements”, which include their characteristics and requirements. Job ClassAd matched against resource ClassAd.

Condor’s ClassAd Matchmaking Mechanism Fig 3.14

Machine ClassAdd Set up during system configuration. Some attributes provided by Condor but their values can be dynamic and alter during system operation. Machine attributes can describe such things as: Machine name Architecture Operating system Main memory available for job Disk memory available for job Processor performance Current load, etc

Job ClassAdd Job is typically characterized by its resource requirements and preferences. May include:  What job requires  What job desires  What job prefers, and  What job will accept using Boolean expressions. These details put in submit description file.

Matchmaking commands Requirements and Rank Available for both job ClassAd and machine ClassAd: Requirements -- specify machine requirements. Rank -- used to differentiate between multiple machines that can satisfy requirements and can identify a preference based upon a user criteria. Rank = Computes to a floating point number. Resource with highest rank chosen.

Allows one to specify dependencies between Condor Jobs. Example “Do not run Job B until Job A completed successfully” Especially important to jobs working together (as in Grid computing). Condor’s Directed Acyclic Graph Manager (DAGMan) Meta-scheduler

Example # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D Job A Job CJob B Job D Condor’s Directed Acyclic Graph (DAG) File

Running DAG c ondor_submit_dag Start a DAG with dag file diamond.dag. condor_submit_dag diamond.dag Submits a Scheduler Universe Job with DAGMan as executable.

Meta-schedulers Schedule jobs across distributed sites Highly desirable in a Grid computing environment. For a Globus installation, interfaces to local Globus GRAM installation, which in turn interfaces with local job scheduler Uses whatever local scheduler present at each site

Meta-scheduler interfacing to Globus GRAM

Condor-G A version of Condor that interfaces to Globus environment. Jobs submitted to Condor through Grid universe and directed to Globus job manager (GRAM) Fig 3-18

Communication between user, myProxy server, and Condor-G for long-running jobs Fig 3.19

Gridway A meta-scheduler designed specifically for a Grid computing environment Interfaces to Globus components. Project began in Now open source. Became part of Globus distribution from version onwards (June 2007).

Globus components used with Gridway Fig 3-20

Distributed Resource Management Application (DRMAA) (pronounced “drama”) Standard set of API’s for submission and control of jobs to DRM’s Bindings in C/C++, Java, Perl, Python, and Ruby for a range of DSMs including (Sun) Grid Engine, Condor, PBS/Torque, LSF and Gridway

Scheduler with DRMAA interface Fig 3.21

Example of the use of DRMAA Fig 3.22

Grid-enabling an application A poorly defined and understood term. It does NOT mean simply executing a job of a Grid platform! Almost all computer batch programs can be shipped to a remote Grid site and executed with little more than with a remote ssh connection. This is a model we have had since computers were first connected (via telnet). Grid-enabling should include utilizing the unique distributed nature of the Grid platform.

Grid-enabling an application With that in mind, a simple definition is: Being able to execute an application on a Grid platform, using the distributed resources available on that platform. However, even that simple definition is not agreed upon by everyone!

A broad definition that matches our view of Grid enabling applications is: “Grid Enabling refers to the adaptation or development of a program to provide the capability of interfacing with a grid middleware in order to schedule and utilize resources from a dynamic and distributed pool of “grid resources” in a manner that effectively meets the program’s needs” 2 2 Nolan, K., “Approaching the Challenge of Grid-Enabling Applications.,” Open Source Grid & Cluster Conf., Oakland, CA, 2008.

How does one do “Grid-enabling”? Still an open question and in the research domain without a standard approach. Here we will describe various approaches.

We can divide the use of the computing resources in a Grid into two types: Using multiple computers separately to solve multiple problems Using multiple computers collectively to solve a single problem

Using Multiple Computers Separately Parameter Sweep Applications In some domains areas, scientists need to run the same program many times but with different input data. “Sweep” across parameter space with different values of input parameter values in search of a solution. Many cases, not easy to compute answer and human intervention is required for to search or design space

Implementing Parameter Sweep Can be simply achieved by submitting multiple job description files, one for each set of parameters but that is not very efficient. Parameter sweep applications are so important that research projects devoted to making them efficient on a Grid. Parameter sweeps appears explicitly in job description languages.(More details in UNC-C course notes)

Exposing an Application as a Service “Wrap” application code to produce a Web service “Wrapping” means application not accessed directly but through service interface Grid computing has embraced Web service technology so natural to consider its use for accessing applications.

Web service invoking a program If Web service written in Java, service could issue a command in a separate process using exec method of current Runtime object with the construction: Runtime runtime = Runtime.getRuntime(); Process process = runtime.exec(“ ” ) where is command to issue, capturing output with OutputStream stdout = process.getOutputStream();...

Portlet acting as a front-end to a wrapped application Fig 9.6

Application with physically distributed components Fig 9.7

Using Grid Middleware API’s Could use Grid middleware APIs in application code for operations such as: File input/output Starting and monitoring jobs Monitoring and discovery of Grid resources.

Using Globus API’s Globus provides suite of services that have APIs (C and Java interfaces) that could be called from the application. Extremely steep learning curve!! Literals hundreds, if not thousands, of C and Java routines listed at the Globus site. No tutorial help and sample usage.

Code using Globus APIs to copy a file (C++) Directly from (van Nieuwpoort) Also in (Kaiser 2004) (Kaiser 2005)

Using CoG kit API’s Using CoG kit API’s is at slightly higher level. Not too difficult but still requires setting up the Globus context.

CoG Kit program to transfer files

Higher Level Middleware- Independent APIs Higher level of abstraction than Globus middleware API’s desirable because: Complexity of Globus routines Grid middleware changes very often Globus not only Grid middleware

Grid Application Toolkit (GAT) APIs for developing portable Grid applications independent of underlying Grid infrastructure and services. Developed in time frame. Copy a file in GAT/C++ ( Kaiser, H. 2005)

SAGA (Simple API for Grid Applications) Subsequent effort by Grid community to standardize higher level API’s SAGA Reading a file (C++) (Kielmann 2006)

GUI workflow editor For constructing workflows of jobs or web services Our own workflow editor – GridNexus (see workshop for more details

57 High level Grid programming interfaces UNC-Charlotte “Seeds” Framework Program by implementing defined patterns (through Java interfaces), see workshop for more details

Questions