VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

Slides:



Advertisements
Similar presentations
Building a CFD Grid Over ThaiGrid Infrastructure Putchong Uthayopas, Ph.D Department of Computer Engineering, Faculty of Engineering, Kasetsart University,
Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Pegasus on the Virtual Grid: A Case Study of Workflow Planning over Captive Resources Yang-Suk Kee, Eun-Kyu Byun, Ewa Deelman, Kran Vahi, Jin-Soo Kim Oracle.
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
1 DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S. TANENBAUM MAARTEN VAN STEEN Chapter 3 Processes Skip
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
Application-specific Tools Netsolve, Ninf, and NEOS CSE 225 Chas Wurster.
NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory
Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.
NetSolve / GridSolve By Milan Novakovic, Steven Morgan.
Tiered architectures 1 to N tiers. 2 An architectural history of computing 1 tier architecture – monolithic Information Systems – Presentation / frontend,
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)
Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
A Virtual Machine Monitor for Utilizing Non-dedicated Clusters Kenji Kaneda Yoshihiro Oyama Akinori Yonezawa (University of Tokyo)
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
“DECISION” PROJECT “DECISION” PROJECT INTEGRATION PLATFORM CORBA PROTOTYPE CAST J. BLACHON & NGUYEN G.T. INRIA Rhône-Alpes June 10th, 1999.
1 Logistical Computing and Internetworking: Middleware for the Use of Storage in Communication Micah Beck Jack Dongarra Terry Moore James Plank University.
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
Problem Solving with NetSolve Michelle Miller, Keith Moore,
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
E-science grid facility for Europe and Latin America E2GRIS1 Gustavo Miranda Teixeira Ricardo Silva Campos Laboratório de Fisiologia Computacional.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Cluster Software Overview
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
1 Data Mining at work Krithi Ramamritham. 2 Dynamics of Web Data Dynamically created Web Pages -- using scripting languages Ad Component Headline Component.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
VgES Version 0.7 Release Overview UCSD VGrADS Team Andrew A. Chien, Henri Casanova, Yang-suk Kee, Jerry Chou, Dionysis Logothetis, Richard.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.
Robust Task Scheduling in Non-deterministic Heterogeneous Computing Systems Zhiao Shi Asim YarKhan, Jack Dongarra Followed by GridSolve, FT-MPI, Open MPI.
VGES Demonstrations Andrew A. Chien, Henri Casanova, Yang-suk Kee, Richard Huang, Dionysis Logothetis, and Jerry Chou CSE, SDSC, and CNS University of.
Lessons from LEAD/VGrADS Demo Yang-suk Kee, Carl Kesselman ISI/USC.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
Jack Dongarra University of Tennessee
Experiments with Fault Tolerant Linear Algebra Algorithms
CSC 480 Software Engineering
CRESCO Project: Salvatore Raia
Abstract Machine Layer Research in VGrADS
Steven Whitham Jeremy Woods
Initial job submission and monitoring efforts with JClarens
Fault Tolerance with FT-MPI for Linear Algebra Algorithms
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September 2005

GridSolve Architecture Agent server list server data server request server result Client [x,y,z,info] = gridsolve(‘dgesv’, A, B) Resource discovery Scheduling Load balancing Fault tolerance

GridSolve philosophy Simple-to-use access to complicated software libraries Access to better hardware and software Selection of best machine in your grid to service user request Heterogeneous computing Portability —non-portable code can be run from a client on any architecture as long as there is a server provisioned with the code Legacy codes easily wrapped into services

GridSolve Usage with VGrADS Simple-to-use access to complicated software libraries, with no knowledge of grid based computing. Selection of best machines in your grid to service user request Portability —Non-portable code can be run from a client on an architecture as long as there is a server provisioned with the code Legacy codes easily wrapped into services Plug into VGrADS Framework Using the vgES for resource selection and launching of application: —Integrated performance information —Integrated monitoring —Fault prediction —Integrating the software and resource information repositories

VGrADS/GridSolve Architecture Agent request Client [x,y,z,info] = gridsolve(‘foo”, A, B) Service Catalog Service Catalog data result vgDL Resource List Software Repository query software location Transfer Start server register Server info

Agent Agent is specific for the client —Initially agent contains no resource information; obtained from vgES Agent requests information from the service catalog about the possible services and their complexity in order to estimate the resources required (vgDL) For each service request —Estimate resources required from problem complexity –vgDL spec: vgdl = Clusterof [N]; node = {node.memory > 500MB, node.speed > 2000}; –vgid = vgCreateVG(vgserver, vgdl, 1000, ns-server-script) —Return the set of resources to the client —The ns-server-script launched by the vgES fetches and deploys needed services on selected VGrADS resources

Service Provider Compile a service —Use standard GridSolve —Write service interface description, compile, get binary Send binary to repository (e.g. IBP) Inform a service notification system or update catalog —This catalog can then be used by the agent to fetch and install the actual service binaries

Fault Tolerant GridSolve/VGrADS Demo Demonstrate fault tolerant PCG application running on dynamic grid resources —Diskless checkpointing of changing data (handled by application) —For surviving k failures, you need k additional checkpoint processes —Very low overhead for checkpoints End-to-end demonstration of VGrADS approach —Easy Matlab frontend —GridSolve provides service description, argument marshalling —VGrADS provides resource location, allocation and management —FT-MPI provides fault notification and a rebuild of the MPI environment —Fault tolerant PCG implementation provides a checkpoint/restart version of a parallel PCG

PCG: With Different MPI Implementations NProcsLAM MPICH FT-MPIFT-MPI ckpt /2000 iters FT-MPI exit 1 iters 165K K K K dual-processor 2.4 GHz AMD Opteron nodes Nodes are connected with a Gigabit Ethernet. bcsstk17: The size is: x Non-zeros: Sparsity: 39 non-zeros per row on average Source: Linear equation from elevated pressure vessel

Question: vgDL for pcg Application Too many ways to express the vgDL —Give me best set of processors within some cluster that have at least M free memory all together for an application which will perform C floating point operations —Use a time constraint to help –Problem must be solved in T time –Still needs a nprocs_req Example –sample_cluster = ClusterOf(node)[%d] { node = [ ( (Memory_Avail>%d) && (Clock>%d) && (Processor==Pentium) && (OS==Linux) ) ] [ Rank=Memory_Avail] } ", nproc_req, (input_size+output_size)*2, total_mflop/time_constraint/nproc_req

Work in Fault Tolerance Determine appropriate checkpoint intervals and number of checkpoint processors —From historical information and statistical prediction —From monitoring —Use for task migration if problems emerge Other fault tolerance techniques —Checkpoint processors participate in computation —Algorithm based fault tolerance Generalize to library of fault tolerant routines —Conjugate gradient solver —Matrix multiply

Work in GridSolve/VGrADS Better construction of vgDL from service complexity information GridSolve monitor not yet integrated vgES monitor to be integrated Need to add more sites —Currently working on adding UCSD CSAG cluster

The End Questions?