GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.

Slides:



Advertisements
Similar presentations
Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.
Advertisements

Starfish: A Self-tuning System for Big Data Analytics.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
Grid Resource Allocation Management (GRAM) GRAM provides the user to access the grid in order to run, terminate and monitor jobs remotely. The job request.
XSEDE 13 July 24, Galaxy Team: PSC Team:
Sphinx Server Sphinx Client Data Warehouse Submitter Generic Grid Site Monitoring Service Resource Message Interface Current Sphinx Client/Server Multi-threaded.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
Workflow Management and Virtual Data Ewa Deelman USC Information Sciences Institute.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
An Astronomical Image Mosaic Service for the National Virtual Observatory / ESTO.
Ewa Deelman Using Grid Technologies to Support Large-Scale Astronomy Applications Ewa Deelman Center for Grid Technologies USC Information.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Managing Workflows with the Pegasus Workflow Management System
SUN HPC Consortium, Heidelberg 2004 Grid(Lab) Resource Management System (GRMS) and GridLab Services Krzysztof Kurowski Poznan Supercomputing and Networking.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Pegasus A Framework for Workflow Planning on the Grid Ewa Deelman USC Information Sciences Institute Pegasus Acknowledgments: Carl Kesselman, Gaurang Mehta,
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Large-Scale Science Through Workflow Management Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,
Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.
Pegasus: Planning for Execution in Grids Ewa Deelman Information Sciences Institute University of Southern California.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Javascript Cog Kit By Zhenhua Guo. Grid Applications Currently, most grid related applications are written as separate software. –server side: Globus,
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
Combining the strengths of UMIST and The Victoria University of Manchester Adaptive Workflow Processing and Execution in Pegasus Kevin Lee School of Computer.
CPT Demo May Build on SC03 Demo and extend it. Phase 1: Doing Root Analysis and add BOSS, Rendezvous, and Pool RLS catalog to analysis workflow.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
Pegasus: Running Large-Scale Scientific Workflows on the TeraGrid Ewa Deelman USC Information Sciences Institute
CNGrid GOS 3.0 Practice OMII-Euro & CNGrid Joint Training Material QiaoJian Jan
Grid Scheduler: Plan & Schedule Adam Arbree Jang Uk In.
Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
CEDPS Data Services Ann Chervenak USC Information Sciences Institute.
Workflow Management and Virtual Data Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.
Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.
Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.
H. Widmann (M&D) Data Discovery and Processing within C3Grid GO-ESSP/LLNL / June, 19 th 2006 / 1 Data Discovery and Basic Processing within the German.
Pegasus: Planning for Execution in Grids Ewa Deelman, Carl Kesselman, Gaurang Mehta, Gurmeet Singh, Karan Vahi Information Sciences Institute University.
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
Hyperion Artifact Life Cycle Management Agenda  Overview  Demo  Tips & Tricks  Takeaways  Queries.
ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
1 USC Information Sciences InstituteYolanda Gil AAAI-08 Tutorial July 13, 2008 Part IV Workflow Mapping and Execution in Pegasus (Thanks.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
Pegasus WMS Extends DAGMan to the grid world
Haiyan Meng and Douglas Thain
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Wide Area Workload Management Work Package DATAGRID project
Mats Rynge USC Information Sciences Institute
Frieda meets Pegasus-WMS
Presentation transcript:

GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

SPI Summer Grid Workshop 25 June Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

SPI Summer Grid Workshop 25 June Grid3 – Cumulative CPU Days to ~ 25 Nov 2003

SPI Summer Grid Workshop 25 June Grid2003: ~100TB data processed to ~ 25 Nov 2003

SPI Summer Grid Workshop 25 June Functional View of Virtual Data Management Location based on metadata attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions Metadata Service Application Replica Location Service Information Services Planner: Data location, Replica selection, Selection of compute and storage resources Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute ResourcesStorage Resources

SPI Summer Grid Workshop 25 June Outline l Pegasus Introduction l Pegasus and Other Globus Components l Pegasus’ Concrete Planner l Future Improvements

SPI Summer Grid Workshop 25 June Grid Applications l Increasing in the level of complexity l Use of individual application components l Reuse of individual intermediate data products l Description of Data Products using Metadata Attributes l Execution environment is complex and very dynamic –Resources come and go –Data is replicated –Components can be found at various locations or staged in on demand l Separation between –the application description –the actual execution description

SPI Summer Grid Workshop 25 June Abstract Workflow Generation Concrete Workflow Generation

SPI Summer Grid Workshop 25 June Pegasus: Planning for Execution in Grids l Maps from abstract to concrete workflow –Algorithmic and AI based techniques l Automatically locates physical locations for both components (transformations and data) –Use Globus Replica Location Service and the Transformation Catalog l find appropriate resources to execute –Via Globus Monitoring and Discovery Serivce l Reuse existing data products where applicable l Publishes newly derived data products –RLS, Chimera virtual data catalog

SPI Summer Grid Workshop 25 June

SPI Summer Grid Workshop 25 June Replica Location Service l Pegasus uses the RLS to find input data LRC RLI Computation l Pegasus uses the RLS to register new data products

SPI Summer Grid Workshop 25 June Use of MDS in Pegasus l MDS provides up-to-date Grid state information –Total and idle job queues length on a pool of resources (condor) –Total and available memory on the pool –Disk space on the pools –Number of jobs running on a job manager l Can be used for resource discovery and selection –Developing various task to resource mapping heuristics l Can be used to publish information necessary for replica selection –Developing replica selection components

SPI Summer Grid Workshop 25 June Abstract Workflow Reduction KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b l The output jobs for the Dag are all the leaf nodes –i.e. f, h, I l Each job requires 2 input files and generates 2 output files. l The user specifies the output location.

SPI Summer Grid Workshop 25 June KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b Optimizing from the point of view of Virtual Data l Jobs d, e, f have output files that have been found in the Replica Location Service. l Additional jobs are deleted. l All jobs (a, b, c, d, e, f) are removed from the DAG.

SPI Summer Grid Workshop 25 June Job e Job gJob h Job d Job a Job c Job f Job i Job b adding transfer nodes for the input files for the root nodes Plans for staging data in KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Planner picks execution and replica locations

SPI Summer Grid Workshop 25 June Staging and registering for each job that materializes data (g, h, i ). KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm transferring the output files of the leaf job (f) to the output location Job e Job gJob h Job d Job a Job c Job f Job i Job b Staging data out and registering new derived products in the RLS

SPI Summer Grid Workshop 25 June KEY The original node Input transfer node Registration node Output transfer node Job gJob h Job i Job e Job gJob h Job d Job a Job c Job f Job i Job b Input DAG The final executable DAG

SPI Summer Grid Workshop 25 June Pegasus Components l Concrete Planner and Submit file generator (gencdag) –The Concrete Planner of the VDS makes the logical to physical mapping of the DAX taking into account the pool where the jobs are to be executed (execution pool) and the final output location (output pool). l Java Replica Location Service Client (rls- client & rls-query-client) –Used to populate and query the globus replica location service.

SPI Summer Grid Workshop 25 June Pegasus Components (cont’d) l XML Pool Config generator (genpoolconfig) –The Pool Config generator queries the MDS as well as local pool config files to generate a XML pool config which is used by Pegasus. –MDS is preferred for generation pool configuration as it provides a much richer information about the pool including the queue statistics, available memory etc. l The following catalogs are looked up to make the translation –Transformation Catalog (tc.data) –Pool Config File –Replica Location Services –Monitoring and Discovery Services

SPI Summer Grid Workshop 25 June Transformation Catalog (Demo) l Consists of a simple text file. –Contains Mappings of Logical Transformations to Physical Transformations. l Format of the tc.data file #poolid logical tr physical tr env isi preprocess /usr/vds/bin/preprocess VDS_HOME=/usr/vds/; l All the physical transformations are absolute path names. l Environment string contains all the environment variables required in order for the transformation to run on the execution pool. l DB based TC in testing phase.

SPI Summer Grid Workshop 25 June Pool Config (Demo) l Pool Config is an XML file which contains information about various pools on which DAGs may execute. l Some of the information contained in the Pool Config file is –Specifies the various job-managers that are available on the pool for the different types of condor universes. –Specifies the GridFtp storage servers associated with each pool. –Specifies the Local Replica Catalogs where data residing in the pool has to be cataloged. –Contains profiles like environment hints which are common site-wide. –Contains the working and storage directories to be used on the pool.

SPI Summer Grid Workshop 25 June Pool config l Two Ways to construct the Pool Config File. –Monitoring and Discovery Service –Local Pool Config File (Text Based) l Client tool to generate Pool Config File –The tool genpoolconfig is used to query the MDS and/or the local pool config file/s to generate the XML Pool Config file.

SPI Summer Grid Workshop 25 June Gvds.Pool.Config (Demo) l This file is read by the information provider and published into MDS. l Format gvds.pool.id : gvds.pool.lrc : gvds.pool.gridftp gvds.pool.gridftp : gvds.pool.universe : gvds.pool.gridlaunch : gvds.pool.workdir : gvds.pool.profile : gvds.pool.profile :

SPI Summer Grid Workshop 25 June Properties (Demo) l Properties file define and modify the behavior of Pegasus. l Properties set in the $VDS_HOME/properties can be overridden by defining them either in $HOME/.chimerarc or by giving them on the command line of any executable. –eg. Gendax –Dvds.home=path to vds home…… l Some examples follow but for more details please read the sample.properties file in $VDS_HOME/etc directory. l Basic Required Properties –vds.home : This is auto set by the clients from the environment variable $VDS_HOME –vds.properties : Path to the default properties file >Default : ${vds.home}/etc/properties

SPI Summer Grid Workshop 25 June Concrete Planner Gencdag (Demo) l The Concrete planner takes the DAX produced by Chimera and converts into a set of condor dag and submit files. l Usage : gencdag --dax --p [--dir ] [--o ] [--force] l You can specify more then one execution pools. Execution will take place on the pools on which the executable exists. If the executable exists on more then one pool then the pool on which the executable will run is selected randomly. l Output pool is the pool where you want all the output products to be transferred to. If not specified the materialized data stays on the execution pool

SPI Summer Grid Workshop 25 June Future Improvements l A sophisticated concrete planner with AI technology l A sophisticated transformation catalog with a DB backend l Smarter scheduling of workflows by deciding whether the workflow is compute intensive or data intensive. l In-time planning. l Using resource queue information and network bandwidth information to make a smarter choice of resources. l Reservation of Disk Space on remote machines

SPI Summer Grid Workshop 25 June Pegasus Portal

SPI Summer Grid Workshop 25 June Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

SPI Summer Grid Workshop 25 June Summary: GriPhyN Virtual Data System l Using Virtual Data helps in reducing time and cost of computation. l Services in the Virtual Data Toolkit –Chimera. Constructs a virtual plan –Pegasus. Constructs a concrete grid plan from this virtual plan. l Some current applications of the virtual data toolkit -

SPI Summer Grid Workshop 25 June Astronomy l Montage (NASA and NVO) ( B. Berriman, J. Good, G. Singh, M. Su ) –Deliver science-grade custom mosaics on demand –Produce mosaics from a wide range of data sources (possibly in different spectra) –User-specified parameters of projection, coordinates, size, rotation and spatial sampling. Mosaic created by Pegasus based Montage from a run of the M101 galaxy images on the Teragrid.

SPI Summer Grid Workshop 25 June Montage Workflow 1202 nodes

SPI Summer Grid Workshop 25 June BLAST : set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query Lead by Veronika Nefedova (ANL) as part of the PACI Data Quest Expedition program 2 major runs were performed using Chimera and Pegasus: 1)60 genomes (4,000 sequences each), In 24 hours processed Genomes selected from DOE-sponsored sequencing projects 67 CPU-days of processing time delivered ~ 10,000 Grid jobs >200,000 BLAST executions 50 GB of data generated 2) 450 genomes processed Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant.

SPI Summer Grid Workshop 25 June For further information l Globus Project: l Chimera : l Pegasus: pegasus.isi.edu l MCS: