Presentation is loading. Please wait.

Presentation is loading. Please wait.

GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004.

Similar presentations


Presentation on theme: "GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004."— Presentation transcript:

1 GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004

2 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 2 A Large Team Effort! The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many, people, including: James Annis, Rick Cavanaugh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, and their wonderful teams

3 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 3 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM

4 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 4 Tutorial Objectives l Provide a detailed introduction to existing services for virtual data management in grids l Provide descriptions and interactive demonstrations of: –the Chimera system for managing virtual data products –the Pegasus system for planning and execution in grids l Intended for those interested in creating and running huge workflows on the grid.

5 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 5 Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

6 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 6 GriPhyN – Grid Physics Network Mission Enhance scientific productivity through: l Discovery and application of datasets l Enabling use of a worldwide data grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance.

7 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 7 Virtual Data System Approach Producing data from transformations with uniform, precise data interface descriptions enables… l Discovery: finding and understanding datasets and transformations l Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets –Forming new workflow –Building new workflow from existing patterns –Managing change l Planning: automated to make the Grid transparent l Audit: explanation and validation via provenance

8 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 8 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:

9 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 9 The Grid l Emerging computational, networking, and storage infrastructure –Pervasive, uniform, and reliable access to remote data, computational, sensor, and human resources l Enable new approaches to applications and problem solving –Remote resources the rule, not the exception l Challenges –Heterogeneous components –Component failures common –Different administrative domains –Local policies for security and resource usage

10 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 10 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

11 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 11 Grid3 – Cumulative CPU Days to ~ 25 Nov 2003

12 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 12 Grid2003: ~100TB data processed to ~ 25 Nov 2003

13 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 13 Requirements for Virtual Data Management l Terabytes or petabytes of data –Often read-only data, “published” by experiments –Other systems need to maintain data consistency l Large data storage and computational resources shared by researchers around the world –Distinct administrative domains –Respect local and global policies governing how resources may be used l Access raw experimental data l Run simulations and analysis to create “derived” data products

14 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 14 Requirements for Virtual Data Management (Cont.) l Locate existing data –Record and query for existence of data l Data access based on metadata –High-level attributes of data l Support high-speed, reliable data movement –E.g., for efficient movement of large experimental data sets l Planning, scheduling and monitoring execution of data requests and computations l Management of data replication –Register and query for replicas –Select the best replica for a data transfer l Virtual data –Desired data may be stored on a storage system (“materialized”) or created on demand

15 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 15 Tutorial Content l The Chimera system for managing virtual data products –Virtual data: materialize data on-demand –Virtual data language, catalog and interpreter l The Pegasus system for planning and execution in grids –Pegasus is a configurable system that can map and execute complex workflows on grid resources

16 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 16 Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

17 Chimera Virtual Data System

18 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 18 Chimera Virtual Data System Outline l Virtual data concept and vision l VDL – the Virtual Data Language l Simple virtual data examples l Virtual data applications in High Energy Physics and Astronomy l Use of virtual data tools

19 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 19 Virtual Data System Capabilities Producing data from transformations with uniform, precise data interface descriptions enables… l Discovery: finding and understanding datasets and transformations l Workflow: structured paradigm for organizing, locating, specifying, & producing scientific datasets –Forming new workflow –Building new workflow from existing patterns –Managing change l Planning: automated to make the Grid transparent l Audit: explanation and validation via provenance

20 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 20 VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution

21 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 21 Example Transformation TR t1( out a2, in a1, none pa = "500", none env = "100000" ) { argument = "-p "${pa}; argument = "-f "${a1}; argument = "-x –y"; argument stdout = ${a2}; profile env.MAXMEM = ${env}; } $a1 $a2 t1

22 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 22 Example Derivations DV d1->t1 ( env="20000", pa="600", a2=@{out:run1.exp15.T1932.summary}, a1=@{in:run1.exp15.T1932.raw}, ); DV d2->t1 ( a1=@{in:run1.exp16.T1918.raw}, a2=@{out.run1.exp16.T1918.summary} );

23 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 23 Workflow from File Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 x1 x2

24 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 24 Example Invocation Completion status and resource usage Attributes of executable transformation Attributes of input and output files

25 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 25 Example Workflow l Complex structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Uses input file –Register with RC l Complex file dependencies –Glues workflow findrange analyze preprocess

26 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 26 Workflow step "preprocess" l TR preprocess turns f.a into f.b1 and f.b2 TR preprocess( output b[], input a ) { argument = "-a top"; argument = " –i "${input:a}; argument = " –o " ${output:b}; } l Makes use of the "list" feature of VDL –Generates 0..N output files. –Number file files depend on the caller.

27 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 27 Workflow step "findrange" l Turns two inputs into one output TR findrange( output b, input a1, input a2, none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${a1} " " ${a2}; argument = " –o " ${b}; argument = " –p " ${p}; } l Uses the default argument feature

28 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 28 Can also use list[] parameters TR findrange( output b, input a[], none name="findrange", none p="0.0" ) { argument = "-a "${name}; argument = " –i " ${" "|a}; argument = " –o " ${b}; argument = " –p " ${p}; }

29 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 29 Workflow step "analyze" l Combines intermediary results TR analyze( output b, input a[] ) { argument = "-a bottom"; argument = " –i " ${a}; argument = " –o " ${b}; }

30 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 30 Complete VDL workflow l Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );

31 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 31 Compound Transformations l Using compound TR –Permits composition of complex TRs from basic ones –Calls are independent >unless linked through LFN –A Call is effectively an anonymous derivation >Late instantiation at workflow generation time –Permits bundling of repetitive workflows –Model: Function calls nested within a function definition

32 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 32 Compound Transformations (cont) l TR diamond bundles black-diamonds TR diamond( out fd, io fc1, io fc2, io fb1, io fb2, in fa, p1, p2 ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

33 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 33 Compound Transformations (cont) l Multiple DVs allow easy generator scripts: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );

34 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 34 Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data

35 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 35 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time

36 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 36 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida

37 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 37 Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation

38 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 38 Virtual Data Grid Vision

39 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 39 Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”

40 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 40 Functional View of Virtual Data Management Location based on metadata attributes Location of one or more physical replicas State of grid resources, performance measurements and predictions Metadata Service Application Replica Location Service Information Services Planner: Data location, Replica selection, Selection of compute and storage resources Security and Policy Executor: Initiates data transfers and computations Data Movement Data Access Compute ResourcesStorage Resources

41 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 41 GriPhyN/PPDG Data Grid Architecture Application Planner Executor Catalog Services Info Services Policy/Security Monitoring Repl. Mgmt. Reliable Transfer Service Compute ResourceStorage Resource DAG (concrete) DAG (abstract) DAGMAN, Kangaroo GRAMGridFTP; GRAM; SRM GSI, CAS MDS MCAT; GriPhyN catalogs GDMP MDS Globus

42 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 42 Executor Example: Condor DAGMan l Directed Acyclic Graph Manager l Specify the dependencies between Condor jobs using DAG data structure l Manage dependencies automatically –(e.g., “Don’t run job “B” until job “A” has completed successfully.”) l Each job is a “node” in DAG l Any number of parent or children nodes l No loops Job A Job BJob C Job D Slide courtesy Miron Livny, U. Wisconsin

43 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 43 Executor Example: Condor DAGMan (Cont.) l DAGMan acts as a “meta-scheduler” –holds & submits jobs to the Condor queue at the appropriate times based on DAG dependencies l If a job fails, DAGMan continues until it can no longer make progress and then creates a “rescue” file with the current state of the DAG –When failed job is ready to be re-run, the rescue file is used to restore the prior state of the DAG DAGMan Condor Job Queue C D B C B A Slide courtesy Miron Livny, U. Wisconsin

44 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 44 Virtual Data in CMS Virtual Data Long Term Vision of CMS: CMS Note 2001/047, GRIPHYN 2001-16

45 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 45 CMS Data Analysis 100b 200b 5K 7K 100K 50K 300K 100K 50K 100K 200K 100K 100b 200b 5K 7K 100K 50K 300K 100K 50K 100K 200K 100K Tag 2 Jet finder 2 Jet finder 1 Reconstruction Algorithm Tag 1 Calibration data Raw data (simulated or real) Reconstructed data (produced by physics analysis jobs) Event 1 Event 2Event 3 Uploaded dataVirtual dataAlgorithms Dominant use of Virtual Data in the Future

46 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 46.................................... Data: 0.5 MB 175 MB 275 MB 105 MB SC2001 Demo Version: pythia cmsim writeHits writeDigis 1 run = 500 events 1 run 1 event CPU: 2 min 8 hours 5 min 45 min truth.ntpl hits.fz hits.DB digis.DB Production Pipeline GriphyN-CMS Demo

47 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 47 Pegasus: Planning for Execution in Grids l Maps from abstract to concrete workflow –Algorithmic and AI based techniques l Automatically locates physical locations for both components (transformations and data) –Use Globus Replica Location Service and the Transformation Catalog l find appropriate resources to execute –Via Globus Monitoring and Discovery Serivce l Reuse existing data products where applicable l Publishes newly derived data products –RLS, Chimera virtual data catalog

48 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 48

49 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 49 Replica Location Service l Pegasus uses the RLS to find input data LRC RLI Computation l Pegasus uses the RLS to register new data products

50 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 50 Use of MDS in Pegasus l MDS provides up-to-date Grid state information –Total and idle job queues length on a pool of resources (condor) –Total and available memory on the pool –Disk space on the pools –Number of jobs running on a job manager l Can be used for resource discovery and selection –Developing various task to resource mapping heuristics l Can be used to publish information necessary for replica selection –Developing replica selection components

51 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 51 Abstract Workflow Reduction KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b l The output jobs for the Dag are all the leaf nodes –i.e. f, h, I l Each job requires 2 input files and generates 2 output files. l The user specifies the output location.

52 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 52 KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Job e Job gJob h Job d Job a Job c Job f Job i Job b Optimizing from the point of view of Virtual Data l Jobs d, e, f have output files that have been found in the Replica Location Service. l Additional jobs are deleted. l All jobs (a, b, c, d, e, f) are removed from the DAG.

53 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 53 Job e Job gJob h Job d Job a Job c Job f Job i Job b adding transfer nodes for the input files for the root nodes Plans for staging data in KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm Planner picks execution and replica locations

54 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 54 Staging and registering for each job that materializes data (g, h, i ). KEY The original node Input transfer node Registration node Output transfer node Node deleted by Reduction algorithm transferring the output files of the leaf job (f) to the output location Job e Job gJob h Job d Job a Job c Job f Job i Job b Staging data out and registering new derived products in the RLS

55 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 55 KEY The original node Input transfer node Registration node Output transfer node Job gJob h Job i Job e Job gJob h Job d Job a Job c Job f Job i Job b Input DAG The final executable DAG

56 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 56 Pegasus Components l Concrete Planner and Submit file generator (gencdag) –The Concrete Planner of the VDS makes the logical to physical mapping of the DAX taking into account the pool where the jobs are to be executed (execution pool) and the final output location (output pool). l Java Replica Location Service Client (rls- client & rls-query-client) –Used to populate and query the globus replica location service.

57 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 57 Pegasus Components (cont’d) l XML Pool Config generator (genpoolconfig) –The Pool Config generator queries the MDS as well as local pool config files to generate a XML pool config which is used by Pegasus. –MDS is preferred for generation pool configuration as it provides a much richer information about the pool including the queue statistics, available memory etc. l The following catalogs are looked up to make the translation –Transformation Catalog (tc.data) –Pool Config File –Replica Location Services –Monitoring and Discovery Services

58 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 58 Transformation Catalog (Demo) l Consists of a simple text file. –Contains Mappings of Logical Transformations to Physical Transformations. l Format of the tc.data file #poolid logical tr physical tr env isi preprocess /usr/vds/bin/preprocess VDS_HOME=/usr/vds/; l All the physical transformations are absolute path names. l Environment string contains all the environment variables required in order for the transformation to run on the execution pool. l DB based TC in testing phase.

59 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 59 Pool Config (Demo) l Pool Config is an XML file which contains information about various pools on which DAGs may execute. l Some of the information contained in the Pool Config file is –Specifies the various job-managers that are available on the pool for the different types of condor universes. –Specifies the GridFtp storage servers associated with each pool. –Specifies the Local Replica Catalogs where data residing in the pool has to be cataloged. –Contains profiles like environment hints which are common site-wide. –Contains the working and storage directories to be used on the pool.

60 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 60 Pool config l Two Ways to construct the Pool Config File. –Monitoring and Discovery Service –Local Pool Config File (Text Based) l Client tool to generate Pool Config File –The tool genpoolconfig is used to query the MDS and/or the local pool config file/s to generate the XML Pool Config file.

61 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 61 Gvds.Pool.Config (Demo) l This file is read by the information provider and published into MDS. l Format gvds.pool.id : gvds.pool.lrc : gvds.pool.gridftp : @ gvds.pool.gridftp : gsiftp://sukhna.isi.edu/nfs/asd2/gmehta@2.4.0 gvds.pool.universe : @ @ gvds.pool.universe : transfer@columbus.isi.edu/jobmanager- fork@2.2.4 gvds.pool.gridlaunch : gvds.pool.workdir : gvds.pool.profile : @ @ gvds.pool.profile : env@GLOBUS_LOCATION@/smarty/gt2.2.4 gvds.pool.profile : vds@VDS_HOME@/nfs/asd2/gmehta/vds

62 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 62 Properties (Demo) l Properties file define and modify the behavior of Pegasus. l Properties set in the $VDS_HOME/properties can be overridden by defining them either in $HOME/.chimerarc or by giving them on the command line of any executable. –eg. Gendax –Dvds.home=path to vds home…… l Some examples follow but for more details please read the sample.properties file in $VDS_HOME/etc directory. l Basic Required Properties –vds.home : This is auto set by the clients from the environment variable $VDS_HOME –vds.properties : Path to the default properties file >Default : ${vds.home}/etc/properties

63 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 63 Concrete Planner Gencdag (Demo) l The Concrete planner takes the DAX produced by Chimera and converts into a set of condor dag and submit files. l Usage : gencdag --dax --p [--dir ] [--o ] [--force] l You can specify more then one execution pools. Execution will take place on the pools on which the executable exists. If the executable exists on more then one pool then the pool on which the executable will run is selected randomly. l Output pool is the pool where you want all the output products to be transferred to. If not specified the materialized data stays on the execution pool

64 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 64 Future Improvements l A sophisticated concrete planner with AI technology l A sophisticated transformation catalog with a DB backend l Smarter scheduling of workflows by deciding whether the workflow is compute intensive or data intensive. l In-time planning. l Using resource queue information and network bandwidth information to make a smarter choice of resources. l Reservation of Disk Space on remote machines

65 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 65 Pegasus Portal

66 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 66 Tutorial Outline l Introduction: Grids, GriPhyN, Virtual Data (5 minutes) l The Chimera system (25 minutes) l The Pegasus system (25 minutes) l Summary (5 minutes)

67 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 67 Summary: GriPhyN Virtual Data System l Using Virtual Data helps in reducing time and cost of computation. l Services in the Virtual Data Toolkit –Chimera. Constructs a virtual plan –Pegasus. Constructs a concrete grid plan from this virtual plan. l Some current applications of the virtual data toolkit -

68 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 68 Astronomy l Montage (NASA and NVO) ( B. Berriman, J. Good, G. Singh, M. Su ) –Deliver science-grade custom mosaics on demand –Produce mosaics from a wide range of data sources (possibly in different spectra) –User-specified parameters of projection, coordinates, size, rotation and spatial sampling. Mosaic created by Pegasus based Montage from a run of the M101 galaxy images on the Teragrid.

69 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 69 Montage Workflow 1202 nodes

70 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 70 BLAST : set of sequence comparison algorithms that are used to search sequence databases for optimal local alignments to a query Lead by Veronika Nefedova (ANL) as part of the PACI Data Quest Expedition program 2 major runs were performed using Chimera and Pegasus: 1)60 genomes (4,000 sequences each), In 24 hours processed Genomes selected from DOE-sponsored sequencing projects 67 CPU-days of processing time delivered ~ 10,000 Grid jobs >200,000 BLAST executions 50 GB of data generated 2) 450 genomes processed Speedup of 5-20 times were achieved because the compute nodes we used efficiently by keeping the submission of the jobs to the compute cluster constant.

71 LISHEP2004/UERJ www.griphyn.org/chimera 13 Feb 04 71 For further information l Globus Project: www.globus.org l Chimera : www.griphyn.org/chimera l Pegasus: pegasus.isi.edu l MCS: www.isi.edu/~deelman/MCS


Download ppt "GriPhyN Virtual Data System Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division LISHEP 2004, UERJ, Rio De Janeiro 13 Feb 2004."

Similar presentations


Ads by Google