Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June.

Similar presentations


Presentation on theme: "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June."— Presentation transcript:

1 The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June 2004 Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division

2 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 2 GriPhyN: Grid Physics Network Mission Enhance scientific productivity through discovery and processing of datasets, using the grid as a scientific workstation Virtual Data enables this approach by creating datasets from workflow “recipes” and recording their provenance. GriPhyN works to “cross the chasm” - application and computer scientists create and field-test paradigms and toolkits together

3 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 3 Acknowledgements: Virtual Data is a Large Team Effort The Chimera Virtual Data System is the work of Ian Foster, Jens Voeckler, Mike Wilde and Yong Zhao The Pegasus Planner is the work of Ewa Deelman, Gaurang Mehta, and Karan Vahi Applications described are the work of many people, including: James Annis, Rick Cavanaugh, Dan Engh, Rob Gardner, Albert Lazzarini, Natalia Maltsev, Marge Bardeen, and their wonderful teams

4 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 4 Virtual Data Scenario simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 On-demand data generation Update workflow following changes Manage workflow; psearch –t 10 –i file3 file4 file5 –o file8 summarize –t 10 –i file6 –o file7 reformat –f fz –i file2 –o file3 file4 file5 conv –l esd –o aod –i file 2 –o file6 simulate –t 10 –o file1 file2 Explain provenance, e.g. for file8:

5 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 5 Virtual Data Describes analysis workflow l The recorded virtual data “recipe” here is: –Files: 8 < (1,3,4,5,7), 7 < 6, (3,4,5,6) < 2 –Programs: 8 < psearch, 7 < summarize, (3,4,5) < reformat, 6 < conv, (1,2) < simulate simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested dataset

6 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 6 Virtual Data Describes analysis workflow l To recreate file 8: Step 1 –simulate > file1, file2 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file

7 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 7 Virtual Data Describes analysis workflow l To re-create file8: Step 2 –files 3, 4, 5, 6 derived from file 2 –reformat > file3, file4, file5 –conv > file 6 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file

8 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 8 Virtual Data Describes analysis workflow l To re-create file 8: step 3 –File 7 depends on file 6 –Summarize > file 7 simulate – t 10 … file1 file2 reformat – f fz … file1 File3,4,5 psearch – t 10 … conv – I esd – o aod file6 summarize – t 10 … file7 file8 Requested file

9 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 9 Virtual Data Describes analysis workflow l To re-create file 8: final step –File 8 depends on files 1, 3, 4, 5, 7 –psearch file 8 simulate – t 10 … file1 file2 psearch – t 10 … reformat – f fz … conv – I esd – o aod file1 File3,4,5 file6 summarize – t 10 … file7 file8 Requested file

10 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 10 Grid3 – The Laboratory Supported by the National Science Foundation and the Department of Energy.

11 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 11 VDL: Virtual Data Language Describes Data Transformations l Transformation –Abstract template of program invocation –Similar to "function definition" l Derivation –“Function call” to a Transformation –Store past and future: >A record of how data products were generated >A recipe of how data products can be generated l Invocation –Record of a Derivation execution l These XML documents reside in a “virtual data catalog” – VDC - a relational database

12 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 12 VDL Describes Workflow via Data Dependencies TR tr1(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } TR tr2(in a1, out a2) { argument stdin = ${a1}; argument stdout = ${a2}; } DV x1->tr1(a1=@{in:file1}, a2=@{out:file2}); DV x2->tr2(a1=@{in:file2}, a2=@{out:file3}); file1 file2 file3 x1 x2

13 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 13 Workflow example l Graph structure –Fan-in –Fan-out –"left" and "right" can run in parallel l Needs external input file –Located via replica catalog l Data file dependencies –Form graph structure findrange analyze preprocess

14 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 14 Complete VDL workflow l Generate appropriate derivations DV top->preprocess( b=[ @{out:"f.b1"}, @{ out:"f.b2"} ], a=@{in:"f.a"} ); DV left->findrange( b=@{out:"f.c1"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="left", p="0.5" ); DV right->findrange( b=@{out:"f.c2"}, a2=@{in:"f.b2"}, a1=@{in:"f.b1"}, name="right" ); DV bottom->analyze( b=@{out:"f.d"}, a=[ @{in:"f.c1"}, @{in:"f.c2"} );

15 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 15 Compound Transformations Enable Functional Abstractions l Compound TR encapsulates an entire sub-graph: TR rangeAnalysis (in fa, p1, p2, out fd, io fc1, io fc2, io fb1, io fb2, ) { call preprocess( a=${fa}, b=[ ${out:fb1}, ${out:fb2} ] ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="LEFT", p=${p1}, b=${out:fc1} ); call findrange( a1=${in:fb1}, a2=${in:fb2}, name="RIGHT", p=${p2}, b=${out:fc2} ); call analyze( a=[ ${in:fc1}, ${in:fc2} ], b=${fd} ); }

16 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 16 Derivation scripts l Representation of virtual data provenance: DV d1->diamond( fd=@{out:"f.00005"}, fc1=@{io:"f.00004"}, fc2=@{io:"f.00003"}, fb1=@{io:"f.00002"}, fb2=@{io:"f.00001"}, fa=@{io:"f.00000"}, p2="100", p1="0" ); DV d2->diamond( fd=@{out:"f.0000B"}, fc1=@{io:"f.0000A"}, fc2=@{io:"f.00009"}, fb1=@{io:"f.00008"}, fb2=@{io:"f.00007"}, fa=@{io:"f.00006"}, p2="141.42135623731", p1="0" );... DV d70->diamond( fd=@{out:"f.001A3"}, fc1=@{io:"f.001A2"}, fc2=@{io:"f.001A1"}, fb1=@{io:"f.001A0"}, fb2=@{io:"f.0019F"}, fa=@{io:"f.0019E"}, p2="800", p1="18" );

17 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 17 Invocation Provenance Completion status and resource usage Attributes of executable transformation Attributes of input and output files

18 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 18 Executing VDL Workflows Abstract workflow local planner Concrete DAG Global planner “Pegasus” DAGman / Condor-G Grid Info “jit” planner (research)

19 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 19 GriPhyN-iVDGL Applications to date l ATLAS, BTeV, CMS – HEP event simulation l Argonne Computational Biology – sequence comparison and result capture l LIGO – Pulsar search l Sloan Digital Sky Survey – cluster finding; near-earth object search planned l Quarknet – science education – cosmic rays, HEP analysis

20 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 20 Genome Analysis Database Update Application work by Alex Rodriguez, Dina Sulakhe, Natalia Matlsev, Argonne MCS Described in GGF10 workshop paper.

21 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 21 Galaxy cluster size distribution DAG Virtual Data Example: Galaxy Cluster Search Sloan Data Jim Annis, Steve Kent, Vijay Sehkri, Fermilab, Michael Milligan, Yong Zhao, University of Chicago. Described in SC2002 paper

22 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 22 Cluster Search Workflow Graph and Execution Trace Workflow jobs vs time

23 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 23 mass = 200 decay = WW stability = 1 LowPt = 20 HighPt = 10000 mass = 200 decay = WW stability = 1 event = 8 mass = 200 decay = WW stability = 1 plot = 1 mass = 200 decay = WW plot = 1 mass = 200 decay = WW event = 8 mass = 200 decay = WW stability = 1 mass = 200 decay = WW stability = 3 mass = 200 decay = WW mass = 200 decay = ZZ mass = 200 decay = bb mass = 200 plot = 1 mass = 200 event = 8 Virtual Data Application: High Energy Physics Data Analysis Work and slide by Rick Cavanaugh and Dimitri Bourilkov, University of Florida Ref: CHEP 2002 paper

24 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 24 Using Virtual Data for Science Education l The QuarkNet-Trillium collaboration is using Grid virtual data tools and methods to enrich science education l Its an experiment to give students the means to: –discover and apply datasets, algorithms, and data analysis methods –collaborate by developing new ones and sharing results and observations –learn data analysis methods that will ready and excite them for a scientific career l And in later steps, we may actually use the Grid!

25 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 25 Quarknet Virtual Data Project Standard Web access Central High School Reston, Virginia Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Yale / Middletown High Collaboration Hartford, Connecticut Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Foothills High School Great Falls, Montana Locally Collected Data Cosmic Ray Detector Student/ Teacher Teams Quarknet Virtual Data Portal Student Data, Algorithms, Results, Notes, and communications Virtual Data Toolkit Virtual Data Catalog Student teacher teams sharing data, methods, programs, and knowledge Enabling collaboration-intensive science discovery with virtual data tools and methods

26 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 26 Detector Performance Study

27 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 27 Example: BTeV Event Simulation

28 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 28 Support for Search and Discovery l Goal: make it as easy to use as Google l More advanced capabilities lie below the surface (as with Google) l Understand the structure and meaning of the datasets and their fields. l Advanced search, using SQL-like queries l Find both DATA and TRANSFORMATIONS l Create datasets from queries l Perform calculations on datasets, filtering results to look for patterns

29 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 29 Search by Metadata

30 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 30 Derving a new dataset …to find mass of “z” particle:

31 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 31 Workflow for missing energy calculations

32 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 32 Virtual Provenance: list of derivations and files <job id="ID000001" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="5“ dv-namespace="Quarknet.HEPSRCH" dv-name="run1aesum"> <job id="ID000002" namespace="Quarknet.HEPSRCH" name="ECalEnergySum" level="7“ dv-namespace="Quarknet.HEPSRCH" … … <job id="ID000014" namespace="Quarknet.HEPSRCH" name="ReconTotalEnergy" level="3"… … …. (excerpted for display)

33 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 33 Virtual Provenance in XML: control flow graph … … … … … (excerpted for display…)

34 And writing the results up in a “poster”

35 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 35 Poster describing analysis

36 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 36 Using active data from Web Services

37 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 37

38 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 38

39 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 39

40 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 40 Levels of Interaction l “Skins” – use it like a calculator, experiment with scenarios and settings, use virtual data like a log book to document, assess, and share parameter values. l “Blocks” – re-assemble workflow pipelines using existing ones as patterns and pre- developed transforms as building blocks l “Code” – write new transforms in a variety of languages and data models

41 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 41 Observations l A provenance approach based on interface definition and data flow declaration fits well with Grid requirements for code and data transportability and heterogeneity l Working in a provenance-managed system has many fringe benefits: uniformity, precision, structure, communication, documentation l The real world is messy – finding the right abstractions is hard, and handling “legacy” applications is even harder

42 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 42 Vision for Provenance in the Large l Universal knowledge management and production systems l Vendors integrate the provenance tracking protocol into data processing products l Ability to run anywhere “in the Grid”

43 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 43 Virtual Data Grid Vision

44 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 44 Planned Dataset Model <FORM /FORM> FileSet of files Relational query or spreadsheet range XML Element Set of files with relational index Object closure New user-defined dataset type: Speculative model described in CIDR 2003 paper by Foster, Voeckler, Wilde and Zhao

45 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 45 Planned Dataset Type Model FileDataset FileFileSet MultiFileSetTarFileSet EventCollection RawEventSetSimulatedEventSet MonteCarlo Simulation DiscreteEvent Simulation Representational Logical (Nonleaf Types are Superclasses)

46 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 46 Provenance Server Plans l OGSA-based Grid services –Discovery, security, resource management l Supports code and data discovery and workflow management l Object names (TR, DS, TY, DV, IV) can be used as global cross-server links l Derivations can reference remote transformations and datasets l Structured object namespaces & object-level access control enable large VO collaboration l Generalize transforms to describe service calls, database queries and language interpreters

47 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 47 Provenance Hyperlinks

48 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 48 Indexing Servers to Support Discovery

49 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 49 For Information and Software l Virtual Data System –www.griphyn.org/chimera - Chimera Virtual Data System: Overview, papers, software l Grids and Grid Software –www.ivdgl.org/grid2003 - Using Grid3 –www.griphyn.org/vdt - Virtual Data Toolkit –www.globus.org – The Globus Toolkit –www.cs.wisc.edu/condor - The Condor Project –www.ppdg.net – Particle Physics Data Grid

50 Summer Grid 2004 www.griphyn.org/chimera 24 June, UTB/SPI 50 Acknowledgements GriPhyN, iVDGL, and QuarkNet (in part) are supported by the National Science Foundation The Globus Alliance, PPDG, and QuarkNet are supported in part by the US Department of Energy, Office of Science; by the NASA Information Power Grid program; and by IBM


Download ppt "The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration Summer Grid 2004 UT Brownsville South Padre Island Center 24 June."

Similar presentations


Ads by Google