Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept.

Similar presentations


Presentation on theme: "Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept."— Presentation transcript:

1 Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept.

2 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 2 The Data Deluge Computational science is increasingly data intense and getting more so. Why? More complex computations: –Nested model runs –Linked models –Finer resolution More sources of data products –Observational data products Streaming continuously from hundreds of sensor and network sources, scaling to thousands Large archives –Annotations –Model configuration parameters –Output results –Model data –Statistical data (e.g., data mining)

3 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 3 Problem Computational scientists are reaching their limit on ability to manage data products associated with investigations –Scientist can touch hundreds to thousands of data products in single investigation

4 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 4 Seeds of solution in Internet? Internet has proven the utility of user-oriented view towards information space management –Search, tag: browser, bookmarks –Publish: blogs, web page tools But web not completely appropriate. Web is –Single-writer, multiple reader, and –Search-and-download. Apply concept of user-oriented view to managing data space Want ability to work locally. –myLEAD: tool to help an investigator make sense of, and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.)

5 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 5 Personal metadata catalog requirements Scientists have following needs: Want to share products but retain control over what gets shared and with whom –Data not made public until results appear in journal Want rich search criteria over vast data space but don’t necessarily want to write SQL queries Need help managing products generated over extended period of time (I.e., years) Want high level of reliability - data must always be accessible,

6 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 6 Distributed and replicated personal metadata catalogues IU NCSA UA Huntsville Millersville UCAR Unidata Okla Univ Master myLEAD catalog Satellite myLEAD catalog -- distribution: users partitioned over 6 sites in LEAD testbed -- replication: master is replica site for all satellites

7 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 7 Hurricane Ivan SE quadrant Voltice study 1998 Voltice study 2002 Workflow template Collection Input parameter Hurricane Ivan SE quadrant Voltice study 1998 Workflow template Collection Input parameter Hurricane Ivan SE quadrant Voltice study 1998 Voltice study 2003 Workflow template Collection Input parameter ftp://fileserver.org/file1998o768 Voltice study 2002 User Bob’s workspace in 1998User Bob’s workspace in 2002User Bob’s workspace in 2003 Physical data storage Table of collection Table of file Table of User Metadata Catalog

8 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 8 Ontologies aid in querying Preservation Sharing Structure Depth 2: searchable Depth 3: brow sable Does not know existence Flat structure Temporary data product Non-published Data products of other users Non-preserved data product Non structured data products structure sharing preservation Ontologies provide -- transparent structure -- controlled vocabulary

9 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 9 LEAD (http://lead.ou.edu) Each year, mesoscale weather – floods, tornadoes, hail, strong winds, lightning, and winter storms – causes hundreds of deaths, routinely disrupts transportation and commerce, and results in annual economic losses > $13B.

10 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 10 Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

11 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 11 OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Conventional Numerical Weather Prediction

12 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 12 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

13 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 13 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

14 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 14 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

15 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 15 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites The process is entirely serial and pre-scheduled: no response to weather! The process is entirely serial and pre-scheduled: no response to weather!

16 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 16 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

17 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 17 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

18 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 18

19 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 19 Objective discussed in this talk: Grow the value of the data holdings. Can do so through provenance: workflow myLEAD time Process, time, causality

20 Exploiting Provenance Metadata

21 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 21 Contents of Talk Importance of Provenance Techniques for Provenance Management Data Quality and Provenance Conclusion

22 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 22 Data Provenance Derivation History of Data starting from its original sources Data: Files, tables, tuples, virtual collections Derivation: Process that transforms data – Script, Web service, Queries, Commands Lineage, Pedigree, Genealogy, Filiation, Parentage, …

23 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 23 A Simple Provenance DAG D1 D0 D2 D4D3 P1 P2P3 D2’ D0’

24 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 24 Importance of Provenance Scientific Domain –Publications are Provenance! –Many scientific datasets available online Biology, Astronomy (SDSS) –Standard metadata describes datasets in well-known repositories –Lineage information usually missing, but vital –GIS: Fitness for use –Material Engineering: Pedigree, Auditing –Biology: Citation & copyright, trust –Astronomy: Context information

25 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 25 Importance of Provenance Business Domain –Data warehousing: Integrated view over historical data from multiple sources –Complex transformations to generate normalized view (ETL) –Business analytics and intelligence (OLAP queries) –Lineage allows “drill-down” from view to source table –Allows tracing back sources of errors –“View deletion” problem V1 V0 V2 T2T1 P1 Q2Q3 Extract Transform Load View Data Source Tables

26 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 26 Application of Provenance Data Quality –Evaluate quality of data –Trust in the source of data –Use provenance and metadata information to estimate data quality for a user –Assertions and Signatures for provenance guarantee Audit Trail –Error detection –Usage log

27 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 27 Application of Provenance Replication Recipe –Provenance can be recipe for generating a dataset –Repeat to verify/compare –Recreate/replicate –Partial updates Attribution –Copyright, citation, check data users Informational –Discover datasets –Browse provenance

28 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 28 Subject of Provenance What is provenance about? Granularity –Attribute, tables, files, data collections Fine-grained vs. Coarse-grained –Trade-off with cost of collecting, storing, querying Data vs. Process Provenance –Provenance can be a graph of data & processes –Which of them is provenance focused upon? –Hybrid where all grouped together

29 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 29 Process vs. Data Oriented D1 D0 D2 D4D3 P1 P2P3 D2’ D0’ D1 D0 D2 D4D3 P1 P2P3 D2’ D0’

30 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 30 Data Processing Architectures Service Oriented Architecture –Grid & Web services –Workflow & Service invocations –Data as parameters, references Databases –Update/View Queries, Stored Procedure Calls –Views, Tables, tuples, attributes Scripting, Command-line, etc.

31 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 31 Scheme for Representing Provenance Scheme for representing provenance –Annotations vs. Inversion Annotation –Annotate data with ancestral data & the steps used to derive it e.g. a DAG –Annotation requires more storage; “Eager” –Annotation can be as rich as user decides Inversion –Store function (query) used to generate data and invert it –Not all functions are invertible; auxiliary data required; JIT computation; query optimization –Minimal information provided (“Where”, “Why”)

32 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 32 Syntactic vs. Semantic Representation of Provenance Syntactic Structure –XML for Annotations –Implement specific for Inversion Semantic Knowledge –Semantic language used to define lineage metadata RDF, OWL –Advantages Provides Context Enhance searches Lineage proofs –Ontologies used as a framework for semantic knowledge –Community effort needed!

33 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 33 Provenance Storage Stored with or separate from data? –Integrity, accessibility Maintenance –Mutability, versioning –who is responsible – data creator or central? Scalability –# of datasets, depth of lineage, granularity, geographical distribution, # of users –Inversion vs. Annotation; Distributed vs. Centralized Overhead –Collection & storage –Automation

34 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 34 Provenance Dissemination Browsing Provenance as a DAG –Go back and forward in lineage through GUI Query based on lineage –By source data, or generating process –Enhanced by semantic information –Drill down during data mining Verify how data was created by reenactment or present proof statements

35 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 35 Taxonomy in Brief Application of Provenance Data qualityAudit trailAttribution Replication RecipeInformational Subject of Provenance Data vs. ProcessGranularity Representation of Provenance Annotation vs. InversionContents Syntactic vs. Semantic Provenance Storage ScalabilityOverhead Provenance Dissemination

36 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 36 Data Quality for Scientific Data Fitness for use Subjective & Objective Parameters –believability, reputation, reliability –precision, timeliness, accuracy Intrinsic Quality of data vs. Quality of data service –Correctness, consistency –accessibility, throughput, availability Good quality for one application may not be good for another (user driven)

37 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 37 Estimating Data Quality from Provenance Hypothesis: For derived datasets, quality depends not just on the dataset but also on its provenance — ancestral processes and data Quality of a dataset could be a function of: –Attributes of dataset –Attributes of generating process –Ancestral Datasets used to derive this dataset –And so on recursively …

38 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 38 Weighted DAG? D1 D0 D2 D4D3 P1 P2P3 D2’ D0’ D0_q = f(D0, P1_q) P1_q = f(P1, D1_q, D2_q, D4_q) D1_q = f(D1, P2_q) D2_q = f(D2, P3_q) P2_q = f(P2, D3_q)P3_q = f(P3, D4_q) D4_q = f(D4)D4 = f(D3)

39 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 39 Challenges for Quality Metrics Some process may produce better quality data than its input dataset Subsetting, aggregation of data may change overall quality estimate Quality of transformation may be parameter dependent Multiple user profiles for different applications Missing lineage information can short-circuit measurement

40 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 40 Uses of Data Quality Measurement Comparing and rank datasets uniformly –Google Personalized Reduce search space to datasets matching user quality requirement Built community-wide quality feedback mechanism –Leverage knowledge of domain expert –Promote publication of better quality data –Amazon reviews?

41 2005-03-07T18:00-05:00Networks & Complex Systems Seminar Talk 41 Research Questions What are the metrics for estimating the quality for data using provenance? How do we optimize user-centric searches based on quality? How can we recover information from incomplete lineage?

42 Thank you! Questions | Comments


Download ppt "Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept."

Similar presentations


Ads by Google