Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

1 Ontolog OOR Use Case Review Todd Schneider 1 April 2010 (v 1.2)
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Management Information Systems, Sixth Edition
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Using the Semantic Web to Construct an Ontology- Based Repository for Software Patterns Scott Henninger Computer Science and Engineering University of.
Selecting Preservation Strategies for Web Archives Stephan Strodl, Andreas Rauber Department of Software.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Information Retrieval in Practice
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Components and Architecture CS 543 – Data Warehousing.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Overview of Search Engines
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
1 Using the Weather to Teach Computing Topics B. Plale, Sangmi Lee, AJ Ragusa Indiana University.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
18:15:32Service Oriented Cyberinfrastructure Lab, Grid Deployments Saul Rioja Link to presentation on wiki.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Semantic Publishing Update Second TUC meeting Munich 22/23 April 2013 Barry Bishop, Ontotext.
Addressing the Data Deluge: the Structuring, Sharing, and Preserving of Scientific Experiment Data Beth Plale Sangmi Lee Scott Jensen Yiming Sun Computer.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
DECISION SUPPORT SYSTEM ARCHITECTURE: The data management component.
CyberInfrastructure to Support Scientific Exploration and Collaboration Dennis Gannon (based on work with many collaborators, most notably Beth Plale )
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
Data Provenance and Data Quality Inference The University of Texas at Dallas Computer Science 11/13/2006 Ping Mao Jungin Kim.
Master Thesis Defense Jan Fiedler 04/17/98
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang 5-1 Chapter 5 Business Intelligence: Data.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Data Warehousing Data Mining Privacy. Reading Bhavani Thuraisingham, Murat Kantarcioglu, and Srinivasan Iyer Extended RBAC-design and implementation.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Bdbms: A Database System for Scientific Data Management Mohamed Y. Eltabakh, Mourad Ouzzani, Walid G. Aref, Ahmed Elmagarmid, Yasin Silva, Umer Arshad,
Objectives Functionalities and services Architecture and software technologies Potential Applications –Link to research problems.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
A Semantic-Web based Framework for Developing Applications to Improve Accessibility in the WWW Michail Salampasis Dept. of Informatics TEI of Thessaloniki.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
ITGS Databases.
A PPARC funded project Workflow and Job Control in Astrogrid Jeff Lusted Dept Physics and Astronomy University of Leicester.
Streamflow - Programming Model for Data Streaming in Scientific Workflows Chathura Herath.
Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc
The Global Land Cover Facility is sponsored by NASA and the University of Maryland.The GLCF is a founding member of the Federation of Earth Science Information.
Towards Personalized and Active Information Management for Meteorological Investigations Beth Plale Indiana University USA.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Session 10a, 21st October 2005 eChallenges e-2005 Copyright 2005 K-Wf Grid, Institute of Informatics SAS Experience Management based on Text Notes (EMBET)
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
Indiana University School of Informatics The LEAD Gateway Dennis Gannon, Beth Plale, Suresh Marru, Marcus Christie School of Informatics Indiana University.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Information Retrieval in Practice
Search Engine Architecture
An Overview of Data-PASS Shared Catalog
Big Data Quality the next semantic challenge
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Metadata Construction in Collaborative Research Networks
LOD reference architecture
AGMLAB Information Technologies
Big DATA.
Database management systems
Presentation transcript:

Metadata, Ontologies, and Provenance: Towards Extended Forms of Data Management Beth Plale, Yogesh Simmhan Computer Science Dept.

T18:00-05:00Networks & Complex Systems Seminar Talk 2 The Data Deluge Computational science is increasingly data intense and getting more so. Why? More complex computations: –Nested model runs –Linked models –Finer resolution More sources of data products –Observational data products Streaming continuously from hundreds of sensor and network sources, scaling to thousands Large archives –Annotations –Model configuration parameters –Output results –Model data –Statistical data (e.g., data mining)

T18:00-05:00Networks & Complex Systems Seminar Talk 3 Problem Computational scientists are reaching their limit on ability to manage data products associated with investigations –Scientist can touch hundreds to thousands of data products in single investigation

T18:00-05:00Networks & Complex Systems Seminar Talk 4 Seeds of solution in Internet? Internet has proven the utility of user-oriented view towards information space management –Search, tag: browser, bookmarks –Publish: blogs, web page tools But web not completely appropriate. Web is –Single-writer, multiple reader, and –Search-and-download. Apply concept of user-oriented view to managing data space Want ability to work locally. –myLEAD: tool to help an investigator make sense of, and operate in, the vast information space that is computational science (e.g., mesoscale meteorology.)

T18:00-05:00Networks & Complex Systems Seminar Talk 5 Personal metadata catalog requirements Scientists have following needs: Want to share products but retain control over what gets shared and with whom –Data not made public until results appear in journal Want rich search criteria over vast data space but don’t necessarily want to write SQL queries Need help managing products generated over extended period of time (I.e., years) Want high level of reliability - data must always be accessible,

T18:00-05:00Networks & Complex Systems Seminar Talk 6 Distributed and replicated personal metadata catalogues IU NCSA UA Huntsville Millersville UCAR Unidata Okla Univ Master myLEAD catalog Satellite myLEAD catalog -- distribution: users partitioned over 6 sites in LEAD testbed -- replication: master is replica site for all satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 7 Hurricane Ivan SE quadrant Voltice study 1998 Voltice study 2002 Workflow template Collection Input parameter Hurricane Ivan SE quadrant Voltice study 1998 Workflow template Collection Input parameter Hurricane Ivan SE quadrant Voltice study 1998 Voltice study 2003 Workflow template Collection Input parameter ftp://fileserver.org/file1998o768 Voltice study 2002 User Bob’s workspace in 1998User Bob’s workspace in 2002User Bob’s workspace in 2003 Physical data storage Table of collection Table of file Table of User Metadata Catalog

T18:00-05:00Networks & Complex Systems Seminar Talk 8 Ontologies aid in querying Preservation Sharing Structure Depth 2: searchable Depth 3: brow sable Does not know existence Flat structure Temporary data product Non-published Data products of other users Non-preserved data product Non structured data products structure sharing preservation Ontologies provide -- transparent structure -- controlled vocabulary

T18:00-05:00Networks & Complex Systems Seminar Talk 9 LEAD ( Each year, mesoscale weather – floods, tornadoes, hail, strong winds, lightning, and winter storms – causes hundreds of deaths, routinely disrupts transportation and commerce, and results in annual economic losses > $13B.

T18:00-05:00Networks & Complex Systems Seminar Talk 10 Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 11 OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Conventional Numerical Weather Prediction

T18:00-05:00Networks & Complex Systems Seminar Talk 12 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 13 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 14 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 15 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students Conventional Numerical Weather Prediction OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites The process is entirely serial and pre-scheduled: no response to weather! The process is entirely serial and pre-scheduled: no response to weather!

T18:00-05:00Networks & Complex Systems Seminar Talk 16 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 17 Analysis/Assimilation Quality Control Retrieval of Unobserved Quantities Creation of Gridded Fields Prediction PCs to Teraflop Systems Product Generation, Display, Dissemination End Users NWS Private Companies Students The LEAD Vision: No Longer Serial or Static OBSERVATIONS Radar Data Mobile Mesonets Surface Observations Upper-Air Balloons Commercial Aircraft Geostationary and Polar Orbiting Satellite Wind Profilers GPS Satellites

T18:00-05:00Networks & Complex Systems Seminar Talk 18

T18:00-05:00Networks & Complex Systems Seminar Talk 19 Objective discussed in this talk: Grow the value of the data holdings. Can do so through provenance: workflow myLEAD time Process, time, causality

Exploiting Provenance Metadata

T18:00-05:00Networks & Complex Systems Seminar Talk 21 Contents of Talk Importance of Provenance Techniques for Provenance Management Data Quality and Provenance Conclusion

T18:00-05:00Networks & Complex Systems Seminar Talk 22 Data Provenance Derivation History of Data starting from its original sources Data: Files, tables, tuples, virtual collections Derivation: Process that transforms data – Script, Web service, Queries, Commands Lineage, Pedigree, Genealogy, Filiation, Parentage, …

T18:00-05:00Networks & Complex Systems Seminar Talk 23 A Simple Provenance DAG D1 D0 D2 D4D3 P1 P2P3 D2’ D0’

T18:00-05:00Networks & Complex Systems Seminar Talk 24 Importance of Provenance Scientific Domain –Publications are Provenance! –Many scientific datasets available online Biology, Astronomy (SDSS) –Standard metadata describes datasets in well-known repositories –Lineage information usually missing, but vital –GIS: Fitness for use –Material Engineering: Pedigree, Auditing –Biology: Citation & copyright, trust –Astronomy: Context information

T18:00-05:00Networks & Complex Systems Seminar Talk 25 Importance of Provenance Business Domain –Data warehousing: Integrated view over historical data from multiple sources –Complex transformations to generate normalized view (ETL) –Business analytics and intelligence (OLAP queries) –Lineage allows “drill-down” from view to source table –Allows tracing back sources of errors –“View deletion” problem V1 V0 V2 T2T1 P1 Q2Q3 Extract Transform Load View Data Source Tables

T18:00-05:00Networks & Complex Systems Seminar Talk 26 Application of Provenance Data Quality –Evaluate quality of data –Trust in the source of data –Use provenance and metadata information to estimate data quality for a user –Assertions and Signatures for provenance guarantee Audit Trail –Error detection –Usage log

T18:00-05:00Networks & Complex Systems Seminar Talk 27 Application of Provenance Replication Recipe –Provenance can be recipe for generating a dataset –Repeat to verify/compare –Recreate/replicate –Partial updates Attribution –Copyright, citation, check data users Informational –Discover datasets –Browse provenance

T18:00-05:00Networks & Complex Systems Seminar Talk 28 Subject of Provenance What is provenance about? Granularity –Attribute, tables, files, data collections Fine-grained vs. Coarse-grained –Trade-off with cost of collecting, storing, querying Data vs. Process Provenance –Provenance can be a graph of data & processes –Which of them is provenance focused upon? –Hybrid where all grouped together

T18:00-05:00Networks & Complex Systems Seminar Talk 29 Process vs. Data Oriented D1 D0 D2 D4D3 P1 P2P3 D2’ D0’ D1 D0 D2 D4D3 P1 P2P3 D2’ D0’

T18:00-05:00Networks & Complex Systems Seminar Talk 30 Data Processing Architectures Service Oriented Architecture –Grid & Web services –Workflow & Service invocations –Data as parameters, references Databases –Update/View Queries, Stored Procedure Calls –Views, Tables, tuples, attributes Scripting, Command-line, etc.

T18:00-05:00Networks & Complex Systems Seminar Talk 31 Scheme for Representing Provenance Scheme for representing provenance –Annotations vs. Inversion Annotation –Annotate data with ancestral data & the steps used to derive it e.g. a DAG –Annotation requires more storage; “Eager” –Annotation can be as rich as user decides Inversion –Store function (query) used to generate data and invert it –Not all functions are invertible; auxiliary data required; JIT computation; query optimization –Minimal information provided (“Where”, “Why”)

T18:00-05:00Networks & Complex Systems Seminar Talk 32 Syntactic vs. Semantic Representation of Provenance Syntactic Structure –XML for Annotations –Implement specific for Inversion Semantic Knowledge –Semantic language used to define lineage metadata RDF, OWL –Advantages Provides Context Enhance searches Lineage proofs –Ontologies used as a framework for semantic knowledge –Community effort needed!

T18:00-05:00Networks & Complex Systems Seminar Talk 33 Provenance Storage Stored with or separate from data? –Integrity, accessibility Maintenance –Mutability, versioning –who is responsible – data creator or central? Scalability –# of datasets, depth of lineage, granularity, geographical distribution, # of users –Inversion vs. Annotation; Distributed vs. Centralized Overhead –Collection & storage –Automation

T18:00-05:00Networks & Complex Systems Seminar Talk 34 Provenance Dissemination Browsing Provenance as a DAG –Go back and forward in lineage through GUI Query based on lineage –By source data, or generating process –Enhanced by semantic information –Drill down during data mining Verify how data was created by reenactment or present proof statements

T18:00-05:00Networks & Complex Systems Seminar Talk 35 Taxonomy in Brief Application of Provenance Data qualityAudit trailAttribution Replication RecipeInformational Subject of Provenance Data vs. ProcessGranularity Representation of Provenance Annotation vs. InversionContents Syntactic vs. Semantic Provenance Storage ScalabilityOverhead Provenance Dissemination

T18:00-05:00Networks & Complex Systems Seminar Talk 36 Data Quality for Scientific Data Fitness for use Subjective & Objective Parameters –believability, reputation, reliability –precision, timeliness, accuracy Intrinsic Quality of data vs. Quality of data service –Correctness, consistency –accessibility, throughput, availability Good quality for one application may not be good for another (user driven)

T18:00-05:00Networks & Complex Systems Seminar Talk 37 Estimating Data Quality from Provenance Hypothesis: For derived datasets, quality depends not just on the dataset but also on its provenance — ancestral processes and data Quality of a dataset could be a function of: –Attributes of dataset –Attributes of generating process –Ancestral Datasets used to derive this dataset –And so on recursively …

T18:00-05:00Networks & Complex Systems Seminar Talk 38 Weighted DAG? D1 D0 D2 D4D3 P1 P2P3 D2’ D0’ D0_q = f(D0, P1_q) P1_q = f(P1, D1_q, D2_q, D4_q) D1_q = f(D1, P2_q) D2_q = f(D2, P3_q) P2_q = f(P2, D3_q)P3_q = f(P3, D4_q) D4_q = f(D4)D4 = f(D3)

T18:00-05:00Networks & Complex Systems Seminar Talk 39 Challenges for Quality Metrics Some process may produce better quality data than its input dataset Subsetting, aggregation of data may change overall quality estimate Quality of transformation may be parameter dependent Multiple user profiles for different applications Missing lineage information can short-circuit measurement

T18:00-05:00Networks & Complex Systems Seminar Talk 40 Uses of Data Quality Measurement Comparing and rank datasets uniformly –Google Personalized Reduce search space to datasets matching user quality requirement Built community-wide quality feedback mechanism –Leverage knowledge of domain expert –Promote publication of better quality data –Amazon reviews?

T18:00-05:00Networks & Complex Systems Seminar Talk 41 Research Questions What are the metrics for estimating the quality for data using provenance? How do we optimize user-centric searches based on quality? How can we recover information from incomplete lineage?

Thank you! Questions | Comments