Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.

Similar presentations


Presentation on theme: "Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University."— Presentation transcript:

1 Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org

2 Combining the strengths of UMIST and The Victoria University of Manchester Integration of public data (in biology) GenBank UniProt EnsEMBLEntrezdbSNP Large volumes of data in many public repositories Increasingly creative uses for this data Their quality is largely unknown

3 Combining the strengths of UMIST and The Victoria University of Manchester Quality of e-science data Defining quality can be challenging: In-silico experiments express cutting-edge research –Experimental data liable to change rapidly –Definitions of quality are themselves experimental Scientists’ quality requirements often just a hunch –Quality tests missing or based on experimental heuristics –Often implicit and embedded in the experiment  not reusable Criteria for data acceptability within a specific data processing context A data consumer’s view on quality:

4 Combining the strengths of UMIST and The Victoria University of Manchester Example: protein identification Data output Protein identification algorithm “Wet lab” experiment Reference databases Protein Hitlist Protein function prediction Remove likely false positives  Improve prediction accuracy Quality filtering Goal: to explicitly define and automatically add the additional filtering step in a principled way Goal: to explicitly define and automatically add the additional filtering step in a principled way Support evidence: provenance metadata

5 Combining the strengths of UMIST and The Victoria University of Manchester Our goals Offer e-scientists a principled way to: Discover quality definitions for specific data domains Make them explicit using a formal model Implement them in their data processing environment Test them on their data … in an incremental refinement cycle Benefits: Automated processing Reusability “plug-in” quality components

6 Combining the strengths of UMIST and The Victoria University of Manchester Approach Research hypothesis: adding quality to data can be made cost-effective –By separating out generic quality processing from domain- specific definitions Define abstract quality views on the data Map quality view to an executable process Execute quality views - runtime environment - data-specific quality services Qurator architectural framework:

7 Combining the strengths of UMIST and The Victoria University of Manchester Abstract quality view model Data Assertions Class space 1 C 11 C 12 … C 21 C 22 … Class space 2 Classification 1 Classification 2 Actions on regions Conditions: regions specification Quality Metadata Evidence e1e1 e2e2 e3e3 Data annotation Coverage PeptidesCount

8 Combining the strengths of UMIST and The Victoria University of Manchester Semantic model for quality concepts Quality “upper ontology” (OWL) Quality “upper ontology” (OWL) Evidence annotations are class instances Quality evidence types Evidence Meta-data model (RDF) Evidence Meta-data model (RDF)

9 Combining the strengths of UMIST and The Victoria University of Manchester Quality hypotheses discovery and testing Performance assessment Execution on test data abstract quality view Compilation Targeted Compilation Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Target-specific Quality component Deployment Multiple target environments: Workflow query processor

10 Combining the strengths of UMIST and The Victoria University of Manchester Generic quality process pattern Collect evidence - Fetch persistent annotations - Compute on-the-fly annotations <variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> Evaluate conditions Execute actions ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 Compute assertions Classifier <QualityAssertion serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass" Persistent evidence

11 Combining the strengths of UMIST and The Victoria University of Manchester Bindings: assertion  service service class  Web service endpoint PIScoreClassifier  http://localhost/axis/services/PIScoreClassifierSvc All services implement the same WSDL interface Makes concrete assertion functions homogeneous Facilitates compilation Uniform input / output messages PIScoreClassifierSvc Common WSDL interface PI_Top_k_svc D = {(d i, evidence(d i ))} {class(d i )} {score(d i )} (service registry)

12 Combining the strengths of UMIST and The Victoria University of Manchester Execution model for Quality views Binding  compilation  executable component –Sub-flow of an existing workflow –Query processing interceptor Host workflow Abstract Quality view Embedded quality workflow QV compiler D D’Quality view on D’ Qurator quality framework Services registry Services implementation Host workflow: D  D’

13 Combining the strengths of UMIST and The Victoria University of Manchester Example: original proteomics workflow Taverna (*): workflow language and enactment engine for e-science applications (*) part of the myGrid project, University of Manchester - taverna.sourceforge.net Quality flow embedding point

14 Combining the strengths of UMIST and The Victoria University of Manchester Example: embedded quality workflow

15 Combining the strengths of UMIST and The Victoria University of Manchester Interactive conditions / actions

16 Combining the strengths of UMIST and The Victoria University of Manchester Quality views for queries Actions: filtering, dump to DB / file

17 Combining the strengths of UMIST and The Victoria University of Manchester Qurator architecture

18 Combining the strengths of UMIST and The Victoria University of Manchester Summary For complex data types, often no single “correct” and agreed-upon definition of quality of data Qurator provides an environment for fast prototyping of quality hypotheses –Based on the notion of “evidence” supporting a quality hypothesis –With support for an incremental learning cycle Quality views offer an abstract model for making data processing environments quality-aware –To be compiled into executable components and embedded –Qurator provides an invocation framework for Quality Views More info and papers: http://www.qurator.orghttp://www.qurator.org Live demos (informal) available


Download ppt "Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University."

Similar presentations


Ads by Google