Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.

Similar presentations


Presentation on theme: "Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang."— Presentation transcript:

1 Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang

2 Introduction Motivation Observation The main user criterion for selecting sources by hand – NOT just response time, BUT the expected quality of the data The sources have varying information quality Results become outdated quickly The intrinsic imprecision of many experimental techniques Contribution Integration of classical query planning the assessment and consideration of information quality (IQ)

3 Correctness and Completeness For a given user query, UQ, against the mediator schema, “Correct plan” Combination of QCAs that are semantically contained in the UQ Plans that compute only correct results “Complete answer” to a UQ w.r.t. the given QCAs Union over the answers of all correct plans Problem Too many correct plans!!

4 Example(1/2) Global tables sequence and gene A user query The sequence of a specific gene The mediator detects from QCAs S5 and other two sources can be used for the gene part S1, S2, and S3 for the sequence part We can generate 9 correct plans Question DO WE HAVE TO EXECUTE ALL THE 9 CORRECT PLANS?

5 Example (2/2) Assuming that IQ scores are available Sequence data on S1 S1 copies infrequently from other sites, sometimes introducing parsing errors Sequence data on S3 highly up-to-date, but few annotations are provided Reducing the number of correct plans to be executed We may consider 3 correct plans, instead of 9 Case 1: If the user was particularly interested in complete annotation  We conclude that plans using S3 are not very promising Case 2: If highly up-to-date data is required  S1 could probably be ignored

6 “Completeness of integrated information sources” by Felix Naumann, et al. (Information systems 2004) Implicit assumption by most information integration projects “The mediator should always compute the complete answer” In many cases, this assumption is wrong! “Computing the complete answer is not always necessary” For example, a meta-search engine does not need to download all hits from all search engines it uses; instead, taking the top ten hits usually suffices “Computing the complete answer may be too expensive or it may take too long time” Another assumption they have “The most complete response to the user is the best, given some cost limits”

7 IQ classification Source-specific criteria Determine the overall quality of a data source E.g., reputation QCA-specific criteria Determine the quality aspects of specific query that are computable by a source E.g., response times Attribute-specific criteria Assess the quality of a source in terms of its ability to provide the attributes of a specific user query E.g., the completeness of the annotation attribute on a source Depending on the application domain and the structure of available sources, the classification may vary Problem The ability to assign IQ scores in an objective manner is difficult Some IQ criteria are highly subjective (e.g., reputation)  Use user profiles, sets of IQ scores for all subjective criteria

8 Source-specific criteria Ease of understanding User ranking Reputation User ranking Reliability Ranking of experimental method (intrinsic error rate) Timeliness Average age of data

9 QCA-specific criteria Availability Percentage of time the source is accessible Price Monetary price of a query Representational Consistency Wrapper workload E.g., a wrapper with relational export schema is always consistent with the global schema Response time Average waiting time for response Accuracy Percentage of objects with errors Usually produced during data input Relevancy Percentage of real word objects represented Usually highly user-dependent

10 Attribute-specific criteria Completeness Fullness of the relation in each attribute (horizontal fitness) E.g., an attribute with 90% of null -values Amount Number of unwanted attributes (vertical fitness)

11 Algorithm (Three phases) Input User query Sources with QCAs, IQ scores Phase 1 Source selection with source-specific criteria  Best sources Phase 2 Planning with QCAs  All correct plans Phase 3 Plan selection with QCA- and attribute-specific criteria  Best plans

12

13 Phase 1: Source selection Goal Use the source-specific IQ criteria to “weed-out” sources that are qualitatively not as good as others (non-good sources) We completely disregard non-good sources for further planning Method used Data Envelopment Analysis (DEA) developed by Charnes et al. A general method to classify a population of observations Avoids the problems of scaling and weighting Do not remove a source S with low IQ If S is the only source providing a certain attribute of the global schema If S exclusively provide certain extensions of an attribute

14 Phase 2: Plan creation UQ with the user weightings for each attribute Plans, each possibly producing a different set of correct tuples for UQ

15 Phase 3: Plan selection Goal Qualitatively rank the plans of the previous phase Restrict plan execution to meet stop conditions Stop condition1: execute some best percentage of plans Stop condition2: execute as many plans as necessary to meet certain cost- or quality- criteria Three steps a) QCA quality The IQ scores of the QCAs are determined b) Plan Quality b1) The quality model (tree-structured) aggregates these scores along tree paths b2) Gain an overall score at the root of the tree, which forms the score of the entire plan c) Plan Ranking Rank all plans using IQ score of each plan

16 3a) QCA quality – determine IQ vectors for the QCAs The general IQ vector for QCAs The IQ vectors for QCAs participating in the six correct plans

17 3b) Plan Quality The six plans have aggregated IQ vectors Merging IQ vectors in join nodes The IQ vector for an inner join node Up to this point, the scores are neither scaled nor weighted, making a comparison or ranking of plans impossible

18 3c) Plan ranking Method used The Simple Additive Weighting (SAQ) method Scaling Positive criteria Availability, accuracy, relevancy, completeness Negative criteria Price, representational consistency, response time, amount Computing the weighted sum Needs a user-specific weight vector Reflects the importance of the individual criteria to the user Stored in the user profile IQ scores of plans obtained by the indifferent weight vector (Each weight value is 1/8)


Download ppt "Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang."

Similar presentations


Ads by Google