Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign."— Presentation transcript:

1 Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign

2 MetaQuerier 2 Background: MetaQuerier – large-scale integration of the deep Web MetaQuerier QueryResult The Deep Web

3 MetaQuerier 3 MetaQuerier: System architecture [CIDR’05] Database Crawler Database Crawler MetaQuerier Interface Extraction Interface Extraction Source Organization Source Organization Schema Matching Schema Matching The Deep Web Back-end: Semantics Discovery Front-end: Query Execution Query Translation Query Translation Source Selection Source Selection Result Compilation Result Compilation Deep Web Repository Unified InterfacesSubject DomainsQuery CapabilitiesQuery Interfaces Query Web databasesFind Web databases

4 MetaQuerier 4 Matching query interfaces (QIs) Book Domain Music Domain m:n complex matching 1:1 simple matching

5 MetaQuerier 5 Traditional approaches of schema matching – Pairwise attribute correspondence Typical matching approaches  Cupid [VLDB’01]  LSD [SIGMOD’01] Scale is a challenge  Only small scale  Large-scale is a must for our task Scale is an opportunity  Context information is not exploited similar attributes across multiple schemas co-occurrence patterns among attributes Pairwise Attribute Correspondence S1.author  S3.name S1.subject  S2.category S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Pairwise Matching

6 MetaQuerier 6 Emerging paradigm: Holistic schema matching approach Match many schemas at the same time and find all the matchings at once Holistic Schema Matching S2: writer title category format S3: name title keyword binding S1: author title subject ISBN Input: a set of schemas Output: a ranked list of matchings author = writer = name subject = category format = binding

7 MetaQuerier 7 Various techniques to realize holistic matching Matching as hidden model discovery: Model generative behavior of schemas from attributes and their semantic relationships  The MGS framework [SIGMOD’03] Matching as correlation mining: The correlation of attributes across sources reflect complex relationships  The DCM framework [KDD’04] Matching as clustering: Attributes in two schemas may be similar through attributes in other schemas  Interactive clustering based matcher [SIGMOD’04]  WISE-Integrator [VLDB’03]

8 MetaQuerier 8 Holistic matching is, in essence– Data mining to discover semantics for information integration Semantics (semantic correspondences) Observations (attribute occurrences) Hidden Regularities Statistical Analysis Generation  Hypothesis  Holistic matching approach hidden model discovery correlation mining clustering

9 MetaQuerier 9 The baseline holistic matching architecture with matching as correlation mining The DCM matcher {adult, child, senior} = passenger departure date = depart AA.comUnited.comExpedia.comDelta.com

10 MetaQuerier 10 The challenge in holistic input: Noisy data quality With the mining nature, holistic matching suffers the inherent problem of noisy data quality! Noisy input is inevitable  extraction of QIs may contain errors  organization of QIs may not be fully accurate The Deep Web Database Crawler Database Crawler Source Organization Source Organization Holistic Schema Matching Holistic Schema Matching Interface Extraction Interface Extraction

11 MetaQuerier 11 Example of errors in interface extraction The correlation between (adult, children) and passenger is affected by a single extraction error! AA.com Result of extraction:

12 MetaQuerier 12 The impact of noises: Error cascade Q: Errors are often minority, why cascade? A: The technique of a semantics related task, e.g., data integration, is often context-sensitive: constraints, heuristics, measures, parameters, procedures Error Cascade (e.g., Interface Extraction)(e.g., Holistic Schema Matching) Accuracy A i Accuracy A j Accuracy = A j ? Accuracy = A i *A j ? A general solution Sampling and voting techniques: The ensemble framework

13 MetaQuerier 13 The intuition of the ensemble idea Sampling: a way to reduce noises in the input Sampling 1) Contain sufficient good schemas to mine matchings 2) Contain fewer noises to have more chance to sustain the holistic matcher Voting: a single sampling may be biased, so let us repeat it multiple times and then vote It is likely that the holistic matcher can be sustained in most samples

14 MetaQuerier 14 The ensemble framework for holistic schema matching Holistic Schema Matching Sampling Voting S2: name title keyword binding S1: author title subject ISBN S3: writer title category format Holistic Schema Matching author = name = writer subject = category S2: name title keyword binding S1: author title subject ISBN S3: writer title category format Holistic Schema Matching author = name = writer subject = category 1 st trialT th trial Multiple Sampling Rank Aggregation

15 MetaQuerier 15 How the ensemble framework works: An example Holistic Schema Matching Holistic Schema Matching Holistic Schema Matching 1. author = name 2. subject = category 3. author = ISBN 1. subject = category 2. author = ISBN 3. author = name 1. author = name 2. publisher = category 3. author = ISBN 1. author = name 2. subject = category 3. author = ISBN Holistic Schema Matching 1. author = ISBN 2. publisher= category 3. author = name Please refer to our paper for more formal analysis

16 MetaQuerier 16 The ensemble idea is inspired by bagging predictors Bagging is used in machine learning to maintain the accuracy of a classifier with the presence of biased distribution of input data We are essentially applying bagging techniques in a new scenario of schema matching However, we are different in  setting: supervised vs. unsupervised  technique: sampling and voting tech  analytic model: our modeling is specific to matching

17 MetaQuerier 17 Configuration of multiple sampling The configuration dilemma  Sample size S If S is too small, the sampled data may not be sufficiently representative If S is too large, the sampled data may contain too many noises  Number of trials T If T is too small, the voting result may not be sufficiently convincing If T is too large, more execution time is needed Two ways to choose S and T  S  T: first choose an S, then derive an appropriate T  T  S: first choose an T, then derive an appropriate S  T  S is better than S  T, since the accuracy is very sensitive to S, not T

18 MetaQuerier 18 Aggregating matchings from all trials: Enforcing the majority matching results Each trial outputs a ranked list of matchings Voting is thus to aggregate a set of ranked list into a single ranked list R, which reflects the ranking results in the majority  Candidate selection If the majority of trials do not find a matching M, M is not considered as a correct matching and thus does not appear in R  Ranking aggregation If the majority of trials ranks M 1 higher than M 2, it will be good if we can also rank M 1 higher than M 2 in R

19 MetaQuerier 19 An example of voting 1. author = name 2. subject = category 3. author = ISBN 1. subject = category 2. author = ISBN 3. author = name 1. author = name 2. publisher = category 3. author = ISBN T1:T1:T2:T2:T3:T3: M 1. author = name, M 2. subject = category, M 3. author = ISBN, M 4. publisher = category All Matchings: Candidate Selection: M 1. author = name, M 2. subject = category, M 3. author = ISBN, M 4. publisher = category Rank Aggregation: Borda’s aggregation: B(M i ) = Σ rank of M i in T j B(M 1 ) = 1 + 3 + 1 = 5, B(M 2 ) = 2 + 1 + 3 = 6, B(M 3 ) = 3 + 2 + 2 = 7 M 1. author = name M 2. subject = category M 3. author = ISBN Rank matchings according to B(M i )

20 MetaQuerier 20 Experimental setup Subsystems integration scenario  Interface Extraction + Holistic Schema Matching Interface Extractor [SIGMOD’04] The DCM Matcher [KDD’04] Datasets  Two representative domains in the TEL-8 dataset in UIUC Web Integration Repository Books and Airfares http://metaquerier.cs.uiuc.edu/repository/

21 MetaQuerier 21 Experimental result: Baseline vs. Ensemble (a) Precision of Books(b) Precision of Airfares (c) Recall of Books(d) Recall of Airfares DomainNoisy input PR Books0.730.75 Airfares0.670.68 Baseline approach DomainAverage accuracy PR Books0.830.89 Airfares0.79 Ensemble approach DomainMost frequent accuracy PR Books0.91.0 Airfares0.710.82

22 MetaQuerier 22 Experimental result: Outliers vs. Missing Data (a) Precision of Books(b) Precision of Airfares (c) Recall of Books(d) Recall of Airfares Upper bound exists Two types of data quality problems  Outliers (noises)  Missing data Outliers  data ideally should not be observed, but observed  can be solved by the ensemble approach Missing data  data ideally should be observed, but not  cannot be solved by the ensemble approach

23 MetaQuerier 23 Contributions Problem  noisy data quality is an inherent challenge for large scale schema matching  critical for sustaining holistic schema matching as a practical and viable technique Solution  an ensemble framework with sampling and voting techniques, inspired by bagging predictors  we are essentially applying bagging techniques in a new scenario of schema matching

24 MetaQuerier 24 Thank You!


Download ppt "Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign."

Similar presentations


Ads by Google