Presentation is loading. Please wait.

Presentation is loading. Please wait.

Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael.

Similar presentations


Presentation on theme: "Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael."— Presentation transcript:

1 Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael Flaster* Google *Work performed at Lucent Technologies -- Bell Laboratories.

2 Slide 2 Overview 1. Motivation 2. Background 3. Strawman 4. Framework 5. Experimental Evaluation 6. Related Work 7. Conclusions

3 Slide 3 Schema Matching vs. Schema Mapping R S.Person First Last City R T.Student Name Address City Source Schema: R S Target Schema: R T Arrows inferred based on meta-data or sample instance data Associated confidence score Meaning (variant of): R S.Person.City R T.Student.City Schema Matching means computer-suggested arrows

4 Slide 4 Schema Mapping: From Arrows to Queries R S.Person First Last City R T.Student Name Address City Given a set of arrows user input, produce a query that maps instances of R S into instances of R T R T Transformations, joins [Miller, Has, Hernandez, VLDB 2002] added by, or with help from, the user Most of this talk is about matching, some implications for mapping later select concat(First,,Last) as Name, City as City from RS.Person, RS.Education,… where … Q: R S -> R T

5 Slide 5 Motivation: inventory mapping example R S.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date R T.book title: string isbn: string price: float format: string R T.music title: string asin: string price: float label: string sale: float Consider integrating two inventory schemas Books, music in separate tables in R T Run some nice schema match software

6 Slide 6 Inventory where clause R S.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date R T.book title: string isbn: string price: float format: string R T.music title: string asin: string price: float label: string sale: float The lines are helpful (schema matching is a best- effort affair), but… lines are semantically correct only in the context of a selection condition where type=1 where type = 2

7 Slide 7 Definition and Goals Contextual schema match: An arrow between source and target schema elements, annotated with a selection condition –In a standard schema match, the condition true is always used Goal: Adapt instance-driven schema matching techniques to infer semantically valid contextual schema matches, and create schema maps from those matches R S.aa R T.bb true M R S.aa R T.bb R S.c=3 M

8 Slide 8 Attribute promotion example Consider integrating data about grade assignments [Fletcher, Wyss, SIGMOD 2005 demo] Again context is needed, but semantics are slightly different: attribute promotion Name Assgn Grade Name Grade1 Grade2 Grade3 … Joe Mary Joe where Assgn=1 where Assgn=2 =3= …

9 Slide 9 Overview 1. Motivation 2. Background 3. Strawman 4. Framework 5. Experimental Evaluation 6. Related Work 7. Conclusion

10 Slide 10 Background: Instance-level matching R S.ac R T 1.bb true M R S.ac R T 1.ac true M San Jose Cupertino Palo Alto Gilroy Pleasanton Sunnyvale Los Angeles Cupertino Gilroy San Diego Nice match! (408) (212) (408) (408) Sunnyvale Los Angeles Cupertino Gilroy San Diego Dubious, at best!

11 Slide 11 Background: Instance-level matching R S.ac R T 1.bb true M R S.ac R T 1.ac true M Perfect match! Dubious, at best! Bayesian Tri-gram Type Expert String Edit Distance Cosine Similarity Whatever More Whatever Coming up with a good score is far from simple! Derive comparable scores across sample size, data types, etc.

12 Slide 12 StandardMatch(R S, R T, ) R S.ac R T 1.ac true M R S.ba R T 1.cd true M R S.ba R T 1.sb true M R S.db R T 1.ar true M R S.ac R T 1.vw true M R S.bd R T 1.ad true M 1.Consider all |R S || R T | matches, score them, normalize the scores R S.ac R T 1.ac true M R S.ba R T 1.cd true M R S.ba R T 1.sb true M R S.db R T 1.ar true M R S.ac R T 1.vw true M R S.bd R T 1.ad true M 2.Rank by normalized score 3.Apply as a cutoff, and return

13 Slide 13 Background: Categorical Attributes What attributes are candidates for the where clause? We focus on categorical attributes (leaving non- categorical attributes as future work) If not identified by schema, infer from sample data, as any attribute with –more than 1 value –most values associated with more than one tuple R S.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date

14 Slide 14 Overview 1. Motivation 2. Background 3. Strawman 4. Framework 5. Experimental Evaluation 6. Related Work 7. Conclusion

15 Slide 15 Strawman Algorithm 1. Use instance-based matching algorithm to compute a set of matches, L = M 1..M n, along with associated scores 2. For each M i in L, of the form (R S.s,R T.t,true) For each categorical attribute c in the source (or target) For each value v taken by c in the sample §Restrict the sample of R S to tuples where c=v §Re-compute the match score on the new sample 3. For c,v that most improves score, replace Mi with (R S.s,R T.t,c=v)

16 Slide 16 ContextMatch(R S, R T, ) R S.ac R T 1.ac true M R S.ba R T 1.cd true M R S.ba R T 2.sb true M R S.db R T 2.ar true M R S.ac R T 1.vw true M R S.bd R T 1.ad true M 2.Rank by normalized score 3.Apply as a cutoff, and return StandardMatch… R S.ba R T 1.cd Rs.t=1 M 5. Evaluate quality of match 6. Keep the best! R S.c = 2 R S.d = open R S.c = 2 or R S.c = 3 R S.t = 0 R S.t = 1 4. Try each context condition

17 Slide 17 Problems with Strawman False Positives –the increase in the score may not be meaningful, since some random subsets of corpus will match better than the whole (even with size-adjusted metrics) False Negatives –original matching algorithm only returned matches with quality above some threshold to be in L, but a match that didnt make the cut may improve greatly with contextual matching Time –with disjuncts -- too many expressions to test

18 Slide 18 Strawman 2.0 Like Strawman, but require an improvement threshold, w, to cut down on false positives Not much better Setting w is problematic, as matcher scores are not perfect

19 Slide 19 Overview 1. Motivation 2. Background 3. Strawman 4. Framework 5. Experimental Evaluation 6. Related Work 7. Conclusion

20 Slide 20 Our approach: R S.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date R T.book title: string isbn: string price: float format: string R T.music title: string asin: string price: float label: string sale: float 1. Pre-filter conditions based on classification 2. Find conditions that improve several matches from the same table

21 Slide 21 View-oriented contextual mapping (contd) R S.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date R T.book title: string isbn: string price: float format: string R T.music title: string asin: string price: float label: string sale: float R S.inv where type = 2 id: integer name: string code: string type: integer instock: string descr: string arrival: date R S.inv where type = 1 id: integer name: string code: string type: integer instock: string descr: string arrival: date

22 Slide 22 Algorithm ContextMatch(R S, R T, ) L = ; M = StandardMatch(R S, R T, ); C = InferCandidateViews(R S,M, EarlyDisjuncts ); for c C do V c = select * from R S where c; for m M do m := m with R S replaced by V c ; s := ScoreMatch(m); L = L {(m,s)}; return SelectContextualMatches(M, L, EarlyDisjuncts )

23 Slide 23 ContextMatch(R S, R T, ) R S.ac R T 1.ac true M R S.ba R T 1.cd true M R S.ba R T 2.sb true M R S.db R T 2.ar true M R S.ac R T 1.vw true M R S.bd R T 1.ad true M 2.Rank by normalized score 3.Apply as a cutoff, and return StandardMatch… InferCandidateViews R S.c = 2 R S.d = open R S.c = 2 or R S.c = 3 R S.t = 0 R S.t = 1 4. Re-compute summaries for V as: select * from R S where R S.t = 1 For each candidate view V, R S.ba R T 1.cd Rs.t=1 M 5. Evaluate quality of matches

24 Slide 24 How to Filter Candidate Views Naïve –Any Boolean condition involving a categorical attribute (strawman approach) SourceClassifier, TargetClassifier –Check for categorical attributes that do a good job categorizing other attributes Disjunct Handling (early or late) Conjunct Handling

25 Slide 25 R S.inv id: integer name: string code: string type: integer instock: string descr: string arrival: date idnametypeinstockcodedescr 0leaves of grass1y hardcover 1the white album2yB002UAXaudio cd 2heart of darkness1n paperback 3wasteland1y039995paperback 4hotel california2nB002GVOelectra Source Classifier Intuition how well do the categorical attributes serve as classifier labels for the other attributes?

26 Slide 26 idnametypeinstockcodedescr 0leaves of grass1y hardcover 1the white album2yB002UAXaudio cd 2heart of darkness1n paperback 3wasteland1y039995paperback 4hotel california2nB002GVOelectra Source Classifier Intuition: type how about type?

27 Slide 27 idnametypeinstockcodedescr 0leaves of grass1y hardcover 1the white album2yB002UAXaudio cd 2heart of darkness1n paperback 3wasteland1y039995paperback 4hotel california2nB002GVOelectra Source Classifier Intuition: instock how about instock?

28 Slide 28 What do we really mean by a good job? Split the sample into a training set and a testing set (randomly) For each categorical attribute C and non-categorical attribute A –Train a classifier H by treating the value of A as the document and the value of C as the label –Test H against test set, determine precision, p, and recall, r –Score(C) w.r.t. A based on combination of precision and recall (F = 2pr/(p+r)) –Compare Score(C) to Score(NC), wher NC is a Naïve Classifier: This classifier chooses most frequent label –C does a good job with H if Hs improvement over Naïve is statistically significant with 95% confidence

29 Slide 29 idnametypeinstockcodedescr 0leaves of grass1y hardcover 1the white album2yB002UAXaudio cd 2heart of darkness1n paperback 3wasteland1y039995paperback 4hotel california2nB002GVOelectra Target Classifier Intuition Train a new classifier, T, treating each target schema attribute as a class of documents Check source values against this classifier Label each value with best guess label Use labels instead of values in the same framework Book.comment Music.label

30 Slide 30 Handling Disjunctive Conditions Why Disjuncts? What if type field had separate categories for hardback and paperback? Two approaches to handling disjunctive conditions, early and late Early Disjuncts –InferCandidateViews is responsible for identifying interesting disjuncts –Each interesting disjunct is evaluated separately, no overlapping conditions are output Late Disjuncts – InferCandidateViews returns no disjuncts –All high-scoring conditions are unioned together (Clio semantics), effectively creating a disjunct

31 Slide 31 Early Disjuncts: A Heuristic Approach When evaluating trained classifier on test set for some categorical attribute C, make note of misclassifications of the form should be A, but guessed B Consider merging the (A,B) pair that would repair most errors –by merge, we mean replace A and B values with (A,B) Re-evaluate Repeat Keep all alternatives formed this way that score well Only accept 1 view that mentions attribute C (dont union)

32 Slide 32 Handling Conjuncts Proposed Approach: –Assumes that a good conjunctive view has a good disjunctive view as one of the terms in the conjunct. Run Context Match Repeatedly At stage i, consider views V C identified by the previous (i-1) th run as the input base tables –where C was the select condition defining the view When considering candidate attributes for a run, only consider categorical attributes not in C. (Conjunct handling not in current experiments)

33 Slide 33 Selecting Contextual Matches Each view V based on condition c is evaluated, rather than each match Compute overall confidence of matches from V, and compare to overall confidence from base table If overall confidence is better than w, use V instead of the base table If more than one qualifies –If EarlyDisunct, choose the best –Else, take all that are over w

34 Slide 34 Comments on Schema Mapping Seek to apply the Clio ( [Popa et al, VLDB 2002] ) approach to mapping construction Create logical tables based on key-foreign key constraints Two challenges –Extend notion of foreign-key constraints in context of selection views, undecidability result –Extend join rules of [Popa et al, VLDB 2002] to handle the selection views See paper for details

35 Slide 35 Overview 1. Motivation 2. Background 3. Strawman 4. Framework 5. Experimental Evaluation 6. Related Work 7. Conclusion

36 Slide 36 Experimental Study Used schemas from the retail domain –schemas created by students at UW Aaron, Ryan, Barrett –Populated code, descr info by scraping web-sites, used some name data from Illinois Semantic Integration Archive ItemType is split, so that instead of just CD, BOOK –e.g. CD1, CD2, BOOK1, BOOK2, =4 Compare matched edges to correct edges –Accuracy: how many of BOOKi edges go to book target table? –Precision: of the BOOKi edges, how many go to book target? –Fmeas: 2(Accuracy * Precision)/(Accuracy + Precision)

37 Slide 37 View improvement threshold: w AaronBarett How sensitive is technique to w ? Depends on disjunct strategy Easier to pick w with EarlyDisjunct Ryan

38 Slide 38 Strawman Strawman means –Late disjunct (EarlyDisjunct=false) –Pick best arrow from each source attribute on per-attribute basis (MultiTable)

39 Slide 39 Sensitivity to Decoy Categorical Attributes EarlyDisjunct LateDisjunct Add 3 extra categorical attributes Vary their correlation with ItemType (higher correlation makes it harder) Naïve is not only slow, it is overly confusing to the quality metrics EarlyDisjunct heuristic based on classification helps with quality

40 Slide 40 Varying schema size Add n non-categorical attributes to every table, all taken from same domain Add n/4 categorical attributes to tables with categorical attributes Early dip is before non-categorical attributes match each other

41 Slide 41 Runtime as schema gets larger Same experiment, compare runtimes TgtClass is somewhat higher quality (not shown), but takes much longer for large schemas

42 Slide 42 Grades Example Create an experiment based on grades example Artificial data –mean of assignment I is (I-1) (as grades improve) –standard deviation is varied Name Assgn Grade Name Grade1 Grade2 Grade3 … Joe Mary Joe Bob Sue where Assgn=2 where Assgn=1 =3= …

43 Slide 43 Grades accuracy as std. dev increases

44 Slide 44 Overview 1. Motivation 2. Background 3. Strawman 4. Framework 5. Experimental Evaluation 6. Related Work 7. Conclusion

45 Slide 45 Related Work Instance level schema matching –Survey [Rahm, Bernstein, VLDB Journal 2001], Coma [Do, Rahm, VLDB02], Coma++ [SIGMOD 05], iMAP [Doan et al, SIGMOD 01], Cupid [Madhavan, Bernstien, Rahm, VLDB 01], etc. Schema mapping –Clio [Popa, et al, VLDB 02], [Haas et al, SIGMOD 2005], etc –Model Management (many papers) Overcoming heterogeneity during match process –Schema Mapping as Query Discovery [Miller, Haas, Hernandez, VLDB 2000] - present user with examples to derive join conditions –MIQIS [Fletcher, Wyss, (demo) SIGMOD 2005] - search through a large space of schema transformations (beyond what is given here), but requires the same data to appear in both source and target –We focus on inferring selection views only, but are very compatible with existing schema match work

46 Slide 46 Conclusions Contributions –Introduced contextual matching as an important extension to schema matching –Defined a general framework in which instance-level match technique is treated as a black box –Identified two techniques based on classification to find good conditions –Identified filtering criterea for contextual matches –Define contextual foreign key and new join rules to extend a Clio- style schema mapper to better handle contextual matches –Experimental study illustrating time/quality tradeoffs Future Work –More complex view conditioning (horizontal partitioning + attribute promotion) –Consider taking constraints on target into account in quality functions

47 The End Thank you, any questions?

48 Slide 48 sizes_fmeas.eps

49 Slide 49 Standard Match Algorithm StandardMatch(R S, R T, ) –Evaluate quality of match between all pairs of (source, target) attributes Ignore complex (multi-attribute) matches for simplicity –return matches between source table R S and target schema R T that have confidence threshold >=

50 Slide 50 R S.ac R T 1.ac true M R S.ba R T 1.cd true M R S.ba R T 1.sb true M R S.db R T 1.ar true M R S.ac R T 1.vw true M R S.bd R T 1.ad true M R S.af R T 1.ca true M

51 Slide 51 Background: Instance-level matching Instance-level schema matching requires sample data for source and target schema Train a variety of matchers by treating each (source, target) column as a set of documents labeled by the column name –e.g. text matchers based on string similarity, token similarity, format similarity, number of tokens, etc, or –numeric matchers based on value distribution, etc. Apply source matchers to sample target data, and vice versa Combine resulting scores (with machine-learned weightings [Doan, Domingos, Halevy, SIGMOD 2001] ) to score each arrow R S.ac R T 1.bb true M score R S.ac R T 1.bb true M perfect match

52 Slide 52 Algorithm ContextMatch(R S, R T, ) L = ; M = StandardMatch(R S, R T, ); C = InferCandidateViews(R S,M, EarlyDisjuncts ); for c C do V c = select * from R S where c; for m M do m := m with R S replaced by V c ; s := ScoreMatch(m); L = L {(m,s)}; return SelectContextualMatches(M, L, EarlyDisjuncts )


Download ppt "Putting Context into Schema Matching Philip Bohannon* Yahoo! Research Eiman Elnahrawy* Rutgers University Wenfei Fan Univ of Edinburgh and Bell Labs Michael."

Similar presentations


Ads by Google