Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer.

Similar presentations


Presentation on theme: "Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer."— Presentation transcript:

1 Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer App” for Multi-Strategy Learning

2 2 Overview Data integration & XML Schema matching Multi-strategy learning Prototype system & experiments Related work Future work Summary

3 3 Data Integration Find houses with four bathrooms and price under $500,000 mediated schema superhomes.com source schema realestate.com source schema homeseekers.com source schema wrapper

4 4 Why Data Integration Matters Very active area in database & AI –research / workshops –start-ups Large organizations –multiple databases with differing schemas Data warehousing The Web: HTML sources The Web: XML sources

5 5 XML Extensible Markup Language –introduced in 1996 The standard for data publishing & exchange –replaces HTML & proprietary formats –embraced by database/web/e-commerce communities XML versus HTML –both use tags to mark up data elements –HTML tags specify format –XML tags define meaning –relationships among elements provided via nesting

6 6 Example Seattle WA USA (206) 729 0831 $250,000 Fantastic house...... Residential Listings House For Sale location: Seattle, WA, USA agent-phone: (206) 729 0831 listed-price: $250,000 comments: Fantastic house... House For Sale...... HTML XML

7 7 XML DTD A DTD can be visualized as a tree Document Type Descriptor –BNF grammar –constraints on element structure: type, order, # of times A real-estate DTD

8 8 Semantic Mappings between Schemas Mediated & source schemas = XML DTDs house location contact-info house address agent-name agent-phone num-bathsamenities full-bathshalf-bathshandicap- equipped contact name phone

9 9 Map of the Problem Map of the Problem source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings

10 10 Current State of Affairs Largely done by hand –labor intensive & error prone –key bottleneck in building applications Will only be exacerbated –data sharing & XML become pervasive –proliferation of DTDs –translation of legacy data Need automatic approaches to scale up!

11 11 Use machine learning to match schemas Basic idea 1. create training data –manually map a set of sources to mediated schema 2. train system on training data –learns from –name of schema elements –format of values –frequency of words & symbols –characteristics of value distribution –proximity, position, structure,... 3. system proposes mappings for subsequent sources Our Approach

12 12 Example realestate.com Seattle, WA (206) 729 0831 $250,000 Fantastic house...... address phone price description mediated schema location Seattle, WA Dallas, TX... listed-price $250,000 $162,000 $180,000... agent-phone (206) 729 0831 (206) 321 4571 (214) 722 4035... comments Fantastic house... Great... Hurry!......

13 13 Multi-Strategy Learning Use a set of base learners –each exploits certain types of information Match schema elements of a new source –apply the learners –combine their predictions using a meta-learner Meta-learner –measures base learner accuracy on training data –weighs each learner based on its accuracy

14 14 Learners Input –schema information: name, proximity, structure,... –data information: value, format,... Output –prediction weighted by confidence score Example learners –name matcher –agent-name => (name,0.7), (phone,0.3) –Naive Bayes –“Seattle, WA” => (address,0.8), (name,0.2) –“Great location...” => (description,0.9), (address,0.1)

15 15 Training the Learners realestate.com Seattle, WA (206) 729 0831 $ 250,000 Fantastic house...... address phone price description mediated schema locationlisted-price agent-phone comments Name Matcher (location, address) (agent-phone, phone) (listed-price, price) (comments, description)... Naive Bayes (“Seattle, WA”, address) (“(206) 729 0831”, phone) (“$ 250,000”, price) (“Fantastic house...”, description)...

16 16 Applying the Learned Models homes.com address phone price description mediated schema area Seattle, WA Kent, WA Austin, TX Seattle, WA Name Matcher Naive Bayes Name Matcher Naive Bayes Meta-learner address description address Combiner address

17 17 The LSD System Base learners/modules –name matcher –Naive Bayes –Whirl nearest-neighbor classifier [Cohen&Hirsh-KDD98] –county-name recognizer Meta-learner –stacking [Ting&Witten99, Wolpert92]

18 18 Name Matcher Matches based on names –including all names on path from root to current node –allowing synonyms Good for... –specific, descriptive names: agent-phone, listed-price Bad for... –vacuous names: item, listings –partially specified, ambiguous names: office (for “office phone”)

19 19 Naive Bayes Learner Exploits frequencies of words & symbols Good for... –elements with words/symbols that are strongly indicative –examples: –“fantastic” & “great” in house descriptions –$ in prices, parentheses in phone numbers Bad for... –short, numeric elements: num-baths, num-bedrooms

20 20 WHIRL Nearest-Neighbor Classifier Similarity-based –stores all examples seen so far –classifies a new example based on similarity to training examples –IR document similarity metric Good for... –long, textual elements: house description, names –limited, descriptive set of values: color (blue, red,...) Bad for... –short, numeric elements: num-baths, num-bedrooms

21 21 County-Name Recognizer Stores all county names, obtained from the Web Verifies if the input name is a county name Essential to matching a county-name element

22 22 Meta-Learner: Stacking Training –uses training data to learn weights –one for each (base learner, mediated-schema element) Combining predictions –for each mediated-schema element –computes weighted sum of base-learner confidence scores –picks mediated-schema element with highest sum

23 23 Experiments

24 24 Reasons for Incorrect Matchings Unfamiliarity –suburb –solution: add a suburb-name recognizer Insufficient information –correctly identified the general type –failed to pinpoint the exact type – Richard Smith (206) 234 5412 –solution: add a proximity learner

25 25 Experiments: Summary Multi-strategy learning –better performance than any single learner Accuracy of 100% unlikely to be reached –difficult even for human Lots of room for improvement –more learners –better learning algorithms

26 26 Related Work Rule-based approaches –TRANSCM [Milo&Zohar98], ARTEMIS [Castano&Antonellis99], [Palopoli et. al. 98] –utilize only schema information Learner-based approaches –SEMINT [Li&Clifton94], ILA [Perkowitz&Etzioni95] –employ a single learner, limited applicability

27 27 Future Work Future Work source descriptions schema matchingdata translation scope completeness reliability query capability leaf elementshigher-level elements 1-1 mappingscomplex mappings

28 28 Future Work Improve matching accuracy –more learners, more domains Incorporate domain knowledge –semantic integrity constraints –concept hierarchy of mediated-schema elements Learn with structured data

29 29 Learning with Structured Data Each example with >1 level of structure Generative model for XML XML classifier XML: “killer app” for relational learning

30 30 Summary Schema matching –automated by learning Multi-strategy learning is essential –handles different types of data –incorporates different types of domain knowledge –easy to incorporate new learners –alleviates effects of noise & dirty data Implemented LSD –promising results with initial experiments


Download ppt "Pedro Domingos Joint work with AnHai Doan & Alon Levy Department of Computer Science & Engineering University of Washington Data Integration: A “Killer."

Similar presentations


Ads by Google