Presentation is loading. Please wait.

Presentation is loading. Please wait.

BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.

Similar presentations


Presentation on theme: "BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration."— Presentation transcript:

1 BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration

2 BYU Data Extraction Group Funded by NSF2 Data Integration Find houses with four bedrooms priced under $200,000 global schema homes.comrealestate.com source schema 2 homeseekers.com source schema 3source schema 1 wrappers Mediator

3 BYU Data Extraction Group Funded by NSF3 Problems How to Recognize Applicable Information Sources for an Application? How to Specify Mapping between the Source Schemas and the Global Schema? How to Reformulate User Queries? How to Merge Data from Heterogeneous Sources? …

4 BYU Data Extraction Group Funded by NSF4 Recognizing Ontology- Applicable HTML Documents

5 BYU Data Extraction Group Funded by NSF5 Application Ontology How to specify an application?

6 BYU Data Extraction Group Funded by NSF6 Applicable HTML Documents Multiple-Record Documents Single-Record Documents HTML Forms How to distinguish an applicable HTML document?

7 BYU Data Extraction Group Funded by NSF7 Multiple-Record Doc’s Document 1: Car Ads Document 2: Items for Sale or Rent

8 BYU Data Extraction Group Funded by NSF8 Single-Record Doc.

9 BYU Data Extraction Group Funded by NSF9 HTML Forms Information hidden under the HTML form

10 BYU Data Extraction Group Funded by NSF10 Recognition Heuristics h1+: Densities h2: Expected Values h3: Grouping How to measure the applicability of an HTML document for an application?

11 BYU Data Extraction Group Funded by NSF11 Document 1: Car Ads h1+: Densities Document 2: Items for Sale or Rent

12 BYU Data Extraction Group Funded by NSF12 Document 1: Car Ads Year: 3 Make: 2 Model: 3 Mileage: 1 Price: 1 Feature: 15 PhoneNr: 3 h2: Expected Values Document 2: Items for Sale or Rent Year: 1 Make: 0 Model: 0 Mileage: 1 Price: 0 Feature: 0 PhoneNr: 4

13 BYU Data Extraction Group Funded by NSF13 h3: Grouping (of 1-Max Object Sets) Year Make Model Price Year Model Year Make Model Mileage … Document 1: Car Ads { { { Year Mileage … Mileage Year Price … Document 2: Items for Sale or Rent { {

14 BYU Data Extraction Group Funded by NSF14 Classification Problem Subtasks –Multiple Records –Singleton Record –Application Form Learning Algorithm: Decision Tree C4.5 –(h1+0, h1+1, …, h2, h3, Positive) –(h1+0, h1+1, …, h2, h3, Negative) How to construct recognition rules for an application?

15 BYU Data Extraction Group Funded by NSF15 Experiments Car Ads and Obituaries Training Sets –Car Ads (Yes| No) 143 | 363 614 | 636 50 |69 –Obituaries (Yes| No) 68 | 135 50 | 69 62 | 135 Test Sets –Car Ads (40 | 40) Precision 95% Recall 98% F-measure 96% –Obituaries (40 |40) Precision 95% Recall 95% F-measure 95%

16 BYU Data Extraction Group Funded by NSF16 Link Analysis

17 BYU Data Extraction Group Funded by NSF17 Form Filling

18 BYU Data Extraction Group Funded by NSF18 Form Filling (Cont.)

19 BYU Data Extraction Group Funded by NSF19 Incorrect Positive Response Motorcycle Year Make Price Mileage PhoneNr Feature

20 BYU Data Extraction Group Funded by NSF20 Historical Figure Deceased Name Death Date Birth Date Age Relationship Relative Name

21 BYU Data Extraction Group Funded by NSF21 Automating Schema Mapping for Data Integration

22 BYU Data Extraction Group Funded by NSF22 Schema Mapping Source Car Year Cost Style Year Feature Cost Phone Target Car MilesMileage Model Make & Model Color Body Type

23 BYU Data Extraction Group Funded by NSF23 Schema Mapping for Populated Schemas Central Idea: Exploit All Data & Metadata Matching Possibilities (Facets) –Attribute Names –Data-Value Characteristics –Expected Data Values –Data-Dictionary Information –Structural Properties

24 BYU Data Extraction Group Funded by NSF24 The Approach Input: –Two Graphs, S and T –Data Instances for S and T –Lightweight Domain Ontology Output: –A Source-to-Target Mapping between S and T Should enable translating data instances from S to T. –Direct and Many Indirect Matches (t, s) (t, s’ <=  ) Framework –Individual Facet Matching –Combination of Individual Matchers

25 BYU Data Extraction Group Funded by NSF25 Attribute Names Target and Source Attributes –T : A –S : B WordNet C4.5 Decision Tree: feature selection, trained on schemas in DB books –f0: same word –f1: synonym –f2: sum of distances to a common hypernym root –f3: number of different common hypernym roots –f4: sum of the number of senses of A and B

26 BYU Data Extraction Group Funded by NSF26 WordNet Rule The number of different common hypernym roots of A and B The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B

27 BYU Data Extraction Group Funded by NSF27 Data-Value Characteristics C4.5 Decision Tree Features –Numeric data (Mean, variation, standard deviation, …) –Alphanumeric data (String length, numeric ratio, space ratio)

28 BYU Data Extraction Group Funded by NSF28 Make & ModelBrand Model Expected Data Values Concepts and Relationships Data Recognizers –CarMake “ford” “honda” … –CarModel “accord” “mustang” “taurus” … Ford Mustang Ford Taurus Ford F150 … CarMake. CarModel Legend Mustang A4 … CarModel CarMake TargetSource Acura Audi BMW …

29 BYU Data Extraction Group Funded by NSF29 Structure Matching HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address TargetSource MLS Bedrooms

30 BYU Data Extraction Group Funded by NSF30 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address TargetSource MLS Bedrooms

31 BYU Data Extraction Group Funded by NSF31 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address TargetSource MLS Bedrooms

32 BYU Data Extraction Group Funded by NSF32 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address Target Source MLS Bedrooms

33 BYU Data Extraction Group Funded by NSF33 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address Target Source MLS Bedrooms

34 BYU Data Extraction Group Funded by NSF34 Structure Matching (Cont.) HouseAgent Golf course Water front Name Fax Address StreetCityState Basic_features bedsSQFT MLS agent location_ description name faxphone location Address Target Source MLS Bedrooms

35 BYU Data Extraction Group Funded by NSF35 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features bedsSQFT MLS location_ description location Target Source MLS Bedrooms

36 BYU Data Extraction Group Funded by NSF36 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features bedsSQFT MLS location_ description location Target Source MLS Bedrooms

37 BYU Data Extraction Group Funded by NSF37 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features beds SQFT MLS location_ description location Target Source MLS Bedrooms House’ Address1’

38 BYU Data Extraction Group Funded by NSF38 {House, MLS} vs. {MLS} House Golf course Water front Address StreetCityState Basic_features beds SQFT MLS location_ description location TargetSource MLS Bedrooms House’ Golf course’ Water front’ Address1’ Street1’City1’State1’

39 BYU Data Extraction Group Funded by NSF39 {Agent} vs. {agent} Agent Name Fax Address StreetCityState agent name faxphone address Target Source

40 BYU Data Extraction Group Funded by NSF40 {Agent} vs. {agent} Agent Name Fax Address StreetCityState agent name fax phone address Target Source Address2’ Street2,City2’State2’

41 BYU Data Extraction Group Funded by NSF41 Inter-Relationship Set HouseAgent Golf course Water front Name Fax Address StreetCityState MLS agent Target Source MLS Bedrooms House’

42 BYU Data Extraction Group Funded by NSF42 Example: Source-To-Target Mapping House’ Golf course’ Water front’ MLS beds agent name fax Address1’Address2’ Address’ Street’ City’ State’

43 BYU Data Extraction Group Funded by NSF43 Target-based Integration and Query System (TIQS) Definition : I = (T, {Si}, {Mi}) Phases –Design (Source-to-Target Mappings {Mi}) –Query Processing (Rule Unfolding)

44 BYU Data Extraction Group Funded by NSF44 Query Reformulation Query –House-Bedrooms(x, 4) :- House-Bedrooms(x, 4), House-Golf_course(x, “Yes”), House-Water_front(x, “Yes”) House’ Golf course’ Water front’ MLS beds agent name fax Address1’Address2’ Address’ Street’ City’ State’

45 BYU Data Extraction Group Funded by NSF45 Query Reformulation Query –House-Bedrooms(x, 4) :- House-Bedrooms(x, 4), House-Golf_Course(x, “Yes”), House-Water_Front(x, “Yes”) House’ Golf course’ Water front’ MLS beds agent name fax Address1’Address2’ Address’ Street’ City’ State’

46 BYU Data Extraction Group Funded by NSF46 TIQS (Cont.) User Queries –Logic Rules –Maximal and Sound Query Answers Advantages –Rule Unfolding –Scalability

47 BYU Data Extraction Group Funded by NSF47 Experimental Results Application (Number of Schemes) Precision (%) Recall (%) F (%) Number Matches Number Correct Number Incorrect Faculty Member (5) 100 540 0 Course Schedule (5) 9993964904546 Real Estate (5) 90949287682092 Data borrowed from Univ. of Washington [DDH, SIGMOD01] Indirect Matches: (precision 87%, recall 94%, F-measure 90%) Rough Comparison with U of W Results * Course Schedule – Accuracy: ~71% * Real Estate (2 tests) – Accuracy: ~75% * Faculty Member – Accuracy, ~92%

48 BYU Data Extraction Group Funded by NSF48 Conclusion A Robust and Flexible Approach to Check Applicability of HTML documents A Composite Approach to Automate Schema Mapping –Direct Matches –Indirect Matches An Approach that Combines Advantages of Basic Approaches to Data Integration

49 BYU Data Extraction Group Funded by NSF49 Future Work Test More Applications and Data to Evaluate the Approaches Extend Training Classifiers for Applicability Checking Further Automating Schema Mapping Automate Ontology Mapping on the Semantic Web Automate Mapping between XML Documents …

50 BYU Data Extraction Group Funded by NSF50 Thanks ! Questions?


Download ppt "BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration."

Similar presentations


Ads by Google