Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automating Schema Matching for Data Integration

Similar presentations


Presentation on theme: "Automating Schema Matching for Data Integration"— Presentation transcript:

1 Automating Schema Matching for Data Integration
David W. Embley Brigham Young University Funded by NSF

2 Information Exchange Source Target Information Extraction Schema
Leverage this … … to do this Schema Matching

3 Presentation Outline Information Extraction
Schema Matching for HTML Table Direct Schema Matching Indirect Schema Matching Conclusions and Future Work

4 Information Extraction

5 Extracting Pertinent Information from Documents

6 A Conceptual Modeling Solution
Year Price Make Mileage Model Feature PhoneNr Extension Car has is for 1..* 0..1 0..*

7 Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*];
Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, End;

8 Recognition and Extraction
Car Year Make Model Mileage Price PhoneNr Subaru SW $1900 (363) Elandra (336) HONDA ACCORD EX 100K (336) Car Feature 0001 Auto 0001 AC 0002 Black door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stero 0002 a/c 0003 Auto 0003 jade green 0003 gold

9 Schema Matching for HTML Tables

10 Table-Schema Matching (Basic Idea)
Many tables on the Web Ontology-Based Extraction: Works well for unstructured or semistructured data What about structured data – tables? Method: Form attribute-value pairs Do extraction Infer mappings from extraction patterns

11 Problem: Different Schemas
Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} ?

12 Problem: Attribute is Value

13 Problem: Attribute-Value is Value
? ?

14 Problem: Value is not Value

15 Problem: Implied Values

16 Problem: Missing Attributes

17 Problem: Compound Attributes

18 Problem: Merged Values

19 Problem: Values not of Interest

20 Problem: Factored Values

21 Problem: Split Values

22 Problem: Information Behind Links
Table extending over several pages Single-Column Table (formated a list)

23 Solution Form attribute-value pairs (adjust if necessary)
Do extraction Infer mappings from extraction patterns

24 Solution: Remove Internal Factoring
Unnest: (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Legend ACURA Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*

25 Solution: Replace Boolean Values
Yes, AutoAir CondAM/FM AM/FM Air Cond. Auto CD Table Yes, CD ACURA ACURA Legend

26 Solution: Form Attribute-Value Pairs
AM/FM Air Cond. Auto CD ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

27 Solution: Adjust Attribute-Value Pairs
AM/FM Air Cond. Auto CD ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

28 Solution: Do Extraction
AM/FM Air Cond. Auto CD ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

29 Solution: Infer Mappings
Make(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table YearTable Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car). Model(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table AM/FM Air Cond. Auto CD Each row is a car. ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

30 Solution: Do Extraction
Model(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table AM/FM Air Cond. Auto CD ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

31 Solution: Do Extraction
AM/FM Air Cond. Auto CD ACURA ACURA Legend PriceTable {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

32 Solution: Do Extraction
AM/FM Air Cond. Auto CD ACURA ACURA Legend ColourFeatureColourTable  AutoFeatureAutoAutoTable  Air Cond.FeatureAir Cond. Air Cond.Table  AM/FMFeatureAM/FMAM/FMTable  CDFeatureCDCDTable Yes, Yes, Yes, Yes, {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}

33 Experiment Tables from 60 sites 10 “training” tables 50 test tables
357 mappings (from all 60 sites) 172 direct mappings (same attribute and meaning) 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)

34 Results 10 “training” tables 50 test tables 16 missed mappings
100% of the 57 mappings (no false mappings) 94.6% of the values in linked pages (5.4% false declarations) 50 test tables 94.7% of the 300 mappings (no false mappings) On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision 16 missed mappings 4 partial (not all unions included) 6 non-U.S. car-ads (unrecognized makes and models) 2 U.S. unrecognized makes and models 3 prices (missing $ or found MSRP instead) 1 mileage (mileages less than 1,000)

35 Direct Schema Matching

36 Attribute Matching for Populated Schemas
Central Idea: Exploit All Data & Metadata Matching Possibilities (Facets) Attribute Names Data-Value Characteristics Expected Data Values Data-Dictionary Information Structural Properties

37 Approach Target Schema T Source Schema S Framework
Individual Facet Matching Combining Facets Best-First Match Iteration

38 Example Target Schema T Source Schema S Year Model Make Car Mileage
Miles Year has 0:1 Year has 0:1 Make has 0:1 Feature has 0:* Make has 0:1 Model has 0:1 Car Cost has 0:1 Model has 0:1 Car Style Phone has 0:1 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S

39 Individual Facet Matching
Attribute Names Data-Value Characteristics Expected Data Values

40 Attribute Names Target and Source Attributes WordNet
S : B WordNet C4.5 Decision Tree: feature selection, trained on schemas in DB books f0: same word f1: synonym f2: sum of distances to a common hypernym root f3: number of different common hypernym roots f4: sum of the number of senses of A and B

41 WordNet Rule The number of different common hypernym roots of A and B
The sum of distances of A and B to a common hypernym The sum of the number of senses of A and B

42 Confidence Measures

43 Data-Value Characteristics
C4.5 Decision Tree Features Numeric data (Mean, variation, standard deviation, …) Alphanumeric data (String length, numeric ratio, space ratio)

44 Confidence Measures

45 Expected Data Values Target Schema T and Source Schema S
Regular expression recognizer for attribute A in T Data instances for attribute B in S Hit Ratio = N’/N for (A, B) match N’ : number of B data instances recognized by the regular expressions of A N: number of B data instances

46 Confidence Measures

47 Combined Measures 1 1 1 1 1 Threshold: 0.5

48 Final Confidence Measures

49 Experimental Results This schema, plus 6 other schemas Matched: 100%
32 matched attributes 376 unmatched attributes Matched: 100% Unmatched: 99.5% “Feature” ---”Color” “Feature” ---”Body Type” F1 93.8% F2 84% F3 92% 98.9% 97.9% 98.4% F1: WordNet F2: Value Characteristics F3: Expected Values

50 Indirect Schema Matching

51 Schema Matching Target Source Color Year Year Make Feature Make &
Model Body Type Target Car Cost Model Car Style Phone Mileage Miles Cost Source

52 Mapping Generation Direct Matches as described earlier:
Attribute Names based on WordNet Value Characteristics based on value lengths, averages, … Expected Values based on regular-expression recognizers Indirect Matches: Direct matches Structure Evaluation Union Selection Decomposition Composition

53 Union and Selection Target Source Color Year Year Make Feature Make &
Model Body Type Target Car Cost Model Car Style Phone Mileage Miles Cost Source

54 Decomposition and Composition
Color Year Year Make Feature Make & Model Body Type Target Car Cost Model Car Style Phone Mileage Miles Cost Source

55 Example Taken From [MBR, VLDB’01]
Structure Example Taken From [MBR, VLDB’01] PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo Count Address ItemCount Item City Street City Street Item ItemNumber City Street Line Qty UoM Quantity UnitOfMeasure Target Source

56 Structure (Nonlexical Matches)
PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo DeliverTo Count Address Count Item City Street City Street Item ItemNumber City Street Line Qty UoM Quantity UnitOfMeasure Target Source

57 Structure (Join over FD Relationship Sets, …)
PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo City Count City Count Item Street City Street City Street Item Street ItemNumber Line Qty UoM Quantity UnitOfMeasure Target Source

58 Structure (Lexical Matches)
PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo City City Count Count City City Count Count Item Street Street City City Street Street City City Street Street Item Street Street ItemNumber Line Line Qty Qty UoM Quantity Quantity UnitOfMeasure Target Source

59 Experiments Methodology Measures Precision Recall F Measure

60 Results Indirect Matches: 94% (precision, recall, F-measure) 98 93 96
Applications (Number of Schemes) Precision (%) Recall F Correct False Positive Negative Course Schedule (5) 98 93 96 119 2 9 Faculty Member (5) 100 140 Real Estate (5) 92 94 235 20 10 Indirect Matches: 94% (precision, recall, F-measure) Data borrowed from Univ. of Washington [DDH, SIGMOD01] Rough Comparison with U of W Results (Direct Matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75%

61 Conclusions and Future Work

62 Conclusions Table Mappings Direct Attribute Matching
Tables: 94.7% (Recall); 100% (Precision) Linked Text: ~97% (Recall); ~86% (Precision) Direct Attribute Matching Matched 32 of 32: 100% Recall 2 False Positives: 94% Precision Direct and Indirect Attribute Matching Matched 494 of 513: 96% Recall 22 False Positives: 96% Precision

63 Current & Future Work: Improve and Extend Indirect Matching
Improve Object-Set Matching (e.g. Lex/non-Lex) Add Relationship-Set Matching Computations

64 Current & Future Work: Tables Behind Forms
Crawling the Hidden Web Filling in Forms from Global Queries

65 Current & Future Work: Developing Extraction Ontologies
Creation from Knowledge Sources and Sample Application Pages K Ontology + Data Frames, Lexicons, … RDF Ontologies User Creation by Example

66 Current & Future Work: and Much More …
Table Understanding Microfilm Census Records Generate Ontologies by Reading Tables


Download ppt "Automating Schema Matching for Data Integration"

Similar presentations


Ads by Google