Presentation is loading. Please wait.

Presentation is loading. Please wait.

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.

Similar presentations


Presentation on theme: "Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported."— Presentation transcript:

1 Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF

2 2 Introduction Many tables on the Web Ontology-based extraction: Works well for unstructured or semi-structured data What about structured data – tables? How to integrate data stored in different tables? Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

3 3 Problem Detecting The Table of Interest ?

4 4 Problem Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature} Different schemas

5 5 Problem Attribute is Value

6 6 Problem Attribute-Value is Value ??

7 7 Problem Value is not Value

8 8 Problem Factored Values

9 9 Problem Split Values

10 10 Problem Merged Values

11 11 Problem Information Behind Links List Table extending over several pages

12 12 Solution Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

13 13 Solution Detect The Table of Interest Top-level tables Table size: at least 3 rows and columns Grid layout: same # of values Attributes Value density: # of ontology extracted values total # of values in the table

14 14 Solution Detect The Table of Interest Linked-page tables Table size: at least 2 rows and columns Attributes Attribute-value-pair pattern Page-spanning tables

15 15 Solution Remove Factoring 2001 2000 1999

16 16 Solution Replace Boolean Values

17 17 Solution Form Attribute-Value Pairs,,,,,,,

18 18 Solution Adjust Attribute-Value Pairs,,,,,,,

19 19 Solution Add Information Hidden Behind Links Unstructured and semi-structured: concatenate,,,,,,,,, Single attribute value pairs: Pair them together List: Mark the beginning and the end < >

20 20 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

21 21 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature} Each row is a car.

22 22 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

23 23 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

24 24 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

25 25 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

26 26 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

27 27 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

28 28 Experimental Results − Table Location Car advertisement application domain 12 2 Structured Linked Page Location Precision: 86% Recall: 92% Testing Set 53 Training Set 7 87%(46) 100%(7) Top Table Location Precision:100% Recall:87% 46 100%(7) 28 Linked Pages 13 15

29 29 Experimental Results − Mapping Car advertisement application domain 46 recognized tables in the testing set Total 319 mappings Precision: 95.8% Recall: 92.8% Top-level tables: 77% of the 296 correct mappings Linked tables: 19.6% Both: 3.4%

30 30 Experimental Results − Table Location Cell-phone sales application domain Testing Set 12 Training Set 5 92%(11) 100%(5) Top Table Location Precision:100% Recall:92% Linked Pages 11 100%(5) 3

31 31 Experimental Results − Mapping Cell-phone sales application domain 11 recognized tables in the testing Set Total 97 mappings Precision: 90.1% Recall: 85.4% Top-level tables: 85.4% of the 88 correct mappings Linked tables: 50.5% Both: 35.9%

32 32 Contribution Provides an approach to extract information automatically from HTML tables Suggests a different way to solve the problem of schema matching


Download ppt "Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported."

Similar presentations


Ads by Google