Presentation is loading. Please wait.

Presentation is loading. Please wait.

Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.

Similar presentations


Presentation on theme: "Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported."— Presentation transcript:

1 Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF

2 Introduction Many tables on the Web How to integrate data stored in different tables? Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

3 Problem Detecting The Table of Interest ?

4 Problem Different source table schemas {Run #, Yr, Make, Model, Tran, Color, Dr} {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} {Vehicle, Distance, Price, Mileage} {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature} Different schemas

5 Problem Attribute is Value

6 Problem Attribute-Value is Value ??

7 Problem Value is not Value

8 Problem Factored Values

9 Problem Split Values

10 Problem Merged Values

11 Problem Information Behind Links Single-Column Table (formatted as list) Table extending over several pages

12 Solution Detect the table of interest Form attribute-value pairs (adjust if necessary) Do extraction Infer mappings from extraction patterns

13 Solution Detect The Table of Interest ‘Real’ table test Same number of values Table size Attribute test Density measure test # of ontology extracted values total # of values in the table

14 Solution Remove Factoring

15 Solution Replace Boolean Values

16 Solution Form Attribute-Value Pairs,,,,,,,

17 Solution Adjust Attribute-Value Pairs,,,,,,,

18 Solution Add Information Hidden Behind Links Unstructured and semi-structured: concatenate,,,,,,,,, Single attribute value pairs: Pair them together List: Mark the beginning and the end < >

19 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

20 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature} Each row is a car.

21 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

22 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

23 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

24 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

25 Solution Inferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

26 Experimental Results Car Advertisement Application domain 10 “training” tables 100% of the 57 mappings (no false mappings) 94.6% precision of the values in linked pages (5.4% false declarations) 50 test tables 94.7% of the 300 mappings (no false mappings) On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision

27 Other Applications Cell Phone Plan Application domain Soccer Player Application domain

28 Contribution Provides an approach to extract information automatically from HTML tables Suggests a different way to solve the problem of schema matching


Download ppt "Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported."

Similar presentations


Ads by Google