Presentation is loading. Please wait.

Presentation is loading. Please wait.

Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.

Similar presentations


Presentation on theme: "Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University."— Presentation transcript:

1 Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University Provo, UT, USA *Funded in part by Novell, Inc., Ancestry.com, Inc., and Faneuil Research.

2 Record-Boundary Discovery Larger Goal: Information Extraction The Salt Lake Tribune … Domestic Cars … ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 … ##### YearMakeModelPhoneNr

3 Desired Objective Query the Web Like a Database Example: Get the year, make, model, and price for 1987 or later cars that are red or white. YearMakeModelPrice ----------------------------------------------------------------------- 97CHEVYCavalier11,995 94DODGE 4,995 94DODGEIntrepid10,000 91FORDTaurus 3,500 90FORDProbe 88FORDEscort 1,000

4 for a page of unstructured records, rich in data and narrow in ontological breadth Approach and Limitations Automatic Ontology-Based Wrapper Generation Application Ontology Parser Constant/Keyword Recognizer Database-Instance Generator Unstructured Records Constant/Keyword Matching Rules Data-Record Table Record-Level Objects, Relationships, and Constraints Database Scheme Populated Database Record Extractor Web Page

5 Application Ontology: Object-Relationship Model Instance Car [-> object]; Car [0..1] has Model [1..*]; Car [0..1] has Make [1..*]; Car [0..1] has Year [1..*]; Car [0..1] has Price [1..*]; Car [0..1] has Mileage [1..*]; PhoneNr [1..*] is for Car [0..1]; PhoneNr [0..1] has Extension [1..*]; Car [0..*] has Feature [1..*]; YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* 0..1 1..* 0..1 0..* 1..*

6 Application Ontology: Data Frames Make matches [10] case insensitive constant { extract “chev”; }, { extract “chevy”; }, { extract “dodge”; }, … end; Model matches [16] case insensitive constant { extract “88”; context “\bolds\S*\s*88\b”; }, … end; Mileage matches [7] case insensitive constant { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, … keyword “\bmiles\b”, “\bmi\b “\bmi.\b”; end;...

7 Ontology Parser Make : chevy … KEYWORD(Mileage) : \bmiles\b... create table Car ( Car integer, Year varchar(2), … ); create table CarFeature ( Car integer, Feature varchar(10));... Object: Car;... Car: Year [0..1]; Car: Make [0..1]; … CarFeature: Car [0..*] has Feature [1..*]; Application Ontology Parser Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme

8 Record Extractor … ‘97 CHEVY Cavalier, Red, 5 spd, … ‘85 DODGE Daytona, needs paint, … …. … ##### ‘97 CHEVY Cavalier, Red, 5 spd, … ##### ‘85 DODGE Daytona, needs paint, … #####... Unstructured Records Record Extractor Web Page

9 Record Extractor: High Fan-Out Heuristic The Salt Lake Tribune … Domestic Cars … ‘97 CHEVY Cavalier, Red, … ‘85 DODGE Daytona, needs … … html head title body … hr h4 b hr h4...h1 Candidate Separator Tags

10 Record Extractor: Record-Separator Heuristics IT: Identifiable “html separator” Tags HT: Highest-count Tags SD: Standard Deviation OM: Ontological Match RP: Repeating-tag Patterns

11 IT: Identifiable “html separator” Tags Domestic Cars ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 hr tr td a table p br h4 h1 strong b i

12 HT: Highest-count Tags Domestic Cars ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 Tag Count ----------------- hr 4 h4 3 b 1

13 SD: Standard Deviation Domestic Cars ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 hr (  = 45.5) ------------------- 159 characters 63 characters 62 characters h4 (  = 48.0) -------------------- 159 characters 63 characters

14 OM: Ontological Match Domestic Cars ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 Record Estimator: average of count of Year, Make, and Model = 3. Closest candidate separator count: h4 = 3, hr = 4, b = 1.

15 RP: Repeating-tag Patterns Domestic Cars ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 3 pairs: Of the tags in the repeating pattern, h4 is closest with 3, then hr with 4.

16 Record Extractor: Consensus Heuristic Certainty is a generalization of: C(E 1 ) + C(E 2 ) - C(E 1 )C(E 2 ). C denotes certainty and E i is the evidence for an observation. Our certainties are based on observations from 10 different sites for 2 different applications (car ads and obituaries) Correct Tag Rank Heuristic 1 2 3 4 IT96.0% 4.0% 0% 0% HT49.0% 32.5% 16.5% 2.0% SD65.5% 22.5% 12.0% 0% OM84.5% 12.5% 2.0% 1.0% RP77.5% 12.5% 9.0% 1.0%

17 Record Extractor: Example Consensus Heuristic Rank Computed IT HT SD OM RP Certainty Factor hr 1 1 1 2 2.994 h4 2 2 2 1 1.983 b 3 3 - 3 -.182 e.g., b: 0 +.165 +.02 - 0 .165 - 0 .02 -.165 .02 + 0 .165 .02 =.1817 Correct Tag Rank Heuristic 1 2 3 4 IT96.0% 4.0% 0% 0% HT49.0% 32.5% 16.5% 2.0% SD65.5% 22.5% 12.0% 0% OM84.5% 12.5% 2.0% 1.0% RP77.5% 12.5% 9.0% 1.0%

18 Record Extractor: Results 4 different applications (car ads, job ads, obituaries, university courses) with 5 new/different sites for each application HeuristicSuccess Rate IT 95% HT 45% SD 65% OM 80% RP 75% Consensus 100%

19 Constant/Keyword Recognizer Descriptor/String/Position(start/end) ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, 566-3800 or 566-3888 Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 Constant/Keyword Recognizer Unstructured Records Constant/Keyword Matching Rules Data-Record Table

20 Database Instance Generator Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint violation Heuristics Descriptor/String/Position(start/end) Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 Database-Instance Generator Data-Record Table Record-Level Objects, Relationships, and Constraints   =2 { }  =52            

21 Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr|566-3800|136|143 PhoneNr|566-3888|148|155 Database-Instance Generator insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”) Database-Instance Generator Data-Record Table Record-Level Objects, Relationships, and Constraints Database Scheme Populated Database

22 Recall & Precision N = number of facts in source C = number of facts declared correctly I = number of facts declared incorrectly (of facts available, how many did we find?) (of facts retrieved, how many were relevant?)

23 Results: Car Ads Training set for tuning ontology: 100 Test set: 116 Salt Lake Tribune Recall %Precision % Year 100 100 Make 97 100 Model 82 100 Mileage 90 100 Price 100 100 PhoneNr 94 100 Extension 50 100 Feature 91 99

24 Car Ads: Comments Unbounded sets –missed: MERC, Town Car, 98 Royale –could use lexicon of makes and models Unspecified variation in lexical patterns –missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) –could adjust lexical patterns Misidentification of attributes –classified AUTO in AUTO SALES as automatic transmission –could adjust exceptions in lexical patterns Typographical errors –“Chrystler”, “DODG ENeon”, “I-15566-2441” –could look for spelling variations and common typos

25 Results: Computer Job Ads Training set for tuning ontology: 50 Test set: 50 Los Angeles Times Recall %Precision % Degree 100 100 Skill 74 100 Email 91 83 Fax 100 100 Voice 79 92

26 Results: Obituaries Training set for tuning ontology: ~ 24 Test set: 90 Arizona Daily Star Recall %Precision % DeceasedName* 100 100 Age 86 98 BirthDate 96 96 DeathDate 84 99 FuneralDate 96 93 FuneralAddress 82 82 FuneralTime 92 87 … Relationship 92 97 RelativeName* 95 74 *partial or full name

27 Cautions Ontology Creation and Tuning –Regular expressions (tool for experimentation) –Category specialization and cultural localization Record Separation –Web page has multiple records satisfying an ontology –(HTML) record separator exists Attribute-Value Pair Generation –Context-sensitive recognizable/categorizable constants –Topic switches within records

28 Conclusions Given an ontology and a Web page with multiple records, it is possible to extract and structure the data automatically. Record Separation Results: 100% Recall and Precision Results –Car Ads: ~ 94% recall and ~ 99% precision –Job Ads: ~ 84% recall and ~ 98% precision –Obituaries: ~ 90% recall and ~ 95% precision (except names: ~ 73% precision) Future Work –Find and categorize pages of interest. –Relax restrictions for record separation. –Strengthen heuristics for extraction. –Add richer conversions and additional constraints to data frames. http://www.deg.byu.edu/


Download ppt "Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University."

Similar presentations


Ads by Google