Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith Department of Computer Science S.W. Liddle ‡, D.W.

Slides:



Advertisements
Similar presentations
1 Initial Results on Wrapping Semistructured Web Pages with Finite-State Transducers and Contextual Rules Chun-Nan Hsu Arizona State University.
Advertisements

Record-Boundary Discovery in Web Documents by Yuan Jiang December 1, 1998.
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
Semiautomatic Generation of Data-Extraction Ontologies Master’s Thesis Proposal Yihong Ding.
1 What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by the National Science.
Ontology-Based Free-Form Query Processing for the Semantic Web by Mark Vickers Supported by:
Domain-Independent Data Extraction: Person Names Carl Christensen and Deryle Lonsdale Brigham Young University
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Annotation Free Information Extraction Chia-Hui Chang Department of Computer Science & Information Engineering National Central University
Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith, Li Xu Department of Computer Science S.W. Liddle ‡
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported.
6/11/20151 A Binary-Categorization Approach for Classifying Multiple-Record Web Documents Using a Probabilistic Retrieval Model Department of Computer.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Data Frames Version 3 Proposal. Data Frames Version 2 Year matches [2] constant { extract "\d{2}"; context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, { extract.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Data Extraction From HTML Tables Cui Tao Department of Computer Science Brigham Young University.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
Toward Tomorrow’s Semantic Web An Approach Based on Information Extraction Ontologies David W. Embley Brigham Young University Funded in part by the National.
Recognizing Ontology-Applicable Multiple-Record Web Documents David W. Embley Dennis Ng Li Xu Brigham Young University.
6/17/20151 Table Structure Understanding by Sibling Page Comparison Cui Tao Data Extraction Group Department of Computer Science Brigham Young University.
BYU 2003BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
Extracting Data Behind Web Forms Stephen W. Liddle David W. Embley Del T. Scott, Sai Ho Yau Brigham Young University Presented by: Helen Chen.
DLLS Ontologically-based Searching for Jobs in Linguistics Deryle Lonsdale Funded by:
Semiautomatic Generation of Resilient Data-Extraction Ontologies Yihong Ding Data Extraction Group Brigham Young University Sponsored by NSF.
ER 2002BYU Data Extraction Group Automatically Extracting Ontologically Specified Data from HTML Tables with Unknown Structure David W. Embley, Cui Tao,
Ontology-Based Information Extraction and Structuring Stephen W. Liddle † School of Accountancy and Information Systems Brigham Young University Douglas.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen, 1 David W. Embley 1 Stephen W. Liddle 2 1 Department of Computer Science 2 Rollins Center.
Extracting and Structuring Web Data David W. Embley Department of Computer Science Brigham Young University D.M. Campbell, Y.S. Jiang, Y.-K. Ng, R.D. Smith.
From OSM-L to JAVA Cui Tao Yihong Ding. Overview of OSM.
DASFAA 2003BYU Data Extraction Group Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded.
UFMG, June 2002BYU Data Extraction Group Automating Schema Matching for Data Integration David W. Embley Brigham Young University Funded by NSF.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
Scheme Matching and Data Extraction over HTML Tables from Heterogeneous Sources Cui Tao March, 2002 Founded by NSF.
Semi-Automatically Generating Data-Extraction Ontology Yihong Ding March 6, 2001.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
Ontos Project n Ontology Parser n Data Frame/Ontology Definition n Relevance Detection n Coarse Structure Detection n Constant/Keyword Matching n Database.
BYU Data Extraction Group Automating Schema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF.
January 2004 ADC’ What Do You Want— Semantic Understanding? (You’ve Got to be Kidding) David W. Embley Brigham Young University Funded in part by.
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University.
Semantic Understanding An Approach Based on Information-Extraction Ontologies David W. Embley Brigham Young University.
Record-Boundary Discovery in Web Documents D.W. Embley, Y. Jiang, Y.-K. Ng Data-Extraction Group* Department of Computer Science Brigham Young University.
Generating Data-Extraction Ontologies By Example Joe Zhou Data Extraction Group Brigham Young University.
An Abstract Framework for Extraction Plans and Heuristics in a Data Extraction System Alan Wessman Brigham Young University Based on research supported.
BYU Data Extraction Group Funded by NSF1 Brigham Young University Li Xu Source Discovery and Schema Mapping for Data Integration.
Automatic Creation and Simplified Querying of Semantic Web Content An Approach Based on Information-Extraction Ontologies Yihong Ding, David W. Embley,
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Ground-truthing Obituaries. Project Overview Untapped sources – Obituaries: hundreds of millions – Problem: how to cost-effectively extract Extraction.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Stephen W. Liddle, PhD Academic Director, Rollins Center for Entrepreneurship & Technology Professor, Information Systems Department Marriott School, Brigham.
Cross-Language Hybrid Keyword and Semantic Search David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Joseph S. Park, Andrew Zitzelberger Brigham Young.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Open Information Extraction using Wikipedia
An Aspect of the NSF CDI InitiativeNSF CDI: Cyber-Enabled Discovery and Innovation.
Talk Schedule Question Answering from Bryan Klimt July 28, 2005.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
David W. Embley Brigham Young University Provo, Utah, USA.
Chapter 4 Attribute Data.
Extracting and Structuring Web Data
Cross-language Information Retrieval
What Do You Want—Semantic Understanding?
Introduction The amount of Web data has increased dramatically
David W. Embley Brigham Young University Provo, Utah, USA
Extracting and Structuring Web Data
Automating Schema Matching for Data Integration
Family History Technology Workshop
Introduction to Information Retrieval
Presentation transcript:

Extracting and Structuring Web Data D.W. Embley*, D.M Campbell †, Y.S. Jiang, Y.-K. Ng, R.D. Smith Department of Computer Science S.W. Liddle ‡, D.W. Quass ‡ School of Accountancy and Information Systems Marriott School of Management Brigham Young University Provo, UT, USA Funded in part by *Novell, Inc., † Ancestry.com, Inc., and ‡ Faneuil Research.

GOAL Query the Web like we query a database Example: Get the year, make, model, and price for 1987 or later cars that are red or white. YearMakeModelPrice CHEVYCavalier11,995 94DODGE 4,995 94DODGEIntrepid10,000 91FORDTaurus 3,500 90FORDProbe 88FORDEscort 1,000

PROBLEM The Web is not structured like a database. Example: The Salt Lake Tribune Classified’s … ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, or …

Making the Web Look Like a Database Web Query Languages –Treat the Web as a graph (pages = nodes, links = edges). –Query the graph (e.g., Find all pages within one hop of pages with the words “Cars for Sale”). Wrappers –Find page of interest. –Parse page to extract attribute-value pairs and insert them into a database. Write parser by hand. Use syntactic clues to generate parser semi-automatically. –Query the database.

for a page of unstructured documents, rich in data and narrow in ontological breadth Automatic Wrapper Generation Application Ontology Parser Constant/Keyword Recognizer Database-Instance Generator Unstructured Record Documents Constant/Keyword Matching Rules Data-Record Table Record-Level Objects, Relationships, and Constraints Database Scheme Populated Database Record Extractor Web Page

Application Ontology: Object-Relationship Model Instance Car [-> object]; Car [0..1] has Model [1..*]; Car [0..1] has Make [1..*]; Car [0..1] has Year [1..*]; Car [0..1] has Price [1..*]; Car [0..1] has Mileage [1..*]; PhoneNr [1..*] is for Car [0..1]; PhoneNr [0..1] has Extension [1..*]; Car [0..*] has Feature [1..*]; YearPrice Make Mileage Model Feature PhoneNr Extension Car has is for has 1..* * * 1..*

Application Ontology: Data Frames Make matches [10] case insensitive constant { extract “chev”; }, { extract “chevy”; }, { extract “dodge”; }, … end; Model matches [16] case insensitive constant { extract “88”; context “\bolds\S*\s*88\b”; }, … end; Mileage matches [7] case insensitive constant { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, … keyword “\bmiles\b”, “\bmi\b “\bmi.\b”; end;...

Ontology Parser Make : chevy … KEYWORD(Mileage) : \bmiles\b... create table Car ( Car integer, Year varchar(2), … ); create table CarFeature ( Car integer, Feature varchar(10));... Object: Car;... Car: Year [0..1]; Car: Make [0..1]; … CarFeature: Car [0..*] has Feature [1..*]; Application Ontology Parser Constant/Keyword Matching Rules Record-Level Objects, Relationships, and Constraints Database Scheme

Record Extractor … ‘97 CHEVY Cavalier, Red, 5 spd, … ‘89 CHEVY Corsica Sdn teal, auto, … …. … ##### ‘97 CHEVY Cavalier, Red, 5 spd, … ##### ‘89 CHEVY Corsica Sdn teal, auto, … #####... Unstructured Record Documents Record Extractor Web Page

Record Extractor: High Fan-Out Heuristic The Salt Lake Tribune … Domestic Cars … ‘97 CHEVY Cavalier, Red, … ‘89 CHEVY Corsica Sdn … … html head title body … hr h4 hr h4 hr...h1

Record Extractor: Record-Separator Heuristics … ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Asking only $11,995. … ‘89 CHEV Corsica Sdn teal, auto, air, trouble free. Only $8,995 …... Identifiable “separator” tags Highest-count tag(s) Interval standard deviation Ontological match Repeating tag patterns Example:

Record Extractor: Consensus Heuristic Certainty is a generalization of: C(E 1 ) + C(E 2 ) - C(E 1 )C(E 2 ). C denotes certainty and E i is the evidence for an observation. Our certainties are based on observations from 10 different sites for 2 different applications (car ads and obituaries) Correct Tag Rank Heuristic IT 96% 4% HT 49% 33% 16% 2% SD 66% 22% 12% OM 85% 12% 2% 1% RP 78% 12% 9% 1%

Record Extractor: Results 4 different applications (car ads, job ads, obituaries, university courses) with 5 new/different sites for each application HeuristicSuccess Rate IT 95% HT 45% SD 65% OM 80% RP 75% Consensus 100%

Constant/Keyword Recognizer Descriptor/String/Position(start/end) ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415 JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Constant/Keyword Recognizer Unstructured Record Documents Constant/Keyword Matching Rules Data-Record Table

Heuristics Keyword proximity Subsumed and overlapping constants Functional relationships Nonfunctional relationships First occurrence without constraint violation

Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Keyword Proximity   '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or

Subsumed/Overlapping Constants '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Functional Relationships '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or

Nonfunctional Relationships '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

First Occurrence without Constraint Violation '97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, or Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155

Year|97|2|3 Make|CHEV|5|8 Make|CHEVY|5|9 Model|Cavalier|11|18 Feature|Red|21|23 Feature|5 spd|26|30 Mileage|7,000|38|42 KEYWORD(Mileage)|miles|44|48 Price|11,995|100|105 Mileage|11,995|100|105 PhoneNr| |136|143 PhoneNr| |148|155 Database-Instance Generator insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “ ”) insert into CarFeature values(1001, “Red”) insert into CarFeature values(1001, “5 spd”) Database-Instance Generator Data-Record Table Record-Level Objects, Relationships, and Constraints Database Scheme Populated Database

Recall & Precision N = number of facts in source C = number of facts declared correctly I = number of facts declared incorrectly (of facts available, how many did we find?) (of facts retrieved, how many were relevant?)

Results: Car Ads Training set for tuning ontology: 100 Test set: 116 Salt Lake Tribune Recall %Precision % Year Make Model Mileage Price PhoneNr Extension Feature 91 99

Car Ads: Comments Unbounded sets –missed: MERC, Town Car, 98 Royale –could use lexicon of makes and models Unspecified variation in lexical patterns –missed: 5 speed (instead of 5 spd), p.l (instead of p.l.) –could adjust lexical patterns Misidentification of attributes –classified AUTO in AUTO SALES as automatic transmission –could adjust exceptions in lexical patterns Typographical errors –“Chrystler”, “DODG ENeon”, “I ” –could look for spelling variations and common typos

Results: Computer Job Ads Training set for tuning ontology: 50 Test set: 50 Los Angeles Times Recall %Precision % Degree Skill Fax Voice 79 92

Obituaries (A More Demanding Application) Our beloved Brian Fielding Frost, age 41, passed away Saturday morning, March 7, 1998, due to injuries sustained in an automobile accident. He was born August 4, 1956 in Salt Lake City, to Donald Fielding and Helen Glade Frost. He married Susan Fox on June 1, He is survived by Susan; sons Jord- dan (9), Travis (8), Bryce (6); parents, three brothers, Donald Glade (Lynne), Kenneth Wesley (Ellen), … Funeral services will be held at 12 noon Friday, March 13, 1998 in the Howard Stake Center, 350 South 1600 East. Friends may call 5-7 p.m. Thurs- day at Wasatch Lawn Mortuary, 3401 S. Highland Drive, and at the Stake Center from 10:45-11:45 a.m. Names Addresses Family Relationships Multiple Dates Multiple Viewings

Obituary Ontology (partial)

Data Frames Lexicons & Specializations Name matches [80] case sensitive constant { extract First, “\s+”, Last; }, … { extract “[A-Z][a-zA-Z]*\s+([A-Z]\.\s+)?”, Last; }, … lexicon { First case insensitive; filename “first.dict”; }, { Last case insensitive; filename “last.dict”; }; end; Relative Name matches [80] case sensitive constant { extract First, “\s+\(“, First, “\)\s+”, Last; substitute “\s*\([^)]*\)” -> “”; } … end;...

Keyword Heuristics Singleton Items RelativeName|Brian Fielding Frost|16|35 DeceasedName|Brian Fielding Frost|16|35 KEYWORD(Age)|age|38|40 Age|41|42|43 KEYWORD(DeceasedName)|passed away|46|56 KEYWORD(DeathDate)|passed away|46|56 BirthDate|March 7, 1998|76|88 DeathDate|March 7, 1998|76|88 IntermentDate|March 7, 1998|76|98 FuneralDate|March 7, 1998|76|98 ViewingDate|March 7, 1998|76|98...

Keyword Heuristics Multiple Items … KEYWORD(Relationship)|born … to|152|192 Relationship|parent|152|192 KEYWORD(BirthDate)|born|152|156 BirthDate|August 4, 1956|157|170 DeathDate|August 4, 1956|157|170 IntermentDate|August 4, 1956|157|170 FuneralDate|August 4, 1956|157|170 ViewingDate|August 4, 1956|157|170 BirthDate|August 4, 1956|157|170 RelativeName|Donald Fielding|194|208 DeceasedName|Donald Fielding|194|208 RelativeName|Helen Glade Frost|214|230 DeceasedName|Helen Glade Frost|214|230 KEYWORD(Relationship)|married|237|243...

Results: Obituaries Training set for tuning ontology: ~ 24 Test set: 90 Arizona Daily Star Recall %Precision % DeceasedName* Age BirthDate DeathDate FuneralDate FuneralAddress FuneralTime … Relationship RelativeName* *partial or full name

Results: Obituaries Training set for tuning ontology: ~ 12 Test set: 38 Salt Lake Tribune Recall %Precision % DeceasedName* Age BirthDate DeathDate FuneralDate FuneralAddress FuneralTime … Relationship RelativeName* *partial or full name

Conclusions Given an ontology and a Web page with multiple records, it is possible to extract and structure the data automatically. Recall and Precision results are encouraging. –Car Ads: ~ 94% recall and ~ 99% precision –Job Ads: ~ 84% recall and ~ 98% precision –Obituaries: ~ 90% recall and ~ 95% precision (except on names: ~ 73% precision) Future Work –Find and categorize pages of interest. –Strengthen heuristics for separation, extraction, and construction. –Add richer conversions and additional constraints to data frames.