Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004.

Similar presentations


Presentation on theme: "1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004."— Presentation transcript:

1 1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004 Research funded by NSF

2 2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (203,200 indexed by Cyndislist.com) Mostly hobbyist (203,200 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 3 Problems No standard way of presenting data No standard way of presenting data Text formatted with HTML tags Text formatted with HTML tags Tables Tables Forms to access information Forms to access information Sites have differing schemas Sites have differing schemas

4 4 Proposed Solution Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Based on Ontos and other work done by the BYU Data Extraction Group (DEG) Able to extract from: Able to extract from: Single-Record or Multiple Record Documents Single-Record or Multiple Record Documents Tables Tables Forms Forms Scalable and robust to changes in pages Scalable and robust to changes in pages Easily adaptable to other domains Easily adaptable to other domains

5 5 Text

6 6 Tables

7 7 Forms

8 8 Forms

9 9 System Overview URL Selector Form Engine Table Engine Single- or Multiple-Record Engine URL List User Query Result Filter Document Retriever and Structure Recognizer Data Constrainer Ontology Result Presenter

10 10 User Query Generated from ontology Generated from ontology Generated once per application domain Generated once per application domain

11 11 User Query

12 12 URL List and URL Selector Contains Genealogy URLs Contains Genealogy URLs Search each URL—too much time Search each URL—too much time Select likely URLs Select likely URLs Distribute document processing using DOGMA Distribute document processing using DOGMA

13 13 URL List and Document Retriever URLFilter http://www.ancestry.com/search/ main.htm?lfl=adv http://userdb.rootsweb.com/deat hs/cgi-bin/deaths.cgi Death Date > 1880 http://www.camcomp.com/users/j walker/johngene/johngenes.htm Name: Bates, Boyle, Damon, Eliot, … Walker, Woodsworth http://www.rootsweb.com/~gaups on/cedarcem.htm Burial Location: Thomaston, GA http://www.cs.utk.edu/~dwalker/g enealogy/LISTS/Adams.html Name: Adams http://www.cs.utk.edu/~dwalker/g enealogy/LISTS/Walker.html Name: Walker http://www.cs.utk.edu/~dwalker/g enealogy/LISTS/Warley.html Name: Warley http://homepages.rootsweb.com/ ~gemmell/walkdesc.htm Name: Walker http://www.smartnouveau.com/jb place/Kemp/f0000425.html Name: Anderson, Burt, Summers, Walker

14 14 Document Structure Recognizer Requests analysis from each Data Extraction Engine Requests analysis from each Data Extraction Engine Selects appropriate method Selects appropriate method

15 15 Data Extraction Engines Text Text Improved record-separation Improved record-separation Ability to handle single-record pages Ability to handle single-record pages Table Table Forms Forms

16 16 Data Constrainer Selects attribute/value pairs Selects attribute/value pairs Fits data to ontology Fits data to ontology

17 17 Result Filter Fits data to query Fits data to query Returns to central Result Presenter Returns to central Result Presenter

18 18 Result Presenter Creates XML Schema from Ontology Creates XML Schema from Ontology Presents results to user Presents results to user

19 19 Result Presenter

20 20 Evaluation Scalability Scalability Query on large URL list Query on large URL list Experiment on number of PCs Experiment on number of PCs Precision and recall Precision and recall Recall difficult to determine Recall difficult to determine Query on small URL list Query on small URL list Adaptability Adaptability Car ontology Car ontology Small URL list Small URL list

21 21 Conclusion Integrates, builds on previous DEG work Integrates, builds on previous DEG work Extracts from: Extracts from: Single- or Multiple-Record Documents Single- or Multiple-Record Documents Tables Tables Forms Forms Scalable Scalable Only searches probable pages Only searches probable pages Distributed with DOGMA Distributed with DOGMA Robust to changes in pages Robust to changes in pages Ontology based—easily adapted to other domains Ontology based—easily adapted to other domains


Download ppt "1 Automating the Extraction of Domain Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Proposal January 2004."

Similar presentations


Ads by Google