Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF.

Similar presentations


Presentation on theme: "1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF."— Presentation transcript:

1 1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF

2 2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (Cyndislist.com) Mostly hobbyist (Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 3 Problems No standard way of presenting data No standard way of presenting data Text formatted with HTML tags Text formatted with HTML tags Tables Tables Forms to access information Forms to access information Each site has its own idea of what genealogical information is—differing schemas Each site has its own idea of what genealogical information is—differing schemas

4 4 Proposed solution Based on Ontos and other work done at the BYU Data Extraction Group Based on Ontos and other work done at the BYU Data Extraction Group Able to extract from: Able to extract from: Semi-structured or unstructured text Semi-structured or unstructured text Tables Tables Forms Forms Scalable and robust to changes in pages Scalable and robust to changes in pages Built for genealogy but easily adaptable to other domains Built for genealogy but easily adaptable to other domains

5 5 Text

6 6 Tables

7 7 Forms

8 8 Forms

9 9 System Overview Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information To be implemented To be improved To be integrated

10 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information 10 User Query Form generated from ontology Form generated from ontology Query by example Query by example

11 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information 11 URL Database and Document Retriever Contains Genealogy URLs Contains Genealogy URLs Search each URL—too much time Search each URL—too much time Filter likely URLs Filter likely URLs URLFilter http://www.ancestry.com/search /main.htm?lfl=adv http://userdb.rootsweb.com/deat hs/cgi-bin/deaths.cgi Death Date > 1880 http://www.camcomp.com/users /jwalker/johngene/johngenes.ht m Name: Bates, Boyle, Damon, Eliot, … Walker, Woodsworth http://www.rootsweb.com/~gau pson/cedarcem.htm Burial Location: Thomaston, GA http://www.cs.utk.edu/~dwalker /genealogy/LISTS/Adams.html Name: Adams http://www.cs.utk.edu/~dwalker /genealogy/LISTS/Walker.html Name: Walker http://www.cs.utk.edu/~dwalker /genealogy/LISTS/Warley.html Name: Warley http://homepages.rootsweb.com /~gemmell/walkdesc.htm Name: Walker http://www.smartnouveau.com/j bplace/Kemp/f0000425.html Name: Anderson, Burt, Summers, Walker

12 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information 12 Method Selector Analyze page Analyze page Select appropriate method Select appropriate method

13 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information 13 Preprocessing Engines Text Text Improved record-separation Improved record-separation Ability to handle single-record pages Ability to handle single-record pages Table Table Forms Forms

14 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information 14 Extraction Engine Ontos Ontos Cache schema matches Cache schema matches

15 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information 15 Result Filter Filters objects relevant to query Filters objects relevant to query Presents to user Presents to user PersonNameGender 1 Ezra Erastus Walker M PersonEventDateLocation1Birth 27 Sep 1885 Taylor, Apache, AZ 1Death 19 Sep 1952

16 16 Conclusion Integrates, builds on previous DEG work Integrates, builds on previous DEG work Extracts from: Extracts from: Semi-structured or unstructured text Semi-structured or unstructured text Tables Tables Forms Forms Scalable—only searches probable pages Scalable—only searches probable pages Robust to changes in pages Robust to changes in pages Ontology based—easily adapted to other domains Ontology based—easily adapted to other domains

17 17 Document Retriever Form Engine Table Engine Unstructured or Semi-Structured Text Engine URL Database User Query Result Filter Document Structure Recognizer Data Extraction Engine Mapping Information


Download ppt "1 Querying the Web for Genealogical Information Troy Walker Spring Research Conference 2003 Research funded by NSF."

Similar presentations


Ads by Google