Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March.

Similar presentations


Presentation on theme: "1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March."— Presentation transcript:

1 1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March 25, 2004 Research funded by NSF grant #IIS-0083127

2 2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (203,200 indexed by Cyndislist.com) Mostly hobbyist (203,200 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 199,000 results “Walker genealogy” on Google: 199,000 results 1 page/minute = 5 months to go through 1 page/minute = 5 months to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas Web pages change Web pages change New pages continuously come on line New pages continuously come on line

4 4 GeneTIQS Based on work done by BYU DEG Based on work done by BYU DEG Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Immediately works for new pages Immediately works for new pages

5 5 Person Ontology

6 6 Value Matchers

7 7 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation

8 8 Single-Record Document

9 9 Simple Multiple-Record Document

10 10 Complex Multiple-Record Document

11 11 Vector Space Modeling Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure

12 12 Ontology Vector { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0}

13 13 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … {0, 0, 0, 0, 0, 0, 0, 0} {0, 141, 89, 76, 0, 0, 48, 23} {0, 1, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0} {0, 140, 89, 76, 0, 0, 48, 23} {0, 140, 89, 76, 0, 0, 48, 23} {0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0} {0, 138, 88, 76, 0, 0, 48, 23} {0, 138, 88, 76, 0, 0, 48, 23}… Gender Christening Burial Marriage Relation Birth Name Death

14 14 Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents

15 15 Demonstration

16 16 Presenting Results

17 17 Preliminary Results Semi-structured Text Semi-structured Text 10 single-record documents 10 single-record documents 3 simple documents containing 268 records 3 simple documents containing 268 records 3 complex documents containing 266 records 3 complex documents containing 266 records Precision and recall for record separation Precision and recall for record separation

18 18 Record Separation RecallPrecision Single100%94.1% Simple94.7%97.3% Complex88.3%93.6%

19 19 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 94.3% Average recall: 94.3% Average precision: 95.0% Average precision: 95.0% Ontology based Ontology based Robust to changes in pages Robust to changes in pages Immediately works with new pages Immediately works with new pages


Download ppt "1 Automating the Extraction of Genealogical Information from the Web GeneTIQS Troy Walker & David W. Embley Family History Technology Conference March."

Similar presentations


Ads by Google