Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,

Similar presentations


Presentation on theme: "1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,"— Presentation transcript:

1 1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19, 2004 Research funded by NSF grant #IIS-0083127

2 2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (240,400 indexed by Cyndislist.com) Mostly hobbyist (240,400 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 523,000 results “Walker genealogy” on Google: 523,000 results 1 page/minute = 1 year to go through 1 page/minute = 1 year to go through Why not enlist the help of a computer? Why not enlist the help of a computer?

3 3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas Web pages change Web pages change New pages continuously come on line New pages continuously come on line

4 4 GeneTIQS Based on work done by BYU DEG Based on work done by BYU DEG Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Immediately works for new pages Immediately works for new pages

5 5 Person Ontology

6 6

7 7 Value Matchers

8 8 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation

9 9 Single-Record Document

10 10 Simple Multiple-Record Document

11 11 Complex Multiple-Record Document

12 12 Vector Space Modeling Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure

13 13 Ontology Vector { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}

14 14 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … … {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 149, 89, 76, 0, 0, 48, 23, 23} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0} {0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 146, 88, 76, 0, 0, 48, 23, 23} {0, 146, 88, 76, 0, 0, 48, 23, 23}… {0, 1, 1, 1, 0, 0, 0, 0, 0} {0, 1, 1, 1, 0, 0, 0, 0, 0} Gender Christening Burial Marriage Relation Name Birth Name Death Relationship

15 15 Problems and Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents

16 16 Problems and Improvements Missed Simple Patterns Missed Simple Patterns More than 3 records More than 3 records Valid Records:Total Records > 2:3 Valid Records:Total Records > 2:3 Keep all Keep all Discard header and footer Discard header and footer

17 17 Demonstration

18 18 Presenting Results

19 19 Evaluation Semi-structured Text Semi-structured Text 21 single-record documents 21 single-record documents 10 simple documents containing 130 records 10 simple documents containing 130 records 20 complex documents with 238 records 20 complex documents with 238 records Precision and recall for record separation Precision and recall for record separation

20 20 recordsreturnedcorrectprecisionrecall single1111100.00% single2111100.00% single3111100.00% single4111100.00% single5111100.00% single6111100.00% single7111100.00% single81400.00% single9111100.00% single10111100.00% single11111100.00% single121300.00% single13111100.00% single14111100.00% single15111100.00% single16111100.00% single17111100.00% single18111100.00% single19111100.00% single20111100.00% single21111100.00% Total21261973.08%90.48% Single- Record Documents

21 21 recordsreturnedcorrectprecisionrecall simple119201995.00%100.00% Simple21917 100.00%89.47% Simple311 100.00% Simple4999100.00% Simple512131184.62%91.67% Simple612111090.91%83.33% Simple71410 100.00%71.43% Simple857571.43%100.00% Simple914 100.00% Simple1015 100.00% Total13012712195.28%93.08% recordsreturnedcorrectprecisionrecall simple119221986.36%100.00% simple2192000.00% simple311141178.57%100.00% simple4910990.00%100.00% simple512161275.00%100.00% simple61223939.13%75.00% simple714221359.09%92.86% simple851000.00% simple914161487.50%100.00% simple10151600.00% Total1301698751.48%66.92% Simple Multiple-Record Documents VSM Separator Highest-Fanout Separator

22 22 Complex Multiple- Record Documents recordsreturnedmissedextracorrectprecisionrecall complex110 00 100.00% complex215 00 100.00% complex312 00 100.00% complex47913666.67%85.71% complex5161510 100.00%93.75% complex61516231381.25%86.67% complex7131210 100.00%92.31% complex810 00 100.00% complex91920121890.00%94.74% complex1010 11990.00% complex11151140 100.00%73.33% complex1215 00 100.00% complex1311 00 100.00% complex141618131583.33%93.75% complex158822675.00% complex168901888.89%100.00% complex17101100 100.00%110.00% complex1841301100.00%25.00% complex1981103872.73%100.00% complex201613411292.31%75.00% Total238237211921891.98%91.60%

23 23 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 92% Average recall: 92% Average precision: 93% Average precision: 93% Ontology based Ontology based Robust to changes in pages Robust to changes in pages Immediately works with new pages Immediately works with new pages

24 24 Future Work Scale Scale Distribute computation Distribute computation Intelligent URL selector Intelligent URL selector More Sources More Sources Tables Tables Forms and dynamic pages Forms and dynamic pages Obtain more information behind links Obtain more information behind links

25 25 Future Work Improve VSM record separation Improve VSM record separation Weight object importance Weight object importance Disambiguate before record separation Disambiguate before record separation Recognize patterns Recognize patterns Improve detection of single-record documents Improve detection of single-record documents


Download ppt "1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19,"

Similar presentations


Ads by Google