1 Automating the Extraction of Domain-Specific Information from the Web A Case Study for the Genealogical Domain Troy Walker Thesis Defense November 19, 2004 Research funded by NSF grant #IIS
2 Genealogical Information on the Web Hundreds of thousands of sites Hundreds of thousands of sites Some professional (Ancestry.com, Familysearch.org) Some professional (Ancestry.com, Familysearch.org) Mostly hobbyist (240,400 indexed by Cyndislist.com) Mostly hobbyist (240,400 indexed by Cyndislist.com) Search engines Search engines “Walker genealogy” on Google: 523,000 results “Walker genealogy” on Google: 523,000 results 1 page/minute = 1 year to go through 1 page/minute = 1 year to go through Why not enlist the help of a computer? Why not enlist the help of a computer?
3 Problems No standard way of presenting data No standard way of presenting data Sites have differing schemas Sites have differing schemas Web pages change Web pages change New pages continuously come on line New pages continuously come on line
4 GeneTIQS Based on work done by BYU DEG Based on work done by BYU DEG Able to extract from: Able to extract from: Single-record documents Single-record documents Simple multiple-record documents Simple multiple-record documents Complex multiple-record documents Complex multiple-record documents Robust to changes in pages Robust to changes in pages Immediately works for new pages Immediately works for new pages
5 Person Ontology
6
7 Value Matchers
8 Record Separation Separating data related to each person Separating data related to each person Previous technique Previous technique Combines many heuristics Combines many heuristics Has problems Has problems Assumes multiple records Assumes multiple records Must be simple separation Must be simple separation
9 Single-Record Document
10 Simple Multiple-Record Document
11 Complex Multiple-Record Document
12 Vector Space Modeling Ontology Vector Ontology Vector Compare to candidate records Compare to candidate records Cosine measure Cosine measure Magnitude measure Magnitude measure
13 Ontology Vector { 0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0}
14 Vector Space Modeling <!DOCTYPE…><html> … …header… …header… … … {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 149, 89, 76, 0, 0, 48, 23, 23} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0} {0, 1, 0, 0, 0, 0, 0, 0, 0, 0} {0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 148, 89, 76, 0, 0, 48, 23, 23} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 0, 0, 0, 0, 0, 0, 0, 0} {0, 146, 88, 76, 0, 0, 48, 23, 23} {0, 146, 88, 76, 0, 0, 48, 23, 23}… {0, 1, 1, 1, 0, 0, 0, 0, 0} {0, 1, 1, 1, 0, 0, 0, 0, 0} Gender Christening Burial Marriage Relation Name Birth Name Death Relationship
15 Problems and Improvements Differing schemas Differing schemas Low cosine measures Low cosine measures Discarded data Discarded data Prune dimensions Prune dimensions {0.8, 0.99, 0.95, 0.9, 0.6, 0.5, 0.6, 3.0, 3.0} {0.0, 141.0, 89.0, 76.0, 0.0, 0.0, 48.0, 23.0, 23.0} Richness of data in single-record documents Richness of data in single-record documents High magnitude measure High magnitude measure Higher magnitude to split documents Higher magnitude to split documents
16 Problems and Improvements Missed Simple Patterns Missed Simple Patterns More than 3 records More than 3 records Valid Records:Total Records > 2:3 Valid Records:Total Records > 2:3 Keep all Keep all Discard header and footer Discard header and footer
17 Demonstration
18 Presenting Results
19 Evaluation Semi-structured Text Semi-structured Text 21 single-record documents 21 single-record documents 10 simple documents containing 130 records 10 simple documents containing 130 records 20 complex documents with 238 records 20 complex documents with 238 records Precision and recall for record separation Precision and recall for record separation
20 recordsreturnedcorrectprecisionrecall single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % single % Total %90.48% Single- Record Documents
21 recordsreturnedcorrectprecisionrecall simple %100.00% Simple %89.47% Simple % Simple % Simple %91.67% Simple %83.33% Simple %71.43% Simple %100.00% Simple % Simple % Total %93.08% recordsreturnedcorrectprecisionrecall simple %100.00% simple % simple %100.00% simple %100.00% simple %100.00% simple %75.00% simple %92.86% simple % simple %100.00% simple % Total %66.92% Simple Multiple-Record Documents VSM Separator Highest-Fanout Separator
22 Complex Multiple- Record Documents recordsreturnedmissedextracorrectprecisionrecall complex % complex % complex % complex %85.71% complex %93.75% complex %86.67% complex %92.31% complex % complex %94.74% complex % complex %73.33% complex % complex % complex %93.75% complex % complex %100.00% complex %110.00% complex %25.00% complex %100.00% complex %75.00% Total %91.60%
23 Conclusion Integrate, build on previous DEG work Integrate, build on previous DEG work Accurate record separation Accurate record separation Average recall: 92% Average recall: 92% Average precision: 93% Average precision: 93% Ontology based Ontology based Robust to changes in pages Robust to changes in pages Immediately works with new pages Immediately works with new pages
24 Future Work Scale Scale Distribute computation Distribute computation Intelligent URL selector Intelligent URL selector More Sources More Sources Tables Tables Forms and dynamic pages Forms and dynamic pages Obtain more information behind links Obtain more information behind links
25 Future Work Improve VSM record separation Improve VSM record separation Weight object importance Weight object importance Disambiguate before record separation Disambiguate before record separation Recognize patterns Recognize patterns Improve detection of single-record documents Improve detection of single-record documents