Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young.

Similar presentations


Presentation on theme: "Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young."— Presentation transcript:

1 Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science

2 2 2 Digital Images – Human Index Large number of competing family history websites Digital images Human indexes Researchers hunting through records and indexes to put families together

3 3 3 Problem Large amounts of primary genealogical data Large amounts of primary genealogical data Big projects to index and extract records Big projects to index and extract records Two independent indexers and adjudication Two independent indexers and adjudication Millions of human hours used to index or match records for names and families Millions of human hours used to index or match records for names and families

4 4 4 Automated Extraction Solution Create a specialized extraction ontology to interpret and label genealogical data Create a specialized extraction ontology to interpret and label genealogical data Add rules and logic that Add rules and logic that Label family roles - husband, daughter, etc. Label family roles - husband, daughter, etc. Link family relationships Link family relationships HUSBAND – WIFE HUSBAND – WIFE PARENT – CHILD PARENT – CHILD

5 5 Outline 1. Data Preparation 2. Ontology Extraction System (OntoES) 3. OWL File and SWRL Rules 4. SPARQL Queries 5. Experimental Results 6. Conclusions 5

6 6 6 1. Data Preparation Collect machine-readable records from three different countries Collect machine-readable records from three different countries Format in HTML format for extraction Format in HTML format for extraction Prepare lexicons for names, places, etc. Prepare lexicons for names, places, etc.

7 7 7 New England Vital Records – Beverly, Massachusetts 1668-1849

8 8 8 Danish Parish – Maglebye, Praesto 1646-1813

9 9 9 English Parish – South Petherton, Somersetshire 1574-1901

10 10 same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts SOUTH PETHERTON MARRIAGES (from genuki)

11 11 Transcript of the original record 1576/1577 eodem die Nicholaus Patch Christinam Denman 26 Jan 1605 Richard Patch et Joanna Lavor 1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis cominguntur 1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur 1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio cominguntur 1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch 1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio cominguntur 1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio cominguntur 1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio cominguntur 1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio cominguntur

12 12 2. Ontology Extraction System OntoES : automatically interpret and correctly label genealogical data using Data frames Data frames Regular expressions Regular expressions Lexicons Lexicons Date conversion methods Date conversion methods

13 13 Marriage Ontology

14 14 Data Frame Editor

15 15 Regular expressions MARDATE Value expression EXAMPLE type 25-Sep 1613 (0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\d\d) Keyword expression (\b(md\.?|marry|marriage|married|maried|wed|we dding)\b)

16 16 Sample MONTH LEXICON 1Ober 1Ober 7ber 7ber 8ber 8ber 9ber 9ber apr apr april april aprilis aprilis aug aug august august augusti augusti augustus augustus avr avr avril avril avrilis avrilis dec dec december december decembr decembre decembri feb febr februari february jan januarij january jul juli julius july jun june

17 17 Object Level

18 18 CANONICALIZATION METHODS inside the ontology Regularize date (Julian format: YYYYddd) Regularize date (Julian format: YYYYddd) 1620 2-May 1620093 Display stored Julian format as DD MMM YYYY 1620093 2 MAY 1620

19 19 Feast Dates Dates expressed as a holy day Fixed Dates Fixed Dates Christmas 1720 25 DEC 1720 Moveable Dates around Easter Moveable Dates around Easter (36 possible Easter dates with leap year variation) (36 possible Easter dates with leap year variation) 1723 Dnica Septuagesima 24 JAN 1723 1723 Dnica Septuagesima 24 JAN 1723 Same day as previous entry Same day as previous entry

20 20 Run Ontology Run Ontology Input Input Ontology (Created with OntoES) Ontology (Created with OntoES) HTML data (Hypertext Markup Language) HTML data (Hypertext Markup Language) Output Output RDF database (Resource Description Format) RDF database (Resource Description Format) OWL file (Ontology Web Language) OWL file (Ontology Web Language)

21 21 Ontology Workbench

22 22 Extracted Marriages Bet Date MarDateNameMNameFNameU same day 1576Nicholas Patch Christian Denman 26 JAN 1605Richard PatchJoan Lavor 26 SEP 1613John ElliottJoan Woodbery 7 AUG 1615Thomas PrimeMaria Parry 29 JAN 1616William WoodberyElizabeth Patch 2 MAY 1620William HillerdFortu: Patch 17 SEP 1622Nicholas PatchElizabeth Owlsey 22 JAN 1627Richard PatchMary White 16 JAN 1630Andrew ElliottJoan Patch 12 FEB 1639Andrew ElliottJoan Pitts

23 23 3. OWL File and SWRL Rules OWL HEADER PERSON - NAMEU

24 24 Sample RDF Triples Person_10| sameAs| Person_10 Person_10| type| Thing Person_10| type| Person NameU_0| NameUValue | Christian Denman NameU_0| sameAs| NameU_0 NameU_0| type| Thing NameU_0|type| NameU NameM_4| NameMValue | Nicholas Patch NameM_4| sameAs| NameM_4 NameM_4| type| Thing NameM_4|type| NameM

25 25 SWRL Rules Define OWL Class Define OWL Class Example – Husband Example – Husband Define Rule Define Rule Example – Person with male name is a Husband Example – Person with male name is a Husband Person-NameM(?x,?y) -> Husband(?x) Person-NameM(?x,?y) -> Husband(?x) ?x ?y

26 26 Marriage – Person_10 to Person_4 Person_10 MarriageRecord_7 NameM_4 Nicholas Patch Person_4

27 27 Rule HEAD in OWL file

28 28 Rule BODY in OWL file

29 29 Related Rules NameF is populated then value in NameU is Husband NameF is populated then value in NameU is Husband Person-NameF(?w,?v) MarriageRecord-Person(?z,?w) Person-NameF(?w,?v) MarriageRecord-Person(?z,?w) MarriageRecord-Person(?z,?x) Person-NameU(?x,?y) -> Husband(?x) -> Husband(?x) ?x ?z ?w ?v ?y

30 30 HusbandOf Rule Husband(?x) Wife(?y) MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?y) MarriageRecord-Person(?z,?y) -> HusbandOf(?x,?y) -> HusbandOf(?x,?y)

31 31 Auxiliary Name Rules 31 NameM(?x) -> Name(?x) NameF(?x) -> Name(?x) NameU(?x) -> Name(?x) NameMValue(?x) -> NameValue(?x) NameFValue(?x) -> NameValue(?x) NameUValue(?x) -> NameValue(?x) Person-NameM(?x,?y) -> Person-Name(?x,?y) Person-NameF(?x,?y) -> Person-Name(?x,?y) Person-NameU(?x,?y) -> Person-Name(?x,?y)

32 32 4.SPARQL Query Who is Husband of Christian Denman? PREFIX : http://www.deg.byu.edu/ontology/Marriage# http://www.deg.byu.edu/ontology/Marriage# SELECT ?Husband WHERE{ ?X :NameValue "Christian Denman". ?X :NameValue "Christian Denman". ?Y :Person-Name ?X. ?Y :Person-Name ?X. ?W :HusbandOf ?Y. ?W :HusbandOf ?Y. ?W :Person-Name ?V. ?W :Person-Name ?V. ?V :NameValue ?Husband ?V :NameValue ?Husband}

33 33 Query Results Husband======================================= "Nicholas Patch"^^http://www.w3.org/2001/XMLSchema#string

34 34 Query Results Husband======================================= "Nicholas Patch"^^http://www.w3.org/2001/XMLSchema#string South Petherton Marriages same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts Nicholas Patch because: NameValue(Nicholas Patch) and Name-NameValue(n1, Nicholas Patch) and Name(n1) is NameM(n1) and Person-NameM(p1, n1) NameValue(Christian Denman) and Name-NameValue(n2, Christian Denman) and Name(n2) is NameU(n2) and Person-NameU(p2, n2) Husband(p1) because: Person-NameM(p1, n1) Wife(p2) because: Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1) and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1) HusbandOf(p1, p2) because: Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1) and MarriageRecord-Person(r1, p1)

35 35 5. Experimental Results Extraction Results Extraction Results American Extraction Problem American Extraction Problem Rule Results Rule Results

36 36 Extraction Results MARRIAGESENTITIESRECALL%ERRORSPRECISION English18859458899.0%898.7% American6081824163089.4%3498.0% Danish17154353899.1%1098.2% BIRTHS English31539489939499.0%6199.4% American6752055180988.0%3398.2% Danish6772061204299.1%1599.3% DEATHS English34588675858999.0%8399.0% American5101305114888.0%2897.6% Danish8332113209399.1%1999.1%

37 37 American Difficulty BIRTH WOODBURY, Charles Henry [Charles William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, 1845. Extra information inside brackets & parentheses Extra information inside brackets & parentheses Charles William – twin of Charles Henry Charles William – twin of Charles Henry Henry [housewright – identified as NAME Henry [housewright – identified as NAME Henrietta (Galloup) –identified as NAME Henrietta (Galloup) –identified as NAME

38 38 Rules Results 100% Precision and Recall 100% Precision and Recall (Once rules are well-defined, the results are perfect.) Database Size Database Size (The RDF database is much larger when rule triples are added.) NEW PROPERTIES – husband, wife, parent, child NEW PROPERTIES – husband, wife, parent, child NEW LINKS NEW LINKS

39 39 Size Impact of Adding Rules MARRIAGE21 rulesEVENT30 rules TriplesOWL (# lines) OWL File (kilobytes) TriplesOWL (# lines) OWL File (kilobytes) OWL File 814498142232140515 W/Rules 1009785312983187375 Difference 1952871775146860 Increase 23.96%57.63%121.43%33.65%33.31%400.00%

40 40 6. Conclusions Speed up data indexing Speed up data indexing Make production of a full index easier Make production of a full index easier Ground the index in original documents Ground the index in original documents Provide for inferred facts Provide for inferred facts Simplify as well as augment record search Simplify as well as augment record search Help link records and form family groups and ancestral lines Help link records and form family groups and ancestral lines


Download ppt "Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young."

Similar presentations


Ads by Google