Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006.

Similar presentations


Presentation on theme: "Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006."— Presentation transcript:

1 Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006

2 2 Digital Images – Human Index Large number of competing family history websites Digital images Human indexes – Double entry Researchers hunting through records and indexes to put families together

3 3 Problem Large amounts of primary genealogical data Large amounts of primary genealogical data Big projects to index and extract records Big projects to index and extract records Two independent indexers and adjudication Two independent indexers and adjudication Millions of human hours used to index or match records for names and families Millions of human hours used to index or match records for names and families

4 4 Automated Extraction Solution Create a specialized extraction ontology to interpret and label genealogical data Create a specialized extraction ontology to interpret and label genealogical data Develop expert logic and rules that Develop expert logic and rules that Match and merge individuals Match and merge individuals Group them into families Group them into families

5 5 Methods Prepare for the records extraction Prepare for the records extraction Run a 1 st PASS to extract the information Run a 1 st PASS to extract the information Run a 2 nd PASS to match individuals and link families Run a 2 nd PASS to match individuals and link families Evaluate and optimize the results Evaluate and optimize the results

6 6 Prepare for Records Extraction Build an Ontology Build an Ontology BYU ontology software Ontos to interpret and correctly label genealogical data using BYU ontology software Ontos to interpret and correctly label genealogical data using Dataframes Dataframes Regular expressions Regular expressions Lexicons Lexicons Conversion functions Conversion functions “encapsulates knowledge about the appearance, behavior, and context of a collection of data elements” Dr. David Embley Collect machine-readable records Collect machine-readable records

7 7 Ontology – Entity Level Ontology – Entity Level

8 8 Danish GIVEN NAME LEXICON MALE MALE Anders –And. Anders –And. Andreas Andreas Christen –Kristen Christen –Kristen Christian –Kristian Christian –Kristian Erik –Eric Erik –Eric Gregers Gregers Hans Hans Ib –Jep –Jeppe Ib –Jep –Jeppe Jacob Jacob Jens Jens Johan – Johannes – Joh. Johan – Johannes – Joh. Jorgen –Jørgen Jorgen –Jørgen Knud Knud Lars – Laurs – Laurids –Lauritz Lars – Laurs – Laurids –Lauritz Mads –Mats - Mats Mads –Mats - Mats FEMALE Ane – Anna – Anne Birthe – Birte Bodil Caroline Dorthe – Dorte Ellen -Helene -Elene Elisabeth –Elsbeth –Lisbeth Else –Ilse Ingeborg Inger Karen Kirsten –Christen –Kirstine –Christine – Kirstine –Chirstine Malene Maren

9 9 DATE Lexicon Adds Thesaurus of Synonyms MONTHS January –Jan –Januar -11br January –Jan –Januar -11br Februrary –Feb –Februar -12br Februrary –Feb –Februar -12br March –Mar –Marts March –Mar –Marts April – Apr –Apl April – Apr –Apl May –Mai May –Mai June –Jun –Juni June –Jun –Juni July –Jul –Juli -5br July –Jul –Juli -5br August –Aug –Augst -6br August –Aug –Augst -6br September –Sep –Sept -7br –Septembre September –Sep –Sept -7br –Septembre October –Oct -8br –Octobre October –Oct -8br –Octobre November –Nov -9br –Novembre November –Nov -9br –Novembre December –Dec -10br December –Dec -10brTIME Year –yr –aar –år Year –yr –aar –år Month –mo –maaned –m. Month –mo –maaned –m. Week –uge –ug. Week –uge –ug. Day –dag –d. Day –dag –d. Hour – h. –hr. Hour – h. –hr. FEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -Pin Trinity –Tr –Trin –Trinitatis DAYS OF WEEK Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –Lørs

10 10 CONVERSION FUNCTIONS inside the ontology Compute birth date from age at death Compute birth date from age at death Death date – 22 Mar 1743 Death date – 22 Mar 1743 Age - 23 yr 2 m Age - 23 yr 2 m -> BIRTH Jan 1720 Compute dates from feast dates Sunday 23 rd after Trinity 1751 -> -> 14 Nov 1751

11 11 Collect Machine-Readable Records

12 12 English Parish – Wirksworth, Derby 1608-1813

13 13 Danish Parish – Maglebye 1646-1813

14 14 Sample Danish marriages

15 15 New England – Beverly, Mass. 1668-1849

16 16 2 Run a 1 st pass to extract the information Annotate the genealogical record with the ontology Annotate the genealogical record with the ontology Populate RDF data file Populate RDF data file

17 17 Annotated Town Record SOURCE –Beverly town records SOURCE –Beverly town records [PAGE HEADER]Births page 391 [PAGE HEADER]Births page 391 [BODY] WOODBURY, Benjamin, s. Nickolas and Anne, bp. 26 : 2 m : 1668. [BODY] WOODBURY, Benjamin, s. Nickolas and Anne, bp. 26 : 2 m : 1668. NAME NAME DATE DATE PLACE PLACE RELATIONSHIP RELATIONSHIP OCCUPATION OCCUPATION RECORD_TYPE RECORD_TYPE SOURCE SOURCE

18 18 Annotated Danish Parish SOURCE -Tvilum Parish Register SOURCE -Tvilum Parish Register [PAGE HEADER]Fødde 1751 page 3 [PAGE HEADER]Fødde 1751 page 3 [BODY] Truust Dom. 23 p: Trinit: laest over Niels Baches SØREN fadd. Johannes Michelsens og Niels Mollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens Bachis søn Peder og Niels Thylkes s. Peder af Truust [BODY] Truust Dom. 23 p: Trinit: laest over Niels Baches SØREN fadd. Johannes Michelsens og Niels Mollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens Bachis søn Peder og Niels Thylkes s. Peder af Truust

19 19 Populate RDF-data file Hilton Campbell’s design Hilton Campbell’s design PERSON PERSON EVENT EVENT LINKS – PERSON(S) to EVENT LINKS – PERSON(S) to EVENT

20 20 EVENT – birth of Rachel PERSON’s – Sarah and Rachel

21 21 3 Run a SECOND PASS to match individuals and to link families FORMULATE RULES FORMULATE RULES in Rule Engine language for RDF-data file Match individuals Match individuals Check family data Check family data Link families up Link families up APPLY RULES through the Java Rules API APPLY RULES through the Java Rules API

22 22 4Evaluate and Optimize Results Evaluate the preliminary results Evaluate the preliminary results Optimize the rules Optimize the rules Improve the whole process Improve the whole process

23 23 VALIDATION I Classification by Record Type: RECALL =.769 RECALL =.769 240 entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 312 entries ACTUAL BIRTHS PRECISION =.976 PRECISION =.976 240 entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 246 Entries TOTAL LABELED ‘BIRTH’ The higher the number, the better

24 24 VALIDATION II Correctness of the Extraction: RECALL =.95 RECALL =.95 950 entries CORRECTLY LABELED ‘NAME’ ________________________________________ 1000 entries ACTUAL NAMES PRECISION =.969 PRECISION =.969 950 entries CORRECTLY LABELED ‘NAME’ ________________________________________ 980 Entries TOTAL LABELED ‘NAME’ The higher the number, the better

25 25 Isaac WOODBURY Isaac WOODBURY Children Children 1. Robert 4 Jul 1672 2. Mary 6 Oct 1674 3. Christian 3 Mar 1677/8 4. Isaac 6 Apr 1680 5. Deliverance 1 Feb 1682/3 6. Joshua 1 Jan 1684/5 7. Elizabeth 17 Jan 1688 8. Nickolas 12 Aug 1688 9. Ann29 Jun 1689 10. Lidia 1 Feb 1691/2 11. Elisabeth about 1694 12. Isaac 20 Jul 1697 13. Benjamin 20 Aug 1699

26 26 Isaac WOODBURY Isaac WOODBURY SON of HUMPHREY SON of HUMPHREY Mary WILKES Mary WILKES MARRIAGE 9 Oct 1671 MARRIAGE 9 Oct 1671 1. Robert 4 Jul 1672 2. Mary 6 Oct 1674 3. Christian 3 Mar 1677/8 4. Isaac 6 Apr 1680 5. Deliverance 1 Feb 1682/3 6. Joshua 1 Jan 1684/5 7. Elizabeth 17 Jan 1688 Isaac WOODBURY SON of NICHOLAS Elizabeth MARRIAGE ________ 1. Nickolas 12 Aug 1688 2. Ann29 Jun 1689 3. Lidia 1 Feb 1691/2 4. Elisabeth about 1694 5. Isaac 20 Jul 1697 6. Benjamin 20 Aug 1699

27 27 VALIDATION III Grouping by FAMILY: Grouping by FAMILY: total # merges + splits to correct families after 2 nd PASS ___________________________________ total # merges + splits to correct families after 1 st PASS The lower the number, the better

28 28 Optimize the Rules Add Add Remove Remove Fine-tune Fine-tune Change the order Change the order Improve the whole process Improve the whole process Until the metrics no longer improve

29 29 AUTOMATIC EXTRACTION Unstructured genealogical data Unstructured genealogical data Searchable annotated genealogical data Families in RDF-data file

30 Questions?


Download ppt "Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006."

Similar presentations


Ads by Google