Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006
2 Digital Images – Human Index Large number of competing family history websites Digital images Human indexes – Double entry Researchers hunting through records and indexes to put families together
3 Problem Large amounts of primary genealogical data Large amounts of primary genealogical data Big projects to index and extract records Big projects to index and extract records Two independent indexers and adjudication Two independent indexers and adjudication Millions of human hours used to index or match records for names and families Millions of human hours used to index or match records for names and families
4 Automated Extraction Solution Create a specialized extraction ontology to interpret and label genealogical data Create a specialized extraction ontology to interpret and label genealogical data Develop expert logic and rules that Develop expert logic and rules that Match and merge individuals Match and merge individuals Group them into families Group them into families
5 Methods Prepare for the records extraction Prepare for the records extraction Run a 1 st PASS to extract the information Run a 1 st PASS to extract the information Run a 2 nd PASS to match individuals and link families Run a 2 nd PASS to match individuals and link families Evaluate and optimize the results Evaluate and optimize the results
6 Prepare for Records Extraction Build an Ontology Build an Ontology BYU ontology software Ontos to interpret and correctly label genealogical data using BYU ontology software Ontos to interpret and correctly label genealogical data using Dataframes Dataframes Regular expressions Regular expressions Lexicons Lexicons Conversion functions Conversion functions “encapsulates knowledge about the appearance, behavior, and context of a collection of data elements” Dr. David Embley Collect machine-readable records Collect machine-readable records
7 Ontology – Entity Level Ontology – Entity Level
8 Danish GIVEN NAME LEXICON MALE MALE Anders –And. Anders –And. Andreas Andreas Christen –Kristen Christen –Kristen Christian –Kristian Christian –Kristian Erik –Eric Erik –Eric Gregers Gregers Hans Hans Ib –Jep –Jeppe Ib –Jep –Jeppe Jacob Jacob Jens Jens Johan – Johannes – Joh. Johan – Johannes – Joh. Jorgen –Jørgen Jorgen –Jørgen Knud Knud Lars – Laurs – Laurids –Lauritz Lars – Laurs – Laurids –Lauritz Mads –Mats - Mats Mads –Mats - Mats FEMALE Ane – Anna – Anne Birthe – Birte Bodil Caroline Dorthe – Dorte Ellen -Helene -Elene Elisabeth –Elsbeth –Lisbeth Else –Ilse Ingeborg Inger Karen Kirsten –Christen –Kirstine –Christine – Kirstine –Chirstine Malene Maren
9 DATE Lexicon Adds Thesaurus of Synonyms MONTHS January –Jan –Januar -11br January –Jan –Januar -11br Februrary –Feb –Februar -12br Februrary –Feb –Februar -12br March –Mar –Marts March –Mar –Marts April – Apr –Apl April – Apr –Apl May –Mai May –Mai June –Jun –Juni June –Jun –Juni July –Jul –Juli -5br July –Jul –Juli -5br August –Aug –Augst -6br August –Aug –Augst -6br September –Sep –Sept -7br –Septembre September –Sep –Sept -7br –Septembre October –Oct -8br –Octobre October –Oct -8br –Octobre November –Nov -9br –Novembre November –Nov -9br –Novembre December –Dec -10br December –Dec -10brTIME Year –yr –aar –år Year –yr –aar –år Month –mo –maaned –m. Month –mo –maaned –m. Week –uge –ug. Week –uge –ug. Day –dag –d. Day –dag –d. Hour – h. –hr. Hour – h. –hr. FEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -Pin Trinity –Tr –Trin –Trinitatis DAYS OF WEEK Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –Lørs
10 CONVERSION FUNCTIONS inside the ontology Compute birth date from age at death Compute birth date from age at death Death date – 22 Mar 1743 Death date – 22 Mar 1743 Age - 23 yr 2 m Age - 23 yr 2 m -> BIRTH Jan 1720 Compute dates from feast dates Sunday 23 rd after Trinity > -> 14 Nov 1751
11 Collect Machine-Readable Records
12 English Parish – Wirksworth, Derby
13 Danish Parish – Maglebye
14 Sample Danish marriages
15 New England – Beverly, Mass
16 2 Run a 1 st pass to extract the information Annotate the genealogical record with the ontology Annotate the genealogical record with the ontology Populate RDF data file Populate RDF data file
17 Annotated Town Record SOURCE –Beverly town records SOURCE –Beverly town records [PAGE HEADER]Births page 391 [PAGE HEADER]Births page 391 [BODY] WOODBURY, Benjamin, s. Nickolas and Anne, bp. 26 : 2 m : [BODY] WOODBURY, Benjamin, s. Nickolas and Anne, bp. 26 : 2 m : NAME NAME DATE DATE PLACE PLACE RELATIONSHIP RELATIONSHIP OCCUPATION OCCUPATION RECORD_TYPE RECORD_TYPE SOURCE SOURCE
18 Annotated Danish Parish SOURCE -Tvilum Parish Register SOURCE -Tvilum Parish Register [PAGE HEADER]Fødde 1751 page 3 [PAGE HEADER]Fødde 1751 page 3 [BODY] Truust Dom. 23 p: Trinit: laest over Niels Baches SØREN fadd. Johannes Michelsens og Niels Mollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens Bachis søn Peder og Niels Thylkes s. Peder af Truust [BODY] Truust Dom. 23 p: Trinit: laest over Niels Baches SØREN fadd. Johannes Michelsens og Niels Mollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens Bachis søn Peder og Niels Thylkes s. Peder af Truust
19 Populate RDF-data file Hilton Campbell’s design Hilton Campbell’s design PERSON PERSON EVENT EVENT LINKS – PERSON(S) to EVENT LINKS – PERSON(S) to EVENT
20 EVENT – birth of Rachel PERSON’s – Sarah and Rachel
21 3 Run a SECOND PASS to match individuals and to link families FORMULATE RULES FORMULATE RULES in Rule Engine language for RDF-data file Match individuals Match individuals Check family data Check family data Link families up Link families up APPLY RULES through the Java Rules API APPLY RULES through the Java Rules API
22 4Evaluate and Optimize Results Evaluate the preliminary results Evaluate the preliminary results Optimize the rules Optimize the rules Improve the whole process Improve the whole process
23 VALIDATION I Classification by Record Type: RECALL =.769 RECALL = entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 312 entries ACTUAL BIRTHS PRECISION =.976 PRECISION = entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 246 Entries TOTAL LABELED ‘BIRTH’ The higher the number, the better
24 VALIDATION II Correctness of the Extraction: RECALL =.95 RECALL = entries CORRECTLY LABELED ‘NAME’ ________________________________________ 1000 entries ACTUAL NAMES PRECISION =.969 PRECISION = entries CORRECTLY LABELED ‘NAME’ ________________________________________ 980 Entries TOTAL LABELED ‘NAME’ The higher the number, the better
25 Isaac WOODBURY Isaac WOODBURY Children Children 1. Robert 4 Jul Mary 6 Oct Christian 3 Mar 1677/8 4. Isaac 6 Apr Deliverance 1 Feb 1682/3 6. Joshua 1 Jan 1684/5 7. Elizabeth 17 Jan Nickolas 12 Aug Ann29 Jun Lidia 1 Feb 1691/2 11. Elisabeth about Isaac 20 Jul Benjamin 20 Aug 1699
26 Isaac WOODBURY Isaac WOODBURY SON of HUMPHREY SON of HUMPHREY Mary WILKES Mary WILKES MARRIAGE 9 Oct 1671 MARRIAGE 9 Oct Robert 4 Jul Mary 6 Oct Christian 3 Mar 1677/8 4. Isaac 6 Apr Deliverance 1 Feb 1682/3 6. Joshua 1 Jan 1684/5 7. Elizabeth 17 Jan 1688 Isaac WOODBURY SON of NICHOLAS Elizabeth MARRIAGE ________ 1. Nickolas 12 Aug Ann29 Jun Lidia 1 Feb 1691/2 4. Elisabeth about Isaac 20 Jul Benjamin 20 Aug 1699
27 VALIDATION III Grouping by FAMILY: Grouping by FAMILY: total # merges + splits to correct families after 2 nd PASS ___________________________________ total # merges + splits to correct families after 1 st PASS The lower the number, the better
28 Optimize the Rules Add Add Remove Remove Fine-tune Fine-tune Change the order Change the order Improve the whole process Improve the whole process Until the metrics no longer improve
29 AUTOMATIC EXTRACTION Unstructured genealogical data Unstructured genealogical data Searchable annotated genealogical data Families in RDF-data file
Questions?