Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006.

Slides:



Advertisements
Similar presentations
Second Grade Saxon Math Lesson 33
Advertisements

Abbreviations. Types of People Mr. Larder used for a man.
Family History Research on the Semantic Web : Building a Semantic Prototype for Danish Genealogical Research By Charla Woodbury Computer Science Spring.
Spelling Lesson 11 Spelling Abbreviations Dec. Sat. Nov. Mon. Jan. Wed. Apr. Sun. Feb. Thurs. Oct. Aug. Tues. Fri. Mar. St. Blvd. Sept. Ave. Rd. Spelling.
2012 CALENDAR. JANUARY 2012 Sunday 日 Monday 月 Tuesday 火 Wednesday 水 Thursday 木 Friday 金 Saturday 土
January 2012 Monday Tuesday Wednesday Thursday Friday Sat/ Sun / /8 14/15 21/22 28/
/3024/ SUN MON TUE WED THU FRI SAT JANUARY 2011 February 2011 SMTWTFS
Lesson 4 Take it easy! 月 一月 - January 二月 - February 三月 - March 四月 - April 五月 - May 六月 - June 七月 - July 八月 - August 九月 - September 十月 - October 十一月.
The Meeting Place Second Grade Saxon Math Lesson 30B.
The Meeting Place Second Grade Saxon Math Lesson 34.
Primary Longman Express
Abbreviation rules O’Rourke Elementary 3 rd Grade.
Non Leap YearLeap Year DateDay NumberMod 7Day NumberMod 7 13-Jan Feb Mar Apr May Jun Jul
1.It will help them feel like part of a group and also it will make the school’s sports team feel encourage. 2.To gain knowledge 3. Because they are comfortable.
Abbreviations. Types of People Dr. Green used for a doctor or anyone with an doctorate degree Mrs. Perez used for a married woman Ms. Babbitt used for.
Red, White, and Blue: The Story of the American Flag.
Jan 2016 Solar Lunar Data.
2015 monthly calendar template
Primary Longman Elect 3A Chapter 5 Asking about dates.
2016 monthly calendar template
Payroll Calendar Fiscal Year
Second Grade Saxon Math Lesson 30A
JANUARY FEBRUARY MARCH APRIL MAY JUNE JULY AUGUST SEPTEMBER
JANUARY FEBRUARY MARCH APRIL MAY JUNE JULY AUGUST SEPTEMBER

2018 monthly calendar template
1   1.テキストの入れ替え テキストを自由に入れ替えることができます。 フチなし全面印刷がおすすめです。 印刷のポイント.
JANUARY FEBRUARY MARCH APRIL MAY JUNE JULY AUGUST SEPTEMBER
Logo Calendar – January 2012 TO DO LIST 01/04/2012 Example TO DO LIST
January MON TUE WED THU FRI SAT SUN
January MON TUE WED THU FRI SAT SUN
2017 Jan Sun Mon Tue Wed Thu Fri Sat
Gantt Chart Enter Year Here Activities Jan Feb Mar Apr May Jun Jul Aug
ANNUAL CALENDAR HOLIDAYS JANUARY FEBRUARY MARCH APRIL MAY JUNE
HOLIDAYS ANNUAL CALENDAR JANUARY FEBRUARY MARCH APRIL MAY JUNE
2020 monthly calendar template
FY 2019 Close Schedule Bi-Weekly Payroll governs close schedule
January Sun Mon Tue Wed Thu Fri Sat
January MON TUE WED THU FRI SAT SUN
January MON TUE WED THU FRI SAT SUN
2017 monthly calendar template
2019 monthly calendar template
Jan Sun Mon Tue Wed Thu Fri Sat
2019 monthly calendar template
2016 monthly calendar template
2017 monthly calendar template
HOLIDAYS ANNUAL CALENDAR JANUARY FEBRUARY MARCH APRIL MAY JUNE
S M T W F S M T W F
January MON TUE WED THU FRI SAT SUN
GANTT CHART can be used for scheduling generic resources as well as project management. They can also be used for scheduling production processes and.
2009 monthly calendar template
1 - January - Sun Mon The Wed Thu Fri Sat
2 0 X X s c h e d u l e 1 MON TUE WED THU JANUARY 20XX FRI SAT SUN MEMO.
January MON TUE WED THU FRI SAT SUN
JANUARY 1 Sun Mon Tue Wed Thu Fri Sat
Calendar
January MON TUE WED THU FRI SAT SUN
JANUARY 1 Sun Mon Tue Wed Thu Fri Sat
Births as per Civil Registration System,

January MON TUE WED THU FRI SAT SUN
S M T W F S M T W F
JANUARY 1 Sun Mon Tue Wed Thu Fri Sat
TIMELINE NAME OF PROJECT Today 2016 Jan Feb Mar Apr May Jun
1 January 2018 Sun Mon Tue Wed Thu Fri Sat
Reviewing Abbreviations
1 January MON TUE WED THU FRI SAT SUN MEMO 2 February MON TUE WED THU FRI SAT SUN.
S M T W F S M T W F
S M T W F S M T W F
1 January MON TUE WED THU FRI SAT SUN MEMO 2 February MON TUE WED THU FRI SAT SUN.
Presentation transcript:

Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006

2 Digital Images – Human Index Large number of competing family history websites Digital images Human indexes – Double entry Researchers hunting through records and indexes to put families together

3 Problem Large amounts of primary genealogical data Large amounts of primary genealogical data Big projects to index and extract records Big projects to index and extract records Two independent indexers and adjudication Two independent indexers and adjudication Millions of human hours used to index or match records for names and families Millions of human hours used to index or match records for names and families

4 Automated Extraction Solution Create a specialized extraction ontology to interpret and label genealogical data Create a specialized extraction ontology to interpret and label genealogical data Develop expert logic and rules that Develop expert logic and rules that Match and merge individuals Match and merge individuals Group them into families Group them into families

5 Methods Prepare for the records extraction Prepare for the records extraction Run a 1 st PASS to extract the information Run a 1 st PASS to extract the information Run a 2 nd PASS to match individuals and link families Run a 2 nd PASS to match individuals and link families Evaluate and optimize the results Evaluate and optimize the results

6 Prepare for Records Extraction Build an Ontology Build an Ontology BYU ontology software Ontos to interpret and correctly label genealogical data using BYU ontology software Ontos to interpret and correctly label genealogical data using Dataframes Dataframes Regular expressions Regular expressions Lexicons Lexicons Conversion functions Conversion functions “encapsulates knowledge about the appearance, behavior, and context of a collection of data elements” Dr. David Embley Collect machine-readable records Collect machine-readable records

7 Ontology – Entity Level Ontology – Entity Level

8 Danish GIVEN NAME LEXICON MALE MALE Anders –And. Anders –And. Andreas Andreas Christen –Kristen Christen –Kristen Christian –Kristian Christian –Kristian Erik –Eric Erik –Eric Gregers Gregers Hans Hans Ib –Jep –Jeppe Ib –Jep –Jeppe Jacob Jacob Jens Jens Johan – Johannes – Joh. Johan – Johannes – Joh. Jorgen –Jørgen Jorgen –Jørgen Knud Knud Lars – Laurs – Laurids –Lauritz Lars – Laurs – Laurids –Lauritz Mads –Mats - Mats Mads –Mats - Mats FEMALE Ane – Anna – Anne Birthe – Birte Bodil Caroline Dorthe – Dorte Ellen -Helene -Elene Elisabeth –Elsbeth –Lisbeth Else –Ilse Ingeborg Inger Karen Kirsten –Christen –Kirstine –Christine – Kirstine –Chirstine Malene Maren

9 DATE Lexicon Adds Thesaurus of Synonyms MONTHS January –Jan –Januar -11br January –Jan –Januar -11br Februrary –Feb –Februar -12br Februrary –Feb –Februar -12br March –Mar –Marts March –Mar –Marts April – Apr –Apl April – Apr –Apl May –Mai May –Mai June –Jun –Juni June –Jun –Juni July –Jul –Juli -5br July –Jul –Juli -5br August –Aug –Augst -6br August –Aug –Augst -6br September –Sep –Sept -7br –Septembre September –Sep –Sept -7br –Septembre October –Oct -8br –Octobre October –Oct -8br –Octobre November –Nov -9br –Novembre November –Nov -9br –Novembre December –Dec -10br December –Dec -10brTIME Year –yr –aar –år Year –yr –aar –år Month –mo –maaned –m. Month –mo –maaned –m. Week –uge –ug. Week –uge –ug. Day –dag –d. Day –dag –d. Hour – h. –hr. Hour – h. –hr. FEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -Pin Trinity –Tr –Trin –Trinitatis DAYS OF WEEK Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –Lørs

10 CONVERSION FUNCTIONS inside the ontology Compute birth date from age at death Compute birth date from age at death Death date – 22 Mar 1743 Death date – 22 Mar 1743 Age - 23 yr 2 m Age - 23 yr 2 m -> BIRTH Jan 1720 Compute dates from feast dates Sunday 23 rd after Trinity > -> 14 Nov 1751

11 Collect Machine-Readable Records

12 English Parish – Wirksworth, Derby

13 Danish Parish – Maglebye

14 Sample Danish marriages

15 New England – Beverly, Mass

16 2 Run a 1 st pass to extract the information Annotate the genealogical record with the ontology Annotate the genealogical record with the ontology Populate RDF data file Populate RDF data file

17 Annotated Town Record SOURCE –Beverly town records SOURCE –Beverly town records [PAGE HEADER]Births page 391 [PAGE HEADER]Births page 391 [BODY] WOODBURY, Benjamin, s. Nickolas and Anne, bp. 26 : 2 m : [BODY] WOODBURY, Benjamin, s. Nickolas and Anne, bp. 26 : 2 m : NAME NAME DATE DATE PLACE PLACE RELATIONSHIP RELATIONSHIP OCCUPATION OCCUPATION RECORD_TYPE RECORD_TYPE SOURCE SOURCE

18 Annotated Danish Parish SOURCE -Tvilum Parish Register SOURCE -Tvilum Parish Register [PAGE HEADER]Fødde 1751 page 3 [PAGE HEADER]Fødde 1751 page 3 [BODY] Truust Dom. 23 p: Trinit: laest over Niels Baches SØREN fadd. Johannes Michelsens og Niels Mollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens Bachis søn Peder og Niels Thylkes s. Peder af Truust [BODY] Truust Dom. 23 p: Trinit: laest over Niels Baches SØREN fadd. Johannes Michelsens og Niels Mollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens Bachis søn Peder og Niels Thylkes s. Peder af Truust

19 Populate RDF-data file Hilton Campbell’s design Hilton Campbell’s design PERSON PERSON EVENT EVENT LINKS – PERSON(S) to EVENT LINKS – PERSON(S) to EVENT

20 EVENT – birth of Rachel PERSON’s – Sarah and Rachel

21 3 Run a SECOND PASS to match individuals and to link families FORMULATE RULES FORMULATE RULES in Rule Engine language for RDF-data file Match individuals Match individuals Check family data Check family data Link families up Link families up APPLY RULES through the Java Rules API APPLY RULES through the Java Rules API

22 4Evaluate and Optimize Results Evaluate the preliminary results Evaluate the preliminary results Optimize the rules Optimize the rules Improve the whole process Improve the whole process

23 VALIDATION I Classification by Record Type: RECALL =.769 RECALL = entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 312 entries ACTUAL BIRTHS PRECISION =.976 PRECISION = entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 246 Entries TOTAL LABELED ‘BIRTH’ The higher the number, the better

24 VALIDATION II Correctness of the Extraction: RECALL =.95 RECALL = entries CORRECTLY LABELED ‘NAME’ ________________________________________ 1000 entries ACTUAL NAMES PRECISION =.969 PRECISION = entries CORRECTLY LABELED ‘NAME’ ________________________________________ 980 Entries TOTAL LABELED ‘NAME’ The higher the number, the better

25 Isaac WOODBURY Isaac WOODBURY Children Children 1. Robert 4 Jul Mary 6 Oct Christian 3 Mar 1677/8 4. Isaac 6 Apr Deliverance 1 Feb 1682/3 6. Joshua 1 Jan 1684/5 7. Elizabeth 17 Jan Nickolas 12 Aug Ann29 Jun Lidia 1 Feb 1691/2 11. Elisabeth about Isaac 20 Jul Benjamin 20 Aug 1699

26 Isaac WOODBURY Isaac WOODBURY SON of HUMPHREY SON of HUMPHREY Mary WILKES Mary WILKES MARRIAGE 9 Oct 1671 MARRIAGE 9 Oct Robert 4 Jul Mary 6 Oct Christian 3 Mar 1677/8 4. Isaac 6 Apr Deliverance 1 Feb 1682/3 6. Joshua 1 Jan 1684/5 7. Elizabeth 17 Jan 1688 Isaac WOODBURY SON of NICHOLAS Elizabeth MARRIAGE ________ 1. Nickolas 12 Aug Ann29 Jun Lidia 1 Feb 1691/2 4. Elisabeth about Isaac 20 Jul Benjamin 20 Aug 1699

27 VALIDATION III Grouping by FAMILY: Grouping by FAMILY: total # merges + splits to correct families after 2 nd PASS ___________________________________ total # merges + splits to correct families after 1 st PASS The lower the number, the better

28 Optimize the Rules Add Add Remove Remove Fine-tune Fine-tune Change the order Change the order Improve the whole process Improve the whole process Until the metrics no longer improve

29 AUTOMATIC EXTRACTION Unstructured genealogical data Unstructured genealogical data Searchable annotated genealogical data Families in RDF-data file

Questions?