Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young.

Similar presentations


Presentation on theme: "Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young."— Presentation transcript:

1 Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young University

2 Motivation Millions of people want genealogical informationMillions of people want genealogical information Acquiring microfilm is expensive and time consumingAcquiring microfilm is expensive and time consuming

3 Extraction Problem Searching microfilm by hand is slow, error prone, and tediousSearching microfilm by hand is slow, error prone, and tedious Extraction by hand requires enormous amounts of time and manpowerExtraction by hand requires enormous amounts of time and manpower

4 Difficulties Tables have different layouts and stylesTables have different layouts and styles Tables contain different recordsTables contain different records Tables do not use a uniform schemaTables do not use a uniform schema Tables lack information and are ambiguousTables lack information and are ambiguous

5 Related Work Current work exploits the geometric properties of tablesCurrent work exploits the geometric properties of tables Regular expressions, grammars, probabilistic models, and templatesRegular expressions, grammars, probabilistic models, and templates They ignore the ontological constraints of this informationThey ignore the ontological constraints of this information

6 Contributions Exploit both ontological and geometric constraintsExploit both ontological and geometric constraints Identify complex recordsIdentify complex records Work with tables with hand-written valuesWork with tables with hand-written values

7 Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Verify Results Verify Results

8 Training Set 25 Tables from 5 different microfilm rolls25 Tables from 5 different microfilm rolls Used to:Used to: –Identify relationships between table cells –Create genealogical ontology –Define features to extract –Generate rules (constraints)

9 Input: Microfilm Table

10

11 Input Features Input Features 1.Coordinates of each cell. 2.Printed text for label cells. 3.Whether or not each value cell is empty.

12 Input: Microfilm Table......

13 Genealogical Ontology

14

15 ......

16 Generate Confidences Confidence of relationships between pairs of cellsConfidence of relationships between pairs of cells Generate confidence values between 0 and 1Generate confidence values between 0 and 1 Generate Confidences Generate Confidences

17 Relationships Generate Confidences Generate Confidences A label cell describes a value cellA label cell describes a value cell Value cells in same row or columnValue cells in same row or column Label cells form a multi-level labelLabel cells form a multi-level label A label cell maps to an object setA label cell maps to an object set Identify factoringIdentify factoring

18 Label Cell and Value Cell A continuous path between a label cell and a value cell Generate Confidences Generate Confidences Label Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

19 Label Cell and Value Cell Preferences for label – value orientations Generate Confidences Generate Confidences Label OrientationConfidence Above1 Left.75 Right.5 Below.25 Label

20 Label Cell and Value Cell Compare the height or width of each label cell with each value cell Generate Confidences Generate Confidences Label OR 10 Not Similar Similar

21 Value Cell and Value Cell (Same Row) A continuous, horizontal path exists between a pair of value cells Generate Confidences Generate Confidences Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

22 Value Cell and Value Cell (Same Column) A continuous, vertical path exists between a label cell and a value cell Generate Confidences Generate Confidences Confidence = 1 If a path exists 1 If a path exists 0 If no path exists 0 If no path exists

23 Value Cell and Value Cell (Geometrically Similar ) Compare height and width Generate Confidences Generate Confidences 10 Not Similar Similar

24 Multi-level Labels Distance between the midpointsDistance between the midpoints A line through the midpointsA line through the midpoints Share a common borderShare a common border Generate Confidences Generate Confidences

25 Match Label Cells to Object Sets Match synonyms of object sets to words in a labelMatch synonyms of object sets to words in a label –Location of matched words –Order that object sets match words Generate Confidences Generate Confidences Full Name Location Day Family Object Sets

26 Enforce Constraints A set of rules describe geometric and ontological constraints.A set of rules describe geometric and ontological constraints. For example:For example: –Value cells of the same type have the same dimensions –A family can’t have 100 members The algorithm iterates over the rulesThe algorithm iterates over the rules Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints

27 1. Similar Value Cells Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints

28 1. Similar Value Cells Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Lower Confidence

29 1. Similar Value Cells Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints

30 2. Combine Aggregations Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints

31 3. Multi-level Labels Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints

32 4. Factoring Observed cardinality:Observed cardinality: –microfilm table Expected cardinality:Expected cardinality: –genealogy ontology Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Check Cardinality Constraints

33 Observed Cardinality Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67...

34 Expected Cardinality [First Name] per [Family] = 4.8 * 1 * 1 = 4.8 Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints

35 5. Ontological Similarity Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Increase Confidence of Label to Object Set Mappings

36 6. Same Microfilm Roll Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Microfilm from the same roll have the same structure and relationships Microfilm from the same roll have the same structure and relationships Generate the confidence values for multiple tables from the same roll Generate the confidence values for multiple tables from the same roll Take the average of the respective confidence values Take the average of the respective confidence values

37 Verify Results Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Verify Results Verify Results

38 Database Full Name … Generate Confidences Generate Confidences Apply Rules Apply Rules Verify Results Verify Results Create SQL Insert statements to store value cell coordinates Create SQL Insert statements to store value cell coordinates … INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,114,521,172 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') INSERT INTO Person (Full Name) VALUES (' 335,173,521,231 ') …

39 Algorithm SQL Insert Statements SQL Insert Statements XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Generate Confidences Generate Confidences Enforce Constraints Enforce Constraints Verify Results Verify Results

40 Training Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100%100%100% Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%100%100% Label Cells – Object Set Matches 74.45%100%84.65% Factoring100%100%100% SQL Fields 99.42%100%99.71%

41 Ambiguous Factoring

42 Experiments 75 Tables from 15 different microfilm rolls75 Tables from 15 different microfilm rolls Precision, recall, and accuracyPrecision, recall, and accuracy –Populated SQL fields –Each relationship

43 Test Set Results RelationshipPrecisionRecallAccuracy Label Cell Describes Value Cell 100% 98.12 % Value Cells in Same Row or Column 100%100%100% Multilevel Labels 100%99.67%99.82% Label Cells – Object Set Matches 84.98%92.76% 88.18 % Factoring100%93.40%93.47% SQL Fields 93.20%92.41%92.15%

44 3 Success Examples 1.Specialized Record 2.Ontology Constraints 3.Factoring

45 1. Specialized Records

46 INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3) INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1) INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1) INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483') INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483') INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')

47 2. Ontology Constraints

48 INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1) INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372') INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371') INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')

49 3. Factoring

50 3 Types of Errors 1.Ambiguous Factoring 2.Long Label Names 3.Ambiguous Columns

51 2. Long Label Names

52 3. Ambiguous Columns

53 Artifacts Tool in the Java programming language Tool in the Java programming language http://www.rdhd.byu.edu/ http://www.rdhd.byu.edu/http://www.rdhd.byu.edu/ Executable Jar File Executable Jar File Source Code Source Code Input Files Input Files Documentation Documentation

54 Future Work Advanced natural language processingAdvanced natural language processing Hand-written valuesHand-written values Machine learningMachine learning

55 Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm Kenneth Martin Tubbs Jr. A Thesis Presented to the Department of Computer Science Brigham Young University


Download ppt "Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables Kenneth Martin Tubbs Jr. A Thesis Submitted to the Faculty of Brigham Young."

Similar presentations


Ads by Google