Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs.

Similar presentations


Presentation on theme: "Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs."— Presentation transcript:

1 Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs

2 Microfilm Image

3 Input The coordinates of each table cell The printed text in ASCII for each cell, if any. Whether or not the cell is empty. Table Zones Table Zones

4 Algorithm Genealogical Ontology Table Zones Table Zones Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Record Patterns Record Patterns

5 Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives 2.Aggregate Table Primitives 3.Sort Candidates

6 Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives Name Column: [[table_label width] [table_value width]+] {below}

7 Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives Name Row: [[table_label height] [table_value height]+] {left}

8 Identify Structure Identify Structure Identify Structure 1.Identify Table Primitives Printed Text Hand-written Text Row Primitive Column Primitive

9 Identify Structure Identify Structure Identify Structure 2. Identify Table Primitives Probabilistic Rules are associated with each primitive type. Examples 1.Column primitives should be factored left to right. (.9) 2.Row primitives factor the Column primitives below them. (.7)

10 Identify Structure Identify Structure Identify Structure 2. Aggregate Table Primitives A D BC F GHIJKL E

11 Identify Structure Identify Structure Identify Structure 2. Aggregate Table Primitives GHIJKL [G H I J K L] or [G] [ H I J K L] or [K] [G H I J L] or [G] [H I J [K][L]] orOthers

12 Identify Structure Identify Structure Identify Structure 2. Sort Candidates The candidates are evaluated based on: 1.The confidence of the table primitive matches. 2.The probability the the rules used are correct.

13 Identify Structure Identify Structure Identify Structure 2. Sort Candidates 1.[G] [ H I J K L] 2.[G H I J K L] 3.[G] [H I J [K][L]] 4.[K] [G H I J L] 5.Others

14 Match Attributes Match Attributes Match Attributes 1.Identify Possible Mappings 2.Sort Candidates

15 Match Attributes Match Attributes Match Attributes 1.Identify Possible Mappings 1.Identical Matches 2.Synonym Matches 3.Composite Matches 4.Human-Aided Matches Genealogical Ontology Printed Text Name SexGender Female AgeFemale, Age Mapping types

16 Match Attributes Match Attributes Match Attributes 2. Sort Candidates The candidates are evaluated based on The candidates are evaluated based on: 1.The type of the match. 2.The confidence of the match.

17 Check Constraints Check Constraints Check Constraints 1.Identify the individual records 2.Evaluate the records with the Genealogical Ontology.

18 Check Constraints Check Constraints Check Constraints Gender Address NameAge 1 1 1 4.13.94.2 Table (Address, Age) = 4.1

19 Check Constraints Check Constraints Check Constraints.9 FamilyAddress AgeGender 1.31.54.31.3 5 1.1 10 Person Name 1.1 Ontology (Address, Age) = 1.5 * 4.3 *.9 = 5.805

20 Check Constraints Check Constraints Check Constraints Constraint_Score = 1 2 (1\(2n)) *  | Ontology(i, j) – Table(i,j) | 2 The variables “i” and “j” are attributes. The sum is over all combinations of “i” and “j”. The variable “n” is number of attributes.

21 Check Constraints Check Constraints Check Constraints The algorithm creates rules to prevent the factoring of the attributes the receive low constraint scores. The algorithm sorts the candidates by their constraint score.

22 Algorithm Genealogical Ontology Table Zones Table Zones Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Record Patterns Record Patterns

23 Final Remarks The algorithm produces: 1.Record Patterns Attributes for each record Geometry for each record 2. Attribute mappings from the table to the ontology.

24 Final Remarks Given extracted values for the information written by hand, the process can extract the records into an XML file. Individuals can then query the XML files and index back into the original microfilm images.


Download ppt "Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs."

Similar presentations


Ads by Google