Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley.

Similar presentations


Presentation on theme: "Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley."— Presentation transcript:

1 Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley

2 Problem Searching through microfilm by hand is tedious.Searching through microfilm by hand is tedious. Extraction by hand requires large amounts of time and manpower.Extraction by hand requires large amounts of time and manpower.

3 Algorithm Record Patterns Record Patterns XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Evaluate Candidates Evaluate Candidates

4 External Preprocessing Input Features Input Features 1.Coordinates of each zone. 2.Printed text of each zone. 3.Whether or not each zone is empty. XML Input File < zone rectangle="66,55,223,11" printed_text=“NAME and Surname of each Person" empty="0" />

5 Identify Structure Identify Table PrimitivesIdentify Table Primitives Evaluate PrimitivesEvaluate Primitives Factor Table PrimitivesFactor Table Primitives Identify Structure Identify Structure

6 Identify Table Primitives Name Row: [label:value+] right, height Identify Structure Identify Structure

7 Identify Table Primitives Column: [label:value+] down, width Name Identify Structure Identify Structure

8 Identify Table Primitives Row: [label:value+] right, height Identify Structure Identify Structure

9 Evaluate Primitives Primitive Confidence Level == Identify Structure Identify Structure

10 Evaluate Primitives * Confidence (Label i, Value j ) = Identify Structure Identify Structure

11 Factor Table Primitives ABCDEF [A B C D E F] or [A] [B C D E F] or [E] [A B C D F] or Others. Identify Structure Identify Structure

12 Factor Table Primitives An expert user assigns probabilities to types of factorings.An expert user assigns probabilities to types of factorings. Example Example [column:column+] left,.90 [column:column+] left,.90 [row:column+] below,.85 [row:column+] below,.85 Identify Structure Identify Structure

13 Match Attributes Identify Possible Mappings from the Microfilm Table to the Genealogical Ontology.Identify Possible Mappings from the Microfilm Table to the Genealogical Ontology. Match Attributes Match Attributes Identify Structure Identify Structure

14 Identify Possible Mappings 1.Identical Matches 2.Synonym Matches 3.Composite Matches Genealogical Ontology Printed Text Name SexGender Female AgeFemale, Age Mapping types Match Attributes Match Attributes Identify Structure Identify Structure

15 Evaluate Mapping Edit distance between wordsEdit distance between words Match Attributes Match Attributes Identify Structure Identify Structure

16 Check Constraints The algorithm evaluates each the factoring of each record with a genealogical ontology.The algorithm evaluates each the factoring of each record with a genealogical ontology. Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints

17 Identify Records Table (Address, Name) = 14 / 3 = 4.67 LabelNumber of Values Address 3 Name 14 Age 13 Gender 14 Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints

18 Genealogical Ontology The genealogical ontology is created by an expert user. The cardinalities are assigned to the ontology by recording the cardinalities of a corpus of microfilm. The cardinalities are assigned to the ontology by recording the cardinalities of a corpus of microfilm. Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints

19 Genealogical Ontology Ontology (Address, Name) = 1 * 4.3 * 1.1 = 4.73 Family Address AgeGender 1 1 Person Name 1.11.2 4.3 1.1 1.3 1 1.1 1 Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints

20 Evaluate Factoring Ontology (Address, Name) = 1.0 * 4.3 * 1.1 = 4.73 Table (Address, Name) = 14 / 3 = 4.67 Distance Classifier Distance_From_Ontology = 1 / (4.73 – 4.67) 2 = 277 Distance_From_No_Factoring = 1 / (1 – 4.67) 2 =.0724 Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints

21 Evaluate Candidates For every combination of primitives, attribute mappings, and factorings compute the product of their confidences.For every combination of primitives, attribute mappings, and factorings compute the product of their confidences. Select most confident combination.Select most confident combination.

22 Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF

23 Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence F F F F F

24 Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence F F F

25 Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF FFFF

26 Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence FFFF FFFF FFFF FFFF FFFF

27 Evaluate Candidates Primitive 1 Primitive 2 Primitive 3 Attribute Confidence F F F

28 Algorithm Record Patterns Record Patterns XML Input File (Preprocessed Microfilm Image) Genealogical OntologyInputOutputMethod Match Attributes Match Attributes Identify Structure Identify Structure Check Constraints Check Constraints Evaluate Candidates Evaluate Candidates

29 Output Record Patterns –Attributes of each record. –Geometry of each record. Attribute mappings for the table to the ontology.

30 Microfilm Queries A web form provides the interfaceA web form provides the interface to query the microfilm database. Individuals can enter keywords, (such as first and last name), and the system locates the appropriate records among the indexed documents.Individuals can enter keywords, (such as first and last name), and the system locates the appropriate records among the indexed documents.

31 Web Query EyreJohn

32 Query Results Click an image to select a result document.

33 Query Results Relevant region of the document is displayed.

34 Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley


Download ppt "Automatically Identifying Record Patterns from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs David W. Embley."

Similar presentations


Ads by Google