Presentation on theme: "Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010."— Presentation transcript:
Extracting Names Using Layout Clues in Genealogical Books Aaron Stewart David W. Embley March 20, 2010
Overview Problem Current solutions Our solution –Preprocessing (briefly, images only) –Pattern approach Future work
Finding Names Name recognition in genealogical texts Focus: Lists, Directories
Finding Names It’s easy for us to spot names… But how does a computer do it? Which side was easier?
Finding Names Stanford Named Entity Recognizer Apache UIMA Framework CRF MEMM Natural Language Processing ?
BYU OntoES Ontology Extraction System Dictionary Regular Expressions
Part 1: Preprocessing
Ancestry.com Data Word text Word bounding boxes Genres: –Genealogical Books –City Directories –Yearbooks –Newspapers
Ancestry.com Data Inconsistent punctuation –Commas and periods –Present in some books, absent in others Word ordering issue –Only some books are affected –Bug in OCR/layout analysis
Word Order The Standing Committee -The Rev William Berrian D ident ; the Rev John McVickar D D D the Rev Pres- I Haioht D D Rev Samuel R Johnson Hoffman Secretary ; the the Hon Gulian C Verplanck D Benjamin D the Hon Mur- ray M Esq Floyd Smith Gouverneur Ogden The to the Esq General Convention -The Rev Edward Y bee D Deputies D the Rev William D Rev Francis Hig- L Hawks D D LL D the Creighton Rev D Vinton the D Hon Murray Hoffman the Hon John A Francis Dix Hon D the Luther Bradish the Hon Nathaniel S Benton
Word Order - Corrected The Standing Committee -The Rev William Berrian D D Pres- ident ; the Rev John McVickar D D the Rev Benjamin I Haioht D D Secretary ; the Rev Samuel R Johnson D D the Hon Mur- ray Hoffman the Hon Gulian C Verplanck Gouverneur M Ogden Esq Floyd Smith Esq The Deputies to the General Convention -The Rev Edward Y Hig- bee D D the Rev William Creighton D D the Rev Francis L Hawks D D LL D the Rev Francis Vinton D D the Hon Murray Hoffman the Hon John A Dix the Hon Luther Bradish the Hon Nathaniel S Benton
Word Order Notice the imaginary green line… Some tokens extend below it – –These are pushed down to the next line! –This is a bug –Clearly, we can do better The Standing Committee -The Rev William Berrian D D Pres- ident ; the Rev John McVickar D D D the Rev Pres- I Haioht
DEG/Ancestry OCR Reformatting TLP original reordering code Page separator Line segment identifier Line ordering RANSAC margin finder
Page Separator Looks for any place where a vertical line can cleanly separate the text Not robust to skew
Line Segment Identifier Combines words within about 2 spaces Handles skew reasonably well
Line Segment Identifier
Line Ordering Works well in most cases Excessive skew or overlap is harder
RANSAC Margin Finder Random Sampling with Consensus Finds a line in the presence of noise Effective for finding left-aligned margins, tab stops, table columns
RANSAC Margin Finder
Margin Finder – Future Work Left Center Right Key
Margin Finder – Future Work Line Wrap?
Margin Finder – Future Work ABBYY FineReader handles – –Paragraphs –Newspaper columns But has trouble with – –Hanging indents –Outline indentation (possibly)
Part 2: Pattern Finding
Pattern Finding 1.Apply baseline name extractor (OntoES) 2.Apply margin finder and insert markers 3.Find left and right context for each name 4.Apply common contexts to extract more names
Pattern Finding 1. Apply baseline name extractor (OntoES)
Pattern Finding LEVEL 1 LEVEL 2 2.Apply margin finder and insert markers
Pattern Finding LEVEL 1 LEVEL 2 3. Find left and right context for each name
Pattern Finding LEVEL 1 LEVEL 2 4. Apply common context patterns to extract more names
Pattern Finding – Sample Results Baseline Results Precision: 40% Recall: 31.25% F1: 35.09% Results of Most Salient Pattern Precision: 51.52% Recall: 53.12% F1: 52.31% Not all results are this good!