Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013.

Similar presentations


Presentation on theme: "Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013."— Presentation transcript:

1 Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013

2 The task Develop a data-rich family history text range recognizer – Perl – Machine learning – Mostly OTS components – Fully automatic – Arbitrary text chunk size Evaluate performance

3 Method Document features – Language identifier (and confidence) We only want English (for now) Used a pre-existing Perl module (Simões) – Type/token ratio We want narrow-domain – % FH lexical items We want to prefer FH vocabulary Hand-coded, 49 words (died, married, cremation, etc.) – % integer words, % person words, % date words, % organization words, % location words We want it to be data-rich Used Stanford named entity engine – Average sentence length Maybe sentences are shorter in FH text?? One vector (floating-point features) per text chunk (e.g. document)

4 Evaluation Gigaword corpus newswire – Associated Press Worldstream articles (Nov. 1994- May 1995) – 585 obituaries (192,000 words) – 649 non-obituaries (221,000 words, randomly selected from 85,000 articles) TiMBL machine learning

5 Results F-Score beta=1, microav: 0.939263 F-Score beta=1, macroav: 0.939184 AUC, microav: 0.940449 AUC, macroav: 0.940449 overall accuracy: 0.939222 (1159/1234), of which 128 exact matches Confusion Matrix: nonobit obit -------------- nonobit | 595 54 obit | 21 564 -*- | 0 0

6 Feature ranking % FH lexical items % integers % person names % dates Average sentence length Type/token ratio % locations % organizations

7 Errors False positives Articles about people perishing in concentration camps Crime stories (murders, serial killers, murder trial, terrorist acts) Accident stories False negatives Lists of creative works Credits from George Abbott's stage career, compiled by his office and from theater reference books: The Misleading Lady, 1913, actor. Yeoman of the Guard, 1915, actor. The Queens Enemies, 1916, actor. Lightnin', 1918, rewrote scenes. … Tagging errors EDITORS: Two versions of Yugoslavia-Obit-Djilas moved on circuits. Please disregard the second, shorter, unbylined version. The AP

8 Caveats Obituaries, not FH data per se Newswire, not books One source Will it scale? Can it port to FSL? Didn’t do any ML tuning Binary acceptor; continuous values possible? Effect of OCR errors?


Download ppt "Sifter for classifying books dense (or not) in family history information & for selecting 3-page sequences for evaluation Deryle Lonsdale 1 Oct. 2013."

Similar presentations


Ads by Google