Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Correction using Domain Dependent Bigram Models from Web Crawls Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov.

Similar presentations


Presentation on theme: "Text Correction using Domain Dependent Bigram Models from Web Crawls Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov."— Presentation transcript:

1 Text Correction using Domain Dependent Bigram Models from Web Crawls Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov

2 Two recent goals of text correction

3 Use of powerful language models word frequencies, n-gram models, HMMs, probabilistic grammars, etc. Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,...

4 Two recent goals of text correction Use of powerful language models word frequencies, n-gram models, HMMs, probabilistic grammars, etc. Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,... Document centric and adaptive text correction prefer words of the text as correction suggestions for unknown tokens. Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003,...

5 Two recent goals of text correction Use of powerful language models word frequencies, n-gram models, HMMs, probabilistic grammars, etc. Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,... Document centric and adaptive text correction prefer words of the text as correction suggestions for unknown tokens. Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003,...

6 Two recent goals of text correction Use of powerful language models word frequencies, n-gram models, HMMs, probabilistic grammars, etc. Keenan et al. 91, Srihari 93, Hong & Hull 95,Golding & Schabes 96,... Here: Use of document centric language models (bigrams) Document centric and adaptive text correction prefer words of the text as correction suggestions for unknown tokens. Taghva & Stofsky 2001, Nartker et al. 2003, Rong Jin 2003,...

7 Use of document centric bigram models Idea

8 Use of document centric bigram models Idea W k-1 W k+1 WkWk............. Text T:

9 Use of document centric bigram models Idea ill-formed W k-1 W k+1 WkWk............. Text T:

10 Use of document centric bigram models Idea ill-formed V 1 V 2... V n W k-1 W k+1 WkWk............. Text T: correction candidates

11 Use of document centric bigram models Idea ill-formed V 1 V 2... V n W k-1 W k+1 WkWk............. Text T: correction candidates Prefer those correction candidates V where bigrams W k-1 V and VW k+1 "are natural, given the text T".

12 Use of document centric bigram models Idea ill-formed V 1 V 2... V n W k-1 W k+1 ViVi............. Text T: correction candidates Prefer those correction candidates V where bigrams W k-1 V and VW k+1 "are natural, given the text T".

13 Use of document centric bigram models Idea ill-formed V 1 V 2... V n W k-1 W k+1 ViVi............. Text T: correction candidates Prefer those correction candidates V where bigrams W k-1 V and VW k+1 "are natural, given the text T".

14 Use of document centric bigram models Idea ill-formed V 1 V 2... V n W k-1 W k+1 ViVi............. Text T: correction candidates Prefer those correction candidates V where bigrams W k-1 V and VW k+1 "are natural, given the text T". Problem How to measure "naturalness of a bigram, given a text"?

15 How to derive "natural" bigram models for a text?

16 Counting bigram frequencies in text T?

17 Sparseness of bigrams: low chance to find bigrams repeated in T. How to derive "natural" bigram models for a text? Counting bigram frequencies in text T?

18 Sparseness of bigrams: low chance to find bigrams repeated in T. Using a fixed background corpus (British National Corpus, Brown Corpus)? How to derive "natural" bigram models for a text? Counting bigram frequencies in text T?

19 Sparseness of bigrams: low chance to find bigrams repeated in T. Using a fixed background corpus (British National Corpus, Brown Corpus)? Sparseness problem partially solved - but models not document centric! How to derive "natural" bigram models for a text? Counting bigram frequencies in text T?

20 Sparseness of bigrams: low chance to find bigrams repeated in T. Using a fixed background corpus (British National Corpus, Brown Corpus)? Sparseness problem partially solved - but models not document centric! Our suggestion Using domain dependent terms from T, crawl a corpus C in the web that reflects domain and vocabulary of T. Count bigram frequencies in C. How to derive "natural" bigram models for a text?

21 Correction Experiments Text T

22 Correction Experiments Text T1. Extract domain specific terms (compounds).

23 Correction Experiments Text T1. Extract domain specific terms (compounds). 2. Crawl a corpus C that reflects domain and vocabulary of T.

24 Correction Experiments Text T1. Extract domain specific terms (compounds). 2. Crawl a corpus C that reflects domain and vocabulary of T. Dictionary D

25 Correction Experiments Text T1. Extract domain specific terms (compounds). 2. Crawl a corpus C that reflects domain and vocabulary of T. Dictionary D 3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V).

26 Correction Experiments First experiment ("in isolation") What is the correction accuracy reached when using s(U,V) as the single information for ranking correction suggestions? Text T1. Extract domain specific terms (compounds). 2. Crawl a corpus C that reflects domain and vocabulary of T. Dictionary D 3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V).

27 Correction Experiments Text T1. Extract domain specific terms (compounds). 2. Crawl a corpus C that reflects domain and vocabulary of T. Dictionary D 3. For each pair of dictionary words UV, store the frequency of UV in C as a score s(U,V). First experiment ("in isolation") What is the correction accuracy reached when using s(U,V) as the single information for ranking correction suggestions? Second experiment ("in combination") Which gain is obtained when adding s(U,V) as a new parameter to a sophisticated correction system using other scores as well?

28 Experiment 1: bigram scores "in isolation" Set of ill-formed output tokens of commercial OCR system. Candidate sets for ill-formed tokens: dictionary entries with edit distance < 3. Using s(U,V) as the single information for ranking correction suggestions. Measured the percentage of correctly top-ranked correction suggestions. Comparing bigram scores from web crawls, from BNC, from Brown Corpus. Neurol.FishMushr.Holoc.RomBotany Crawl64.5%43.6%54.8%59.5%48.2%56.5% BNC46.8%34.7%41.8%40.9%37.5%28.5% Brown38.2%30.5%36.4%40.2%37.0%25.5% Texts from 6 domains

29 Experiment 1: bigram scores "in isolation" Set of ill-formed output tokens of commercial OCR system. Candidate sets for ill-formed tokens: dictionary entries with edit distance < 3. Using s(U,V) as the single information for ranking correction suggestions. Measured the percentage of correctly top-ranked correction suggestions. Comparing bigram scores from web crawls, from BNC, from Brown Corpus. Neurol.FishMushr.Holoc.RomBotany Crawl64.5%43.6%54.8%59.5%48.2%56.5% BNC46.8%34.7%41.8%40.9%37.5%28.5% Brown38.2%30.5%36.4%40.2%37.0%25.5% Texts from 6 domains Resumee: crawled bigram frequencies clearly better than those from static corpora.

30 Experiment 2: adding bigram scores to fully-fledged correction system Baseline: correction with length-sensitive Levenshtein distance and crawled word frequencies as two scores. Then adding bigram frequencies as a third score. Measuring the correction accuracy (percentage of correct tokens) reached with fully automated correction (optimized parameters). Corrected output of commercial OCR 1 and open source OCR 2.

31 Experiment 2: adding bigram scores to fully-fledged correction system OCR 1 Output OCR 1 Baseline correction Adding bigram score Additional gain Neurology98.7499.3999.440.05 Fish99.2399.4799.570.10 Mushroom99.0199.5099.550.05 Holocaust98.8699.0399.150.12 Roman Empire98.7398.9099.000.10 Botany97.1997.6797.890.22

32 Experiment 2: adding bigram scores to fully-fledged correction system OCR 1 output OCR 1 Baseline correction Adding bigram score Additional gain Neurology98.7499.3999.440.05 Fish99.2399.4799.570.10 Mushroom99.0199.5099.550.05 Holocaust98.8699.0399.150.12 Roman Empire98.7398.9099.000.10 Botany97.1997.6797.890.22 Output highly accurate

33 Experiment 2: adding bigram scores to fully-fledged correction system OCR 1 Output OCR 1 Baseline correction Adding bigram score Additional gain Neurology98.7499.3999.440.05 Fish99.2399.4799.570.10 Mushroom99.0199.5099.550.05 Holocaust98.8699.0399.150.12 Roman Empire98.7398.9099.000.10 Botany97.1997.6797.890.22 Baseline correction adds significant improvement

34 Experiment 2: adding bigram scores to fully-fledged correction system OCR 1 Output OCR 1 Baseline correction Adding bigram score Additional gain Neurology98.7499.3999.440.05 Fish99.2399.4799.570.10 Mushroom99.0199.5099.550.05 Holocaust98.8699.0399.150.12 Roman Empire98.7398.9099.000.10 Botany97.1997.6797.890.22 Small additional gain by adding bigram score

35 Experiment 2: adding bigram scores to fully-fledged correction system OCR 2 Output OCR 2 Baseline correction Adding bigram score Additional gain Neurology90.1396.2996.710.42 Fish93.3696.7198.021.31 Mushroom89.2695.5196.000.49 Holocaust88.7794.2394.610.38 Roman Empire93.1196.1296.910.79 Botany91.7195.4196.090.68

36 Experiment 2: adding bigram scores to fully-fledged correction system OCR 2 Output OCR 2 Baseline correction Adding bigram score Additional gain Neurology90.1396.2996.710.42 Fish93.3696.7198.021.31 Mushroom89.2695.5196.000.49 Holocaust88.7794.2394.610.38 Roman Empire93.1196.1296.910.79 Botany91.7195.4196.090.68 Reduced output accuracy

37 Experiment 2: adding bigram scores to fully-fledged correction system OCR 2 Output OCR 2 Baseline correction Adding bigram score Additional gain Neurology90.1396.2996.710.42 Fish93.3696.7198.021.31 Mushroom89.2695.5196.000.49 Holocaust88.7794.2394.610.38 Roman Empire93.1196.1296.910.79 Botany91.7195.4196.090.68 Baseline correction adds drastic improvement

38 Experiment 2: adding bigram scores to fully-fledged correction system OCR 2 Output OCR 2 Baseline correction Adding bigram score Additional gain Neurology90.1396.2996.710.42 Fish93.3696.7198.021.31 Mushroom89.2695.5196.000.49 Holocaust88.7794.2394.610.38 Roman Empire93.1196.1296.910.79 Botany91.7195.4196.090.68 Considerable additional gain by adding bigram score

39 Additional experiments: comparing language models Compare word frequencies in input text with 1.word frequencies retrieved from "general" standard corpora 2.word frequencies retrieved from crawled domain dependent corpora Result Experiment Using the same large word list (dictionary) D, the top-k segments w.r.t. ordering using frequencies of type 2 covers much more tokens of the input text than the top-k segments w.r.t. ordering using frequencies of type 1

40 Additional experiments: comparing language models Tokens Types Crawled frequencies Standard frequencies

41 Summing up

42 Bigram scores represent a useful additional score for correction systems.

43 Summing up Bigram scores represent a useful additional score for correction systems. Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora.

44 Summing up Bigram scores represent a useful additional score for correction systems. Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora. Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper).

45 Summing up Bigram scores represent a useful additional score for correction systems. Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora. Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper). The additional gain in accuracy reached with bigram scores depends on the baseline.

46 Summing up Bigram scores represent a useful additional score for correction systems. Bigram scores obtained from text-centered domain dependent crawled corpora more valuable than uniform bigram scores from general corpora. Sophisticated crawling strategies developed. Special techniques for keeping arbitrary bigram scores in main memory (see paper). The additional gain in accuracy reached with bigram scores depends on the baseline. Language models obtained from text-centered domain dependent corpora retrieved in the web reflect the language of the input document much more closely than those obtained from general corpora.

47 Thanks for your attention!


Download ppt "Text Correction using Domain Dependent Bigram Models from Web Crawls Christoph Ringlstetter, Max Hadersbeck, Klaus U. Schulz, and Stoyan Mihov."

Similar presentations


Ads by Google