Presentation is loading. Please wait.

Presentation is loading. Please wait.

A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010.

Similar presentations


Presentation on theme: "A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010."— Presentation transcript:

1 A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010

2 Some real-word errors The sand-eel is the principle food for many birds and animals. Our teacher tort us to spell. Henley Regatta comes near the top of the English social calender.

3 Spellchecker-induced real-word errors The Wine Bar Company is opening a chain of brassieres. The nightwatchman threw the switch and eliminated the backyard.

4 Cupertino, California

5 ... to encourage cooperation and...

6

7 Cupertino co-operation....

8 The original Cupertinos "reinforcing bilateral and multilateral Cupertino" "South Asian Association for regional Cupertino"

9 Confusion sets {cite, sight, site} {form, from} {passed, past} {peace, piece} {principal, principle} {quiet, quite, quit} {their, there, they're} {weather, whether} {you're, your}

10 He had quiet a young girl staying with him of 17 named Ethel Monticue.

11 He had quiet a young girl staying with him quite? quit? of 17 named Ethel Monticue.

12 The confusion-set approach has been demonstrated to work with (a) a short list of confusion sets, (b) artificial test data.

13 To assess its potential for real, unrestricted text, we need: (1) a realistically-sized list of confusion sets, (2) a corpus of running text containing genuine real-word errors.

14 A list of confusion sets Tuned string-to-string edit-distance ~ 6000 sets Headword (confusables) –wright (right, write) –right (rite, write) –write (right, rite, writ)  Inflected forms  Proper nouns  Usage errors – e.g.

15 A corpus of real-word errors Sentences675 Words12024 Total errors (tokens)833 Distinct errors (types)428 Distinct error/target pairs495 quit  quiet quit  quite

16 The collation of the information was relay quit easy to do. Corpus mark-up example

17 Corpus profile: Frequent errors Error|target pairFrequency there|their35 form|from20 to|too19 their|there19 a|an18 its|it's17 your|you're15 weather|whether12 cant|can't10 collage|college9

18 Corpus profile: Homophone errors Homophone setN. Occs there, their, they're38 to, too, two23 its, it's17 your, you're15 weather, whether12 herd, heard5 witch, which4 hear, here3 wile, while3 14% of distinct error/target pairs

19 Corpus profile: Simple errors Error TypeN.Errors% Errors Omission (e.g. ether, either)14229% Substitution (e.g. vary, very)10421% Insertion (e.g. bellow, below)5611% Transposition (e.g. dose, does)122% All simple31463% All error pairs495100%

20 How would our list cope with our corpus? TypesTokens Detectable and correctable E.g. shod (should) 44%58% Detectable but not correctable E.g. martial (material) 16%12% Not detectable (inflection error) E.g. friend (friends), take (taken) 23%17% Not detectable (other) E.g. pads (passed) 17%13% Total (100%)495833

21 Non-detectable/non-correctable Error not a headword (“non-detectable”) Target not a candidate (“non-correctable”) PairFrequencyPairFrequency a, an17an, a4 the, they4cause, because3 is, his2as, has2 is, it2easy, easily2 i, it2for, from2 u, your2in, is2 mouths, months2 none, non2 no, know2

22 Using the list for spellchecking Rules based on surrounding context May be unreliable –25% errors have another error within 2 words –9% are another real-word error Syntax-based methods –Easiest to implement –Shown to have good performance

23 Syntax-based rules: potential TagsetsTypesTokens Distinct bellow (NN1,VVB,VVI) below (AV0, PRP) 58%68% ? Overlapping pray (VVB, VVI, AV0) prey (NN1, VVB, VVI) 31%25%  Matching confirm (VVI, VVB) conform (VVI, VVB) 11%7% Total errors (=100%)299580

24 Resources available for download


Download ppt "A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010."

Similar presentations


Ads by Google