LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.

Slides:



Advertisements
Similar presentations
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Object Oriented Programming A programming concept which views programs as objects with properties and ways to manipulate the object and the properties.
LING/C SC/PSYC 438/538 Lecture 12 Sandiway Fong. Administrivia We'll postpone Homework 4 review until next week …
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th.
Programming in Visual Basic
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/19.
LING/C SC/PSYC 438/538 Lecture 4 9/1 Sandiway Fong.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
CS 4705 Probabilistic Approaches to Pronunciation and Spelling.
CS 4705 Lecture 5 Probabilistic Approaches to Pronunciation and Spelling.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
Chapter 2 Basic Elements of Fortan
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/23.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Spelling Checkers Daniel Jurafsky and James H. Martin, Prentice Hall, 2000.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS276 – Information Retrieval and Web Search Checking in. By the end of this week you need to have: Watched the online videos corresponding to the first.
Bell Work InOut Find the pattern and fill in the numbers then write the function rule. Function rule: y= x
LING/C SC/PSYC 438/538 Lecture 5 9/8 Sandiway Fong.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning, Pandu Nayak.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Lesson No:9 MS-Word Tools, Mail Merge and working with Tables CHBT-01 Basic Micro process & Computer Operation.
LING 438/538 Computational Linguistics
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Input, Output, and Processing
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
LING/C SC/PSYC 438/538 Lecture 7 9/15 Sandiway Fong.
LING/C SC/PSYC 438/538 Lecture 3 8/30 Sandiway Fong.
CS276: Information Retrieval and Web Search
Minimum Edit Distance Definition of Minimum Edit Distance.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Program Design. The design process How do you go about writing a program? –It’s like many other things in life Understand the problem to be solved Develop.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
LING/C SC/PSYC 438/538 Lecture 14 Sandiway Fong. Administrivia Homework 6 graded.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
January 2012Spelling Models1 Human Language Technology Spelling Models.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
Minimum Edit Distance Definition of Minimum Edit Distance.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Definition of Minimum Edit Distance
Topic: Programming Languages and their Evolution + Intro to Scratch
RELATIONS AND FUNCTIONS
Definition of Minimum Edit Distance
Programming Fundamentals (750113) Ch1. Problem Solving
CSA3180: Natural Language Processing
CPSC 503 Computational Linguistics
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Programming Fundamentals (750113) Ch1. Problem Solving
CPSC 503 Computational Linguistics
Programming Fundamentals (750113) Ch1. Problem Solving
Bioinformatics Algorithms and Data Structures
CS621/CS449 Artificial Intelligence Lecture Notes
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong

Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued – formalism described on Monday – implementation of the Double R grammar – complete with demo

Today’s Topics Chapter 3 of JM – 3.10 Spelling and Correction of Spelling Errors – Also discussed in section 5.9 – 3.11 Minimum Edit Distance Reading – Chapter 4: N-grams – Pre-requisite: An understand of simple probability concepts

But first From last time… Revisit Finite State Transducers (FST)

e-insertion FST Context-sensitive rule: left context right context Corresponding FST: other = any feasible pair not mentioned explicitly in this transducer other = any feasible pair not mentioned explicitly in this transducer ^ morpheme boundary # word boundary

e-insertion FST State transition table: FST: Input Output foxes#

e-insertion FST State transition table: FST: Input zip^s# Output zips#

Spelling Errors

Textbook cites (Kukich, 1992): – Non-word detection (easiest) graffe (giraffe) – Isolated-word (context-free) error correction graffe (giraffe,…) graffed (gaffed,…) by definition cannot correct when error word is a valid word – Context-dependent error detection and correction (hardest) your an idiot  you’re an idiot (Microsoft Word corrects this by default) 9

Spelling Errors OCR – visual similarity h  b, e  c, jump  jurnps Typing – keyboard distance small  smsll, spell  spel; Graffiti (many HCI studies) – stroke similarity Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X Two stroke characters: B, D, P(error: two characters) Cognitive Errors – bad spellers separate  seperate 10

11 correct – textbook section 5.9 Kernighan et al. ( correct ) – take typo t (not a word) mutate t minimally by deleting, inserting, substituting or transposing (swapping) a letter look up “mutated t“ in a dictionary candidates are “mutated t“ that are real words – example (5.2) t = acress C = {actress,cress, caress, access, across, acres, acres}

correct formula – = argmax c  C P(t|c) P(c)(Bayesian Inference) – C = {actress, cress, caress, access, across, acres, acres} Prior: P(c) – estimated using frequency information over a large corpus (N words) – P(c) = freq(c)/N – P(c) = freq(c)+0.5/(N+0.5V) avoid zero counts (non-occurrences) (add fractional part 0.5) add one (0.5) smoothing V is vocabulary size of corpus 12

13 correct Likelihood: P(t|c) – using some corpus of errors – compute following 4 confusion matrices – del[x,y] = freq(correct xy mistyped as x) – ins[x,y] = freq(correct x mistyped as xy) – sub[x,y] = freq(correct x mistyped as y) – trans[x,y] = freq(correct xy mistyped as yx) – P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y – P(t|c) = ins[x,y]/f(x) if c related to t by insertion of yetc… probability of typo t given candidate word c 26 x 26 matrix a–z Very hard to collect this data Very hard to collect this data

correct example – t = acress – = acres (44%) despite all the math wrong result for – was called a stellar and versatile acress 14 what does Microsoft Word use? –was called a stellar and versatile acress

Microsoft Word

Google

17 Minimum Edit Distance – textbook section 3.11 general string comparison edit operations are insertion, deletion and substitution not just limited to distance defined by a single operation away ( correct ) we can ask how different is string a from b by the minimum edit distance

18 Minimum Edit Distance applications – could be used for multi-typo correction – used in Machine Translation Evaluation (MTEval) – example Source: 生産工程改善について Translations: (Standard) For improvement of the production process (MT-A) About a production process betterment (MT-B) About the production process improvement method – compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

19 Minimum Edit Distance cost models – Levenshtein insertion, deletion and substitution all have unit cost – Levenshtein (alternate) insertion, deletion have unit cost substitution is twice as expensive substitution = one insert followed by one delete – Typewriter insertion, deletion and substitution all have unit cost modified by key proximity

20 Minimum Edit Distance Dynamic Programming – divide-and-conquer to solve a problem we divide it into sub-problems – sub-problems may be repeated don’t want to re-solve a sub-problem the 2nd time around – idea: put solutions to sub-problems in a table and just look up the solution 2nd time around, thereby saving time memoization

Fibonacci Series definition – F(0)=0 – F(1)=1 – F(n) = F(n-1) + F(n-2) for n≥2 sequence – 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, … computation – F(5) – F(4)+F(3) – F(3)+F(2)+F(3) – F(2)+F(1)+F(2)+F(2)+F(1) 21

22 Minimum Edit Distance can be defined incrementally example: – intention – execution suppose we have the following minimum costs insertion, deletion have unit cost, substitution is twice as expensive – intent  execu (9) – inten  execut (9) – inten  execu (8) Minimum cost for intent  execut (8) is given by computing the possibilities – intent  execu (9)  execut (insert t) (10) – intent (delete t)  inten  execut (9) (10) – inten  execu (8): substitute t for t (zero cost) (8) intent execut intent execu inten execut inten execu one edit operation away

23 Minimum Edit Distance Generally

Minimum Edit Distance Programming Practice: could be easily implemented in Perl

25 Minimum Edit Distance Generally

26 Minimum Edit Distance Computation Or in Microsoft Excel, file: eds.xls (on website) $ in a cell reference means don’t change when copied from cell to cell e.g. in C$1 1 stays the same in $A3 A stays the same