LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.

Slides:



Advertisements
Similar presentations
Computer Programming w/ Eng. Applications
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 13: 10/9.
Overview of Programming and Problem Solving ROBERT REAVES.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th.
LING 388: Language and Computers Sandiway Fong Lecture 9: 9/27.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/19.
© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming The software development method algorithms.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
LING 388: Language and Computers Sandiway Fong Lecture 12: 10/5.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
LING 388: Language and Computers Sandiway Fong Lecture 11: 10/3.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/23.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
LING 388: Language and Computers Sandiway Fong Lecture 10: 9/26.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS276 – Information Retrieval and Web Search Checking in. By the end of this week you need to have: Watched the online videos corresponding to the first.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Introduction to Engineering Computing GEEN 1300 Lecture 7 15 June 2010 Review for midterm.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
LING 438/538 Computational Linguistics
Jon Curwin and Roger Slater, QUANTITATIVE METHODS: A SHORT COURSE ISBN © Cengage Chapter 2: Basic Sums.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
Computer Science: A Structured Programming Approach Using C1 6-9 Recursion In general, programmers use two approaches to writing repetitive algorithms.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Minimum Edit Distance Definition of Minimum Edit Distance.
LING 388: Language and Computers Sandiway Fong 9/27 Lecture 10.
© 2004 Pearson Addison-Wesley. All rights reserved ComS 207: Programming I Instructor: Alexander Stoytchev
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
1 CS161 Introduction to Computer Science Topic #8.
CS 147 Virtual Memory Prof. Sin Min Lee Anthony Palladino.
Weighted Minimum Edit Distance
Processing Text Excel can not only be used to process numbers, but also text. This often involves taking apart (parsing) or putting together text values.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Instructor: Alexander Stoytchev CprE 185: Intro to Problem Solving (using C)
January 2012Spelling Models1 Human Language Technology Spelling Models.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
Minimum Edit Distance Definition of Minimum Edit Distance.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Lecture 3: More Java Basics Michael Hsu CSULA. Recall From Lecture Two  Write a basic program in Java  The process of writing, compiling, and running.
Dynamic Programming for the Edit Distance Problem.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Lecture 12.
Definition of Minimum Edit Distance
Definition of Minimum Edit Distance
Single-Source All-Destinations Shortest Paths With Negative Costs
Objective of This Course
Single-Source All-Destinations Shortest Paths With Negative Costs
Dynamic Programming Computation of Edit Distance
Chapter 5 Loops Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc. All rights reserved.
CSA3180: Natural Language Processing
CPSC 503 Computational Linguistics
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CPSC 503 Computational Linguistics
Bioinformatics Algorithms and Data Structures
Hashing.
Instructor: Alexander Stoytchev
ICS103: Programming in C 5: Repetition and Loop Statements
Presentation transcript:

LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25

2 Last Time Prolog implementation of Finite State Transducers (FST) for morphological processing Let’s revisit the implementation of: (Context-Sensitive) Spelling Rule: (3.5) –   e / { x, s, z } ^ __ s# leftright

3 Stage 2: Intermediate  Surface Levels context-sensitive spelling rule also can be implemented using a FST implements context-sensitive rule q 0 to q 2 : left context q 3 to q 0 : right context

4 FST 3.14 in Prolog % To run: ?- q0([f,o,x,^,s,#],L). L = [f,o,x,e,s,#] % q0 q0([],[]).% final state case q0([z|L1],[z|L2]) :- !, q1(L1,L2). % q0 - z -> q1 q0([s|L1],[s|L2]) :- !, q1(L1,L2). % q0 - s -> q1 q0([x|L1],[x|L2]) :- !, q1(L1,L2). % q0 - x -> q1 q0([#|L1],[#|L2]) :- !, q0(L1,L2). % q0 - # -> q0 q0([^|L1],L2) :- !, q0(L1,L2). % q0 - ^:ep -> q0 q0([X|L1],[X|L2]) :- q0(L1,L2). % q0 - other -> q0

5 FST 3.14 in Prolog % q1 q1([],[]).% final state case q1([z|L1],[z|L2]) :- !, q1(L1,L2). % q1 - z -> q1 q1([s|L1],[s|L2]) :- !, q1(L1,L2). % q1 - s -> q1 q1([x|L1],[x|L2]) :- !, q1(L1,L2). % q1 - x -> q1 q1([^|L1],L2) :- !, q2(L1,L2). % q1 - ^:ep -> q2 q1([#|L1],[#|L2]) :- !, q0(L1,L2). % q1 - # -> q0 q1([X|L1],[X|L2]) :- q0(L1,L2). % q1 - other -> q0

6 FST 3.14 in Prolog % q2 q2([],[]).% final state case q2(L1,[e|L2]) :- q3(L1,L2). % q2 - ep:e -> q3 q2([s|L1],[s|L2]) :- !, q5(L1,L2). % q2 - s -> q5 q2([z|L1],[z|L2]) :- !, q1(L1,L2). % q2 - z -> q1 q2([x|L1],[x|L2]) :- !, q1(L1,L2). % q2 - x -> q1 q2([#|L1],[#|L2]) :- !, q0(L1,L2). % q2 - # -> q0 q2([X|L1],[X|L2]) :- \+ X = ^, q0(L1,L2). % q2 - other -> q0

7 FST 3.14 in Prolog % q3 q3([s|L1],[s|L2]) :- q4(L1,L2). % q3 - s -> q4 % q4 q4([#|L1],[#|L2]) :- q0(L1,L2). % q4 - # -> q0 to q0

8 FST 3.14 in Prolog % q5 q5([z|L1],[z|L2]) :- !, q1(L1,L2). % q5 - z -> q1 q5([s|L1],[s|L2]) :- !, q1(L1,L2). % q5 - s -> q1 q5([x|L1],[x|L2]) :- !, q1(L1,L2). % q5 - x -> q1 q5([^|L1],L2) :- !, q2(L1,L2). % q5 - ^:e -> q2 q5([X|L1],[X|L2]) :- \+ X = #, q0(L1,L2). % q5 - other -> q0

9 Stage 2: Intermediate  Surface Levels Prolog queries Intermediate ⇒ Surface Level | ?- q0([f,o,x,^,s,#],L). L = [f,o,x,e,s,#] ? | ?- q0([c,a,t,^,s,#],L). L = [c,a,t,s,#] ? | ?- q0([s,i,s,t,e,r,^,s,#],L). L = [s,i,s,t,e,r,s,#] ? Prolog queries Surface ⇒ Intermediate Level | ?- q0(X,[f,o,x,e,s,#]). ! Resource error: insufficient memory (Infinite loop) Fix it!

10 Stage 2: Intermediate  Surface Levels Use the Prolog debugger... ?- trace. ?- notrace.

11 Today’s Topic Chapter 5 –Spelling errors and correction –Error Correction correct –Bayesian Probability –Minimum Edit Distance Computation Dynamic Programming

12 Spelling Errors Textbook cites (Kukich, 1992): –Non-word detection (easiest) graffe (giraffe) –Isolated-word (context-free) error correction graffe (giraffe,…) graffed (gaffed,…) by definition cannot correct when error word is a valid word –Context-dependent error detection and correction (hardest) your an idiot  you’re an idiot (Microsoft Word corrects this by default)

13 Spelling Errors OCR –visual similarity h  b, e  c, jump  jurnps Typing –keyboard distance small  smsll, spell  spel; Graffiti (many HCI studies) –stroke similarity Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X Two stroke characters: B, D, P(error: two characters) Cognitive Errors –bad spellers separateseperate

14 correct –textbook section 5.5 –(we will go through again later in more detail when we do chapter 6) Kernighan et al. ( correct ) –take typo t (not a word) mutate t minimally by deleting, inserting, substituting or transposing (swapping) a letter look up “mutated t“ in a dictionary candidates are “mutated t“ that are real words –example (5.2) t = acress C = {actress,cress, caress, access, across, acres, acres}

15 correct formula – = argmax c  C P(t|c) P(c)(Bayesian Inference) –C = {actress, cress, caress, access, across, acres, acres} Prior: P(c) –estimated using frequency information over a large corpus (N words) –P(c) = freq(c)/N –P(c) = freq(c)+0.5/(N+0.5V) avoid zero counts (non-occurrences) (add fractional part 0.5) add one (0.5) smoothing –see chapter 6 V is vocabulary size of corpus

16 correct Likelihood: P(t|c) –using some corpus of errors –compute following 4 confusion matrices –del[x,y] = freq(correct xy mistyped as x) –ins[x,y] = freq(correct x mistyped as xy) –sub[x,y] = freq(correct x mistyped as y) –trans[x,y] = freq(correct xy mistyped as yx) –P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y –P(t|c) = ins[x,y]/f(x) if c related to t by insertion of yetc… Figure 5.3 probability of typo t given candidate word c 26 x 26 matrix a–z

17 correct example –t = acress – = acres (44%) despite all the math wrong result for –was called a stellar and versatile acress what does Microsoft Word use? –was called a stellar and versatile acress

18 Minimum Edit Distance –textbook section 5.7 general string comparison edit operations are insertion, deletion and substitution not just limited to distance defined by a single operation away ( correct ) we can ask how different is string a from b by the minimum edit distance

19 Minimum Edit Distance applications –could be used for multi-typo correction –used in Machine Translation Evaluation (MTEval) –example Source: 生産工程改善について Translations: (Standard) For improvement of the production process (MT-A) About a production process betterment (MT-B) About the production process improvement method –compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

20 Minimum Edit Distance cost models –Levenshtein insertion, deletion and substitution all have unit cost –Levenshtein (alternate) insertion, deletion have unit cost substitution is twice as expensive substitution = one insert followed by one delete –Typewriter insertion, deletion and substitution all have unit cost modified by key proximity

21 Minimum Edit Distance Dynamic Programming –divide-and-conquer to solve a problem we divide it into sub-problems –sub-problems may be repeated don’t want to re-solve a sub-problem the 2nd time around –idea: put solutions to sub-problems in a table and just look up the solution 2nd time around, thereby saving time memoization

22 Fibonacci Series definition –F(0)=0 –F(1)=1 –F(n) = F(n-1) + F(n-2) for n≥2 sequence –0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, computation –F(5) –F(4)+F(3) –F(3)+F(2)+F(3) –F(2)+F(1)+F(2)+F(2)+F(1)

23 Minimum Edit Distance can be defined incrementally example: –intention –execution suppose we have the following minimum costs insertion, deletion have unit cost, substitution is twice as expensive –intent  execu (9) –inten  execut (9) –inten  execu (8) Minimum cost for intent  execut (8) is given by computing the possibilities –intent  execu (9)  execut (insert t) (10) –intent (delete t)  inten  execut (9) (10) –inten  execu (8): substitute t for t (zero cost) (8) intent execut intent execu inten execut inten execu one edit operation away

24 Minimum Edit Distance Generally formula is right, table (figure 5.6) is wrong!

25 Minimum Edit Distance Generally example for inte  e min cost should be 3

26 Minimum Edit Distance Computation Microsoft Excel implementation of the formula (from course webpage): try it! $ in a cell reference means don’t change when copied from cell to cell e.g. in C$1 1 stays the same in $A3 A stays the same