CPSC 503 Computational Linguistics

Slides:



Advertisements
Similar presentations
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
Advertisements

Measuring the degree of similarity: PAM and blosum Matrix
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
CS 4705 Lecture 5 Probabilistic Approaches to Pronunciation and Spelling.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
LING 438/538 Computational Linguistics
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Sequence Alignment.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
January 2012Spelling Models1 Human Language Technology Spelling Models.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Dynamic Programming for the Edit Distance Problem.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
April 21, 2016Introduction to Artificial Intelligence Lecture 22: Computer Vision II 1 Canny Edge Detector The Canny edge detector is a good approximation.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
Definition of Minimum Edit Distance
Statistical Language Models
Tracking Objects with Dynamics
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Learning Sequence Motif Models Using Expectation Maximization (EM)
Statistical NLP: Lecture 13
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Data Mining Lecture 11.
CSCI 5832 Natural Language Processing
Training Tree Transducers
CPSC 503 Computational Linguistics
Lecture 12: Data Wrangling
CPSC 503 Computational Linguistics
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
CS621/CS449 Artificial Intelligence Lecture Notes
CSCI 5832 Natural Language Processing
CSA3180: Natural Language Processing
Finite-State and the Noisy Channel
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Word embeddings (continued)
David Kauchak CS159 – Spring 2019
CSCI 5582 Artificial Intelligence
CS249: Neural Language Model
Presentation transcript:

CPSC 503 Computational Linguistics Probabilistic models for spelling Lecture 6 Giuseppe Carenini 2/18/2019 CPSC503 Spring 2004

Today 30/1 Dealing with spelling errors (.05% -3% - 38%) Noisy channel model Bayes rule applied to Noisy channel model (single and multiple errors) Minimum edit distance (multiple errors) ...dynamic programming Very related problem: modeling pronunciation variation for automatic speech recognition and text to speech systems Applications: word processing, clean-up corpus, on-line hand-writing recognition Introduce the need for more sophisticated probabilistic language models 2/18/2019 CPSC503 Spring 2004

Background knowledge Morphological analysis P(x) (either prob. distribution or pmf) joint p(x,y) conditional p(x|y) Bayes rule For convenience let’s call all of them prob distributions Word length, word class 2/18/2019 CPSC503 Spring 2004

Spelling: the problem(s) Correction Detection Find the most likely correct word funn -> funny, fun, ... Non-word isolated …in this context trust funn a lot of funn Non-word context Real-word isolated ?! Check if it is in the lexicon Find candidates and select the most likely it was funne - trust fun Is it an impossible (or very unlikely) word in this context? .. a wild big. Find the most likely substitution word in this context Real-word context 2/18/2019 CPSC503 Spring 2004

Spelling: Data 80% of misspelled words, single error Types of errors Insertion (toy -> tony) deletion (tuna -> tua) substitution (tone -> tony) transposition (length -> legnth) Types of errors Typographic Cognition Substituting one char with the one next on the keyboard Cognitive errors include homophones errors piece peace 2/18/2019 CPSC503 Spring 2004

Noisy Channel An influential metaphor in language processing is the noisy channel model signal noisy Noisy channel: speech, machine translation Bayesian classification You’ll find in one way or another in many nlp papers after 1990 In spelling noise introduced by processes that cause people to misspell words We want to classify the noisy word as the most likely word that generated it Special case of Bayesian classification 2/18/2019 CPSC503 Spring 2004

Bayes and the Noisy Channel: Spelling Non-word isolated Goal: Find the most likely word given some observed (misspelled) word Memorize this 2/18/2019 CPSC503 Spring 2004

Problem P(w|O) is hard/impossible to get (why?) So (1) we apply Bayes (2) simplify Refer to distribution joint and conditional If you have a large enough corpus you could collect the pairs needed to compute P(S|w) for all possible misspelling for each word in the lexicon. Seems unlikely Hoping that what we are left with can be estimated more easily prior likelihood 2/18/2019 CPSC503 Spring 2004

Estimate of prior P(w) (Easy) smoothing Always verify… P(w) is easy. That’s just the prior probability of that word given some corpus (that we hope is similar to the text being corrected). 2/18/2019 CPSC503 Spring 2004

Estimate of P(O|w) is feasible (Kernighan et. al ’90) For one-error misspelling: Estimate the probability of each possible error type e.g., insert a after c, substitute f with h P(O|w) equal to the probability of the error that generated O from w e.g., P( cbat| cat) = P(insert b after c) What about P(O|w)… i.e. the probability that this string would have appeared given that the right word was w 2/18/2019 CPSC503 Spring 2004

Estimate P(error type) Large corpus compute confusion matrices (e.g substitution: sub[x,y]) and count matrix #Times b was incorrectly used for a a b c ……… ……… a Count(a)= # of a in corpus ……… b 5 ……… c 8 15 d Still have to build some tables! b was incorrectly used instead of a How many b in the corpus are actually a 8 … ……… ……… ……… 2/18/2019 CPSC503 Spring 2004

Corpus: Example … On 16 January, he sais [sub[i,y] 3] that because of astronaut safety tha [del[a,t] 4] would be no more space shuttle missions to miantain [tran[a,i] 2] and upgrade the orbiting telescope…….. 2/18/2019 CPSC503 Spring 2004

Final Method single error (1) Given O, collect all the wi that could have generated O by one error. E.g., O=acress => w1 = actress (t deletion), w2 = across (sub o with e), … word prior Probability of the error generating O from w1 (2) For all the wi compute: Sort and display top-n to the user the prior of the collected words Multiply P(wi) by the probability of the particular error that results in O (estimate of P(O|wi)). (3) Sort and display top-n to user 2/18/2019 CPSC503 Spring 2004

Example: O = acress …stellar and versatile acress whose… _ _ _ _ _ Corpus size N=44 million words Normalizing percentages …stellar and versatile acress whose… 2/18/2019 CPSC503 Spring 2004

Evaluation “correct” system Neither was just proposing the first word that could have generated O by one error 2/18/2019 CPSC503 Spring 2004

Corpora: issues to remember Zero counts in the corpus: Just because an event didn’t happen in the corpus doesn’t mean it won’t happen e.g., cress has not really zero probability Getting a corpus that matches the actual use. e.g., Kids don’t misspell the same way that adults do 2/18/2019 CPSC503 Spring 2004

Today 30/1 Minimum edit distance (dynamic programming) Dealing with spelling errors (.05% -3% - 38%) Noisy channel model Bayes rule applied to Noisy channel model (single error) Minimum edit distance (dynamic programming) Very related problem: modeling pronunciation variation for automatic speech recognition and text to speech systems Applications: word processing, clean-up corpus, on-line hand-writing recognition 2/18/2019 CPSC503 Spring 2004

Multiple Spelling Errors (BEFORE) Given O collect all the wi that could have generated O by one error……. (NOW) Given O collect all the wi that could have generated O by 1..k errors Distance how alike two strings are to each other General Solution: How to compute # and type of errors “between” O and wi? 2/18/2019 CPSC503 Spring 2004

Minimum Edit Distance Def. Minimum number of edit operations (insertion, deletion and substitution) needed to transform one string into another. gumbo gumb gum gam delete o delete b Compute string minimum edit distance between O and each wi There are lots of applications of Levenshtein distance. It is used in biology to find similar sequences of nucleic acids in DNA or amino acids in proteins. It is used in some spell checkers to guess at which word (from a dictionary) is meant when an unknown word is encountered. Wilbert Heeringa's dialectology project uses Levenshtein distance to estimate the proximity of dialect pronunciations. And some translation assistance projects have used the alignment capability of the algorithm in order to discover (the approximate location of) good translation equivalents. This application, using potentially large texts, requires optimisations to run effectively. substitute u by a 2/18/2019 CPSC503 Spring 2004

Minimum Edit Distance Algorithm Dynamic programming (very common technique in NLP) High level description: Fills in a matrix of partial comparisons Value of a cell computed as “simple” function of surrounding cells Output: not only number of edit operations but also sequence of operations Compute string minimum edit distance between O and each wi There are lots of applications of Levenshtein distance. It is used in biology to find similar sequences of nucleic acids in DNA or amino acids in proteins. It is used in some spell checkers to guess at which word (from a dictionary) is meant when an unknown word is encountered. Wilbert Heeringa's dialectology project uses Levenshtein distance to estimate the proximity of dialect pronunciations. And some translation assistance projects have used the alignment capability of the algorithm in order to discover (the approximate location of) good translation equivalents. This application, using potentially large texts, requires optimisations to run effectively. 2/18/2019 CPSC503 Spring 2004

Minimum Edit Distance Algorithm Details del-cost =1 sub-cost=2 ins-cost=1 target source i j ed[i,j] = min distance between first i chars of the source and first j chars of the target update x y z del ins sub or equal ? i-1 , j i-1, j-1 i , j-1 Book kind of confusing matrix indexes start from bottom left corner MIN(z+1,y+1, x + (2 or 0)) 2/18/2019 CPSC503 Spring 2004

Final Method multiple errors (1) Given O, for each wi compute: mei=min-edit distance(wi,O) if mei<k save corresponding edit operations in EdOpi word prior Probability of the errors generating O from w1 (2) For all the wi compute: Sort and display top-n to the user the prior of the collected words Multiply P(wi) by the probability of the particular error that results in O (estimate of P(O|wi)). (3) Sort and display top-n to user 2/18/2019 CPSC503 Spring 2004

Spelling: the problem(s) Correction Detection Find the most likely correct word funn -> funny, funnel... Non-word isolated …in this context trust funn a lot of funn Non-word context Real-word isolated ?! Check if it is in the lexicon Find candidates and select the most likely it was funne - trust fun Is it an impossible (or very unlikely) word in this context? .. a wild big. Find the most likely sub word in this context Real-word context 2/18/2019 CPSC503 Spring 2004

Next Time Brief intro to key ideas behind Perl (no syntax, I’ll point you to a good/short primer) Start with Chp.6 N-Grams (assign probabilities to sequences of words) 2/18/2019 CPSC503 Spring 2004