CPSC 503 Computational Linguistics

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
Pattern Recognition and Machine Learning
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
9/8/20151 Natural Language Processing Lecture Notes 1.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
January 2012Spelling Models1 Human Language Technology Spelling Models.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
Dynamic Programming for the Edit Distance Problem.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
CS 224S / LINGUIST 285 Spoken Language Processing
Lecture 7: Constrained Conditional Models
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Learning Sequence Motif Models Using Expectation Maximization (EM)
Statistical NLP: Lecture 13
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Data Mining Lecture 11.
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
Lecture 12: Data Wrangling
CPSC 503 Computational Linguistics
Probability and Time: Markov Models
Hidden Markov Models Part 2: Algorithms
CSCI 5832 Natural Language Processing
Probability and Time: Markov Models
CPSC 503 Computational Linguistics
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Probability and Time: Markov Models
CS621/CS449 Artificial Intelligence Lecture Notes
CSCI 5832 Natural Language Processing
CS4705 Natural Language Processing
CSA3180: Natural Language Processing
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Probability and Time: Markov Models
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Word embeddings (continued)
INF 141: Information Retrieval
CSCI 5582 Artificial Intelligence
CS249: Neural Language Model
Presentation transcript:

CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini 4/7/2019 CPSC503 Winter 2007

Knowledge-Formalisms Map (including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Spelling: Very simple NLP task (requires morphological recognition) Shows Need for probabilistic approaches Move beyond single words Probability of a sentence Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 4/7/2019 CPSC503 Spring 2005

Today Sep 20 Dealing with spelling errors Start n-grams models Noisy channel model Bayes rule applied to Noisy channel model (single and multiple spelling errors) Start n-grams models Applications: word processing, clean-up corpus, on-line hand-writing recognition Introduce the need for (more sophisticated) probabilistic language models 4/7/2019 CPSC503 Spring 2005

Background knowledge Morphological analysis P(x) (prob. distribution) joint p(x,y) conditional p(x|y) Bayes rule Chain rule For convenience let’s call all of them prob distributions Word length, word class 4/7/2019 CPSC503 Spring 2005

Spelling: the problem(s) Correction Detection Find the most likely correct word funn -> funny, fun, ... Non-word isolated …in this context trust funn a lot of funn Non-word context Real-word isolated ?! Check if it is in the lexicon Find candidates and select the most likely it was funne - trust fun Is it an impossible (or very unlikely) word in this context? .. a wild big. Find the most likely substitution word in this context Real-word context 4/7/2019 CPSC503 Spring 2005

Spelling: Data 05% -3% - 38% 80% of misspelled words, single error Insertion (toy -> tony) deletion (tuna -> tua) substitution (tone -> tony) transposition (length -> legnth) Types of errors Typographic (more common, user knows the correct spelling… the -> rhe) Cognitive (user doesn’t know…… piece -> peace) Very related problem: modeling pronunciation variation for automatic speech recognition and text to speech systems (.05% carefully edited newswire) (3% in “normal” human typewritten text) (Telephone lookup 38%) Usually related to the keyboard: Substituting one char with the one next on the keyboard Cognitive errors include homophones errors piece peace 4/7/2019 CPSC503 Spring 2005

Noisy Channel An influential metaphor in language processing is the noisy channel model signal noisy Noisy channel: speech, machine translation Bayesian classification You’ll find in one way or another in many nlp papers after 1990 In spelling noise introduced by processes that cause people to misspell words We want to classify the noisy word as the most likely word that generated it Special case of Bayesian classification 4/7/2019 CPSC503 Spring 2005

Bayes and the Noisy Channel: Spelling Non-word isolated Goal: Find the most likely word given some observed (misspelled) word Memorize this 4/7/2019 CPSC503 Spring 2005

Problem P(w|O) is hard/impossible to get (why?) So (1) we apply Bayes (2) simplify Refer to distribution joint and conditional If you have a large enough corpus you could collect the pairs needed to compute P(S|w) for all possible misspelling for each word in the lexicon. Seems unlikely Hoping that what we are left with can be estimated more easily prior likelihood 4/7/2019 CPSC503 Spring 2005

Estimate of prior P(w) (Easy) smoothing Always verify… P(w) is easy. That’s just the prior probability of that word given some corpus (that we hope is similar to the text being corrected). 4/7/2019 CPSC503 Spring 2005

Estimate of P(O|w) is feasible (Kernighan et. al ’90) For one-error misspelling: Estimate the probability of each possible error type e.g., insert a after c, substitute f with h P(O|w) equal to the probability of the error that generated O from w e.g., P( cbat| cat) = P(insert b after c) What about P(O|w)… i.e. the probability that this string would have appeared given that the right word was w 4/7/2019 CPSC503 Spring 2005

Estimate P(error type) Large corpus compute confusion matrices (e.g substitution: sub[x,y]) and count matrix #Times b was incorrectly used for a a b c ……… ……… a Count(a)= # of a in corpus ……… b 5 ……… c 8 15 d Still have to build some tables! b was incorrectly used instead of a How many b in the corpus are actually a 8 … ……… ……… ……… 4/7/2019 CPSC503 Spring 2005

Corpus: Example … On 16 January, he sais [sub[i,y] 3] that because of astronaut safety tha [del[a,t] 4] would be no more space shuttle missions to miantain [tran[a,i] 2] and upgrade the orbiting telescope…….. 4/7/2019 CPSC503 Spring 2005

Final Method single error (1) Given O, collect all the wi that could have generated O by one error. E.g., O=acress => w1 = actress (t deletion), w2 = across (sub o with e), … … word prior Probability of the error generating O from w1 (2) For all the wi compute: Apply any single transformation to O and see if it generates a word Collect all generated words Sort and display top-n to the user the prior of the collected words Multiply P(wi) by the probability of the particular error that results in O (estimate of P(O|wi)). (3) Sort and display top-n to user 4/7/2019 CPSC503 Spring 2005

Example: O = acress 1988 AP newswire corpus 44 million words _ _ _ _ _ Corpus size N=44 million words Normalizing percentages Acres -> acress two ways (1) inserting s after e (2) inserting s after s …stellar and versatile acress whose… 4/7/2019 CPSC503 Spring 2005

Evaluation “correct” system Neither was just proposing the first word that could have generated O by one error The following table shows that correct agrees with the majority of the judges in 87% of the 329 cases of interest. In order to help calibrate this result, three inferior methods ,are also evaluated. The no-prior method ignores the prior probability. The no-channel method ignores the channel probability. Finally, the neither method ignores both probabilities and selects the first candidate in "all cases”. As the following table shows, correct is significantly better than the three inferior alternatives. Both the channel and the prior probabilities provide a significant contribution, and the combination is significantly better than either in isolation. The second half of the table evaluates the judges against one another and shows that they significantly out-perform correct, indicating that there is plenty of room for further improvement. 6 All three judges found the task more difficult and time consuming than they had expected. Each judge spent about half a day grading the 564 triples. (6) Judges were only scored on triples for which they selected "1" or "2," and for which the other two judges agreed on "1" or "2”. A triple was scored "correct" for one judge if that judge agreed with the other two and "incorrect" if that judge disagreed with the other two. 4/7/2019 CPSC503 Spring 2005

Corpora: issues to remember Zero counts in the corpus: Just because an event didn’t happen in the corpus doesn’t mean it won’t happen e.g., cress has not really zero probability Getting a corpus that matches the actual use. e.g., Kids don’t misspell the same way that adults do 4/7/2019 CPSC503 Spring 2005

Multiple Spelling Errors (BEFORE) Given O collect all the wi that could have generated O by one error……. (NOW) Given O collect all the wi that could have generated O by 1..k errors Distance how alike two strings are to each other General Solution: How to compute # and type of errors “between” O and wi? 4/7/2019 CPSC503 Spring 2005

Minimum Edit Distance Def. Minimum number of edit operations (insertion, deletion and substitution) needed to transform one string into another. gumbo gumb gum gam w delete o delete b Compute string minimum edit distance between O and each wi There are lots of applications of Levenshtein distance. It is used in biology to find similar sequences of nucleic acids in DNA or amino acids in proteins. It is used in some spell checkers to guess at which word (from a dictionary) is meant when an unknown word is encountered. Wilbert Heeringa's dialectology project uses Levenshtein distance to estimate the proximity of dialect pronunciations. And some translation assistance projects have used the alignment capability of the algorithm in order to discover (the approximate location of) good translation equivalents. This application, using potentially large texts, requires optimisations to run effectively. substitute u by a O 4/7/2019 CPSC503 Spring 2005

Minimum Edit Distance Algorithm Dynamic programming (very common technique in NLP) High level description: Fills in a matrix of partial comparisons Value of a cell computed as “simple” function of surrounding cells Output: not only number of edit operations but also sequence of operations Compute string minimum edit distance between O and each wi There are lots of applications of Levenshtein distance. It is used in biology to find similar sequences of nucleic acids in DNA or amino acids in proteins. It is used in some spell checkers to guess at which word (from a dictionary) is meant when an unknown word is encountered. Wilbert Heeringa's dialectology project uses Levenshtein distance to estimate the proximity of dialect pronunciations. And some translation assistance projects have used the alignment capability of the algorithm in order to discover (the approximate location of) good translation equivalents. This application, using potentially large texts, requires optimizations to run effectively. 4/7/2019 CPSC503 Spring 2005

Minimum Edit Distance Algorithm Details del-cost =1 sub-cost=2 ins-cost=1 target source i j ed[i,j] = min distance between first i chars of the source and first j chars of the target update x y z del ins sub or equal ? i-1 , j i-1, j-1 i , j-1 Book kind of confusing matrix indexes start from bottom left corner MIN(z+1,y+1, x + (2 or 0)) 4/7/2019 CPSC503 Spring 2005

Final Method multiple errors (1) Given O, for each wi compute: mei=min-edit distance(wi,O) if mei<k save corresponding edit operations in EdOpi word prior Probability of the errors generating O from wi (2) For all the wi compute: Sort and display top-n to the user the prior of the collected words Multiply P(wi) by the probability of the particular error that results in O (estimate of P(O|wi)). (3) Sort and display top-n to user 4/7/2019 CPSC503 Spring 2005

Spelling: the problem(s) Correction Detection Find the most likely correct word funn -> funny, funnel... Non-word isolated …in this context trust funn a lot of funn Non-word context Real-word isolated ?! Check if it is in the lexicon Find candidates and select the most likely it was funne - trust fun Is it an impossible (or very unlikely) word in this context? .. a wild big. Find the most likely sub word in this context Real-word context 4/7/2019 CPSC503 Spring 2005

Real Word Spelling Errors Collect a set of common sets of confusions: C={C1 .. Cn} e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..} Whenever c’  Ci is encountered Compute the probability of the sentence in which it appears Substitute all cCi (c ≠ c’) and compute the probability of the resulting sentence Choose the higher one Mental confusions Their/they’re/there To/too/two Weather/whether Typos that result in real words Lave for Have Similar process for non-word errors 4/7/2019 CPSC503 Spring 2005

Key Transition Up to this point we’ve mostly been discussing words in isolation Now we’re switching to sequences of words And we’re going to worry about assigning probabilities to sequences of words 4/7/2019 CPSC503 Spring 2005

Knowledge-Formalisms Map (including probabilistic formalisms) State Machines (and prob. versions) (Finite State Automata,Finite State Transducers, Markov Models) Morphology Syntax Rule systems (and prob. versions) (e.g., (Prob.) Context-Free Grammars) Semantics Spelling: Very simple NLP task (requires morphological recognition) Shows Need for probabilistic approaches Move beyond single words Probability of a sentence Pragmatics Discourse and Dialogue Logical formalisms (First-Order Logics) AI planners 4/7/2019 CPSC503 Spring 2005

Only Spelling? AB Assign a probability to a sentence Part-of-speech tagging Word-sense disambiguation Probabilistic Parsing Predict the next word Speech recognition Hand-writing recognition Augmentative communication for the disabled AB Why would you want to assign a probability to a sentence or… Why would you want to predict the next word… Impossible to estimate  4/7/2019 CPSC503 Spring 2005

Decompose: apply chain rule Applied to a word sequence from position 1 to n: Most sentences/sequences will not appear or appear only once  Standard Solution: decompose in a set of probabilities that are easier to estimate So the probability of a sequence is 4/7/2019 CPSC503 Spring 2005

Example Sequence “The big red dog barks” P(The big red dog barks)= P(big|the) * P(red|the big)* P(dog|the big red)* P(barks|the big red dog) Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|<S>) 4/7/2019 CPSC503 Spring 2005

Not a satisfying solution  Even for small n (e.g., 6) we would need a far too large corpus to estimate: Markov Assumption: the entire prefix history isn’t necessary. unigram That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes. Markov Assumption Assume that the entire prefix history isn’t necessary. In other words, an event doesn’t depend on all of its history, just a fixed length near history So for each component in the product replace each with its with the approximation (assuming a prefix of N) bigram trigram 4/7/2019 CPSC503 Spring 2005

Prob of a sentence: N-Grams unigram bigram trigram 4/7/2019 CPSC503 Spring 2005

Bigram <s>The big red dog barks P(The big red dog barks)= P(The|<S>) * P(big|the) * P(red|big)* P(dog|red)* P(barks|dog) Trigram? 4/7/2019 CPSC503 Spring 2005

Estimates for N-Grams bigram ..in general 4/7/2019 CPSC503 Spring 2005 N-pairs in a corpus is equal to the N-words in the corpus 4/7/2019 CPSC503 Spring 2005

Next Time Finish N-Grams (Chp. 4) Model Evaluation (sec. 4.4) No smoothing 4.5-4.7 Start Hidden Markov-Model 4/7/2019 CPSC503 Spring 2005