CS 4705 Probabilistic Approaches to Pronunciation and Spelling.

Slides:



Advertisements
Similar presentations
CS100J 18 September 2003 Rsrecah on spleilng Aoccdrnig to a rscheearch at Cmabirgde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are,
Advertisements

CS 4705 Lecture 5 Probabilistic Approaches to Pronunciation and Spelling.
Calcaneal Fractures Bob Handley The heel bone What is it like? Where does it break? Can I mend it? Should I mend it?
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine.
What is science? Science: is a process by which we gain knowledge deals only with the natural world collects & organizes information (data/evidence) gives.
Sensation and Perception Chapter 3. Psychophysics This is how we experience our physical world. Classroom demo judging weight of pill bottles. Which one.
The Human Eye. Refractive index of lens different for each wavelength (colour) Cool colours (blues) appear closer; warm colours (reds) further away Agree?
* What is reading? * Challenges for older readers and writers * What can I do to help? * What is available to support me? * Questions * Reading and writers.
Vision and Perception Input-Process-Output (S)IPDE Process Time Relevant.
Logo Design. UNTITLED Cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor of the hmuan mind: aoccdrnig to a rscheearch.
Count the Number of “f”s FINISHED FILES ARE THE RE SULT OF YEARS OF SCIENTI FIC STUDY COMBINED WITH THE EXPERIENCE OF YEARS... How many did you find?
Inclusive Learning Through Technology Damian Gordon.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Neural Networks AI – Week 21 Sub-symbolic AI One: Neural Networks Lee McCluskey, room 3/10
Sensation.
 The nugger was flinp.  The nugger was flinp and wugnet.  The nugger was flinp, wugnet and manple in my waslet.  What was flinp?  How else does the.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
T HE H UMAN M IND. The phaonmneal pwer of the hmuan mnid Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deson’t mttaer in what oredr the ltteers.
Human-to-Human Communication A model for human-computer interaction? Important scope limitation: problem solving Why look at human-human interface? – The.
What do you see?. O lny srmat poelpe can raed tihs. I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg. The phaonmneal pweor.
Teaching reading.
Mindboggling. Visual Imagery (visual cortex) Visualize a place you’d like to be. Maybe it’s riding a bike, sitting in the park or just hanging out in.
Technical Reading Presented by Beatrice Moore Luchin NUMBERS Mathematics Professional Development NUMBERSmpd.com.
Taxonomy, Ontology, Semantics and Other Painful Things By Francis Hsu
Chapter 5: Spell Checking
Sensation & Perception How do we construct our representations of the external world? To represent the world, we must detect physical energy (a stimulus)
~ Thought Journal ~ SILENTLY read the following passage. When you are finished, SILENTLY write down your reaction in your thought journal. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Editing Documents Dr. Anatoliy Tmanov Pennsylvania State University.
The phenomenal power of the human mind   I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg.The phaonmneal pweor of the hmuan mnid!
Please read this sentence and count the number of F’s:
Discover the Possibilities: Leadership Coaching 2004 Parks and Recreation Conference.
PROOFREADING Mini-lesson (Step 4 of WHAT GOOD WRITERS DO... )
The human brain … … tricks us whenever it can!. The human mind is so non-literal! I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg.
CHS AP Psychology Unit 4: Sensation, Perception and States of Consciousness Essential Task 4-1: Discuss basic principles of sensation/bottom up processing.
Sensation & Perception
Illusions and Other Visual Defects CITA 6016 Food Sensory Analysis University of Puerto Rico Food Science & Technology.
Ignite your thought process Creativity. Two Myths About Creativity  Only a few special people possess it  Creativity is a gift and not a skill.
Bible Study for Dummies the Rest of Us I can do this!
Read each slide for directions
The Eye and Optical Illusions Chatfield Senior High.
The human brain … … tricks us whenever it can!.
1.
Flowers for Algernon By Daniel Keyes. Story Background Revolves around the main character, Charlie Gordon (who is the narrator) – a thirty-two-year-old.
Welcome to Group Dynamics LDSP 351 Dr. Crystal Hoyt.
Sen sati on & Per cep tio n How do we construct our representations of the external world? To represent the world, we must detect physical energy (a stimulus)
Inspiring Youth to Live their Dreams! Scott Shickler Founder & CEO.
The Human Eye.
Aoccdrnig to rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoatnt tihng is taht the frist.
Reading Disorders By: Larry Burd, Ph.D.
What Is Neuropsychological (Neurocognitive) Testing?
The phenomenal power of the human mind   I cdnuolt blveiee taht I cluod aulaclty uesdnatnrd waht I was rdanieg The phaonmneal pweor of the hmuan mnid!
Please read the sign..
OPTICAL ILLUSIONS.
Effective Reinforcement
Schema and Scripts.
Mindboggling.
Even though the next page may look weird, you can still read it!
Mindboggling.
Closing IBS Lecture Fall 2008.
Sensation and Perception
Sabotage Effective Communication

Mindboggling.
How does your brain perceive objects?

Mindboggling.
Mining Gold from Data Data Mining.
The EYE YE VIDEO.
Presentation transcript:

CS 4705 Probabilistic Approaches to Pronunciation and Spelling

Spoken and Written Word (Lexical) Errors Variation vs. error Word formation errors: –I go to Columbia Universary. –Easy enoughly. –words of rule formation Lexical access problems: –Turn to the right (left) –“I called my mother on the television and did not understand the door. It was too breakfast, but they came from far to near. My mother is not too old for me to be young." (Wernecke’s aphasia)

Can humans understand ‘what is meant’ as opposed to ‘what is said/written’? –Aoccdrnig to a rscheearch at an Elingsh uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteers are at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe. How do we do this?

Detecting and Correcting Spelling Errors Applications: –Spell checking in M$ word –OCR scanning errors –Hand-writing recognition of zip codes, signatures, Graffiti Issues: –Correct non-words (dg for dog, but frplc?) –Correct “wrong” words in context (their for there, words of rule formation)

Patterns of Error Human typists make different types of errors from OCR systems -- why?why? Error classification I: performance-based: –Insertion: catt –Deletion: ct –Substitution: car –Transposition: cta Error classification II: cognitive –People don’t know how to spell (nucular/nuclear) –Homonymous errors (their/there)

How do we decide if a (legal) word is an error? How likely is a word to occur? –They met there friends in Mozambique. The Noisy Channel Model –Input to channel: true (typed or spoken) word w –Output from channel: an observation O –Decoding task: find w = P(w|O) Source Noisy Channel Decoder

Bayesian Inference Population: 10 Columbia students –What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) =.4 –That a rcs is a CS major? p(c) =.3 –That a rcs is a vegetarian CS major? p(c,v) =.2 –4 vegetarians–3 CS majors

Bayesian Inference Population: 10 Columbia students –4 vegetarians, 3 CS major –Probability that a rcs is a vegetarian? p(v) =.4 –That a rcs who is a vegetarian is also a CS major? p(c|v) =.5 –That a rcs is a vegetarian (and) CS major? p(c,v) =.2

Bayesian Inference Population: 10 Columbia students –4 vegetarians, 3 CS major –Probability that a rcs is vegetarian? p(v) =.4 –That a rc who is a vegetarian is also a CS major p(c|v) =.5 –That a rcs is a vegetarian and a CS major? p(c,v) =.2 = p(v) p(c|v) (.4 *.5)

Bayesian Inference Population: Columbia students –4 vegetarians, 3 CS major –Probability that a rcs is a CS major? p(c) =.3 –That rc who is a CS major is also a vegetarian? p(v|c) =.66 –That rcs is a vegetarian CS major? p(c,v) =.2 = p(c) p(v|c) = (.3 *.66)

Bayes Rule So, we know the joint probabilities –p(c,v) = p(c) p(v|c) –p(v,c) = p(v) p(c|v) –p(c,v) = p(v,c) Using these equations, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c) –p(v) p(c|v) = p(c) p(v|c)

Returning to Spelling... –Channel Input: w; Output: O –Decoding: hypothesis w = P(w|O) –or, by Bayes Rule... – w = –and, since P(O) doesn’t change for any entries in our lexicon we are going to consider, we can ignore it as constant, so… –w = P(O|w) P(w) (Given that w was intended, how likely are we to see O) Source Noisy Channel Decoder

How could we use this model to correct spelling errors? Simplifying assumptions –We only have to correct non-word errors –Each non-word (O) differs from its correct word (w) by one step (insertion, deletion, substitution, transposition) From O, generate a list of candidates differing by one step and appearing in the lexicon, e.g. ErrorCorrCorr letterError letterPosType caatcat-a2ins caatcaratr-3del

How do we decide which correction is most likely? We want to find the lexicon entry w that maximizes P(typo|w) P(w) How do we estimate the likelihood P(typo|w) and the prior P(w)? First, find some corpora –Different corpora needed for different purposes –Some need to be labeled -- others do not –For spelling correction, what do we need? Word occurrence information (unlabeled) A corpus of labeled spelling errors

Cat vs Carat Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus –cat occurs 6500 times, so p(cat) = –carat occurs 3000 times, so p(carat) = Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections (  p(typo|word)) –suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500 times (p(- r)=.15)

Then p(word|typo) = p(typo|word) * p(word) –p(cat|caat) = p(+a) * p(cat) =.1 * = –p(carat|caat) = p(-r) * p(carat) =.15 * = Issues: –What if there are no instances of carat in corpus? Smoothing algorithms –Estimate of P(typo|word) may not be accurate Training probabilities on typo/word pairs –What if there is more than one error per word?

A More General Approach: Minimum Edit Distance How can we measure how different one word is from another word? –How many operations will it take to transform one word into another? caat --> cat, fplc --> fireplace (*treat abbreviations as typos??) –Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1) –Alternative: weight each operation by training on a corpus of spelling errors to see which most frequent

–Alternative: count substitutions as 2 (1 insertion and 1 deletion) –Alternative: Damerau-Levenshtein Distance includes transpositions as a single operation (e.g. cta  cat) Code and demo for simple Levenshtein MED

MED Calculation is an Example of Dynamic Programming Decompose a problem into its subproblems –e.g. fp --> firep a subproblem of fplc --> fireplace –Intuition: An optimal solution for the subproblem will be part of an optimal solution for the problem –Solve any subproblem only once: store all solutions –Recursive algorithm Often: Work backwards from the desired goal state to the initial state

For MED, create an edit-distance matrix: –each cell c[x,y] represents the distance between the first x chars of the target t and the first y chars of the source s (e.g the x-length prefix of t compared to the y-length prefix of s) –this distance is the minimum cost of inserting, deleting, or substituting operations on the previously considered substrings of the source and target

Edit Distance Matrix, Subst=2 -- Wrong NB: errors

NB: Subst x for x Cost is 0, not 2 n o i t n e t n i # #execution

Summary We can apply probabilistic modeling to NL problems like spell-checking –Noisy channel model, Bayesian method –Training priors and likelihoods on a corpus Dynamic programming approaches allow us to solve large problems that can be decomposed into subproblems –e.g. MED algorithm Apply similar methods to modeling pronunciation variation

–Allophonic variation + register/style (lexical) variation butter/tub, going to/gonna –Pronunciation phenomena can be seen as insertions/deletions/substitutions too, with somewhat different ways of computing the likelihoods Measuring ASR accuracy over words (WER) Next time: Ch 6