Chomsky’s Spell Checker Cohan Sujay Carlos, Aiaioo Labs, Bangalore.

Slides:



Advertisements
Similar presentations
Probability Simple Events
Advertisements

Presentation 5. Probability.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE 11 (Lab): Probability reminder.
Chapter 4 Probability and Probability Distributions
COUNTING AND PROBABILITY
CHAPTER 13: Binomial Distributions
SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
SECTION 12.2 Tests About a Population Proportion.
1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.
22C:19 Discrete Structures Discrete Probability Fall 2014 Sukumar Ghosh.
For Thursday, Feb. 20 Bring Conditional Probability Worksheet to class Exam #1:March 4 Project Presentations Start March 6 Team Homework #4 due Feb. 25.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Visual Recognition Tutorial
Stats Probability Theory.
1 Copyright M.R.K. Krishna Rao 2003 Chapter 5. Discrete Probability Everything you have learned about counting constitutes the basis for computing the.
Econ 140 Lecture 31 Univariate Populations Lecture 3.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Parameters and Statistics Probabilities The Binomial Probability Test.
Conditional Probability
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 14 From Randomness to Probability.
Chapter 6 Probability.
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
Chapter 9 Introducing Probability - A bridge from Descriptive Statistics to Inferential Statistics.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Chapter 5 Sampling Distributions
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 13, Slide 1 Chapter 13 From Randomness to Probability.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 5 Section 1 – Slide 1 of 33 Chapter 5 Section 1 Probability Rules.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
1 Chapters 6-8. UNIT 2 VOCABULARY – Chap 6 2 ( 2) THE NOTATION “P” REPRESENTS THE TRUE PROBABILITY OF AN EVENT HAPPENING, ACCORDING TO AN IDEAL DISTRIBUTION.
From Randomness to Probability
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
1 TABLE OF CONTENTS PROBABILITY THEORY Lecture – 1Basics Lecture – 2 Independence and Bernoulli Trials Lecture – 3Random Variables Lecture – 4 Binomial.
1.3 Simulations and Experimental Probability (Textbook Section 4.1)
November 2004CSA4050: Crash Concepts in Probability1 CSA4050: Advanced Topics in NLP Probability I Experiments/Outcomes/Events Independence/Dependence.
The Cartoon Guide to Statistics
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
3-1 MGMG 522 : Session #3 Hypothesis Testing (Ch. 5)
40S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Statistics Lesson: ST-5 The Binomial Distribution The Binomial Distribution Learning Outcome.
From Randomness to Probability Chapter 14. Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen,
POSC 202A: Lecture 4 Probability. We begin with the basics of probability and then move on to expected value. Understanding probability is important because.
Welcome to MM570 Psychological Statistics
Inference: Probabilities and Distributions Feb , 2012.
Probability. What is probability? Probability discusses the likelihood or chance of something happening. For instance, -- the probability of it raining.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Module 12: Experimental and theoretical probabilities.
Inference About Means Chapter 23. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it’d be nice.
Chapter 6 - Probability Math 22 Introductory Statistics.
STROUD Worked examples and exercises are in the text Programme 29: Probability PROGRAMME 29 PROBABILITY.
- 1 - Outline Introduction to the Bayesian theory –Bayesian Probability –Bayes’ Rule –Bayesian Inference –Historical Note Coin trials example Bayes rule.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
3/7/20161 Now it’s time to look at… Discrete Probability.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Lesson Runs Test for Randomness. Objectives Perform a runs test for randomness Runs tests are used to test whether it is reasonable to conclude.
AP Statistics From Randomness to Probability Chapter 14.
Probability and Probability Distributions. Probability Concepts Probability: –We now assume the population parameters are known and calculate the chances.
Quantitative Techniques – Class I
Meaning of Probability
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Basic Probability Theory
Review of Probability and Estimators Arun Das, Jason Rebello
Lecture 11 Sections 5.1 – 5.2 Objectives: Probability
Statistical NLP: Lecture 4
Honors Statistics From Randomness to Probability
Experimental Probability Vs. Theoretical Probability
Now it’s time to look at…
Experimental Probability Vs. Theoretical Probability
Discrete Random Variables: Joint PMFs, Conditioning and Independence
Presentation transcript:

Chomsky’s Spell Checker Cohan Sujay Carlos, Aiaioo Labs, Bangalore

How would you build a spell checker for Chomsky? Use spelling rules to correct spelling? Use statistics and probabilities?

Chomsky Derided researchers in machine learning who use purely statistical methods to produce behaviour that mimics something in the world, but who don't try to understand the meaning of that behaviour.

Chomsky Compared such researchers to scientists who might study the dance made by a bee returning to the hive, and who could produce a statistically based simulation of such a dance without attempting to understand why the bee behaved that way.

So, don’t misread Chomsky As you can see: Chomsky has a problem with Machine Learning in particular with what some users of machine learning do not do … but no problem with Statistics.

Uses of Statistics in Linguistics How can statistics be useful? Can probabilities be useful?

Statistical <> Probabilistic Statistical: 1. Get an estimate of a parameter value 2. Show that it is significant F = G m 1 m 2 / r 2 Used in establishing the validity of experiment claims

Statistical <> Probabilistic Probabilistic: Assign probabilities to the outcomes of an experiment Linguistics: outcome = category

Linguistic categories can be uncertain Say you have two categories: ◦ 1. (Sentences that try to make you act) ◦ 2. (Sentences that supply information) “Buy an Apply iPod” is of category 1. “An iPod costs $400” is of category 2. Good so far … but … “I recommend an Apply iPod” is what? Maybe a bit of both?

What is a Probability? A number between 0 and 1 Sum of the probabilities on all outcomes must be 1

What is a Probability? A number between 0 and 1 assigned to an outcome such that … The sum of the probabilities on all outcomes is 1 O = heads O = tails P(heads) = 0.5 P(tails) = 0.5

Categorization Problem : Spelling 1. Field 2. Wield 3. Shield 4. Deceive 5. Receive 6. Ceiling

Rule-based Approach “I before E except after C” -- an example of a linguistic insight

Probabilistic Statistical Model: Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus P(IE) = P(EI) = P(CIE) = P(CEI) =

Words where ie occur after c science society ancient species

Courtesy Peter Norvig’s article on Chomsky

Chomsky In 1969 he wrote: But it must be recognized that the notion of "probability of a sentence" is an entirely useless one, under any known interpretation of this term.

Chomsky’s Argument A completely new sentence must have a probability of 0, since it is an outcome that has not been seen. Since novel sentences are in fact generated all the time, there is a contradiction.

Probabilities should broadly indicate likelihood of sentences ◦ P(I saw a van) >> P(eyes awe of an) Probabilistic language model Courtesy Dan Klein

Definitions ◦ Outcome ◦ Sample Space S ◦ Event E Axioms of Probability ◦ Axiom 1:0 <= P(E) <= 1 ◦ Axiom 2:P(S) = 1 ◦ Axiom 3:P(union [ E ]) = sum[ P(E) ]  for mutually exclusive events. AXIOMS of Probability Let’s start!

◦ P(EF) = P(E|F) * P(F) Example: Tossing 2 coins CONDITIONAL PROBABILITY Coin 1 ECoin 2 FP(EF) 0 (tails) (tails)1 (heads) (heads)0 (tails) (heads) 0.25 This is a table of P(EF). You find that no matter what E or F is, P(EF)=0.25

MARGINAL PROBABILITIES FP(F) EP(E) From the previous table you can calculate that …

◦ P(EF) = P(E|F) * P(F) So, now, applying the above equation … CONDITIONAL PROBABILITY Coin 1 ECoin 2 FP(E|F) 0 (tails) (tails)1 (heads)0.5 1 (heads)0 (tails)0.5 1 (heads) 0.5 You see that in all cases, P(E|F) = P(E). This means E and F are independent.

◦ P(DEF) = P(D|EF) * P(EF) CONDITIONAL PROBABILITY This is very similar to P(EF) = P(E|F) * P(F) I’ve just replaced E with D, And F with EF.

◦ P(DEF) = P(D|EF) * [ P(E|F) * P(F) ] CONDITIONAL PROBABILITY Using P(EF) = P(E|F) * P(F) P(EF)

◦ P(DEF) = P(D|EF) * P(E|F) * P(F) ◦ P(“I am human”) = P(“human” | “I am”) * P(“am” | “I”) * P(“I”) CONDITIONAL PROBABILITY

◦ P(DEF) = P(D|EF) * P(E|F) * P(F) ◦ P(“I am human”) = P(“human” | “I am”) * P(“am” | “I”) * P(“I”) ◦ P(“I am human”) = C(“I am human”)/C(“I am”) * C(“I am”)/C(“I”) * C(“I”)/C(“*”) CONDITIONAL PROBABILITY

Markov Assumption You can’t remember too much.

◦ P(DEF) = P(D|EF) * P(E|F) * P(F) ◦ P(“I am human”) = P(“human” | “I am”) * P(“am” | “I”) * P(“I”) ◦ P(“I am human”) = C(“I am human”)/C(“I am”) * C(“I am”)/C(“I”) * C(“I”)/C(“*”) CONDITIONAL PROBABILITY So, you ‘forget’ the green stuff.

◦ P(DEF) = P(D|EF) * P(E|F) * P(F) ◦ P(“I am human”) = P(“human” | “am”) * P(“am” | “I”) * P(“I”) ◦ P(“I am human”) = C(“am human”)/C(“am”) * C(“I am”)/C(“I”) * C(“I”)/C(“*”) CONDITIONAL PROBABILITY Now this forgetful model can deal with unknown sentences. I’ll give you one.

◦ P(“Cohan is short”) ◦ = P(“short” | “is”) * P(“is” | “Cohan”) * P(“Cohan”) ◦ = C(“is short”)/C(“is”) * C(“Cohan is”)/C(“Cohan”) * C(“Cohan”)/C(“*”) UNKNOWN SENTENCE I never heard anyone say this before! I am 6 ft and some. It’s a hitherto unknown sentence!

Solved! But there’s another way to solve it … and you’ll see how this has a lot to do with spell checking!

Spell Checking Let’s say the word hand is mistyped Hand --- *Hamd There you have an unknown word!

Spell Checking Out-of-Vocabulary Error *Hamd

Spell Checking *Hamd Hand

Spell Checking *Hamd HandHard

What we need to find … P(hard|hamd) P(hand|hamd) Whichever is greater is the right one!

What we need to find … But how do you find the probability P(hard|hamd) when ‘hamd’ is an *unknown word*? This is related to Chomsky’s conundrum of the unknown sentence, isn’t it? The second approach uses a little additional probability theory.

Conditional Prob: ◦ P(FE) = P( F|E ) * P (E) ◦ P(EF) = P( E|F ) * P (F) P(E|F) = P(F|E) * P(E) / P(F) BAYES RULE

P(E|F) = P(F|E) * P(E) / P(F) P(hand|hamd) = P(hamd|hand)*P(hand) BAYES RULE

P(E|F) = P(F|E) * P(E) / P(F) P(hand|hamd) = P(hamd|hand)*P(hand) P(hamd|hand) = P(h|h)P(a|a)P(m|n)P(d|d) BAYES RULE

Kernighan’s paper on spell correction 1990

P(E|F) = P(F|E) * P(E) / P(F) P(hand|hamd) = P(hamd|hand)*P(hand) P(hamd|hand) = P(h|h)P(a|a)P(m|n)P(d|d) P(hand|hamd) = P(m|n)*P(hand) BAYES RULE The stuff in green is approximately 1, so ignore it!

So we have what we needed to find … P(hard|hamd) = P(m|n)*P(hand) P(hand|hamd) = P(m|r)*P(hard) Whichever is greater is the right one!

Note: Use of Unigram Probabilities The P(m|n) part is called the channel probability The P(hand) part is called the prior probability Kernighan’s experiment showed that P(hand) is very important (causes a 7% improvement)!

Side Note: We may be able to use a dictionary with wrongly spelt words P(hand) >> P(hazd) in the dictionary It is possible that ◦ P(m|n)*P(hand) > P(m|z)*P(hazd) But more of what would have been OOV errors will now be In-Vocabulary errors.

Solved! Chomsky’s Spell Checker

But there’s another interesting problem in Spell Checking … When is an unknown word not an error?

Spell Checking How do you tell whether an unknown word is a new word or a spelling mistake? How can you tell “Shivaswamy” is not a wrong spelling of “Shuddering” You can use statistics

Uses of Statistics Papers Spell-Checkers Collocation Discovery

Clustering: [research paper] Clustering Algorithm 1 o ooooooo Clustering Algorithm 2 ooo ooo oo The paper claims Algorithm 2 is better than Algorithm 1. Is the claim valid?

Clustering Clustering Algorithm 1 o ooooooo Clustering Algorithm 2 ooo ooo oo Now, what do you think?

Purity of Clusters Algorithm 1: 1.00 Algorithm 2:0.55 estimates Now, what can I make the reverse claim that Algorithm 1 is better?

Purity of Clusters Algorithm 1: Algorithm 2: variance Now is the improvement because of randomness? Or is it significant?

Coin Toss Experiment Analogy O = heads O = tails oooo Can I now claim that P(H) = 0.75? Can I make the claim that I have a coin whose P(H) is 0.75 and not 0.5?

Check significance with a T-test The t-test will show that oooo could very easily have occurred by chance on a coin with P(H) = 0.5

T-test I’ll need 50 tosses to claim that P(H) = 0.75 with a 5% probability of being wrong. Image courtesy of a website I no longer remember

See how Kernighan established statistical significance in his paper variance estimate significance

T-test You can also use the T-test in spell checking! Image courtesy of a website I no longer remember

Spell Checking How do you tell whether an unknown word is a new word or a spelling mistake? How can you tell “Shivaswamy” is not a wrong spelling of “Shuddering”

T-test Calculate the probability of “Shivaswamy” having come about by errors made on “Shuddering” … that is, 8 errors happening in 10 letters. P(error) = 0.8 Say the average letter error rate is 0.1. Verify using the t-test that Shivaswamy is not likely to have come about from Shuddering by a 0.1 error-per-letter process. Use the word length (10) as n.

Statistical Spell Checker So now you have a spell checker that can not only correct OOV errors But also knows when not to do so!

Recap What you have learnt in Probabilities ◦ Axioms of Probability ◦ Conditional and Joint Probabilities ◦ Independence ◦ Bayesian Inversion Is all you will need to start reading statistical NLP papers on: ◦ Language Models ◦ Spell Checking ◦ POS tagging ◦ Machine Translation

That’s all about the spell checker! Yeah, I brought in some controversial Chomsky quotes to motivate the linguistics students to sit through the lecture, but seriously, don’t you think the chances are high he uses a spell-checker like this one?