1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.

1 BASIC NOTIONS OF PROBABILITY THEORY

NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing (‘casting’) it. For what proportion of the throws will we get – a particular value (say, 4)? – an even value? – a value greater than or equal to 3? PROBABILITY THEORY was developed to give us a vocabulary to talk about the LIKELIHOOD of certain EVENTS Where an EVENT is any result of a TRIAL (‘experiment’) – Getting the value 4 when casting a die, – Getting a value greater than 3, – But also: winning a race, getting a ‘tail’ result when flipping a coin, encountering a certain word

NLE 3 What probability theory is for Suppose you’ve already texted the characters “There in a minu” You’d like your mobile phone to guess the most likely completion of “minu” rather than MINUET or MINUS or MINUSCULE In other words, you’d like your mobile phone to know that given what you’ve texted so far, MINUTE is more likely than those other alternatives PROBABILITY THEORY was developed to formalize the notion of LIKELIHOOD

NLE 4 TRIALS (or EXPERIMENTS) Anything that may have a certain OUTCOME (on which you can make a bet, say) Classic examples: – Throwing a die – A horse race In NLE: – Looking at the next word in a text – Having your NL system perform a certain task

NLE 5 (ELEMENTARY) OUTCOMES The results of an experiment: – In a coin toss, HEAD or TAILS – In a race, the names of the horses involved Or if we are only interested in whether a particular horse wins: WIN and LOSE In NLE: – When looking at the next word: the possible words – In the case of a system: RIGHT or WRONG

NLE 6 EVENTS Often, we want to talk about the likelihood of getting one of several outcomes: – E.g., with dice, the likelihood of getting an even number, or a number greater than 3 An EVENT is a set of possible OUTCOMES (possibly just a single elementary outcome): – E1 = {4} – E2 = {2,4,6} – E3 = {3,4,5,6}

NLE 7 SAMPLE SPACES The SAMPLE SPACE is the set of all possible outcomes: – For the case of a dice, sample space S = {1,2,3,4,5,6} For the texting case: – Texting a word is a TRIAL, – The word texted is an OUTCOME, – EVENTS which result from this trial are: texting the word “minute”, texting a word that begins with “minu”, etc – The set of all possible words is the SAMPLE SPACE (NB: the sample space may be very large, or even infinite)

NLE 8 Probability Functions The likelihood of an event is indicated using a PROBABILITY FUNCTION P The probability of an event E is specified by a function P(E), with values between 0 and 1 – P(E) = 1: the event is CERTAIN to occur – P(E) = 0: the event is certain NOT to occur Example: in the case of die casting, – P(E’ = ‘getting as a result a number between 1 and 6’) = P({1,2,3,4,5,6}) = 1 – P(E’’ = ‘getting as a result 7’) = 0 The sum of the probabilities of all elementary outcomes = 1

NLE 9 Probabilities and relative frequencies In the case of a die, we know all of the possible outcomes ahead of time, and we also know a priori what the likelihood of a certain outcome is. But in many other situations in which we would like to estimate the likelihood of an event, this is not the case. For example, suppose that we would like to bet on horses rather than on dice. Harry is a race horse: we do not know ahead of time how likely it is for Harry to win. The best we can do is to ESTIMATE P(WIN) using the RELATIVE FREQUENCY of the outcome `Harry wins’ Suppose Harry raced 100 times, and won 20 races overall. Then – P(WIN) = WIN/TOTAL NUMBER OF RACES =.2 – P(LOSE) =.8 The use of probabilities we are interested in (estimate the probability of certain sequences of words) is of this type

NLE 10 Joint probabilities We are often interested in the probability of TWO events happening: – When throwing a die TWICE, the probability of getting a 6 both times – The probability of finding a sequence of two words: `the’ and `car’ We use the notation A&B to indicate the conjunction of two events, and P(A&B) to indicate the probability of such conjunction – Because events are SETS, the probability is often also written as We use the same notation with WORDS: P(‘the’ & ‘car’)

NLE 11 Other combinations of events A  B: either event A or event B happens – P(A  B) = P(A) + P(B) – P(A  B)  A: event A does not happen – P(  A) = 1 –P(A)

NLE 12 Prior probability vs. conditional probability The prior probability P(WIN) is the likelihood of an event occurring irrespective of anything else we know about the world Often however we DO have additional information, that can help us making a more informed guess about the likelihood of a certain event E.g, take again the case of Harry the horse. Suppose we know that it was raining during 30 of the races that Harry raced, and that Harry won 15 of these races. Intuitively, the probability of Harry winning when it’s raining is.5 - HIGHER than the probability of Harry winning overall – We can make a more informed guess We indicate the probability of an event A happening given that we know that event B happened as well – the CONDITIONAL PROBABILITY of A given B – as P(A|B)

NLE 13 Conditional probability Conditional probability is DEFINED as follows: Intuitively, you RESTRICT the range of trials in consideration to those in which event B took place, as well (most easily seen when thinking in terms of relative frequency)

NLE 14 Conditional probability

NLE 15 Example Consider the case of Harry the horse again: Where: – P(WIN&RAIN) = 15/100 =.15 – P(RAIN) = 30/100 =.30 This gives: (in agreement with our intuitions)

NLE 16 The chain rule The definition of conditional probability can we rewritten as: – P(A&B) = P(A|B) P(B) – P(A&B) = P(B|A) P(A) These equation generalize to the so-called CHAIN RULE: – P(w 1,w 2,w 3,….w n ) = P(w 1 ) P(w 2 |w 1 ) P(w 3 |w 1,w 2 ) …. P(w n |w 1 …. w n-1 ) The chain rule plays an important role in statistical NLE: – P(the big dog) = P(the) P(big|the) P(dog|the big)

NLE 17 Independence Additional information does not always help. For example, knowing the color of a dice usually doesn’t help us predicting the result of a throw; knowing the name of the jockey’s girlfriend doesn’t help predicting how well the horse he rides will do in a race; etc. When this is the case, we say that two events are INDEPENDENT The notion of independence is defined in probability theory using the definition of conditional probability Consider again the basic form of the chain rule: – P(A&B) = P(A|B) P(B) We say that two events are INDEPENDENT if: – P(A&B) = P(A) P(B) – P(A|B) = P(A)

NLE 18 Bayes’ theorem Suppose you’ve developed an IR system for searching a big database (say, the Web) Given any search, about 1/100,000 documents is relevant (REL) Suppose your system is pretty good: – P(YES|REL) =.95 – P(YES|  REL) =.005 What is the probability that the document is relevant, when the system says YES? – P(YES|REL)?

NLE 19 Bayes’ Theorem Bayes’ Theorem is a pretty trivial consequence of the definition of conditional probability, but it is very useful in that it allows us to use one conditional probability to compute another We already saw that the definition of conditional probability can be rewritten equivalently as: – P(A&B) = P(A|B) P(B) – P(A&B) = P(B|A) P(A) If we equate the two left sides, we get Bayes’ theorem

NLE 20 Application of Bayes’ theorem

NLE 21 Statistical NLE What’s the connection between this and natural language? A number of NL interpretation (and generation) tasks can be formulated in terms of CHOICE BETWEEN ALTERNATIVES: choosing the most likely – continuation of a certain sentence – POS tag or meaning for a word – Parse for a sentence In all of these cases, we can formalize `likelihood’ using probabilities, and choose the alternative with THE HIGHEST PROBABILITY Tomorrow we will see the first (and simplest) example of this: choosing the most likely next word This task can be viewed as the task of choosing the w that maximizes: P(w | W 1 …. W N-1 )

NLE 22 Using corpora to estimate probabilities But where do we get these probabilities? Idea: estimate them by RELATIVE FREQUENCY. The simplest method: Maximum Likelihood Estimate (MLE). Count the number of words in a corpus, then count how many times a given sequence is encountered.

NLE 23 Readings Krenn and Samuelsson, The Linguist’s Guide to Statistics (on the Web site) The Statistics GlossaryStatistics Glossary Further reading: Manning and Schuetze, chapter 2.1

1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.

Similar presentations

Presentation on theme: "1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing.

Similar presentations

Presentation on theme: "1 BASIC NOTIONS OF PROBABILITY THEORY. NLE 2 What probability theory is for Suppose that we have a fair dice, with six faces, and that we keep throwing."— Presentation transcript:

Similar presentations

About project

Feedback