Probabilistic reasoning

Probabilistic reasoning
Dr. C. Lee Giles David Reese Professor, College of Information Sciences and Technology The Pennsylvania State University, University Park, PA, USA Special thanks to J. Lafferty, J. Latombe, T. Cover, R.V. Jones

Probability vs all the others
Probability theory the branch of mathematics concerned with analysis of random phenomena. Randomness: a non-order or non-coherence in a sequence of symbols or steps, such that there is no intelligible pattern or combination. The central objects of probability theory are random variables, stochastic processes, and events mathematical abstractions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. Uncertainty A lack of knowledge about an event Can be represented by a probability Ex: role a die, draw a card Can be represented as an error Statistic (a measure in statistics) Can use probability in determining that measure

Founders of Probability Theory
Blaise Pascal ( , France) Pierre Fermat ( , France) Laid the foundations of the probability theory in a correspondence on a dice game posed by a French nobleman.

Sample Spaces – measures of events
Collection (list) of all possible outcomes EXPERIMENT: ROLL A DIE e.g.: All six faces of a die: EXPERIMENT: DRAW A CARD e.g.: All 52 cards in a deck:

Types of Events Random event Simple event Joint event
Different likelihoods of occurrence Simple event Outcome from a sample space with one characteristic in simplest form e.g.: King of clubs from a deck of cards Joint event Conjunction (AND, , “,”); disjunction (OR,v) Contains several simple events e.g.: A red ace from a deck of cards (ace on hearts OR ace of diamonds)

Visualizing Events Excellent ways of determining probabilities:
Contingency tables (neat way to look at): Tree diagrams: Ace Not Ace Total Black Red Total Ace Red Cards Not an Ace Full Deck of Cards Ace Black Cards Not an Ace

Review of Probability Rules
Given 2 events: G, H P(G OR H) = P(G) + P(H) - P(G AND H); for mutually exclusive events, P(G AND H) = 0 P(G and H) = P(G)P(HG), also written as P(HG) = P(G and H)/P(G) If G and H are independent, P(HG) = P(H), thus P(G AND H) = P(G)P(H) P(G) > P(G)P(H); P(H) > P(G)P(H)

Odds Another way to express probability is in terms of odds, d
d = p/(1-p) p = probability of an outcome Example: What are the odds of getting a six on a dice throw? We know that p=1/6, so d = 1/6/(1-1/6) = (1/6)/(5/6) = 1/5. Gamblers often turn it around and say that the odds against getting a six on a dice roll are 5 to 1.

Probabilistic Reasoning
Types of probabilistic reasoning Reasoning using probabilistic methods Reasoning with uncertainty Rigorous reasoning vs heuristics or biases

Heuristics and Biases in Reasoning
Tversky & Kahneman showed that people often do not follow rules of probability Instead, decision making may be based on heuristics (heuristic decision making) Lower cognitive load but may lead to systematic errors and biases Example heuristics Representativeness Availability Conjunctive fallacy

Gambling/Predictions
A fair coin is flipped. H heads, T tails - What is a more likely sequence? A) H T H T T H H H H H H H What is the result more likely to follow A)? follow B)?

Decision Tree

Representativeness Heuristic
The sequence “H T H T T H” is seen as more representative of or similar to a prototypical coin sequence While each sequence has the same probability of occurring The likelihood of a flip following both A and B are the same: ½ H; ½ T The T for B) is no more likely; events are independent Gambler’s Fallacy When is this not the case?

Please choose the most likely alternative:
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Please choose the most likely alternative: (a) Linda is a bank teller (b) Linda is a bank teller and is active in the feminist movement

Conjunction Fallacy Nearly 90% choose the second alternative (bank teller and active in the feminist movement), even though it is logically incorrect (conjunction fallacy) bank tellers feminists hands up who did psychology to avoid venn diagrams and maths aren’t you glad you’re here bank tellers who are not feminists feminists who are not bank tellers feminist bank tellers P(A) > P(A,B); P(B) > P(A,B) Kahnemann and Tversky (1982)

How to avoid these mistakes
Such mistakes can cause bad decisions and loss of Profits Lives Health Justice (prosecutor’s fallacy) Etc. Instead, use probabilistic methods

Example Reasoning with an Uncertain Agent
sensors actuators ? agent environment ? ? model ?

An Old Problem … Getting Somewhere

Types of Uncertainty For example, to drive your car in the morning:
It must not have been stolen during the night It must not have flat tires There must be gas in the tank The battery must not be dead The ignition must work You must not have lost the car keys No truck should obstruct the driveway You must not have suddenly become blind or paralytic Etc… Not only would it not be possible to list all of them, but would trying to do so be efficient? Uncertainty in prior knowledge E.g., some causes of a disease are unknown and are not represented in the background knowledge of a medical-assistant agent Uncertainty in actions E.g., actions are represented with relatively short lists of preconditions, while these lists are in fact arbitrary long

Questions How to represent uncertainty in knowledge?
How to reason (inferences) with uncertain knowledge? Which action to choose under uncertainty?

Handling Uncertainty Possible Approaches: Default reasoning
Worst-case reasoning Probabilistic reasoning

Default Reasoning Creed: The world is fairly normal. Abnormalities are rare So, an agent assumes normality, until there is evidence of the contrary E.g., if an agent sees a bird x, it assumes that x can fly, unless it has evidence that x is a penguin, an ostrich, a dead bird, a bird with broken wings, …

Worst-Case Reasoning Creed: Just the opposite! The world is ruled by Murphy’s Law Uncertainty is defined by sets, e.g., the set possible outcomes of an action, the set of possible positions of a robot The agent assumes the worst case, and chooses the actions that maximizes a utility function in this case Example: Adversarial search

Probabilistic Reasoning
Creed: The world is not divided between “normal” and “abnormal”, nor is it adversarial. Possible situations have various likelihoods (probabilities) The agent has probabilistic beliefs – pieces of knowledge with associated probabilities (strengths) – and chooses its actions to maximize the expected value of some utility function

Notion of Probability Axioms of probability
You drive on Atherton often, and you notice that 70% of the times there is a traffic slowdown at the exit to Park. The next time you plan to drive on Atherton, you will believe that the proposition “there is a slowdown at the exit to Park” is True with probability 0.7 P(AvA) = = P(A) + P(A) So: P(A) = 1 - P(A) The probability of a proposition A is a real number P(A) between 0 and 1 P(True) = 1 and P(False) = 0 P(AvB) = P(A) + P(B) - P(AB) P(AB) = P(A|B) P(B) = P(B|A) P(A) Axioms of probability

Interpretations of Probability
Frequency Subjective

Frequency Interpretation
Draw a ball from a bag containing n balls of the same size, r red and s yellow. The probability that the proposition A = “the ball is red” is true corresponds to the relative frequency with which we expect to draw a red ball  P(A) = r/n

Subjective Interpretation
There are many situations in which there is no objective frequency interpretation: On a windy day, just before paragliding from the top of El Capitan, you say “there is probability 0.05 that I am going to die” You have worked hard in this class and you believe that the probability that you will get an A is 0.9

Random Variables A proposition that takes the value True with probability p and False with probability 1-p is a random variable with distribution (p,1-p) If a bag contains balls having 3 possible colors – red, yellow, and blue – the color of a ball picked at random from the bag is a random variable with 3 possible values The (probability) distribution of a random variable X with n values x1, x2, …, xn is: (p1, p2, …, pn) with P(X=xi) = pi and Si=1,…,n pi = 1

Expected Value Random variable X with n values x1,…,xn and distribution (p1,…,pn) E.g.: X is the state reached after doing an action A under uncertainty Function U of X E.g., U is the utility of a state The expected value of U after doing A is E[U] = Si=1,…,n pi U(xi)

Toothache Example A certain dentist is only interested in two things about any patient, whether he has a toothache and whether he has a cavity Over years of practice, she has constructed the following joint distribution: Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89

Joint Probability Distribution
k random variables X1, …, Xk The joint distribution of these variables is a table in which each entry gives the probability of one combination of values of X1, …, Xk Example: Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89 P(CavityToothache) P(CavityToothache)

Joint Distribution Says It All
Toothache Toothache Cavity 0.04 0.06 Cavity 0.01 0.89 P(Toothache) = P((Toothache Cavity) v (ToothacheCavity)) = P(Toothache Cavity) + P(ToothacheCavity) = = 0.05 P(Toothache v Cavity) = P((Toothache Cavity) v (ToothacheCavity) v (Toothache Cavity)) = = 0.11

Conditional Probability
Definition: P(AB) = P(A|B) P(B) Read P(A|B): Probability of A given that we know B P(A) is called the prior probability of A P(A|B) is called the posterior or conditional probability of A given B

Example P(CavityToothache) = P(Cavity|Toothache) P(Toothache)
0.04 0.06 Cavity 0.01 0.89 P(CavityToothache) = P(Cavity|Toothache) P(Toothache) P(Cavity) = 0.1 P(Cavity|Toothache) = P(CavityToothache) / P(Toothache) = 0.04/0.05 = 0.8

Generalization P(A  B  C) = P(A|B,C) P(B|C) P(C)

Conditional Independence
Propositions A and B are (conditionally) independent iff: P(A|B) = P(A)  P(AB) = P(A) P(B) A and B are independent given C iff: P(A|B,C) = P(A|C)  P(AB|C) = P(A|C) P(B|C)

Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B?

Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B)

Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) A = (AB) v (AB) P(A) = P(AB) + P(AB)

Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) A = (AB) v (AB) P(A) = P(AB) + P(AB) P(AB) = P(A) x (1-P(B)) P(B) = 1-P(B)

Let A and B be independent, i.e.: P(A|B) = P(A) P(AB) = P(A) P(B) What about A and B? P(A|B) = P(A B)/P(B) A = (AB) v (AB) P(A) = P(AB) + P(AB) P(AB) = P(A) x (1-P(B)) P(B) = 1-P(B) P(AB) = P(A)

Bayes’ Rule P(A  B) = P(A|B) P(B) = P(B|A) P(A) P(A|B) P(B) P(B|A) =

Example Given: Bayes’ rule tells: P(Cavity) = 0.1 P(Toothache) = 0.05
P(Cavity|Toothache) = 0.8 Bayes’ rule tells: P(Toothache|Cavity) = (0.8 x 0.05)/ = 0.4

Web Size Estimation - Capture/Recapture Analysis
Consider the web page coverage of search engines a and b pa probability that engine a has indexed a page, pb for engine b, pa,b joint probability sa number of unique pages indexed by engine a; N number of web pages nb number of documents returned by b for a query, na,b number of documents returned by both engines a&b for a query Lower bound estimate of size of the Web: random sampling assumption extensions - bayesian estimate, more engines (Bharat, Broder, WWW7 ‘98), etc. web size Lawrence, Giles, Science’98

The most common use of the term “information theory”
Shannon founded information theory with landmark paper published in 1948. Founding both digital computer and digital circuit design theory in 1937 21-year-old master's student at MIT, he wrote a thesis demonstrating that electrical application of Boolean algebra could construct and resolve any logical, numerical relationship. It has been claimed that this was the most important master's thesis of all time. Shannon contributed to the basic work on code breaking. Coined the term “bit”

Information Theory (in classical sense)
A model of innate information content of something Documents, images, messages, DNA Other models? Information That which reduces uncertainty Entropy A measure of information content Conditional Entropy Information content based on a context or other information Formal limitations on what can be Compressed Communicated Represented

Claude Shannon 1948 Shannon defined information in a way not previously used Shannon noted that information content can depend on the probability of the events, not just on the number of outcomes. Uncertainty is the lack of knowledge about an outcome. Entropy is a measure of that uncertainty (or randomness) in information in a system

Information Theory – another definition
Defines the amount of information in a message or document as the minimum number of bits needed to encode all possible meanings of that message, assuming all messages are equally likely What would be the minimum message to encode the days of the week field in a database? A type of compression!

Fundamental Questions Addressed by Entropy and Information Theory
What is the ultimate data compression for an information source? How much data can be sent reliably over a noisy communications channel? How accurately can we represent an object (e.g. image, etc.) as a function of the number of bits used. Good feature selection for data mining and machine learning

Basic Concepts in Information Theory
Entropy: Measuring uncertainty of a random variable Kullback-Leibler divergence: comparing two distributions Mutual Information: measuring the correlation of two random variables

Information Content I(x)
Define the amount of information gained after observing an event x with probability p(x) is I(x) where: I(x) = log2(1/p(x)) = - log2 p(x) Examples Flip a coin, x = heads p(heads) = 1/2; I(heads) = 1 Role a die, x = 6 p(6) = 1/6; I(6) = More information gained from observing a die toss than a coin flip. Why, there are more events.

Properties of Information I(x)
p(x) = 1; I = 0 If we know with certainty the outcome of an event , there is no information gained by its occurrence The occurrence of an event provides some or no information but it never results in a loss of information. I(x) > I(y) for p(x) < p(y) The less probable an event is, the more information we gain from its occurrence. I(xy) = I(x) + I(y) : additive

Entropy H(x) Entropy H(x) of an event is the expectation (average) of amount of information gained from that event over all possible happenings Entropy is the average amount of uncertainty in an event. Entropy is the amount of information in a message or document A message in which everything is known p(x) = 1 has zero entropy

Entropy: Motivation Feature selection: Text compression:
If we use only a few words to classify docs, what kind of words should we use? P(Topic| “computer”=1) vs p(Topic | “the”=1): which is more random? Text compression: Some documents (less random) can be compressed more than others (more random) Can we quantify the “compressibility”? In general, given a random variable X following distribution p(X), How do we measure the “randomness” of X? How do we design optimal coding for X?

Entropy H(X) measures the uncertainty/randomness of random variable X
Entropy: Definition Entropy H(X) measures the uncertainty/randomness of random variable X Example: H(X) P(Head) 1.0

Entropy: Properties Minimum value of H(X): 0
What kind of X has the minimum entropy? Maximum value of H(X): log M, where M is the number of possible values for X What kind of X has the maximum entropy? Related to coding

Interpretations of H(X)
Measures the “amount of information” in X Think of each value of X as a “message” Think of X as a random experiment (20 questions) Minimum average number of bits to compress values of X The more random X is, the harder to compress A fair coin has the maximum information, and is hardest to compress A biased coin has some information, and can be compressed to <1 bit on average A completely biased coin has no information, and needs only 0 bit

Entropy as a function of probability
H p(x) Max entropy occurs when all p(x)’s are equal!

Entropy Definition

Entropy Facts

Examples of Entropy Average over all possible outcomes to calculate the entropy. If all events likely, more entropy if more events can occur. More possibilities (events) for a die than a coin => entropy die > entropy coin

Joint Entropy H(x,x) = H(x)

Mutual Information I(x;y)
I(x;y) = H(y) – H(y|x) I is how much information x gives about y on the average

Mutual Information I(x;y)
Entropy is a special case H(x) = I(x;x) Symmetric: I(x;y) = I(y;x) Uncertainty of x after seeing y is the same as the uncertainty of y after seeing x Nonnegative I(x;y)

Conditional Entropy The conditional entropy of a random variable Y given another X, expresses how much extra information one still needs to supply on average to communicate Y given that the other party knows X H(Topic| “computer”) vs. H(Topic | “the”)?

Intuitively, H(p,q)  H(p), and mathematically,
Cross Entropy H(p,q) What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=? Intuitively, H(p,q)  H(p), and mathematically,

Kullback-Leibler Divergence D(p||q)
What if we encode X with a code optimized for a wrong distribution q? How many bits would we waste? Properties: - D(p||q)0 - D(p||q)D(q||p) - D(p||q)=0 iff p=q Relative entropy KL-divergence is often used to measure the distance between two distributions Interpretation: Fix p, D(p||q) and H(p,q) vary in the same way If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood

Cross Entropy, KL-Div, and Likelihood
Example: X {“H”,“T”} Y=(HHTTH) Equivalent criteria for selecting/evaluating a model Perplexity(p)

Mutual Information I(X;Y)
Comparing two distributions: p(x,y) vs p(x)p(y) Properties: I(X;Y)0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y - Related to the “channel capacity” in information theory Examples: I(Topic; “computer”) vs. I(Topic; “the”)? I(“computer”, “program”) vs I(“computer”, “baseball”)?

Probabilistic reasoning

Similar presentations

Presentation on theme: "Probabilistic reasoning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic reasoning

Similar presentations

Presentation on theme: "Probabilistic reasoning"— Presentation transcript:

Similar presentations

About project

Feedback