PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

Slides:



Advertisements
Similar presentations
Unsupervised Learning
Advertisements

Probability: Review The state of the world is described using random variables Probabilities are defined over events –Sets of world states characterized.
More probability CS151 David Kauchak Fall 2010 Some material borrowed from: Sara Owsley Sood and others.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Naïve Bayes Classifier
What is Statistical Modeling
DECISION TREES David Kauchak CS 451 – Fall Admin Assignment 1 available and is due on Friday (printed out at the beginning of class) Door code for.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Thanks to Nir Friedman, HU
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
PROBABILITY David Kauchak CS451 – Fall Admin Midterm Grading Assignment 6 No office hours tomorrow from 10-11am (though I’ll be around most of the.
Probability.
Crash Course on Machine Learning
Recitation 1 Probability Review
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
Sets, Combinatorics, Probability, and Number Theory Mathematical Structures for Computer Science Chapter 3 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesProbability.
1 CY1B2 Statistics Aims: To introduce basic statistics. Outcomes: To understand some fundamental concepts in statistics, and be able to apply some probability.
QA in Finance/ Ch 3 Probability in Finance Probability.
1 We should appreciate, cherish and cultivate blessings.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Theory of Probability Statistics for Business and Economics.
PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.
MACHINE LEARNING BASICS David Kauchak CS159 Fall 2014.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Uncertainty ECE457 Applied Artificial Intelligence Spring 2007 Lecture #8.
12/7/20151 Probability Introduction to Probability, Conditional Probability and Random Variables.
Inference: Probabilities and Distributions Feb , 2012.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Logistic Regression William Cohen.
Machine Learning 5. Parametric Methods.
Probability. Probability Probability is fundamental to scientific inference Probability is fundamental to scientific inference Deterministic vs. Probabilistic.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
AP Statistics From Randomness to Probability Chapter 14.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Conditional Probability 423/what-is-your-favorite-data-analysis-cartoon 1.
ECE457 Applied Artificial Intelligence Fall 2007 Lecture #8
Matt Gormley Lecture 3 September 7, 2016
Probability and Statistics
Probability David Kauchak CS158 – Fall 2013.
Probability and Statistics for Particle Physics
Data Science Algorithms: The Basic Methods
From Randomness to Probability
Basic Probability Theory
Review of Probability and Estimators Arun Das, Jason Rebello
Statistical NLP: Lecture 4
Honors Statistics From Randomness to Probability
I flip a coin two times. What is the sample space?
Machine Learning basics
David Kauchak CS159 Spring 2019
ECE457 Applied Artificial Intelligence Spring 2008 Lecture #8
Presentation transcript:

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013

Admin Assignment 6 - L2 normalization constant should be 2 (not 1) - Just a handful of changes to the Perceptron code!

Probabilistic Modeling training data probabilistic model train Model the data with a probabilistic model specifically, learn p(features, label) p(features, label) tells us how likely these features and this example are

An example: classifying fruit red, round, leaf, 3oz, … green, round, no leaf, 4oz, … yellow, curved, no leaf, 4oz, … green, curved, no leaf, 5oz, … label apple banana examples Training data probabilistic model: p(features, label) train

Probabilistic models Probabilistic models define a probability distribution over features and labels: probabilistic model: p(features, label) yellow, curved, no leaf, 6oz, banana 0.004

Probabilistic model vs. classifier probabilistic model: p(features, label) yellow, curved, no leaf, 6oz, banana Probabilistic model: Classifier: yellow, curved, no leaf, 6oz probabilistic model: p(features, label) banana

Probabilistic models: classification Probabilistic models define a probability distribution over features and labels: probabilistic model: p(features, label) yellow, curved, no leaf, 6oz, banana How do we use a probabilistic model for classification/prediction? Given an unlabeled example: yellow, curved, no leaf, 6oz predict the label

Probabilistic models Probabilistic models define a probability distribution over features and labels: probabilistic model: p(features, label) yellow, curved, no leaf, 6oz, banana For each label, ask for the probability under the model Pick the label with the highest probability yellow, curved, no leaf, 6oz, apple

Probabilistic model vs. classifier probabilistic model: p(features, label) yellow, curved, no leaf, 6oz, banana Probabilistic model: Classifier: yellow, curved, no leaf, 6oz probabilistic model: p(features, label) banana Why probabilistic models?

Probabilistic models Probabilities are nice to work with  range between 0 and 1  can combine them in a well understood way  lots of mathematical background/theory  an aside: to get the benefit of probabilistic output you can sometimes calibrate the confidence output of a non- probabilistic classifier Provide a strong, well-founded groundwork  Allow us to make clear decisions about things like regularization  Tend to be much less “heuristic” than the models we’ve seen  Different models have very clear meanings

Probabilistic models: big questions Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting?

Same problems we’ve been dealing with so far Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Probabilistic modelsML in general Which model do we use (decision tree, linear model, non-parametric) How do train the model? How do we deal with overfitting?

Basic steps for probabilistic modeling Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Probabilistic models Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting

Basic steps for probabilistic modeling Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Probabilistic models Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting

What was the data generating distribution? Training dataTest set data generating distribution

Step 1: picking a model data generating distribution What we’re really trying to do is model is the data generating distribution, that is how likely the feature/label combinations are

Some maths What rule?

Some maths

Step 1: pick a model So, far we have made NO assumptions about the data How many entries would the probability distribution table have if we tried to represent all possible values (e.g. for the wine data set)?

Full distribution tables x1x1 x2x2 x3x3 …yp( ) 000…0* 000…1* 100…0* 100…1* 010…0* 010…1* … Wine problem: all possible combination of features ~7000 binary features Sample space size: = ?

Any problems with this?

Full distribution tables x1x1 x2x2 x3x3 …yp( ) 000…0* 000…1* 100…0* 100…1* 010…0* 010…1* … - Storing a table of that size is impossible - How are we supposed to learn/estimate each entry in the table?

Step 1: pick a model So, far we have made NO assumptions about the data Model selection involves making assumptions about the data We did this before, e.g. assume the data is linearly separable These assumptions allow us to represent the data more compactly and to estimate the parameters of the model

An aside: independence Two variables are independent if one has nothing whatever to do with the other For two independent variables, knowing the value of one does not change the probability distribution of the other variable (or the probability of any individual event)  the result of the toss of a coin is independent of a roll of a dice  price of tea in England is independent of the whether or not you pass AI

independent or dependent? Catching a cold and having cat-allergy Miles per gallon and driving habits Height and longevity of life

Independent variables How does independence affect our probability equations/properties? If A and B are independent (written …)  P(A,B) = ?  P(A|B) = ?  P(B|A) = ? AB

Independent variables How does independence affect our probability equations/properties? If A and B are independent (written …)  P(A,B) = P(A)P(B)  P(A|B) = P(A)  P(B|A) = P(B) AB How does independence help us?

Independent variables If A and B are independent  P(A,B) = P(A)P(B)  P(A|B) = P(A)  P(B|A) = P(B) Reduces the storage requirement for the distributions Reduces the complexity of the distribution Reduces the number of probabilities we need to estimate

Conditional Independence Dependent events can become independent given certain other events Examples,  height and length of life  “correlation” studies size of your lawn and length of life If A, B are conditionally independent of C  P(A,B|C) = P(A|C)P(B|C)  P(A|B,C) = P(A|C)  P(B|A,C) = P(B|C)  but P(A,B) ≠ P(A)P(B)

Naïve Bayes assumption What does this assume?

Naïve Bayes assumption Assumes feature i is independent of the the other features given the label For the wine problem?

Naïve Bayes assumption Assumes feature i is independent of the the other features given the label Assumes the probability of a word occurring in a review is independent of the other words given the label For example, the probability of “pinot” occurring is independent of whether or not “wine” occurs given that the review is about “chardonnay” Is this assumption true?

Naïve Bayes assumption For most applications, this is not true! For example, the fact that “pinot” occurs will probably make it more likely that “noir” occurs (or take a compound phrase like “San Francisco”) However, this is often a reasonable approximation:

Naïve Bayes model naïve bayes assumption How do we model this? -for binary features -for discrete features, i.e. counts -for real valued features p(x i |y) is the probability of a particular feature value given the label

p(x|y) Binary features: Other features: Could use lookup table for each value, but doesn’t generalize well Better, model as a distribution: -gaussian (i.e. normal) distribution -poisson distribution -multinomial distribution (more on this later) -… biased coin toss!

Basic steps for probabilistic modeling Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Probabilistic models Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting

Obtaining probabilities We’ve talked a lot about probabilities, but not where they come from  How do we calculate p(x i |y) from training data?  What is the probability of surviving the titanic?  What is that any review is about Pinot Noir?  What is the probability that a particular review is about Pinot Noir? HH HH H T T TT T

Obtaining probabilities training data probabilistic model train …

Estimating probabilities What is the probability of a pinot noir review? We don’t know! We can estimate that based on data, though: number of review labeled pinot noir total number of reviews This is called the maximum likelihood estimation. Why?

Maximum Likelihood Estimation (MLE) Maximum likelihood estimation picks the values for the model parameters that maximize the likelihood of the training data You flip a coin 100 times. 60 times you get heads. What is the MLE for heads? p(head) = 0.60

Maximum Likelihood Estimation (MLE) Maximum likelihood estimation picks the values for the model parameters that maximize the likelihood of the training data You flip a coin 100 times. 60 times you get heads. What is the likelihood of the data under this model (each coin flip is a data point)?

MLE example You flip a coin 100 times. 60 times you get heads. MLE for heads: p(head) = 0.60 What is the likelihood of the data under this model (each coin flip is a data point)? log( * ) = -67.3

MLE example Can we do any better? p(heads) = 0.5 log( * ) =-69.3 p(heads) = 0.7  log( * )=-69.5