PRIORS David Kauchak CS159 Fall 2014. Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Basics of Statistical Estimation
Albert Gatt Corpora and Statistical Methods – Lecture 7.
Clustering Beyond K-means
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
What is Statistical Modeling
Parameter Estimation using likelihood functions Tutorial #1
Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.
Assuming normally distributed data! Naïve Bayes Classifier.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Bayesian learning finalized (with high probability)
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Bayesian Learning, Cont’d. Administrivia Various homework bugs: Due: Oct 12 (Tues) not 9 (Sat) Problem 3 should read: (duh) (some) info on naive Bayes.
Inference about a Mean Part II
Thanks to Nir Friedman, HU
Bayesian Wrap-Up (probably). Administrivia Office hours tomorrow on schedule Woo hoo! Office hours today deferred... [sigh] 4:30-5:15.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Bayesian Learning Part 3+/- σ. Administrivia Final project/proposal Hand-out/brief discussion today Proposal due: Mar 27 Midterm exam: Thurs, Mar 22 (Thurs.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
PROBABILITY David Kauchak CS451 – Fall Admin Midterm Grading Assignment 6 No office hours tomorrow from 10-11am (though I’ll be around most of the.
Crash Course on Machine Learning
1 Advanced Smoothing, Evaluation of Language Models.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Text Classification, Active/Interactive learning.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
IID Samples In supervised learning, we usually assume that data points are sampled independently and from the same distribution IID assumption: data are.
PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Logistic Regression William Cohen.
Intro to NLP - J. Eisner1 A MAXENT viewpoint.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
N-Grams Chapter 4 Part 2.
Matt Gormley Lecture 3 September 7, 2016
Usman Roshan CS 675 Machine Learning
Maximum Likelihood Estimate
Probability David Kauchak CS158 – Fall 2013.
Data Science Algorithms: The Basic Methods
Bayes Net Learning: Bayesian Approaches
CS 416 Artificial Intelligence
CSCI 5822 Probabilistic Models of Human and Machine Learning
More about Posterior Distributions
Introduction to EM algorithm
Introduction to Bayesian Model Comparison
CS246: Latent Dirichlet Analysis
Word Alignment David Kauchak CS159 – Fall 2019 Philipp Koehn
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
CS590I: Information Retrieval
David Kauchak CS159 Spring 2019
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

PRIORS David Kauchak CS159 Fall 2014

Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update

Maximum likelihood estimation Intuitive Sets the probabilities so as to maximize the probability of the training data Problems?  Overfitting!  Amount of data particularly problematic for rare events  Is our training data representative

Basic steps for probabilistic modeling Which model do we use, i.e. how do we calculate p(feature, label)? How do train the model, i.e. how to we we estimate the probabilities for the model? How do we deal with overfitting? Probabilistic models Step 1: pick a model Step 2: figure out how to estimate the probabilities for the model Step 3 (optional): deal with overfitting

Priors Coin1 data: 3 Heads and 1 Tail Coin2 data: 30 Heads and 10 tails Coin3 data: 2 Tails Coin4 data: 497 Heads and 503 tails If someone asked you what the probability of heads was for each of these coins, what would you say?

Training revisited From a probability standpoint, MLE training is selecting the Θ that maximizes: We pick the most likely model parameters given the data i.e.

Estimating revisited We can incorporate a prior belief in what the probabilities might be! To do this, we need to break down our probability (Hint: Bayes rule)

Estimating revisited What are each of these probabilities?

Priors likelihood of the data under the model probability of different parameters, call the prior probability of seeing the data (regardless of model)

Priors Does p(data) matter for the argmax?

Priors likelihood of the data under the model probability of different parameters, call the prior What does MLE assume for a prior on the model parameters?

Priors likelihood of the data under the model probability of different parameters, call the prior -Assumes a uniform prior, i.e. all Θ are equally likely! -Relies solely on the likelihood

A better approach We can use any distribution we’d like This allows us to impart addition bias into the model

Another view on the prior Remember, the max is the same if we take the log: We can use any distribution we’d like This allows us to impart addition bias into the model

Prior for NB Uniform prior Dirichlet prior λ = 0 increasing

Prior: another view What happens to our likelihood if, for one of the labels, we never saw a particular feature? MLE: Goes to 0!

Prior: another view Adding a prior avoids this!

Smoothing training data for each label, pretend like we’ve seen each feature/word occur in λ additional examples Sometimes this is also called smoothing because it is seen as smoothing or interpolating between the MLE and some other distribution