ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Regression –Model selection, Cross-validation –Error decomposition.
Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
ECE 5984: Introduction to Machine Learning
Crash Course on Machine Learning
Recitation 1 Probability Review
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Probability Review –Statistical Estimation (MLE) Readings: Barber 8.1, 8.2.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CSE 446: Point Estimation Winter 2012 Dan Weld Slides adapted from Carlos Guestrin (& Luke Zettlemoyer)
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Unsupervised Learning: Kmeans, GMM, EM Readings: Barber
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Nearest Neighbour Readings: Barber 14 (kNN)
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Supervised Learning –General Setup, learning from data –Nearest Neighbour.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Regression Readings: Barber 17.1, 17.2.
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
Maximum Likelihood Estimation
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Machine Learning 5. Parametric Methods.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 24 of 41 Monday, 18 October.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Bayesian Estimation and Confidence Intervals Lecture XXII.
Bayesian Estimation and Confidence Intervals
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ICS 280 Learning in Graphical Models
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
CS 2750: Machine Learning Probability Review Density Estimation
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Statistical NLP: Lecture 4
LECTURE 07: BAYESIAN ESTIMATION
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Parametric Methods Berlin Chen, 2005 References:
CS639: Data Management for Data Science
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Statistical Estimation (MLE, MAP, Bayesian) Readings: Barber 8.6, 8.7

Administrativia HW0 –Solutions available HW1 –Due on Sun 02/15, 11:55pm – Project Proposal –Due: Tue 02/24, 11:55 pm –<=2pages, NIPS format (C) Dhruv Batra2

Recap from last time (C) Dhruv Batra3

Procedural View Training Stage: –Raw Data  x (Feature Extraction) –Training Data { (x,y) }  f (Learning) Testing Stage –Raw Data  x (Feature Extraction) –Test Data x  f(x) (Apply function, Evaluate error) (C) Dhruv Batra4

Statistical Estimation View Probabilities to rescue: –x and y are random variables –D = (x 1,y 1 ), (x 2,y 2 ), …, (x N,y N ) ~ P(X,Y) IID: Independent Identically Distributed –Both training & testing data sampled IID from P(X,Y) –Learn on training set –Have some hope of generalizing to test set (C) Dhruv Batra5

Interpreting Probabilities What does P(A) mean? Frequentist View –limit N  ∞ #(A is true)/N –limiting frequency of a repeating non-deterministic event Bayesian View –P(A) is your “belief” about A Market Design View –P(A) tells you how much you would bet (C) Dhruv Batra6

Concepts Marginal distributions / Marginalization Conditional distribution / Chain Rule Bayes Rule (C) Dhruv Batra7

Concepts Likelihood –How much does a certain hypothesis explain the data? Prior –What do you believe before seeing any data? Posterior –What do we believe after seeing the data? (C) Dhruv Batra8

KL-Divergence / Relative Entropy (C) Dhruv Batra9Slide Credit: Sam Roweis

Plan for Today Statistical Learning –Frequentist Tool Maximum Likelihood –Bayesian Tools Maximum A Posteriori Bayesian Estimation Simple examples (like coin toss) –But SAME concepts will apply to sophisticated problems. (C) Dhruv Batra10

Your first probabilistic learning algorithm After taking this ML class, you drop out of VT and join an illegal betting company. Your new boss asks you: –If Novak Djokovic & Rafael Nadal play tomorrow, will Nadal win or lose W/L? You say: what happened in the past? –W, L, L, W, W You say: P(Nadal Wins) = … Why? (C) Dhruv Batra11

(C) Dhruv Batra12 Slide Credit: Yaser Abu-Mostapha

Maximum Likelihood Estimation Goal: Find a good θ What’s a good θ? –One that makes it likely for us to have seen this data –Quality of θ = Likelihood(θ; D) = P(data | θ) (C) Dhruv Batra13

Sufficient Statistic D 1 = {1,1,1,0,0,0} D 2 = {1,0,1,0,1,0} A function of the data ϕ (Y) is a sufficient statistic, if the following is true (C) Dhruv Batra14

Why Max-Likelihood? Leads to “natural” estimators MLE is OPT if model-class is correct –Log-likelihood is same as cross-entropy –Relate cross-entropy to KL (C) Dhruv Batra15

16 How many flips do I need? Joke Credit: Carlos Guestrin Boss says: Last year: –3 heads/wins-for-Nadal –2 tails/losses-for-Nadal. You say:  = 3/5, I can prove it! He says: What if –30 heads/wins-for-Nadal –20 tails/losses-for-Nadal. You say: Same answer, I can prove it! He says: What’s better? You say: Humm… The more the merrier??? He says: Is this why I am paying you the big bucks???

Bayesian Estimation Boss says: What is I know Nadal is a better player on clay courts? You say: Bayesian it is then.. (C) Dhruv Batra17

Priors What are priors? –Express beliefs before experiments are conducted –Computational ease: lead to “good” posteriors –Help deal with unseen data –Regularizers: More about this in later lectures Conjugate Priors –Prior is conjugate to likelihood if it leads to itself as posterior –Closed form representation of posterior (C) Dhruv Batra18

19 Beta prior distribution – P(  ) Demo: – Slide Credit: Carlos Guestrin

Benefits of conjugate priors (C) Dhruv Batra20

21 MAP for Beta distribution MAP: use most likely parameter: Beta prior equivalent to extra W/L matches As N → inf, prior is “forgotten” But, for small sample size, prior is important! Slide Credit: Carlos Guestrin

Effect of Prior Prior = Beta(2,2) –θ prior = 0.5 Dataset = {H} –L(θ) = θ –θ MLE = 1 Posterior = Beta(3,2) –θ MAP = (3-1)/(3+2-2) = 2/3 (C) Dhruv Batra22

23 What you need to know Statistical Learning: –Maximum likelihood Why MLE? –Sufficient statistics –Maximum a posterori –Bayesian estimation (return an entire distribution) –Priors, posteriors, conjugate priors –Beta distribution (conjugate of bernoulli)