Machine Learning 5. Parametric Methods.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

: INTRODUCTION TO Machine Learning Parametric Methods.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
The General Linear Model. The Simple Linear Model Linear Regression.
Data mining in 1D: curve fitting
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Probability theory: (lecture 2 on AMLbook.com)
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
BCS547 Neural Decoding.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Review of statistical modeling and probability theory Alan Moses ML4bio.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
CS479/679 Pattern Recognition Dr. George Bebis
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Maximum Likelihood Estimation
Bias and Variance of the Estimator
Basic Estimation Techniques
INTRODUCTION TO Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
INTRODUCTION TO Machine Learning
Machine Learning” Dr. Alper Özpınar.
LECTURE 07: BAYESIAN ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Parametric Estimation
12. Principles of Parameter Estimation
Presentation transcript:

Machine Learning 5. Parametric Methods

Parametric Methods Need a probabilities to make decisions (prior, evidence, likelihood) Probability is a function of input (observables) Represent function by Selecting its general form (model) with several unknown parameters Find(estimate) parameters from data that optimize certain criteria (e.g. minimize generalization error) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Parametric Estimation Assume sample comes from a distribution known up to its parameters Sufficient statistics : parameters that completely define distribution (e.g. mean and variance) X = { xt }t where xt ~ p (x) Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ2) where θ = { μ, σ2} Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Estimation Assume form of distribution Estimate its sufficient parameters from a sample Use distribution in classification or regressions Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Maximum Likelihood Estimation X consists of independent and identically distributed (iid) samples Likelihood of θ given the sample X l (θ|X) = p (X |θ) = ∏t p (xt|θ) Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ) Maximum likelihood estimator (MLE) θ* = argmaxθ L(θ|X) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Why to use log likelihood Log is increasing function Increased input->increased output Maximizing log of a function is equivalent to maximizing function itself Log convert products to sum log (abc)=log(a)+log(b)+logc Makes analysis/computation simpler Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = pox (1 – po ) (1 – x) Solving L(p|X)/dp=0 => Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Examples: Multinomial Generalization of Bernoulli : multinomial Outcome is mutually exclusive K states Probability of occurring pi P (x1,x2,...,xK) = ∏i pixi L(p1,p2,...,pK|X) = log ∏t ∏i pixit MLE: pi = ∑t xit / N Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Gaussian (Normal) Distribution p(x) = N ( μ, σ2) Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bias and Variance of an estimator Actual value of en estimator depends on data Get different N samples from a true distribution , results in different value of an estimator Value of an estimator is a random variables Can ask about its mean and variance Difference between the true value of a parameter and mean of estimator is bias of the estimator Usually looking for formula resulting in unbiased estimator with small variance Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bias and Variance Unknown parameter θ Estimator di = d (Xi) on sample Xi Bias: bθ(d) = E [d] – θ Variance: E [(d–E [d])2] Mean square error: r (d,θ) = E [(d–θ)2] = (E [d] – θ)2 + E [(d–E [d])2] = Bias2 + Variance Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Mean Estimator Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Variance Estimator Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Optimal Estimators Estimators are random variables as depend on random sample from the distribution We look for estimators (functions of samples) that have minimum expected square error Expected square error can be decomposed into bias (drift) and variance Sometimes accept larger errors but require unbiased estimators. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X) Maximum Likelihood (ML): θML = argmaxθ p(X|θ) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bayes estimator Sometimes have some prior information on the possible value range that parameter may take Can’t have exact value but can provide a probability for each value (density function) Probability of a parameter before looking at data Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bayes Estimator Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bayes Estimator Assume this function have a peak around true value of the parameter Maximum a posteriori estimate will minimize error Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bayes’ Estimator: Example xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) θML = m θMAP = Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Parametric Classification Bayes Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Assumptions Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Example Input: customer income Output: which car out of K models he will prefer Assume that for each model there is a group of customers with certain income Income in a single class distributed normal Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Example: True distributions Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Example: Estimate parameters Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Example: Discriminant Evaluate on a new input Select class with largest discriminant Assume: priors and variances are equal: Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Boundary between classes Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Equal variances Single boundary at halfway between means Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Variances are different Two boundaries Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Two approaches Likelihood-based approach (till now) Calculate probabilities using Bayes rule Compute discriminant function Discriminant function approach(later) Directly estimate discriminant Bypass probabilities estimation Boundary between classes? Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Regression x is independent variable, r is dependant variable Unknown f, want to approximate to predict future values Parametric approach: assume model with small number of parameters y= Find best parameters from data Also have to make assumption on noise Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Regressions Have a training data (x,r) Find parameters to maximize likelihood In other words, what parameters makes data most probable Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Regressions Ignore the last term,(does not depend on parameters Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Regression Minimize last term Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Least Square Estimate Minimize this Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Linear Regression Assume linear model Need to minimize Set derivatives to zero 2 linear equations in 2 unknowns Can solve easily Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Polynomial Regression Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bias and Variance noise squared error bias variance Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bias and Variance Given single sample (x,r), what is the expected error Variations are due to noise and training sample First term is due to noise Does not depend on the estimate Can’t be removed Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Variance Second term Deviation of estimator from regression function Depends on estimator and training set Average over all possible training samples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Example: polynomial regression As we increase degree of the polynomial Bias decreases as allow better fit to points Variance increases as small deviation in training sample might result in large deviation in model parameters Bias/variance dilemma true for any machine learning systems Need a way to find optimal model complexity to balance between bias and variance Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bias/Variance Dilemma Example: gi(x)=2 has no variance and high bias gi(x)= ∑t rti/N has lower bias with variance As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma: (Geman et al., 1992) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Bias/Variance Dilemma gi Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Polynomial Regression Best fit “min error” Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Model Selection How to select right model complexity? Different from estimating model parameters There are several procedures Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Cross-Validation Can’t calculate bias and variance as don’t know true model But can estimate total generalization error Set aside portion of data (validation set) Increase model complexity, find parameters Calculate error on validation set Stop when error cease to decrease or even start increasing Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Cross-Validation Best fit, “elbow” Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Regularization Introduce penalty for model complexity into an error function Find optimal model complexity (e.g. degree of polynomial) and optimal parameters (coefficients) which minimize this function Lambda is penalty for model complexity If lambda is too large only very simple models will be admitted Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)