: INTRODUCTION TO Machine Learning Parametric Methods.

Slides:



Advertisements
Similar presentations
Bayesian Learning & Estimation Theory
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning CMPT 726 Simon Fraser University
Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Machine Learning 5. Parametric Methods.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Statistics in MSmcDESPOT
Maximum Likelihood Estimation
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
INTRODUCTION TO Machine Learning
10701 / Machine Learning Today: - Cross validation,
Pattern Recognition and Machine Learning
Biointelligence Laboratory, Seoul National University
INTRODUCTION TO Machine Learning
Machine Learning” Dr. Alper Özpınar.
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Parametric Estimation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Applied Statistics and Probability for Engineers
Presentation transcript:

: INTRODUCTION TO Machine Learning Parametric Methods

Parametric Estimation X = { x t } t where x t ~ p (x) Parametric estimation: Assume a form for p (x |q ) and estimate q, its sufficient statistics, using X N ( μ, σ 2 ) where q = { μ, σ 2 }

Maximum Likelihood Estimation Likelihood of q given the sample X l ( θ |X) = p (X | θ ) = t p (x t | θ ) Log likelihood L( θ |X) = log l ( θ |X) = t log p (x t | θ ) Maximum likelihood estimator θ * = argmax θ L( θ |X)

Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1} P (x) = p o x (1 – p o ) (1 – x) L (p o |X) = log t p o x t (1 – p o ) (1 – x t ) MLE: p o = t x t / N Multinomial: K>2 states, x i in {0,1} P (x 1,x 2,...,x K ) = i p i x i L(p 1,p 2,...,p K |X) = log t i p i x i t MLE: p i = t x i t / N

Gaussian (Normal) Distribution p(x) = N ( μ, σ 2 ) MLE for μ and σ 2 :

Bias and Variance Unknown parameter q Estimator d i = d (X i ) on sample X i Bias: b q (d) = E [d] – q Variance: E [(d–E [d]) 2 ] Mean square error: r (d,q) = E [(d–q) 2 ] = (E [d] – q) 2 + E [(d–E [d]) 2 ] = Bias 2 + Variance

Bayes Estimator Treat θ as a random var with prior p ( θ ) Bayes rule: p ( θ |X) = p(X| θ ) p( θ ) / p(X) Full: p(x|X) = p(x| θ ) p( θ |X) d θ Maximum a Posteriori (MAP): θ MAP = argmax θ p( θ |X) Maximum Likelihood (ML): θ ML = argmax θ p(X| θ ) Bayes: θ Bayes = E[ θ |X] = θ p( θ |X) d θ

Parametric Classification

Given the sample ML estimates are Discriminant becomes Parametric Classification

(a)and(b) for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold if decision.

Parametric Classification (a)and(b) for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. In (c), the expected risks are shown for the two classes and for reject with

Regression

Regression: From LogL to Error

Linear Regression

Polynomial Regression

Square Error: Relative Square Error: Absolute Error: E ( θ |X) = t |r t – g(x t | θ )| ε -sensitive Error: E ( θ |X) = t 1(|r t – g(x t | θ )|>ε) (|r t – g(x t |θ)| – ε) Other Error Measures

Bias and Variance biasvariance noisesquared error

Estimating Bias and Variance M samples X i ={x t i, r t i }, i=1,...,M are used to fit g i (x), i =1,...,M

Bias/Variance Dilemma Example: g i (x)=2 has no variance and high bias g i (x)= t r t i /N has lower bias with variance As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)

Bias/Variance Dilemma (a) Function, f(x) = 2sin(1.5x), and one noisy (N(0,1)) dataset sampled from the function. Five samples are taken, each containing twenty in-stances. (b), (c), (d) are five polynomial fits, namely, gi(.), of order 1, 3 and 5. for each case, dotted line is the average of the five fits namely,.

Polynomial Regression Best fit min error In the same setting as that of previous, using one hundred models instead of five, bias, variance, and error for polynomials of order 1 to 5.

Model Selection Cross-validation: Measure generalization accuracy by testing on data unused during training Regularization: Penalize complex models E=error on data + λ model complexity Akaikes information criterion (AIC), Bayesian information criterion (BIC) Minimum description length (MDL): Kolmogorov complexity, shortest description of data Structural risk minimization (SRM)

Best fit, elbow Model Selection

Bayesian Model Selection Prior on models, p(model) Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high posterior

Regression example Coefficients increase in magnitude as order increases: 1: [ , ] 2: [0.1682, , ] 3: [0.4238, , , : [ , , , , ]