Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Visual Recognition Tutorial
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensional reduction, PCA
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Today Wrap up of probability Vectors, Matrices. Calculus
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Jakob Verbeek December 11, 2009
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Linear Models for Classification
Discriminant Analysis
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning 5. Parametric Methods.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Pattern Classification, Chapter 3
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EE513 Audio Signals and Systems
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Hairong Qi, Gonzalez Family Professor
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Part 3: Estimation of Parameters

Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form of the densities are given or assumed, then, using the labeled samples, the parameters can be estimated. (supervised learning) Maximum Likelihood Estimation of Parameters Assume we have a sample set: as belonging to a given class. Drawn from (independently drawn from identically distributed r.v.) samples

( unknown parameter vector) for gaussian The density function - assumed to be of known form So our problem: estimate using sample set: Now drop and assume a single density function. : estimate of Anything can be an estimate. What is a good estimate? Should converge to actual values Unbiased etc Consider the mixture density (due to statistical independence) is called “likelihood function” that maximizes (Best agrees with drawn samples.)

if is a singular, Then find such that and for solving for. When is a vector, then : gradient of L wrt Where: Therefore or ( log-likelihood) (Be careful not to find the minimum with derivatives)

Example 1: Consider an exponential distribution otherwise (single feature, single parameter) With a random sample : valid for (inverse of average)

Example 2: Multivariate Gaussian with unknown mean vector M. Assume is known. k samples from the same distribution: (iid) (linear algebra)

(sample average or sample mean) Estimation of when it is unknown. (Do it yourself: not so simple) : sample covariance where is the same as above. Biased estimate : use for an unbiased estimate.

Example 3: Binary variables with unknown parameters (n parameters) So, k samples here is the element of sample.

So,  is the sample average of the feature.

Since is binary, will be the same as counting the occurances of ‘1’. Consider character recognition problem with binary matrices. For each pixel, count the number of 1’s and this is the estimate of. AB

Part 4: Features and Feature Extraction

Problems of Dimensionality and Feature Selection Are all features independent? Especially in binary features, we might have >100. The classification accuracy vs. size of feature set. Consider the Gaussian case with same for both categories. (assuming a priori probabilities are the same) (e:error) where is the square of mahalonobis distance between class means. Mahalonobis distance between and (the means)

P(e) decreases as r increases (the distance between the means). If (all features statistically independent.) then We conclude from here that 1-Most useful features are the ones with large distance and small variance. 2-Each feature contributes to reduce the probability of error.

When r increases, probability of error decreases. Best features are the ones with distant means and small variances.  So add new features if the ones we already have are not adequate (more features, decreasing prob. of error.)  But it was shown that adding new features after some point leads to worse performance.

Find statistically independent features Find discriminating features Computationally feasible features Principal Component Analysis (PCA) (Karhunen-Loeve Transform) Finds (reduces the set) to statistically independent features. vectors Find a representative Squared error criterion

Eliminating Redundant Features is to be found using a larger set So we either  Throw one away  Generate a new feature using and (ex:projections of the points to a line)  Form a linear combination of features. new sample in 1-d space Features that are linearly dependent

A linear transformation W? Can be found by: K-L expansion, Principal Component Analysis W :above are satisfied (class discrimination and independent x1, x2,..). represented with a single vector. - Find a vector so that sum of the squared distances to is minimum(Zero degree representation). Linear functions

squared-error criterion Find that maximizes. Solution is given by the sample mean. 0

Where, this expression is minimized. Consider now 1-d representation from 2-d. -The line should pass through the sample mean. unit vector in the direction of line Independent of Distance of Xk from M

Now how to find best e that minimizes It turns out that given the scatter matrix e must be the eigenvector of the scatter matrix with the largest eigenvalue lambda. That is, we project the data onto a line through the sample mean in the direction of the eigenvector of the scatter matrix with largest eigenvalue. Now consider d dimensional projection Here are d eigenvectors of the scatter matrix having largest eigenvalues.

Coefficients are called principal components. So each m dimensional feature vector is transferred to d dimensional space since the components are given as Now represent our new feature vector’s elements So

FISHER’S LINEAR DISCRIMINANT Curse of dimensionality. More features, more samples needed. We would like to choose features with more discriminating ability. Reduces the dimension of the problem to one in simplest form. Seperates samples from different categories. Consider samples from 2 different categories now.

- Find a line so that the projection separates the samples best. Same as : Apply a transformation to samples X to result with a scalar such that Fisher’s criterion function is maximized, where Line 1(bad solution) Line 2 (good solution) Blue (original data) Red(projection onto Line 1) Yellow(projection onto Line 2)

This reduces the problem to 1d, by keeping the classes most distant from each other. But if we write and in terms of and

Then, - within class scatter matrix - between class scatter matrix Then, maximize Rayleigh quotient

It can be shown that W that maximizes J can be found by solving the eigenvalue problem again: and the solution is given by Optimal if the densities are gaussians with equal covariance matrices. That means reducing the dimension does not cause any loss. Multiple Discriminant Analysis: c category problem. A generalization of 2-category problem.

Non-Parametric Techniques Density Estimation Use samples directly for classification - Nearest Neighbor Rule - 1-NN - k-NN Linear Discriminant Functions: gi(X) is linear.