Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Lecture 3 Nonparametric density estimation and classification
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Prénom Nom Document Analysis: Fundamentals of pattern recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Image Classification 영상분류
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 E. Fatemizadeh Statistical Pattern Recognition.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Discriminant Analysis
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Lecture 2. Bayesian Decision Theory
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
LECTURE 03: DECISION SURFACES
Parameter Estimation 主講人:虞台文.
CH 5: Multivariate Methods
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Outline Parameter estimation – continued Non-parametric methods.
Classification Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008

© Prof. Rolf Ingold 2 Outline  Introduction  Parameter estimation  Non parametric classifiers : kNN  Neural networks  Hidden Markov Models  Other approaches

© Prof. Rolf Ingold 3 Introduction  Bayesian decision theory provides a theoretical framework for statistical pattern recognition  It supposes the following probabilistic information to be available:  n, the number of classes  P(  i ), the a priori probability (prior) of each class  i  p(x|  i ), the distribution of the feature vector x, depending of the class  i  How to estimate these values and functions ?  especially how to estimate the class dependent distribution (or density) functions

© Prof. Rolf Ingold 4 Approaches for statistical pattern recognition  Several approaches try to overcome the difficulty of getting the class dependent feature distributions (or densities):  Parameter estimation : the form of the distributions is supposed to be known; only some parameters have to be estimated from training samples  Parzen windows : densities are estimated from training samples by “smoothing” them with a window function  K-nearest neighbors (KNN) rule : the decision is associated with the dominant class of the K-nearest neighbors taken from the training samples  Functional discrimination : the decision consist in minimizing an objective function within an augmented feature space

© Prof. Rolf Ingold 5 Parameter Estimation  By hypothesis, the following information is supposed to be known  n, the number of classes  for each class  i  the a priori probability P(  i )  the functional form of the class conditional feature densities with unknown parameters  i  a labeled set of training data D i ={x i1, x i2,..., x iN i } supposed to be drawn randomly from  i  In fact parameter estimation can be performed class by class

© Prof. Rolf Ingold 6 Maximum likelihood criteria  Maximum likelihood estimation consists in determining  i that maximizes the likelihood of D i, i.e  For some distributions, the problem can be solved analytically by the equations  is it really a maximum ?  If the solution can not be found analytically, it can be computed iteratively by a gradient climbing method

© Prof. Rolf Ingold 7 Univariate Gaussian distribution  In one dimension, the normal distribution N( ,   ) is defined by the expression   represents the mean     represents la variance  le  maximum of the curve corresponds to

© Prof. Rolf Ingold 8 Multivariate Gaussian distribution  At d dimensions, the generalized normal distribution N( ,  ) is defined by where   represents the mean vector   represents the covariance matrix

© Prof. Rolf Ingold 9 Interpretation of the parameters  The mean vector  represents the center of the distribution  The covariance matrix  describes the scatter  it is symmetrical :  ij  ji  it is positive semidefinite (usually postive definite)  ii  i 2 ≥   the principal axes of the hyperboloids are given by the eigenvectors of   the length of the axes are given by the eigenvalues  if two features x i and x j are statistically independent, then  ij  ji 

© Prof. Rolf Ingold 10 Mahalanobis distance  Regions of constant density are hyperboloids centered at  and characterized by the equations where C is a positive constant  The Mahalanobis distance from x to  is defined as

© Prof. Rolf Ingold 11 Estimation of  and  of normal distributions  In the one-dimensional case, the maximum likelihood criteria leads to following equations  In the one-dimensional case the solution is  Generalized to the multi-dimensional case, we obtain

© Prof. Rolf Ingold 12 Bias Problem  The estimation for   (resp.   ) is biased; the expected value over all sets of size n is different to the true variance, which is  An unbiased estimation would be  Both estimator converge asymptotically  Which estimator is correct ?  they are neither right or wrong !  no one has all desirable properties  Bayesian learning theory can give an answer

© Prof. Rolf Ingold 13 Discriminant functions for normal distributions (1)  For normal distributions, the following discriminant functions may be stated  In the case where all classes share the same covariance matrix  the decision boundaries are linear

© Prof. Rolf Ingold 14 Linear decision boundaries for normal distributions

© Prof. Rolf Ingold 15 Discriminant functions for normal distributions (2)  In the case of arbitrary covariance matrices, boundaries become quadratic

© Prof. Rolf Ingold 16

© Prof. Rolf Ingold 17 Font Recognition : 1D-Gaussian estimation (1)  Font style discrimination (■ roman ■ italic) using hpd-stdev  estimated models fit with distributions  decision boundary is accurate  recognition accuracy (96.3%) is confirmed by the experimental confusion matrix

© Prof. Rolf Ingold 18 Font Recognition : 1D-Gaussian estimation (2)  Font boldness discrimination (■ normal ■ bold) using hr-mean  estimated models do not fit real distributions  decision boundary is surprisingly well adapted  recognition accuracy (97.6%) is high as observed from the experimental confusion matrix

© Prof. Rolf Ingold 19 Font Recognition : 1D-Gaussian estimation (3)  Boldness is generally dependent on the font family  hr-mean can perfectly discriminate ■ normal and ■ bold fonts if the font family is known (recognition rate > 99.9%) Times Courier Arial all

© Prof. Rolf Ingold 20 Font Recognition : 1D-Gaussian estimation (4)  Font family discrimination (■ Arial, ■ Courier, ■ Times) using hr-mean  estimated models do not fit real distributions at all  decision boundary are inadequate  recognition accuracy is bad (41,9%)

© Prof. Rolf Ingold 21 Font Recognition : 1D-Multi-Gaussian estimation  Font family discrimination (■ Arial, ■ Courier, ■ Times) using hr-mean, supposing font style to be known for learning  estimated models fit real distributions  decision boundary are adequate  recognition accuracy is nearly optimal for the given feature (89,6%)

© Prof. Rolf Ingold 22 Font Recognition : 2D-Gaussian estimation  Font family discrimination (■ Arial, ■ Courier, ■ Times) using two features: hr-stdev and vr-mean  models fit approximately two classes but not the third one  decision boundary is surprisingly well adapted  recognition accuracy (93,5%) is reasonable

© Prof. Rolf Ingold 23 Font Recognition : General Gaussian estimation  Performance of font family discrimination (■ Arial, ■ Courier, ■ Times) depends of the used feature set  hr-stdev : recognition rate => 72,7%  hr-stdev, vr-mean : recognition rate => 93,5%  hp-mean, hr-mean, vr-mean : recognition rate => 98,0%  hp-mean, hpd-stdev, hr-mean, vr-mean, hr-stdev, vr-stdev : recognition rate => 99,7%

© Prof. Rolf Ingold 24 Font recognition : classifier for all 12 classes  Discrimination of all fonts using all features hp-mean, hpd-stdev, hr-mean, hr-stdev, vr-mean, vr-stdev  overall recognition rate of 99.6%  most errors due to roman/italic confusion

© Prof. Rolf Ingold 25 Error types  In a Bayesian classifier using parameter estimation, several error types occur  Indistinguishability errors, due to overlapping of distributions, which are inherent to the problem  can not be reduced  Modeling errors, due to a bad choice for the parametric density functions (models)  can be avoided by changing the models  Modeling errors, due to the imprecision of training data;  can be improved by increasing training data

© Prof. Rolf Ingold 26 Influence of the size of training data Evolution of the error rate as function of the size of training sets (experiment with 4 training sets and 2 test sets, ■ average)