Pattern Recognition PhD Course.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Component Analysis (Review)
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.
Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.
Visual Recognition Tutorial
Simple Linear Regression
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
The moment generating function of random variable X is given by Moment generating function.
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The Multivariate Normal Distribution, Part 1 BMTRY 726 1/10/2014.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Discriminant Analysis
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Machine Learning 5. Parametric Methods.
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Lecture 1.31 Criteria for optimal reception of radio signals.
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
LECTURE 03: DECISION SURFACES
Data Mining Lecture 11.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
REMOTE SENSING Multispectral Image Classification
REMOTE SENSING Multispectral Image Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Pattern Recognition and Machine Learning
3.3 Network-Centric Community Detection
Generally Discriminant Analysis
LECTURE 23: INFORMATION THEORY REVIEW
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
11. Conditional Density Functions and Conditional Expected Values
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
11. Conditional Density Functions and Conditional Expected Values
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

Pattern Recognition PhD Course

Automatic Letter Recognition Steps for the letter recognition: 1. Creating a training set:   - Separating characters from text   - Creating the feature vector for the separated character  2. Identify the unidentified characters using the training set

the training set ?

Pattern Recognition X p-dimesional random vector, the feature vector Y discrete variable, the classification, RY={1,2,…,M} decision function g: ℝ 𝑝 ⟶ 𝑅 𝑌 If g 𝑋 ≠𝑌 then the decision makes error

In the formulation of the Bayes decision problem, introduce a cost function 𝐶(𝑦,𝑦′)≥0 which is the cost if the label Y = y and the decision g( 𝑋 ) = y’ . For a decision function g, the risk is the expectation of the cost: 𝑅 𝑔 =𝐄(C(y),g( 𝑋 )) In Bayes decision problem, the aim is to minimize the risk, i.e., the goal is to find a function 𝑔 ∗ : ℝ 𝑝 → 1,2,…,𝑀 such that 𝑅 𝑔 ∗ = min 𝑔: ℝ 𝑝 ⟶ 1,2,…𝑀 𝑅(𝑔) where 𝑔 ∗ is called the Bayes decision function, and 𝑅 ∗ =𝑅( 𝑔 ∗ ) is the Bayes risk

For the posteriori probabilities, introduce the notations: 𝑃 𝑦 𝑋 =𝐏(Y=y| 𝑋 ) Let the decision function 𝑔 ∗ be defined by 𝑔 ∗ 𝑋 = arg min 𝑦′ 𝑦=1 𝑀 𝐶(𝑦,𝑦′ ) 𝑃 𝑦 ( 𝑋 ) If arg min is not unique then choose the smallest y’ , which minimizes the sum. This definition implies that for any decision function g, 𝑦=1 𝑀 𝐶(𝑦, 𝑔 ∗ 𝑋 ) 𝑃 𝑦 ( 𝑋 )≤ 𝑦=1 𝑀 𝐶(𝑦, 𝑔 𝑋 ) 𝑃 𝑦 ( 𝑋 )

Theorem For any decision function g, we have that Proof. For a decision function g, let’s calculate the risk. =𝐄 𝑦=1 𝑀 𝑦′=1 𝑀 𝐶(𝑦,𝑦′ )𝐏 𝑌=𝑦,𝑔 𝑋 = 𝑦 ′ | 𝑋 𝑅 𝑔 =𝐄 𝐶(𝑌,𝑔( 𝑋 ) =𝐄 𝐶(𝑌,𝑔( 𝑋 )| 𝑿 =𝐄 𝑦=1 𝑀 𝑦′=1 𝑀 𝐶(𝑦,𝑦′ ) 𝐼 𝑔 𝑋 = 𝑦 ′ 𝐏 𝑌=𝑦| 𝑋 =𝐄 𝑦=1 𝑀 𝐶(𝑦,𝑔 𝑋 ) 𝑃 𝑦 ( 𝑋 ) This implies that 𝑅 𝑔 =𝐄 𝑦=1 𝑀 𝐶(𝑦,𝑔 𝑋 ) 𝑃 𝑦 ( 𝑋 ) ≥ ≥𝐄 𝑦=1 𝑀 𝐶 𝑦, 𝑔 ∗ 𝑋 𝑃 𝑦 𝑋 =R( 𝑔 ∗ )

Concerning the cost function, the most frequently studied example is the so called 0 − 1 loss: 𝐶 𝑦, 𝑦 ′ = 1 𝑖𝑓 𝑦≠𝑦′ 0 𝑖𝑓 𝑦=𝑦′ For the 0 − 1 loss, the corresponding risk is the error probability: 𝑅 𝑔 =𝐄 𝐶(𝑌,𝑔( 𝑋 ) =𝐄 𝐼 𝑌≠𝑔( 𝑋 ) =𝐏 𝑌≠𝑔( 𝑋 ) , and the Bayes decision is of form 𝑔 ∗ 𝑋 = arg min 𝑦′ 𝑦≠𝑦′ 𝑃 𝑦 𝑋 = arg max 𝑦′ 𝑃 𝑦′ 𝑋 which is called maximum posteriori decision, too.

If the distribution of the observation vector 𝑋 has density, then the Bayes decision has an equivalent formulation. Introduce the notations for density of 𝑋 by 𝐏 𝑋 𝜖𝐵 = 𝐵 𝑓 𝑥 𝑑 𝑥 and for the conditional densities by 𝑃 𝑋 𝜖𝐵|𝑌=𝑦 = 𝐵 𝑓 𝑥 𝑑 𝑥 and for a priori probabilities 𝑞 𝑦 =𝑃 𝑌=𝑦 then it is easy to check that 𝑃 𝑦 𝑌=𝑦| 𝑋 = 𝑥 = 𝑞 𝑦 𝑓 𝑦 𝑥 𝑓 𝑥

and therefore From the proof of Theorem we may derive a formula for the optimal risk:

If 𝑋 has density then For the 0 − 1 loss, we get that which has the form, for densities,

Multivariate Normal Distribution

Linear Combinations

MVN Properties

Discriminant Analysis (DA)

That is, in multivariate normal case, we can reach the minimal risk!

A goodness-of-fit parameter, Wilks’ lambda, is defined as follows: where λj is the jth eigenvalue corresponding to the eigenvector described above and m is the minimum of C-1 and p. Wilks' Lambda Test Wilks' Lambda test is to test which variable contribute significance in discriminant function. The closer Wilks' lambda is to 0, the more the variable contributes to the discriminant function. The table also provide a Chi-Square statsitic to test the significance of Wilk's Lambda. If the e-value if less than 0.05, we can conclude that the corresponding function explain the group membership well.