Pattern Recognition PhD Course.

Slides:

Advertisements

Similar presentations

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.

Advertisements

Component Analysis (Review)

Pattern Recognition and Machine Learning

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

Pattern Recognition and Machine Learning

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.

Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.

Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.

Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.

Chapter 2: Bayesian Decision Theory (Part 2) Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density.

Pattern Classification Chapter 2 (Part 2)0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O.

Visual Recognition Tutorial

Simple Linear Regression

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.

Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Machine Learning CMPT 726 Simon Fraser University

Visual Recognition Tutorial

The moment generating function of random variable X is given by Moment generating function.

Pattern Recognition Topic 2: Bayes Rule Expectant mother:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

The Multivariate Normal Distribution, Part 1 BMTRY 726 1/10/2014.

METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Principles of Pattern Recognition

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.

ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Discriminant Analysis

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Machine Learning 5. Parametric Methods.

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.

Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Lecture 2. Bayesian Decision Theory

Lecture 1.31 Criteria for optimal reception of radio signals.

Probability Theory and Parameter Estimation I

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

LECTURE 10: DISCRIMINANT ANALYSIS

LECTURE 03: DECISION SURFACES

Data Mining Lecture 11.

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

REMOTE SENSING Multispectral Image Classification

REMOTE SENSING Multispectral Image Classification

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Ying shen Sse, tongji university Sep. 2016

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

EE513 Audio Signals and Systems

Pattern Recognition and Machine Learning

3.3 Network-Centric Community Detection

Generally Discriminant Analysis

LECTURE 23: INFORMATION THEORY REVIEW

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

LECTURE 09: DISCRIMINANT ANALYSIS

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

11. Conditional Density Functions and Conditional Expected Values

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

11. Conditional Density Functions and Conditional Expected Values

Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)

Presentation transcript:

Pattern Recognition PhD Course

Automatic Letter Recognition Steps for the letter recognition: 1. Creating a training set: - Separating characters from text - Creating the feature vector for the separated character 2. Identify the unidentified characters using the training set

the training set ?

Pattern Recognition X p-dimesional random vector, the feature vector Y discrete variable, the classification, RY={1,2,…,M} decision function g: ℝ 𝑝 ⟶ 𝑅 𝑌 If g 𝑋 ≠𝑌 then the decision makes error

In the formulation of the Bayes decision problem, introduce a cost function 𝐶(𝑦,𝑦′)≥0 which is the cost if the label Y = y and the decision g( 𝑋 ) = y’ . For a decision function g, the risk is the expectation of the cost: 𝑅 𝑔 =𝐄(C(y),g( 𝑋 )) In Bayes decision problem, the aim is to minimize the risk, i.e., the goal is to find a function 𝑔 ∗ : ℝ 𝑝 → 1,2,…,𝑀 such that 𝑅 𝑔 ∗ = min 𝑔: ℝ 𝑝 ⟶ 1,2,…𝑀 𝑅(𝑔) where 𝑔 ∗ is called the Bayes decision function, and 𝑅 ∗ =𝑅( 𝑔 ∗ ) is the Bayes risk

For the posteriori probabilities, introduce the notations: 𝑃 𝑦 𝑋 =𝐏(Y=y| 𝑋 ) Let the decision function 𝑔 ∗ be defined by 𝑔 ∗ 𝑋 = arg min 𝑦′ 𝑦=1 𝑀 𝐶(𝑦,𝑦′ ) 𝑃 𝑦 ( 𝑋 ) If arg min is not unique then choose the smallest y’ , which minimizes the sum. This definition implies that for any decision function g, 𝑦=1 𝑀 𝐶(𝑦, 𝑔 ∗ 𝑋 ) 𝑃 𝑦 ( 𝑋 )≤ 𝑦=1 𝑀 𝐶(𝑦, 𝑔 𝑋 ) 𝑃 𝑦 ( 𝑋 )

Theorem For any decision function g, we have that Proof. For a decision function g, let’s calculate the risk. =𝐄 𝑦=1 𝑀 𝑦′=1 𝑀 𝐶(𝑦,𝑦′ )𝐏 𝑌=𝑦,𝑔 𝑋 = 𝑦 ′ | 𝑋 𝑅 𝑔 =𝐄 𝐶(𝑌,𝑔( 𝑋 ) =𝐄 𝐶(𝑌,𝑔( 𝑋 )| 𝑿 =𝐄 𝑦=1 𝑀 𝑦′=1 𝑀 𝐶(𝑦,𝑦′ ) 𝐼 𝑔 𝑋 = 𝑦 ′ 𝐏 𝑌=𝑦| 𝑋 =𝐄 𝑦=1 𝑀 𝐶(𝑦,𝑔 𝑋 ) 𝑃 𝑦 ( 𝑋 ) This implies that 𝑅 𝑔 =𝐄 𝑦=1 𝑀 𝐶(𝑦,𝑔 𝑋 ) 𝑃 𝑦 ( 𝑋 ) ≥ ≥𝐄 𝑦=1 𝑀 𝐶 𝑦, 𝑔 ∗ 𝑋 𝑃 𝑦 𝑋 =R( 𝑔 ∗ )

Concerning the cost function, the most frequently studied example is the so called 0 − 1 loss: 𝐶 𝑦, 𝑦 ′ = 1 𝑖𝑓 𝑦≠𝑦′ 0 𝑖𝑓 𝑦=𝑦′ For the 0 − 1 loss, the corresponding risk is the error probability: 𝑅 𝑔 =𝐄 𝐶(𝑌,𝑔( 𝑋 ) =𝐄 𝐼 𝑌≠𝑔( 𝑋 ) =𝐏 𝑌≠𝑔( 𝑋 ) , and the Bayes decision is of form 𝑔 ∗ 𝑋 = arg min 𝑦′ 𝑦≠𝑦′ 𝑃 𝑦 𝑋 = arg max 𝑦′ 𝑃 𝑦′ 𝑋 which is called maximum posteriori decision, too.

If the distribution of the observation vector 𝑋 has density, then the Bayes decision has an equivalent formulation. Introduce the notations for density of 𝑋 by 𝐏 𝑋 𝜖𝐵 = 𝐵 𝑓 𝑥 𝑑 𝑥 and for the conditional densities by 𝑃 𝑋 𝜖𝐵|𝑌=𝑦 = 𝐵 𝑓 𝑥 𝑑 𝑥 and for a priori probabilities 𝑞 𝑦 =𝑃 𝑌=𝑦 then it is easy to check that 𝑃 𝑦 𝑌=𝑦| 𝑋 = 𝑥 = 𝑞 𝑦 𝑓 𝑦 𝑥 𝑓 𝑥

and therefore From the proof of Theorem we may derive a formula for the optimal risk:

If 𝑋 has density then For the 0 − 1 loss, we get that which has the form, for densities,

Multivariate Normal Distribution

Linear Combinations

MVN Properties

Discriminant Analysis (DA)

That is, in multivariate normal case, we can reach the minimal risk!

A goodness-of-fit parameter, Wilks’ lambda, is defined as follows: where λj is the jth eigenvalue corresponding to the eigenvector described above and m is the minimum of C-1 and p. Wilks' Lambda Test Wilks' Lambda test is to test which variable contribute significance in discriminant function. The closer Wilks' lambda is to 0, the more the variable contributes to the discriminant function. The table also provide a Chi-Square statsitic to test the significance of Wilk's Lambda. If the e-value if less than 0.05, we can conclude that the corresponding function explain the group membership well.