LECTURE 01: COURSE OVERVIEW

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Pattern Recognition and Machine Learning
Chapter 1: Introduction to Pattern Recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning CMPT 726 Simon Fraser University
Pattern Classification, Chapter 1 1 Basic Probability.
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
IBS-09-SL RM 501 – Ranjit Goswami 1 Basic Probability.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 02: BAYESIAN DECISION THEORY Objectives: Bayes.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Compiled By: Raj G Tiwari.  A pattern is an object, process or event that can be given a name.  A pattern class (or category) is a set of patterns sharing.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Objectives: Normal Random Variables Support Regions Whitening Transformations Resources: DHS – Chap. 2 (Part 2) K.F. – Intro to PR X. Z. – PR Course S.B.
Objectives: Student Introductions Review Course Syllabus Demonstrate Basic Concepts Resources: Syllabus Java Applet Internet Books and Notes URL:.../publications/courses/ece_8443/lectures/lecture_01.ppt.../publications/courses/ece_8443/lectures/lecture_01
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Artificial Intelligence
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 11: Advanced Discriminant Analysis
LECTURE 04: DECISION SURFACES
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
LECTURE 03: DECISION SURFACES
Statistical Models for Automatic Speech Recognition
CS668: Pattern Recognition Ch 1: Introduction
Pattern Recognition Sergios Theodoridis Konstantinos Koutroumbas
Pattern Classification, Chapter 3
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
LECTURE 01: COURSE OVERVIEW
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 07: BAYESIAN ESTIMATION
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 05: THRESHOLD DECODING
LECTURE 11: Exam No. 1 Review
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Presentation transcript:

LECTURE 01: COURSE OVERVIEW Objectives: Terminology The Design Cycle Generalization and Risk The Bayesian Approach Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary

Terminology Pattern Recognition: “the act of taking raw data and taking an action based on the category of the pattern.” Common Applications: speech recognition, fingerprint identification (biometrics), DNA sequence identification Related Terminology: Machine Learning: The ability of a machine to improve its performance based on previous results. Machine Understanding: acting on the intentions of the user generating the data. Related Fields: artificial intelligence, signal processing and discipline-specific research (e.g., target recognition, speech recognition, natural language processing).

Recognition or Understanding? Which of these images are most scenic? How can we develop a system to automatically determine scenic beauty? (Hint: feature combination) Solutions to such problems require good feature extraction and good decision theory.

Features Are Confusable Regions of overlap represent the classification error Error rates can be computed with knowledge of the joint probability distributions (see OCW-MIT-6-450Fall-2006). Context is used to reduce overlap. In real problems, features are confusable and represent actual variation in the data. The traditional role of the signal processing engineer has been to develop better features (e.g., “invariants”).

Decomposition Decision Post-Processing Classification Feature Extraction Post-Processing Classification Segmentation Sensing Input Decision

The Design Cycle Start Collect Data Key issues: “There is no data like more data.” Perceptually-meaningful features? How do we find the best model? How do we estimate parameters? How do we evaluate performance? Goal of the course: Introduce you to mathematically rigorous ways to train and evaluate models. Choose Features Choose Model Train Classifier Evaluate Classifier End

Common Mistakes I got 100% accuracy on... Almost any algorithm works some of the time, but few real-world problems have ever been completely solved. Training on the evaluation data is forbidden. Once you use evaluation data, you should discard it. My algorithm is better because... Statistical significance and experimental design play a big role in determining the validity of a result. There is always some probability a random choice of an algorithm will produce a better result. Hence, in this course, we will also learn how to evaluate algorithms.

Image Processing Example Sorting Fish: incoming fish are sorted according to species using optical sensing (sea bass or salmon?) Problem Analysis: set up a camera and take some sample images to extract features Consider features such as length, lightness, width, number and shape of fins, position of mouth, etc. Feature Extraction Segmentation Sensing

Conclusion: Length is a poor discriminator Length As A Discriminator Conclusion: Length is a poor discriminator

Add Another Feature Lightness is a better feature than length because it reduces the misclassification error. Can we combine features in such a way that we improve performance? (Hint: correlation)

Width And Lightness Treat features as a N-tuple (two-dimensional vector) Create a scatter plot Draw a line (regression) separating the two classes

Decision Theory Can we do better than a linear classifier? What is wrong with this decision surface? (Hint: generalization)

Generalization and Risk Why might a smoother decision surface be a better choice? (Hint: Occam’s Razor). This course investigates how to find such “optimal” decision surfaces and how to provide system designers with the tools to make intelligent trade-offs.

Correlation Degrees of difficulty: Real data is often much harder:

First Principle There are many excellent resources on the Internet that demonstrate pattern recognition concepts. There are many MATLAB toolboxes that implement state of the art algorithms. One such resource is a Java Applet that lets you quickly explore how a variety of algorithms process the same data. An important first principle is: There are no magic equations or algorithms. You must understand the properties of your data and what a priori knowledge you can bring to bear on the problem. ….

Bayesian Formulations Message Source Linguistic Channel Articulatory Channel Acoustic Channel Message Words Phones Features Bayesian formulation for speech recognition: Objective: minimize the word error rate by maximizing Approach: maximize (training) acoustic model (hidden Markov models, Gaussian mixtures, etc. language model (finite state machines, N-grams) acoustics (ignored during maximization) Bayes Rule allows us to convert the problem of estimating an unknown posterior probability to a process in which we can postulate a model, collect data under controlled conditions, and estimate the parameters of the model.

Summary Pattern recognition vs. machine learning vs. machine understanding First principle of pattern recognition? We will focus more on decision theory and less on feature extraction. This course emphasizes statistical and data-driven methods for optimizing system design and parameter values. Second most important principle?

Feature Extraction

Generalization And Risk How much can we trust isolated data points? Optimal decision surface changes abruptly Optimal decision surface is a line Optimal decision surface still a line Can we integrate prior knowledge about data, confidence, or willingness to take risk?

Review Normal (Gaussian) Distributions Multivariate Normal (Gaussian) Distributions Support Regions: a convenient visualization tool

Normal Distributions Recall the definition of a normal distribution (Gaussian): Why is this distribution so important in engineering? Mean: Covariance: Statistical independence? Higher-order moments? Occam’s Razor? Entropy? Linear combinations of normal random variables? Central Limit Theorem?

Univariate Normal Distribution A normal or Gaussian density is a powerful model for modeling continuous- valued feature vectors corrupted by noise due to its analytical tractability. Univariate normal distribution: where the mean and covariance are defined by: The entropy of a univariate normal distribution is given by:

Mean and Variance A normal distribution is completely specified by its mean and variance: The peak is at: 66% of the area is within one ; 95% is within two ; 99% is within three . A normal distribution achieves the maximum entropy of all distributions having a given mean and variance. Central Limit Theorem: The sum of a large number of small, independent random variables will lead to a Gaussian distribution.

Multivariate Normal Distributions A multivariate distribution is defined as: where μ represents the mean (vector) and Σ represents the covariance (matrix). Note the exponent term is really a dot product or weighted Euclidean distance. The covariance is always symmetric and positive semidefinite. How does the shape vary as a function of the covariance?

Support Regions A support region is the obtained by the intersection of a Gaussian distribution with a plane. For a horizontal plane, this generates an ellipse whose points are of equal probability density. The shape of the support region is defined by the covariance matrix.

Derivation

Identity Covariance

Unequal Variances

Nonzero Off-Diagonal Elements

Unconstrained or “Full” Covariance