ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Slides:



Advertisements
Similar presentations
Independent Component Analysis
Advertisements

Independent Component Analysis: The Fast ICA algorithm
Component Analysis (Review)
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimension reduction (1)
Supervised Learning Recap
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
Principal Component Analysis
Independent Component Analysis (ICA)
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Bayesian belief networks 2. PCA and ICA
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Maximum likelihood (ML)
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Independent Component Analysis Zhen Wei, Li Jin, Yuxue Jin Department of Statistics Stanford University An Introduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Modern Navigation Thomas Herring
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Linear Models for Classification
Discriminant Analysis
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Principal Component Analysis (PCA)
Univariate Gaussian Case (Cont.)
Introduction to Independent Component Analysis Math 285 project Fall 2015 Jingmei Lu Xixi Lu 12/10/2015.
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Deep Feedforward Networks
Ch 12. Continuous Latent Variables ~ 12
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives: Heteroscedastic LDA Independent Component Analysis Examples Resources: Java PR Applet W.P.: Fisher DTREG: LDA S.S.: DFA Java PR Applet W.P.: Fisher DTREG: LDA S.S.: DFA

ECE 8527: Lecture 12, Slide 1 Heteroscedastic Linear Discriminant Analysis (HLDA) Heteroscedastic: when random variables have different variances. When might we observe heteroscedasticity?  Suppose 100 students enroll in a typing class — some of which have typing experience and some of which do not.  After the first class there would be a great deal of dispersion in the number of typing mistakes.  After the final class the dispersion would be smaller.  The error variance is non-constant — it decreases as time increases. An example is shown to the right. The two classes have nearly the same mean, but different variances, and the variances differ in one direction. LDA would project these classes onto a line that does not achieve maximal separation. HLDA seeks a transform that will account for the unequal variances. HLDA is typically useful when classes have significant overlap.

ECE 8527: Lecture 12, Slide 2 Partitioning Our Parameter Vector Let W be partitioned into the first p columns corresponding to the dimensions we retain, and the remaining d-p columns corresponding to the dimensions we discard. Then the dimensionality reduction problem can be viewed in two steps:  A non-singular transform is applied to x to transform the features, and  A dimensionality reduction is performed where reduce the output of this linear transformation, y, to a reduced dimension vector, y p. Let us partition the mean and variances as follows: where is common to all terms and are different for each class.

ECE 8527: Lecture 12, Slide 3 Density and Likelihood Functions The density function of a data point under the model, assuming a Gaussian model (as we did with PCA and LDA), is given by: where is an indicator function for the class assignment for each data point. (This simply represents the density function for the transformed data.) The log likelihood function is given by: Differentiating the likelihood with respect to the unknown means and variances gives:

ECE 8527: Lecture 12, Slide 4 Optimal Solution Substituting the optimal values into the likelihood equation, and then maximizing with respect to θ gives: These equations do not have a closed-form solution. For the general case, we must solve them iteratively using a gradient descent algorithm and a two-step process in which we estimate means and variances from θ and then estimate the optimal value of θ from the means and variances. Simplifications exist for diagonal and equal covariances, but the benefits of the algorithm seem to diminish in these cases. To classify data, one must compute the log-likelihood distance from each class and then assign the class based on the maximum likelihood. Let’s work some examples (class-dependent PCA, LDA and HLDA).some examples HLDA training is significantly more expensive than PCA or LDA, but classification is of the same complexity as PCA and LDA because this is still essentially a linear transformation plus a Mahalanobis distance computation.

ECE 8527: Lecture 12, Slide 5 Independent Component Analysis (ICA) Goal is to discover underlying structure in a signal. Originally gained popularity for applications in blind source separation (BSS), the process of extracting one or more unknown signals from noise (e.g., cocktail party effect). Most often applied to time series analysis though it can also be used for traditional pattern recognition problems. Define a signal as a sum of statistically independent signals: If we can estimate A, then we can compute s by inverting A: This is the basic principle of blind deconvolution or BSS. The unique aspect of ICA is that it attempts to model x as a sum of statistically independent non-Gaussian signals. Why?

ECE 8527: Lecture 12, Slide 6 Objective Function Unlike mean square error approaches, ICA attempts to optimize the parameters of the model based on a variety of information theoretic measures:  Mutual information:  Negentropy:  Maximum likelihood: Mutual information and Negentropy are related by: It is common in ICA to zero mean and prewhiten the data (using PCA) so that the technique can focus on the non-Gaussian aspects of the data. Since these are linear operations, they do not impact the non-Gaussian aspects of the model. There are no closed form solutions for the problem described above, and a gradient descent approach must be used to find the model parameters. We will need to develop more powerful mathematics to do this (e.g., the Expectation Maximization algorithm).

ECE 8527: Lecture 12, Slide 7 FastICA One very popular algorithm for ICA is based on finding a projection of x that maximizes non-Gaussianity. Define an approximation to Negentropy: Use an iterative equation solver to find the weight vector, w:  Choose an initial random guess for w.  Compute:  Let:  If the direction of w changes, iterate. Later in the course we will see many iterative algorithms of this form, and formally derive their properties. FastICA is very similar to a gradient descent solution of the maximum likelihood equations. ICA has been successfully applied to a wide variety of BSS problems including audio, EEG, and financial data.

ECE 8527: Lecture 12, Slide 8 Summary HLDA is used when random variables have different variances. ICA assumes the random variables are linear mixtures of some unknown latent variables, and the mixing system is also unknown. The latent variables are assumed non-Gaussian and mutually independent, and they are called the independent components of the observed data. These independent components, also called sources or factors, can be found by ICA. There are many other forms of component analysis including neural network based approaches (e.g., nonlinear PCA, learning vector quantization – LVQ), kernel-based approaches that use data-driven kernels, probabilistic ICA, and support vector machines.