Background on Classification

Slides:



Advertisements
Similar presentations
3D Geometry for Computer Graphics
Advertisements

Component Analysis (Review)
Dimension reduction (1)
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Dimensional reduction, PCA
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Ordinary least squares regression (OLS)
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Some matrix stuff.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Factor Analysis Psy 524 Ainsworth. Assumptions Assumes reliable correlations Highly affected by missing data, outlying cases and truncated data Data screening.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Linear Models for Classification
Discriminant Analysis
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector length = sqrt(2*2+1*1) orientation.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
Feature Extraction 主講人:虞台文.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Unsupervised Learning II Feature Extraction
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Introduction to Vectors and Matrices
Principal Component Analysis (PCA)
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Principal Component Analysis (PCA)
Classification Discriminant Analysis
Classification Discriminant Analysis
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
OVERVIEW OF LINEAR MODELS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
X.1 Principal component analysis
OVERVIEW OF LINEAR MODELS
Dimensionality Reduction
Feature space tansformation methods
Generally Discriminant Analysis
Principal Components What matters most?.
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Parametric Methods Berlin Chen, 2005 References:
Introduction to Vectors and Matrices
Multivariate Methods Berlin Chen
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

Background on Classification

What is Classification? Objects of Different Types : Classes Sensing and Digitizing Calculating Properties : Features Mapping Features to Classes

Features (term means different things to different people) A Feature is a quantity used by a classifier A Feature Vector is an ordered list of features Examples: A full spectrum A dimensionality reduced spectrum A differentiated spectrum Vegetation indices

Features (terms means different things to different people) For the math behind Classifiers, feature vectors are thought of as points in n-dimensional space Spectra in 426 dimensional (or 426D) space (NDVI, Nitrogen, ChlorA, ChlorB) in 4D space Dimensionality reduction used to Visualize in 2D or 3D Mitigate the Curse of Dimensionality

What is a Classifier A bit more formal x = Feature Vector (x1, x2, …, xB)’  L = set of class labels : L = {L1, L2, …, LC} e.g. {Pinus palustris, Quercus laevis, …}   A (Discrete) classifier is a function

What is a Classifier A bit more formal x = Feature Vector (x1, x2, …, xB)’  L = set of class labels : L = {L1, L2, …, LC} e.g. {Pinus palustris, Quercus laevis, …}   A (Continuous) classifier is a function

Linear Discriminants Discriminant classifiers are designed to discriminate between classes Generative model classes

Linear Discriminant or Linear Discriminant Analysis There are many differ types. Here are some: Ordinary Least Squares Ridge Regression Lasso Canonical Perceptron Support Vector Machine (without kernels) Relevance Vector Machine

Linear Discriminants Points on left side of lines are in blue class Points on right side of lines are in red class Which line is best? What does best mean? Blue Class Red Class

Linear Discriminants – 2 Classes bias wk0 x0=1 wk1 x1 S wk2 x2 . . wkm BIG Numbers for Class 1 small Numbers for Class 2 xm Features Weights

Example of “Best” Support Vector Machine Pairwise (2 classes at a time) Maximizes Margin Between Classes Minimizes Objective Function by Solving Quadratic Program

yn = w0 +w1xn,1+w2xn,2+…+wBxn,B Back to Classifiers Definition: Training Given a data set X = {x1, x2, …, xN} Corresponding desired, or target, outputs Y = {y1, y2, …, yN} user defined functional form of a classifier, e.g. yn = w0 +w1xn,1+w2xn,2+…+wBxn,B Estimate the parameters {w0, w1,…, wB} X is called the training set

Linear Classifiers Ordinary Least Squares Continuous Classifier Target Outputs usually {0,1} or {-1,1} Minimize the squared error:

Linear Classifiers – Least Squares How do we minimize? Take derivative and set to zero or

Example Rows ~ Spectra X = t = w = pinv(X)*t = X*w =

Ridge Regression Ordinary Least Squares + Regularization Diagonal Loading: Solution:

Ridge Regression Diagonal Loading can be crucial for Ordinary Least Squares Solution: Ridge Regression Solution: Diagonal Loading: Diagonal Loading can be crucial for Numerical Stability

Illustrative Example We’ll see value later

Notes on Methodology When developing a “Machine Learning” algorithm, one should test it on simulated data first Necessary but not sufficient Necessary: If it doesn’t work on simulated data, then it almost certainly will not work with real data Sufficient: If it works on simulated data, then it may or may not work on real data Question: How do we simulate?

Simulating Data Usually use Gaussians because they are often assumed (although not accurate nearly as often) Multivariate Gaussians completely determined by Mean Covariance Matrix

Some Single Gaussians in One - Dimension

Fisher Canonical LDA Gaussians Trickery of Displays Same X-axis Same Y-Axis

Fisher Canonical LDA Gaussians Trickery of Displays Same X-axis Different Y-Axis

Fisher Canonical LDA Gaussians Trickery of Displays Same X-axis Different Y-Axis

Some Single Gaussians in Two Dimensions

Formulas for Gaussians To generate simulated data, we need to draw samples from these distributions Univariate Gaussian Multivariate Gaussian, e.g. x is a spectrum Covariance Matrix

Sample Covariance Definition Called Outer Product Example Outer Product

Covariance Matrices If S is a covariance matrix, then people who need to know can calculate matrices U and D with the properties that S is Diagonalized U is Orthogonal (Like a Rotation) D is Diagonal

Generating Covariance Matrices (1) Any Matrix of the form AtA is a covariance for some distribution So we can do the following: Set A = random square matrix S = AtA (2) So we also can do the following: Make a diagonal matrix D Make a rotation matrix U Make a covariance matrix using by setting S = UtDU We will generate covariance matrices S using Python

Go To Python

Linear Dimensionality Reduction PCA: Principal Components Analysis Maximize amount of variance in first k bands compared to all other linear (orthogonal) transforms MNF: Minimum Noise Fraction Minimizes estimate of Noise/Signal or Maximizes estimate of Signal/Noise

PCA Start with a data set of spectra or other samples. Implicitly assumed drawn from same distribution. Compute sample mean over all spectra: Compute Sample Covariance: Diagonalize S: PCA is defined to be:

PCA – Easy Examples Eigenvector (Columns of V determine major and minor axes New coordinate system is a rotation (U) and shift (x-xbar) of original coordinate system Assumes elliptical, which is Gaussian Eigenvalues determine length of major and minor axes

PCA, “Dark Points”, and BRDF These Black Points are from Oak Trees These Red Points are from Soil in a Baseball Field

Go To Python

MNF Assumption: Observed Spectrum = Signal + Noise Want to transform x so is minimized How do we represent this ratio?

Noise Variance / Signal Variance MNF Assume the signal and noise are both random vectors with multivariate Gaussian distributions Assume the noise is zero mean. Equally likely to add or subtract by the same amounts. The noise variance uniquely determines how much the signal is modified by noise. Therefore, we should try to minimize the ratio Noise Variance / Signal Variance

MNF – Noise/Signal Ratio How do we compute it for spectra? 426 bands -> 426 variances and 425*424/2 covariances Dividing element-wise won’t work What should we do? Diagonalize! Covariance of n is diagonalizable: Covariance of x is diagonalizable:

MNF – Noise/Signal Ratio Covariance of n is diagonalizable: Covariance of x is diagonalizable: GOOD NEWS! They can be simultaneously diagonalized. It’s a little complicated by basically looks like this: So

MNF: Algorithm Estimate n Calculate Covariance of n Calculate Covariance of x Calculate Left Eigenvectors and Eigenvalues of Make sure eigenvalues are sorted in order of Big to Little if maximizing Little to Big if minimizing Only keep the Eigenvectors that early in the sort or