Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

Similar presentations


Presentation on theme: "Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear."— Presentation transcript:

1 Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear discriminant analysis (LDA)

2 Feature reduction vs feature selection Feature reduction – All original features are used – The transformed features are linear combinations of the original features. Feature selection – Only a subset of the original features are used.

3 Why feature reduction? Most machine learning and data mining techniques may not be effective for high- dimensional data – Curse of Dimensionality – Query accuracy and efficiency degrade rapidly as the dimension increases. The intrinsic dimension may be small. – For example, the number of genes responsible for a certain type of disease may be small.

4 Why feature reduction? Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage and retrieval. Noise removal: positive effect on query accuracy.

5 Applications of feature reduction Face recognition Handwritten digit recognition Text mining Image retrieval Microarray data analysis Protein classification

6 High-dimensional data in bioinformatics Gene expression Gene expression pattern images

7 High-dimensional data in computer vision Face imagesHandwritten digits

8 Principle component analysis (PCA) Motivation Example: Given 53 blood and urine samples (features) from 65 people. How can we visualize the measurements?

9 PCA Matrix format (65x53) Instances Features Difficult to see the correlations between the features...

10 PCA - Data Visualization Spectral format (53 pictures, one for each feature) Difficult to see the correlations between the features...

11 Bi-variate Tri-variate PCA - Data Visualization How can we visualize the other variables??? … difficult to see in 4 or higher dimensional spaces...

12 Data Visualization Is there a representation better than the coordinate axes? Is it really necessary to show all the 53 dimensions? – … what if there are strong correlations between the features? How could we find the smallest subspace of the 53-D space that keeps the most information about the original data? A solution: Principal Component Analysis

13 Principle Component Analysis Orthogonal projection of data onto lower-dimension linear space that... – maximizes variance of projected data (purple line) – minimizes mean squared distance between data point and projections (sum of blue lines) PCA:

14 Principle Components Analysis Idea: – Given data points in a d-dimensional space, project into lower dimensional space while preserving as much information as possible Eg, find best planar approximation to 3D data Eg, find best 12-D approximation to 10 4 -D data – In particular, choose projection that minimizes squared error in reconstructing original data

15 Vectors originating from the center of mass Principal component #1 points in the direction of the largest variance. Each subsequent principal component… – is orthogonal to the previous ones, and – points in the directions of the largest variance of the residual subspace The Principal Components

16 2D Gaussian dataset

17 1 st PCA axis

18 2 nd PCA axis

19 PCA algorithm I (sequential) We maximize the variance of the projection in the residual subspace We maximize the variance of projection of x x’ PCA reconstruction Given the centered data {x 1, …, x m }, compute the principal vectors: 1 st PCA vector k th PCA vector w1(w1Tx)w1(w1Tx) w2(w2Tx)w2(w2Tx) x w1w1 w2w2 x’=w 1 (w 1 T x)+w 2 (w 2 T x) w

20 PCA algorithm II (sample covariance matrix) Given data {x 1, …, x m }, compute covariance matrix  PCA basis vectors = the eigenvectors of  Larger eigenvalue  more important eigenvectors where

21 PCA algorithm II PCA algorithm(X, k): top k eigenvalues/eigenvectors % X = N  m data matrix, % … each data point x i = column vector, i=1..m X  subtract mean x from each column vector x i in X   X X T … covariance matrix of X { i, u i } i=1..N = eigenvectors/eigenvalues of   1  2  …  N Return { i, u i } i=1.. k % top k principle components

22 PCA algorithm III (SVD of the data matrix) Singular Value Decomposition of the centered data matrix X. X features  samples = USV T X VTVT SU = samples significant noise significant sig.

23 PCA algorithm III Columns of U – the principal vectors, { u (1), …, u (k) } – orthogonal and has unit norm – so U T U = I – Can reconstruct the data using linear combinations of { u (1), …, u (k) } Matrix S – Diagonal – Shows importance of each eigenvector Columns of V T – The coefficients for reconstructing the samples

24 Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a 12... a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w PCA – Input and Output

25 From k original variables: x 1,x 2,...,x k : Produce k new variables: y 1,y 2,...,y k : y 1 = a 11 x 1 + a 12 x 2 +... + a 1k x k y 2 = a 21 x 1 + a 22 x 2 +... + a 2k x k... y k = a k1 x 1 + a k2 x 2 +... + a kk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. y i 's are Principal Components PCA – usage (k to k transfer)

26 PCA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer

27 From k original variables: x 1,x 2,...,x k : Produce t new variables: y 1,y 2,...,y t : y 1 = a 11 x 1 + a 12 x 2 +... + a 1k x k y 2 = a 21 x 1 + a 22 x 2 +... + a 2k x k... y t = a t1 x 1 + a t2 x 2 +... + a tk x k such that: y i 's are uncorrelated (orthogonal) y 1 explains as much as possible of original variance in data set y 2 explains as much as possible of remaining variance etc. PCA – usage (Dimension reduction) Use first t rows Of Matrix w

28 PCA Summary until now Rotates multivariate dataset into a new configuration which is easier to interpret Purposes – simplify data – look at relationships between variables – look at patterns of units

29 Example of PCA on Iris Data

30 PCA application on image compression

31 Original Image Divide the original 372x492 image into patches: Each patch is an instance that contains 12x12 pixels on a grid View each as a 144-D vector

32 L 2 error and PCA dim

33 PCA compression: 144D ) 60D

34 PCA compression: 144D ) 16D

35 16 most important eigenvectors

36 PCA compression: 144D ) 6D

37 6 most important eigenvectors

38 PCA compression: 144D ) 3D

39 3 most important eigenvectors

40 PCA compression: 144D ) 1D

41 60 most important eigenvectors Looks like the discrete cosine bases of JPG!...

42 2D Discrete Cosine Basis http://en.wikipedia.org/wiki/Discrete_cosine_transform

43 PCA for image compression p=1p=2p=4p=8 p=16p=32p=64p=100 Original Image

44 On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR

45 Linear Discriminant Analysis First applied by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA): Dimension reduction – Finds linear combinations of the features X=X 1,...,X d with large ratios of between-groups to within-groups sums of squares - discriminant variables; Classification – Predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables

46 Is PCA a good criterion for classification? Data variation determines the projection direction What’s missing? – Class information

47 What is a good projection? Similarly, what is a good criterion? – Separating different classes Two classes overlap Two classes are separated

48 What class information may be useful? Between-class distance – Distance between the centroids of different classes

49 What class information may be useful? Within-class distance – Accumulated distance of an instance to the centroid of its class Between-class distance –Distance between the centroids of different classes

50 Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance

51 Linear discriminant analysis Linear discriminant analysis (LDA) finds most discriminant projection by maximizing between-class distance and minimizing within-class distance

52 Notations Data matrix Training data from different from 1, 2, …, k

53 Notations Between-class scatter Properties: Between-class distance = trace of between-class scatter (I.e., the summation of diagonal elements of the scatter) Within-class distance = trace of within-class scatter Within-class scatter

54 Discriminant criterion Discriminant criterion in mathematical formulation – Between-class scatter matrix – Within-class scatter matrix The optimal transformation is given by solving a generalized eigenvalue problem

55 Graphical view of classification d n n K-1 d 1 Find the nearest neighbor Or nearest centroid d 1 A test data point h

56 LDA Computation

57 Input: n x k data matrix, with k original variables: x 1,x 2,...,x k Output: a 11 a 12... a 1k λ 1 a 21 a 22 a 2k λ 2...… a k1 a k2 a kk λ k k x k transfer matrixk x 1 Eigen values w LDA – Input and Output

58 LDA - usage n x k data k x k w T n x k transferred data x= n x k data k x t W’ T n x t transfe rred data x= Dimension reduction Direct transfer

59 LDA Principle: Maximizing between class distances and Minimizing within class distances

60 An Example: Fisher’s Iris Data Actual Group Number of Observations Predicted Group SetosaVersicolorVirginica Setosa Versicolor Virginica 50 0 48 1 0 2 49 Table 1:Linear Discriminant Analysis (APER = 0.0200)

61 LDA on Iris Data

62 On Class Practice Data – Iris.txt (Neucom format) and your own data (if applicable) Method: PCA, LDA, SNR Software – Neucom v0.919 – Steps: Visualization->PCA – Steps: Visualization->LDA – Steps: Data Analysis->SNR


Download ppt "Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear."

Similar presentations


Ads by Google