Presentation is loading. Please wait.

Presentation is loading. Please wait.

Curse of Dimensionality

Similar presentations


Presentation on theme: "Curse of Dimensionality"— Presentation transcript:

1 Curse of Dimensionality

2

3

4

5

6

7

8

9

10

11

12

13 Dimensionality Reduction

14

15

16

17

18

19

20

21

22

23

24 Why Reduce Dimensionality?
Reduces time complexity: Less computation Reduces space complexity: Less parameters Saves the cost of observing the feature Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

25

26

27

28

29

30

31 Feature Selection vs Extraction
Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k Principal components analysis (PCA), linear discriminant analysis (LDA), factor analysis (FA) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

32

33

34

35

36 Principal Component Analysis

37

38 PCA Intuition: find the axis that shows the greatest variation, and project all points into this axis f2 e1 e2 f1

39

40

41

42

43

44

45

46

47 Factor Analysis

48 Introduction The purpose of factor analysis is to describe the variation among many variables in terms of a few underlying but unobservable random variables called factors All the covariance or correlations are explained by the common factors Any portion of the variance unexplained by the common factors is assigned to residual errors terms which are called unique factors

49 Factor Analysis Factor analysis is a class of procedures used for data reduction and summarization. It is an interdependence technique: no distinction between dependent and independent variables. Factor analysis is used: To identify underlying dimensions, or factors, that explain the correlations among a set of variables. To identify a new, smaller, set of uncorrelated variables to replace the original set of correlated variables.

50 Concept Factor analysis can be viewed as a statistical procedure for grouping variables into subsets such that the variables with each set are mutually highly correlated, whereas at the same time variables in different subsets are relatively uncorrelated.

51 Factor Analysis Assume set of unobservable (“latent”) variables
Goal: Characterize dependency among observables using latent variables Suppose group of variables having large correlation among themselves and small correlation with other variables Single factor?

52 Factor Analysis Assume k input factors (latent unobservable) variables generating d observables Assume all variations in observable variables are due to latent or noise (with unknown variance) Find transformation from unobservable to observables which explain the data

53 What is FA? Patterns of correlations are identified and either used as descriptives (PCA) or as indicative of underlying theory (FA) Process of providing an operational definition for latent construct (through regression equation)

54 Factor Analysis Model Each variable is expressed as a linear combination of factors. The factors are some common factors plus a unique factor. The factor model is represented as: Xi = Ai 1F1 + Ai 2F2 + Ai 3F AimFm + ViUi where Xi = i th standardized variable Aij = standardized mult reg coeff of var i on common factor j Fj = common factor j Vi = standardized reg coeff of var i on unique factor i Ui = the unique factor for variable i m = number of common factors

55 Factor Analysis Model The first set of weights (factor score coefficients) are chosen so that the first factor explains the largest portion of the total variance. Then a second set of weights can be selected, so that the second factor explains most of the residual variance, subject to being uncorrelated with the first factor. This same principle applies for selecting additional weights for the additional factors.

56 Factor Analysis Model The common factors themselves can be expressed as linear combinations of the observed variables. Fi = Wi1X1 + Wi2X2 + Wi3X WikXk Where:   Fi = estimate of i th factor Wi= weight or factor score coefficient k = number of variables

57 Factor Analysis Find V such that where S is estimation of covariance matrix and V loading (explanation by latent variables) V is d x k matrix (k<d) Solution using eigenvalue and eigenvectors

58 Conducting Factor Analysis
Fig. 19.2 Problem formulation Construction of the Correlation Matrix Method of Factor Analysis Determination of Number of Factors Rotation of Factors Interpretation of Factors Calculation of Factor Scores Determination of Model Fit

59 General Steps to FA Step 1: Selecting and Measuring a set of variables in a given domain Step 2: Data screening in order to prepare the correlation matrix Step 3: Factor Extraction Step 4: Factor Rotation to increase interpretability Step 5: Interpretation Further Steps: Validation and Reliability of the measures

60 Factor Analysis In FA, factors zj are stretched, rotated and translated to generate x

61 FA Usage Speech is a function of position of small number of articulators (lungs, lips, tongue) Factor analysis: go from signal space (4000 points for 500ms ) to articulation space (20 points) Classify speech (assign text label) by 20 points Speech Compression: send 20 values

62 Independent Component Analysis

63 Motivation Method for finding underlying components from multi-dimensional data Focus is on Independent and Non-Gaussian components in ICA as compared to uncorrelated and gaussian components in FA and PCA Sep 10, 2003 ENEE 698A Seminar

64 Cocktail-party Problem
Multiple speakers in room (independent) Multiple sensors receiving signals which are mixture of original signals Estimate original source signals from mixture of received signals Can be viewed as Blind-Source Separation as mixing parameters are not known Sep 10, 2003 ENEE 698A Seminar

65 Independent Components Analysis (ICA)
The goal of ICA is to discover both the inputs and how they were mixed. Assumption: The observed data is the sum of a set of inputs which have been mixed together in an unknown fashion. McKeown, et al. (1998) FMRI – Week 10 – Analysis II Scott Huettel, Duke University

66 PCA finds the directions of maximum variance ICA finds the directions of maximum independence

67 Principle: Maximize Information
Q: How to extract maximum information from multiple visual channels? ICA produces brain-like visual filters for natural images. A: ICA does this -- it maximizes joint entropy & minimizes mutual information between output channels (Bell & Sejnowski, 1995). [OnStart -> forest image plus Title Our visual system has evolved to pick up relevant visual information quikcly and efficiently, but with minimal assumptions about exactly what we see (since this could lead to frequent hallucinations!). [ENTER 2nd forest image] An important assumption that visual systems can make is that the statistical nature of natural scenes is fairly stable. How can a visual system, either natural or synthetic, extract maximum information from a visual scene most efficiently? [ENTER] Infomax ICA, developed under ONR funding by Bell and Sejnowski, does just this. Infomax ICA is a neural network approach to blind signal processing that seeks to maximize the total information (in Shannon’s sense) in its output channels, given its input. This is equivalent to minimizing the mutual information contained in pairs of outputs. Applied to image patches from natural scenes like these by Tony Bell and others, [ENTER] ICA derives maximally informative sets of visual patch filters that strongly resemble the receptive fields of primary visual neurons. [ENTER] Set of 144 ICA filters

68 Principles of ICA Estimation
“Nongaussian is independent”: central limit theorem Measure of nonguassianity Kurtosis: (Kurtosis=0 for a gaussian distribution) Negentropy: a gaussian variable has the largest entropy among all random variables of equal variance:

69 ICA Definition Observe n random variables which are linear combinations of n random variables which are mutually independent In Matrix Notation, X = AS Assume source signals are statistically independent Estimate the mixing parameters and source signals Find a linear transformation of observed signals such that the resulting signals are as independent as possible Sep 10, 2003 ENEE 698A Seminar

70 Restrictions and Ambiguities
Components are assumed independent Components must have non-gaussian densities Energies of independent components can’t be estimated Sign Ambiguity in independent components Sep 10, 2003 ENEE 698A Seminar

71 Gaussian and Non-Gaussian components
If some components are gaussian and some are non-gaussian. Can estimate all non-gaussian components Linear combination of gaussian components can be estimated. If only one gaussian component, model can be estimated Sep 10, 2003 ENEE 698A Seminar

72 Why Non-Gaussian Components
Uncorrelated Gaussian r.v. are independent Orthogonal mixing matrix can’t be estimated from Gaussian r.v. For Gaussian r.v. estimate of model is up to an orthogonal transformation ICA can be considered as non-gaussian factor analysis Sep 10, 2003 ENEE 698A Seminar

73 Summing up

74 The Factor Analysis Model
The generative model for factor analysis assumes that the data was produced in three stages: Pick values independently for some hidden factors that have Gaussian priors Linearly combine the factors using a factor loading matrix. Use more linear combinations than factors. Add Gaussian noise that is different for each input. j i

75 The Full Gaussian Model
The generative model for factor analysis assumes that the data was produced in three stages: Pick values independently for some hidden factors that have Gaussian priors Linearly combine the factors using a square matrix. There is no need to add Gaussian noise because we can already generate all points in the dataspace. j i

76 The PCA Model The generative model for factor analysis assumes that the data was produced in three stages: Pick values independently for some hidden factors that can have any value Linearly combine the factors using a factor loading matrix. Use more linear combinations than factors. Add Gaussian noise that is the same for each input. j i

77 The Probabilistic PCA Model
The generative model for factor analysis assumes that the data was produced in three stages: Pick values independently for some hidden factors that can have any value Linearly combine the factors using a factor loading matrix. Use more linear combinations than factors. Add Gaussian noise that is the same for each input. j i

78 Extra slides

79 Dimensionality Reduction
One approach to deal with high dimensional data is by reducing their dimensionality. Project high dimensional data onto a lower dimensional sub-space using linear or non-linear transformations.

80 Dimensionality Reduction
Linear transformations are simple to compute and tractable. Classical –linear- approaches: Principal Component Analysis (PCA) Fisher Discriminant Analysis (FDA) –Singular Value Decomosition (SVD) --Factor Analysis (FA) --Canonical Correlation(CCA) k x k x d d x (k<<d)

81 Principal Component Analysis
1

82 Principal Component Analysis
This function is minimized if xo is equal to mean  1

83

84

85

86

87

88

89

90

91

92

93

94 Principal Components Analysis (PCA)
Find a low-dimensional space such that when x is projected there, information loss is minimized. The projection of x on the direction of w is: z = wTx Find w such that Var(z) is maximized Var(z) = Var(wTx) = E[(wTx – wTμ)2] = E[(wTx – wTμ)(wTx – wTμ)] = E[wT(x – μ)(x – μ)Tw] = wT E[(x – μ)(x –μ)T]w = wT ∑ w where Var(x)= E[(x – μ)(x –μ)T] = ∑ Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

95 Maximize Var(z) subject to ||w||=1
∑w1 = αw1 that is, w1 is an eigenvector of ∑ Choose the one with the largest eigenvalue for Var(z) to be max Second principal component: Max Var(z2), s.t., ||w2||=1 and orthogonal to w1 ∑ w2 = α w2 that is, w2 is another eigenvector of ∑ and so on. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

96 What PCA does z = WT(x – m) where the columns of W are the eigenvectors of ∑, and m is sample mean Centers the data at the origin and rotates the axes Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

97 How to choose k ? Proportion of Variance (PoV) explained
when λi are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow” Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

98 Factor Analysis In FA, factors zj are stretched, rotated and translated to generate x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)


Download ppt "Curse of Dimensionality"

Similar presentations


Ads by Google