Presentation is loading. Please wait.

Presentation is loading. Please wait.

2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into.

Similar presentations


Presentation on theme: "2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into."— Presentation transcript:

1

2 2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into the product of two nonnegative matrices.  Motivation:  easier to interpret  provide better results in information retrieval, clustering

3 2.4 Nonnegative Matrix Factorization  Definition:  Solution:  three general classes of algorithms for constructing a nonnegative matrix factorization  multiplicative update, alternating least squares, and gradient descent algorithms

4 Procedure - Multiplicative Update Algorithm

5 Weaknesses: tends to be more sensitive to initialization It has also been shown that the multiplicative update procedure is slow to converge

6 Procedure - Alternating Least Squares

7  we have a least squares step, where we solve for one of the factor matrices, followed by another least squares step to solve for the other one.  In between, we ensure nonnegativity by setting any negative elements to zero.

8 Example 2.4  we are going to factor the termdoc matrix into a nonnegative product of two matrices W and H, where W is 6*3 and H is 3*5  utilizes the multiplicative update option of the NMF function

9 2.5 Factor Analysis  λ ij in the above model are called the factor loadings error terms ε i are called the specific factors.  : the communality of xi.

10 2.5 Factor Analysis  Assumptions:  E[e] = 0 E[f] = 0 E[x] = 0  error terms εi are uncorrelated with each other  common factors are uncorrelated with the specific factors fj.  sample covariance (or correlation) matrix is of the form  where Ψ is a diagonal matrix representing E[eeT]. The variance of εi is called the specificity of xi; so the matrix Ψ is also called the specificity matrix.

11 2.5 Factor Analysis  Both Λ and f are unknown and must be estimated. And the estimates are not unique.  Once this initial estimate is obtained, other solutions can be found by rotating Λ.  goal of some rotations is to make the structure of Λ more interpretable, by making the λ ij close to one or zero.  factor rotation methods can either be orthogonal or oblique.  The orthogonal rotation methods include quartimax, varimax, orthomax, and equimax. The promax and procrustes rotations are oblique.

12 2.5 Factor Analysis  want to transform the observations using the estimated factor analysis model either for plotting purposes or for further analysis methods, such as clustering or classification.  We could think of these observations as being transformed to the.factor space. These are called factor scores, similarly to PCA.  the factor scores are really estimates and depend on the method that is used.  MATLAB Statistics Toolbox uses the maximum likelihood method to obtain the factor loadings, and implements various rotation methods mentioned earlier.

13 Example 2.5  Dataset: stockreturns  consists of 100 observations, representing the percent change in stock prices for 10 companies.  It turns out that the first four companies can be classified as technology, the next three as financial, and the last three as retail.  use factor analysis to see if there is any structure in the data that supports this grouping.

14 Example 2.5  we plot the matrix Lam (the factor loadings) in Figure 2.4

15 Example 2.5

16 2.6 Fisher’s Linear Discriminant  It is known as Fisher linear discriminant or mapping (FLD) [Duda and Hart, 1973] and is one of the tools used for pattern recognition and supervised learning.  The goal of LDA is to reduce the dimensionality to 1-D(a linear projection ) so that the projected observations are well separated.

17 2.6 Fisher’s Linear Discriminant  One approach:  building a classifier with high- dimensional data is to project the observations onto a line in such a way that the observations are well- separated in this lower- dimensional space.  The linear separability (and thus the classification) of the observations is greatly affected by the position or orientation of this line.

18 2.6 Fisher’s Linear Discriminant  In LDA we seek a linear mapping that maximizes the linear class separability in the new representation of the data.  Definitions:  We consider a set of n p-dimensional observations x 1,…,x n, with samples labeled as belonging to class 1 (λ 1 ) and samples as belonging to class 2 (λ 2 ). We will denote the set of observations in the i-th class as Λ i.

19 2.6 Fisher’s Linear Discriminant  our measure of the standard deviations: p- dimensional sample mean

20 2.6 Fisher’s Linear Discriminant  use Equation 2.14 to measure the separation of the means for the two classes:  use the scatter as our measure of the standard deviations.  The LDA is defined as the vector w that maximizes the function

21 2.6 Fisher’s Linear Discriminant  the solution to the maximization of Equation 2.16 as

22 2.6 Fisher’s Linear Discriminant  Example 2.6  Generate some observations that are multivariate normal using the mvnrnd function and plot them as points in the top panel of Figure 2.7.

23 2.7 Intrinsic Dimensionality  intrinsic dimensionality: defined as the smallest number of dimensions or variables needed to model the data without loss  Approaches:  we describe several local estimators: nearest neighbor, correlation dimension, and maximum likelihood. These are followed by a global method based on packing numbers.

24 2.7.1 Nearest Neighbor Approach  Definitions:  Let r k,x represent the distance from x to the k-th nearest neighbor of x. The average kth nearest neighbor distance is given by C n is independent of k.

25 2.7.1 Nearest Neighbor Approach  then we obtain the following

26 2.7.1 Nearest Neighbor Approach  Pettis et al. [1979] found that their algorithm works better if potential outliers are removed before estimating the intrinsic dimensionality.  define outliers

27 Procedure - Intrinsic Dimensionality

28 Example 2.7  We first generate some data to illustrate the functionality of the algorithm. The helix is described by the following equations, and points are randomly chosen along this path: For this data set, the dimensionality is 3, but the intrinsic dimensionality is 1. The resulting estimate of the intrinsic dimensionality is 1.14.

29 2.7.2 Correlation Dimension  correlation dimension estimator  based on the assumption that the number of observations in a hyper-sphere with radius r is proportional to r d  Given a set of observations S n ={x 1 … x n } so we can use this to estimate the intrinsic dimensionality d.

30 2.7.2 Correlation Dimension  the correlation dimension is given by  Since we have a finite sample, arriving at the limit of zero in Equation 2.24 is not possible  The intrinsic dimensionality, can be estimated by calculating C (r ) for two values of r and then finding the ratio:  Section 2.7.4 to illustrate the use of the correlation dimension Hospital

31 2.7.3 Maximum Likelihood Approach  x 1 … x n, residing in a p-dimensional space R p  x i =g(y i ) where the y i are sampled from an unknown smooth density f with support on R d  assume that f(x) is approximately constant within a small hypersphere S x (r) of radius r around x

32 2.7.3 Maximum Likelihood Approach  the rate λ(t) of the process N x (t) at dimensionality d as  Levina and Bickel provide maximum likelihood estimator of the intrinsic dimensionality R 给定, 0<t<r λ x (t)=N x( t)/N x (r) λ x (t) 是分布函数 其中参数 d 未知 用最大似然估计 估计参数 d

33 2.7.3 Maximum Likelihood Approach  a more convenient way to arrive at the estimate by fixing the number of neighbors k instead of the radius r  recommend that one obtain estimates of the intrinsic dimensionality at each observation x i  The final estimate is obtained by averaging these results

34 2.7.4 Estimation Using Packing Numbers  the r-covering number N (r) of a set S ={x 1,…,x n } in this space is the smallest number of hyperspheres with radius r that is needed to cover all observations x i in the data set S.  N (r) of a d-dimensional data set is proportional to r -d  capacity dimension

35 2.7.4 Estimation Using Packing Numbers  impossible to find the r-covering number  Kegl uses the r-packing number M (r ) instead  defined as the maximum number of observations xi in the data set S that can be covered by a single hypersphere of radius r  We have:( 夹逼 )

36 2.7.4 Estimation Using Packing Numbers  the packing estimate for intrinsic dimensionality is found using

37 Example 2.8  use function from Dimensionality Reduction Toolbox called generate_data that will allow us to generate data based on a helix  The data are plotted in Figure 2.9, where we see that the data fall on a 1-D manifold along a helix  the packing number estimator providing the best answer with this artificial data set.

38


Download ppt "2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into."

Similar presentations


Ads by Google