2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimension reduction (1)
Chapter 4: Linear Models for Classification
Face Recognition and Biometric Systems
Classification and risk prediction
Principal Component Analysis
Factor Analysis Purpose of Factor Analysis Maximum likelihood Factor Analysis Least-squares Factor rotation techniques R commands for factor analysis References.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Factor Analysis Purpose of Factor Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Principal component analysis (PCA)
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Chapter 5 Part II 5.3 Spread of Data 5.4 Fisher Discriminant.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Visual Recognition Tutorial
Linear and generalised linear models
Techniques for studying correlation and covariance structure
Separate multivariate observations
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is.
CHAPTER 5 SIGNAL SPACE ANALYSIS
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Linear Models for Classification
Discriminant Analysis
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
2D-LDA: A statistical linear discriminant analysis for image matrix
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Feature Extraction 主講人:虞台文.
Principal Components Analysis ( PCA)
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Statistical Interpretation of Least Squares ASEN.
Principal Component Analysis (PCA)
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Factor Analysis An Alternative technique for studying correlation and covariance structure.
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Machine Learning Basics
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Ying shen Sse, tongji university Sep. 2016
EE513 Audio Signals and Systems
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Factor Analysis An Alternative technique for studying correlation and covariance structure.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Parametric Methods Berlin Chen, 2005 References:
Test #1 Thursday September 20th
Presentation transcript:

2.4 Nonnegative Matrix Factorization  NMF casts matrix factorization as a constrained optimization problem that seeks to factor the original matrix into the product of two nonnegative matrices.  Motivation:  easier to interpret  provide better results in information retrieval, clustering

2.4 Nonnegative Matrix Factorization  Definition:  Solution:  three general classes of algorithms for constructing a nonnegative matrix factorization  multiplicative update, alternating least squares, and gradient descent algorithms

Procedure - Multiplicative Update Algorithm

Weaknesses: tends to be more sensitive to initialization It has also been shown that the multiplicative update procedure is slow to converge

Procedure - Alternating Least Squares

 we have a least squares step, where we solve for one of the factor matrices, followed by another least squares step to solve for the other one.  In between, we ensure nonnegativity by setting any negative elements to zero.

Example 2.4  we are going to factor the termdoc matrix into a nonnegative product of two matrices W and H, where W is 6*3 and H is 3*5  utilizes the multiplicative update option of the NMF function

2.5 Factor Analysis  λ ij in the above model are called the factor loadings error terms ε i are called the specific factors.  : the communality of xi.

2.5 Factor Analysis  Assumptions:  E[e] = 0 E[f] = 0 E[x] = 0  error terms εi are uncorrelated with each other  common factors are uncorrelated with the specific factors fj.  sample covariance (or correlation) matrix is of the form  where Ψ is a diagonal matrix representing E[eeT]. The variance of εi is called the specificity of xi; so the matrix Ψ is also called the specificity matrix.

2.5 Factor Analysis  Both Λ and f are unknown and must be estimated. And the estimates are not unique.  Once this initial estimate is obtained, other solutions can be found by rotating Λ.  goal of some rotations is to make the structure of Λ more interpretable, by making the λ ij close to one or zero.  factor rotation methods can either be orthogonal or oblique.  The orthogonal rotation methods include quartimax, varimax, orthomax, and equimax. The promax and procrustes rotations are oblique.

2.5 Factor Analysis  want to transform the observations using the estimated factor analysis model either for plotting purposes or for further analysis methods, such as clustering or classification.  We could think of these observations as being transformed to the.factor space. These are called factor scores, similarly to PCA.  the factor scores are really estimates and depend on the method that is used.  MATLAB Statistics Toolbox uses the maximum likelihood method to obtain the factor loadings, and implements various rotation methods mentioned earlier.

Example 2.5  Dataset: stockreturns  consists of 100 observations, representing the percent change in stock prices for 10 companies.  It turns out that the first four companies can be classified as technology, the next three as financial, and the last three as retail.  use factor analysis to see if there is any structure in the data that supports this grouping.

Example 2.5  we plot the matrix Lam (the factor loadings) in Figure 2.4

Example 2.5

2.6 Fisher’s Linear Discriminant  It is known as Fisher linear discriminant or mapping (FLD) [Duda and Hart, 1973] and is one of the tools used for pattern recognition and supervised learning.  The goal of LDA is to reduce the dimensionality to 1-D(a linear projection ) so that the projected observations are well separated.

2.6 Fisher’s Linear Discriminant  One approach:  building a classifier with high- dimensional data is to project the observations onto a line in such a way that the observations are well- separated in this lower- dimensional space.  The linear separability (and thus the classification) of the observations is greatly affected by the position or orientation of this line.

2.6 Fisher’s Linear Discriminant  In LDA we seek a linear mapping that maximizes the linear class separability in the new representation of the data.  Definitions:  We consider a set of n p-dimensional observations x 1,…,x n, with samples labeled as belonging to class 1 (λ 1 ) and samples as belonging to class 2 (λ 2 ). We will denote the set of observations in the i-th class as Λ i.

2.6 Fisher’s Linear Discriminant  our measure of the standard deviations: p- dimensional sample mean

2.6 Fisher’s Linear Discriminant  use Equation 2.14 to measure the separation of the means for the two classes:  use the scatter as our measure of the standard deviations.  The LDA is defined as the vector w that maximizes the function

2.6 Fisher’s Linear Discriminant  the solution to the maximization of Equation 2.16 as

2.6 Fisher’s Linear Discriminant  Example 2.6  Generate some observations that are multivariate normal using the mvnrnd function and plot them as points in the top panel of Figure 2.7.

2.7 Intrinsic Dimensionality  intrinsic dimensionality: defined as the smallest number of dimensions or variables needed to model the data without loss  Approaches:  we describe several local estimators: nearest neighbor, correlation dimension, and maximum likelihood. These are followed by a global method based on packing numbers.

2.7.1 Nearest Neighbor Approach  Definitions:  Let r k,x represent the distance from x to the k-th nearest neighbor of x. The average kth nearest neighbor distance is given by C n is independent of k.

2.7.1 Nearest Neighbor Approach  then we obtain the following

2.7.1 Nearest Neighbor Approach  Pettis et al. [1979] found that their algorithm works better if potential outliers are removed before estimating the intrinsic dimensionality.  define outliers

Procedure - Intrinsic Dimensionality

Example 2.7  We first generate some data to illustrate the functionality of the algorithm. The helix is described by the following equations, and points are randomly chosen along this path: For this data set, the dimensionality is 3, but the intrinsic dimensionality is 1. The resulting estimate of the intrinsic dimensionality is 1.14.

2.7.2 Correlation Dimension  correlation dimension estimator  based on the assumption that the number of observations in a hyper-sphere with radius r is proportional to r d  Given a set of observations S n ={x 1 … x n } so we can use this to estimate the intrinsic dimensionality d.

2.7.2 Correlation Dimension  the correlation dimension is given by  Since we have a finite sample, arriving at the limit of zero in Equation 2.24 is not possible  The intrinsic dimensionality, can be estimated by calculating C (r ) for two values of r and then finding the ratio:  Section to illustrate the use of the correlation dimension Hospital

2.7.3 Maximum Likelihood Approach  x 1 … x n, residing in a p-dimensional space R p  x i =g(y i ) where the y i are sampled from an unknown smooth density f with support on R d  assume that f(x) is approximately constant within a small hypersphere S x (r) of radius r around x

2.7.3 Maximum Likelihood Approach  the rate λ(t) of the process N x (t) at dimensionality d as  Levina and Bickel provide maximum likelihood estimator of the intrinsic dimensionality R 给定, 0<t<r λ x (t)=N x( t)/N x (r) λ x (t) 是分布函数 其中参数 d 未知 用最大似然估计 估计参数 d

2.7.3 Maximum Likelihood Approach  a more convenient way to arrive at the estimate by fixing the number of neighbors k instead of the radius r  recommend that one obtain estimates of the intrinsic dimensionality at each observation x i  The final estimate is obtained by averaging these results

2.7.4 Estimation Using Packing Numbers  the r-covering number N (r) of a set S ={x 1,…,x n } in this space is the smallest number of hyperspheres with radius r that is needed to cover all observations x i in the data set S.  N (r) of a d-dimensional data set is proportional to r -d  capacity dimension

2.7.4 Estimation Using Packing Numbers  impossible to find the r-covering number  Kegl uses the r-packing number M (r ) instead  defined as the maximum number of observations xi in the data set S that can be covered by a single hypersphere of radius r  We have:( 夹逼 )

2.7.4 Estimation Using Packing Numbers  the packing estimate for intrinsic dimensionality is found using

Example 2.8  use function from Dimensionality Reduction Toolbox called generate_data that will allow us to generate data based on a helix  The data are plotted in Figure 2.9, where we see that the data fall on a 1-D manifold along a helix  the packing number estimator providing the best answer with this artificial data set.