Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha

Similar presentations


Presentation on theme: "Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha"— Presentation transcript:

1 Variational Graph Embedding for Globally and Locally Consistent Feature Extraction
Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha S. Kevin Zhou Siemens Corporate Research Inc., USA Bao-Gang Hu Chinese Academy of Sciences, China Presented by Yang on 09/08/2009 at ECML PKDD 2009

2 Motivation No.1 Graph-Based Learning
Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Data graph: G= (V, E, W) V: node set a node v  a data x E: edge set W: edge weights wij=w(xi,xj)

3 Motivation No.1 Graph-Based Learning
Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Where comes a graph? Existing graph construction: Prior: prior knowledge, side information, etc Naive: gaussian similarity, ε-ball, kNN, b-matching, etc. Data graph: G= (V, E, W) V: node set a node v  a data x E: edge set W: edge weights wij=w(xi,xj)

4 Motivation No.1 Graph-Based Learning
Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Where comes a graph? Existing graph construction: Prior: prior knowledge, side information, etc Naive: gaussian similarity, ε-ball, kNN, b-matching, etc. GBL is very sensitive to graph structure and edge weight. Q1: How to construct a reliable & effective graph? Data graph: G= (V, E, W) V: node set a node v  a data x E: edge set W: edge weights wij=w(xi,xj)

5 Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy

6 Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy

7 Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy

8 Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy Q2: Is it possible to combine the two pros into one framework?

9 This Paper Well-defined graph can be established based on theoretically justified learning measures. Take feature extraction as an example Graph-based Learning task, Mutual Information (MI) and Bayes Error Rate (BER) as example measures, we show that: Both global/statistical and local/geometric structures of the data can be captured using a single objective, leading to high-quality feature learner with various appealing properties Algorithm: Variational Graph-Embedding: iterative between graph learning and spectral graph embedding

10 Variational Graph Embedding for FE
Outline Motivation Variational Graph Embedding for FE Theoretically Optimal Measures: MI and BER Variational Graph Embedding (MI) Variational Graph Embedding (BER) Experiment

11 Optimal Feature Extraction
Theoretical Optimal Measures for feature learning Mutual Information (MI) accounts for high-order statistics, i.e., complete dependency y--z Bayes Error Rate (BER) directly maximizing the discrimination/generalization ability Practical prohibition Estimation: Both involve unkown distributions, and numerical integration of high-dimensional data Optimization: coupled variables, complicated objective

12 Two steps making things easy
Our Solution Two steps making things easy Nonparametric Estimation Eliminates the numerical integration of unknown high-dimensional distributions, reduces the task into kernel-based optimization Variational Kernel Approximation Turns complex optimization into variational graph embedding The resultant algorithm is a EM-style iteration between learning a graph (optimizing variational parameters) and embedding a learned graph (optimizing learner parameters)

13 Variational Graph Embedding: MI
1. Max Nonparametric Quadratic MI 1.1 Shannon entropy  Renyi’s quadratic entropy 1.2 Kernel density estimation: isotropic Gaussian kernel Optimizing a big sum of nonconvex functions!

14 Variational Graph Embedding: MI
1. Max Nonparametric Quadratic MI 2. Max Variational Nonparametric QMI 1.1 Shannon entropy  Renyi’s quadratic entropy 1.2 Kernel density estimation: isotropic Gaussian kernel 2. kernel term  variational kernel term (Jaakkola-Jordan Variational Lower Bound for exp function) Variational Graph Embedding (MI) E-Step: optimizing variational parameters (equivalent to learning a weighted graph) M-Step: linearly embedding graph (can be solved by spectral analysis in linear or kernel space)

15 Variational Graph Embedding: MI
Initial Graph Approximate each kernel term by its first-order Taylor expansion

16 Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff

17 Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination

18 Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving

19 Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving

20 Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving Connection with LPP and FDA

21 Variational Graph Embedding: BER
Bayes Error Rate Nonparametric Estimation Variational kernel approximation Variational Graph Embedding =

22 Variational Graph Embedding: BER
Variational Graph Embedding (MI) E-Step: optimizing variational parameters (equivalent to learning a weighted graph) M-Step: linearly embedding graph (can be solved by spectral analysis in linear or kernel space)

23 Experiments Face recognition
Compared with both global/statistical methods (PCA/KPCA, LDA/KDA) and local/geometric methods (LPP/KLPP, MFA/KMFA) Compared with both supervised (LDA, MFA) and unsupervised (PCA, LPP) Three benchmark facial image sets (Yale, ORL and CMU Pie) are used. All the images were taken in different environments, at different times, with different poses, facial expressions and details. All the raw images are normalized to 32x32. For each data set, we randomly select v images of each person as training data, and leave others for testing. Only the training data are used to learn features. To evaluate the effectiveness of different methods, the classification accuracy of a k-NN classifier on testing data is used as the evaluation metric.

24 Experiments Face recognition

25 Experiments Face recognition
On average, MIE is 36% over PCA, 8% over FDA, and 4% over MFA; BERE is 39% over PCA, 11% over FDA and 6% over MFA. The improvements are even more significant in the kernel case: kMIE (31%,9%,7%) and kBERE(33%,10%,8%) over (KPCA, KDA, KMFA)

26 Discussion 1. It is possible to capture both global/statistical and local/geometric structure into one well-defined objective. 2. Graph construction: What makes a good graph? Predictive: relevant to the target concept we are inferring Locally and globally consistent: account for both local & global information revealed by the data, and strive for a nature tradeoff Computationally convenient: easy and inexpensive for learning and testing A feasible graph construction approach: nonparametric measure + variational kernel approximation Kernel / (dis-)similarity learning Metric Learning, Dimensionality reduction Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering, … …

27 Thanks!


Download ppt "Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha"

Similar presentations


Ads by Google