An Efficient Greedy Method for Unsupervised Feature Selection

Slides:



Advertisements
Similar presentations
RANDOM PROJECTIONS IN DIMENSIONALITY REDUCTION APPLICATIONS TO IMAGE AND TEXT DATA Ella Bingham and Heikki Mannila Ângelo Cardoso IST/UTL November 2009.
Advertisements

Clustering Basic Concepts and Algorithms
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Spectral graph reduction for image and streaming video segmentation Fabio Galasso 1 Margret Keuper 2 Thomas Brox 2 Bernt Schiele 1 1 Max Planck Institute.
Tighter and Convex Maximum Margin Clustering Yu-Feng Li (LAMDA, Nanjing University, China) Ivor W. Tsang.
Structured Sparse Principal Component Analysis Reading Group Presenter: Peng Zhang Cognitive Radio Institute Friday, October 01, 2010 Authors: Rodolphe.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Ilias Theodorakopoulos PhD Candidate
Data Visualization STAT 890, STAT 442, CM 462
Feature Selection Presented by: Nafise Hatamikhah
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai et al, KDD 2010 Presenter: Yunchao Gong Dept. Computer Science, UNC Chapel Hill.
One-Shot Multi-Set Non-rigid Feature-Spatial Matching
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
Principal Component Analysis
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering (Part II) 11/26/07. Spectral Clustering.
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.
Learning Spatially Localized, Parts- Based Representation.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Evaluating Performance for Data Mining Techniques
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,
Presented By Wanchen Lu 2/25/2013
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Non Negative Matrix Factorization
Cs: compressed sensing
Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Center for Evolutionary Functional Genomics Large-Scale Sparse Logistic Regression Jieping Ye Arizona State University Joint work with Jun Liu and Jianhui.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
1 Robust Nonnegative Matrix Factorization Yining Zhang
A Clustering Method Based on Nonnegative Matrix Factorization for Text Mining Farial Shahnaz.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai, Chiyuan Zhang, Xiaofei He Zhejiang University.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Kijung Shin Jinhong Jung Lee Sael U Kang
June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Unsupervised Streaming Feature Selection in Social Media
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Ultra-high dimensional feature selection Yun Li
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein (Technion), David L. Donoho (Stanford), Michael.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Eick: Introduction Machine Learning
Nonnegative polynomials and applications to learning
A Consensus-Based Clustering Method
Machine Learning Dimensionality Reduction
Basic Algorithms Christina Gallner
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Sparse Learning Based on L2,1-norm
Sparse Principal Component Analysis
Non-Negative Matrix Factorization
Outline Sparse Reconstruction RIP Condition
Presentation transcript:

An Efficient Greedy Method for Unsupervised Feature Selection Ahmed Farahat Joint work with Ali Ghodsi, and Mohamed Kamel {afarahat, aghodsib, mkamel} @uwaterloo.ca ICDM 2011

Outline Introduction Proposed Work Experiments and Results Conclusion Dimension Reduction & Feature Selection Previous Work Proposed Work Feature Selection Criterion Recursive Formula Greedy Feature Selection Experiments and Results Conclusion

Dimension Reduction In data mining applications, data instances are typically described by a huge number of features. Images (>2 megapixels) Documents (>10K words) Most of these features are irrelevant or redundant. Goal: Reduce the dimensionality of the data: allow a better understanding of data improve the performance of other learning tasks

Feature Selection vs. Extraction Feature Selection (a.k.a variable selection) searches for a relevant subset of existing features (−) a combinatorial optimization problem (+) features are easy to interpret Feature Extraction (a.k.a feature transformation) learns a new set of features (+) unique solutions in polynomial time (−) features are difficult to interpret

Feature Selection Wrapper vs. filter methods: Wrapper methods search for features which enhance the performance of the learning task (+) more accurate, (−) more complex Filter methods analyze the intrinsic properties of the data, and select highly-ranked features according to some criterion. (+) less complex, (−) less accurate Supervised vs. unsupervised methods This work: filter and unsupervised methods

Previous Work PCA-based Sparse PCA-based calculate PCA, associate features with principal components based on their coefficients, select features associated with the first principal components (Jolliffe, 2002) Sparse PCA-based calculate sparse PCA (Zou et al. 2006), select for each principal component the subset of features with non-zero coefficients Convex Principal Feature Selection (CPFS) (Masaeli et al SDM’10) formulates a continuous optimization problem which minimizes the reconstruction error of the data matrix with sparsity constraints

Previous Work (Cont.) Feature Selection using Feature Similarity (FSFS) (Mitra et al. TPAMI’02) groups features into clusters and then selects a representative feature for each cluster Laplacian Score (LS) (He et al. NIPS’06) selects features that preserve similarities between data instances Multi-Cluster Feature Selection (MCFS) (Cai et al. KDD’10) selects features that preserve the multi-cluster structure of the data

This Work A criterion for unsupervised feature selection minimizes the reconstruction error of the data matrix based on the selected subset of features A recursive formula for calculating the criterion An effective greedy algorithm for unsupervised feature selection P S

Feature Select Criterion Data matrix Reconstructed matrix n features Minimize loss m instances Least squares

Feature Select Criterion (Cont.) Problem 1: (Unsupervised Feature Selection) Find a subset of features such that where and This is an NP-hard combinatorial optimization problem.

Recursive Selection Criterion Theorem 1: Given a set of features . For any , where P S

Recursive Selection Criterion (Cont.) Lemma 1: Given a set of features . For any , where

Proof of Lemma 1

Proof of Lemma 1 (Cont.) Let be the Schur complement of in . Use block-wise inversion formula of :

Recursive Selection Criterion (Cont.) Corollary 1: Given a set of features . For any , Proof: Using Lemma 1,

Recursive Selection Criterion Theorem 1: Given a set of features . For any , where

Proof of Theorem 1

Greedy Selection Criterion Problem 2: (Greedy Feature Selection) At iteration t, find feature l such that, Using Theorem 1: where Problem 2 is equivalent to:

Greedy Selection Criterion (Cont.)

Greedy Selection Criterion (Cont.) At iteration t: Problems: Memory inefficient: Computationally complex: per iteration

Greedy Selection Criterion (Cont.) At iteration t, define: Calculate E and G recursively as: , Define ,

Memory-Efficient Selection Update formulas for f and g

Partition-based Selection Greedy selection criterion: + per iteration At each iteration, n candidate features x n projections Solution: Partition features into c << n random groups Select the feature which best represents the centroids of these groups Similar update formulas can be developed for f and g Complexity: + per iteration

Experiments Seven methods were compared PCA-LRG: is a PCA-based method that selects features associated with the first k principal components (Masaeli et al 2010) FSFS: is the Feature Selection using Feature Similarity (Mitra et al. 2006) LS: is the Laplacian Score (LS) method (He et al. 2006) SPEC: is the spectral feature selection method (Zhao et al. 2007) MCFS: is the Multi-Cluster Feature Selection method (Cai et al. 2010) GreedyFS: is the basic greedy algorithm (using recursive update formulas for f and g but without random partitioning) PartGreedyFS: is the partition-based greedy algorithm

Data Sets These data sets were recently used by Cai et al. (2010) to evaluate different feature selection methods in comparison to the Multi-Cluster Feature Selection (MCFS) method.

Results – k-means

Results – Affinity Propagation

Results – Run Times

Results – Run Times

Conclusion This work presents a novel greedy algorithm for unsupervised feature selection. a feature selection criterion which measures the reconstruction error of the data matrix based on the subset of selected features a recursive formula for calculating the feature selection criterion an efficient greedy algorithm for feature selection, and two memory and time efficient variants It has been empirically shown that the proposed algorithm achieves better clustering performance is less computationally demanding than methods that give comparable clustering performance

Thank you!

References I. Jolliffe, Principal Component Analysis, 2nd ed. Springer, 2002 H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component analysis,” J. Comput. Graph. Stat., 2006 M. Masaeli, Y. Yan, Y. Cui, G. Fung, and J. Dy, “Convex principal feature selection,” SIAM SDM 2010 X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,” NIPS 2006 Y. Cui and J. Dy, “Orthogonal principal feature selection,” in the Sparse Optimization and Variable Selection Workshop, ICML 2008 Z. Zhao and H. Liu, “Spectral feature selection for supervised and unsupervised learning,” ICML 2007 D. Cai, C. Zhang, and X. He, “Unsupervised feature selection for multi-cluster data,” KDD 2010 P. Mitra, C. Murthy, and S. Pal, “Unsupervised feature selection using feature similarity,” IEEE Trans. Pattern Anal. Mach. Intell., 2002.