Learning With Structured Sparsity

Slides:



Advertisements
Similar presentations
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
Advertisements

Bayesian Belief Propagation
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Compressive Sensing IT530, Lecture Notes.
Fast Algorithms For Hierarchical Range Histogram Constructions
Multi-Label Prediction via Compressed Sensing By Daniel Hsu, Sham M. Kakade, John Langford, Tong Zhang (NIPS 2009) Presented by: Lingbo Li ECE, Duke University.
Joint work with Irad Yavneh
Structured Sparse Principal Component Analysis Reading Group Presenter: Peng Zhang Cognitive Radio Institute Friday, October 01, 2010 Authors: Rodolphe.
Online Performance Guarantees for Sparse Recovery Raja Giryes ICASSP 2011 Volkan Cevher.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
T HE POWER OF C ONVEX R ELAXATION : N EAR - OPTIMAL MATRIX COMPLETION E MMANUEL J. C ANDES AND T ERENCE T AO M ARCH, 2009 Presenter: Shujie Hou February,
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.
Chapter 2: Lasso for linear models
1 Micha Feigin, Danny Feldman, Nir Sochen
An Introduction to Sparse Coding, Sparse Sensing, and Optimization Speaker: Wei-Lun Chao Date: Nov. 23, 2011 DISP Lab, Graduate Institute of Communication.
Learning With Dynamic Group Sparsity Junzhou Huang Xiaolei Huang Dimitris Metaxas Rutgers University Lehigh University Rutgers University.
ECE Department Rice University dsp.rice.edu/cs Measurements and Bits: Compressed Sensing meets Information Theory Shriram Sarvotham Dror Baron Richard.
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
Quantization and compressed sensing Dmitri Minkin.
Image Denoising via Learned Dictionaries and Sparse Representations
Random Convolution in Compressive Sampling Michael Fleyer.
Introduction to Compressive Sensing
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Topics in MMSE Estimation for Sparse Approximation Michael Elad The Computer Science Department The Technion – Israel Institute of technology Haifa 32000,
Compressed Sensing Compressive Sampling
An ALPS’ view of Sparse Recovery Volkan Cevher Laboratory for Information and Inference Systems - LIONS
An Introduction to Support Vector Machines Martin Law.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Game Theory Meets Compressed Sensing
Recovery of Clustered Sparse Signals from Compressive Measurements
Cs: compressed sensing
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Recovering low rank and sparse matrices from compressive measurements Aswin C Sankaranarayanan Rice University Richard G. Baraniuk Andrew E. Waters.
Model-Based Compressive Sensing Presenter: Jason David Bonior ECE / CMR Tennessee Technological University November 5, 2010 Reading Group (Richard G. Baraniuk,
An Introduction to Support Vector Machines (M. Law)
Compressible priors for high-dimensional statistics Volkan Cevher LIONS/Laboratory for Information and Inference Systems
Shriram Sarvotham Dror Baron Richard Baraniuk ECE Department Rice University dsp.rice.edu/cs Sudocodes Fast measurement and reconstruction of sparse signals.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
An Efficient Greedy Method for Unsupervised Feature Selection
Zhilin Zhang, Bhaskar D. Rao University of California, San Diego March 28,
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
Ultra-high dimensional feature selection Yun Li
Jianchao Yang, John Wright, Thomas Huang, Yi Ma CVPR 2008 Image Super-Resolution as Sparse Representation of Raw Image Patches.
Regularized Least-Squares and Convex Optimization.
From Sparse Solutions of Systems of Equations to Sparse Modeling of Signals and Images Alfred M. Bruckstein (Technion), David L. Donoho (Stanford), Michael.
n-pixel Hubble image (cropped)
Compressive Coded Aperture Video Reconstruction
Computing and Compressive Sensing in Wireless Sensor Networks
Learning With Dynamic Group Sparsity
Probabilistic Models for Linear Regression
Nuclear Norm Heuristic for Rank Minimization
Learning with information of features
Estimating Networks With Jumps
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Sudocodes Fast measurement and reconstruction of sparse signals
CIS 700: “algorithms for Big Data”
Sudocodes Fast measurement and reconstruction of sparse signals
Signal Processing on Graphs: Performance of Graph Structure Estimation
Outline Sparse Reconstruction RIP Condition
Subspace Expanders and Low Rank Matrix Recovery
Presentation transcript:

Learning With Structured Sparsity Junzhou Huang, Tong Zhang, Dimitris Metaxas Rutgers, The State University of New Jersey

Outline Motivation of structured sparsity more priors improve the model selection stability Generalizing group sparsity: structured sparsity CS: structured RIP requires fewer samples statistical estimation: more robust to noise examples of structured sparsity: graph sparsity An efficient algorithm for structured sparsity StructOMP: structured greedy algorithm

Standard Sparsity Without priors for supp(w) Suppose X the n × p data matrix. Let . The problem is formulated as Without priors for supp(w) Convex relaxation (L1 regularization), such as Lasso Greedy algorithm, such as OMP Complexity for k-sparse data O(k ln (p) ) CS: related with the number of random projections Statistics: related with the 2-norm estimation error It is well known that the standard sparsity problem has been widely studied in the past years. Its formulation is shown as here. Supp(w): the support set of sparse data w is defined as the set of indices corresponding to the nonzero entries in w.

Group Sparsity Partition {1, . . . , p}= into m disjoint groups G1,G2, . . . ,Gm. Suppose g groups cover k features Priors for supp(w) entries in one group are either zeros both or nonzeros both Group complexity: O(k + g ln(m)). choosing g out of m groups (g ln(m) ) for feature selection complexity (MDL) suffer penalty k for estimation with k selected features (AIC) Rigid, none-overlapping group setting Group complexity is much smaller than the standard sparsity complexity; Group structure is much smaller if the group structure is meaningful

Motivation Dimension Effect Natural question Knowing exact knowledge of supp(w): O(k) complexity Lasso finds supp(w) with O(k ln(p) ) complexity Group Lasso finds supp(w) with O(g ln(m) ) complexity Natural question what if we have partial knowledge of supp(w)? structured sparsity: not all feature combinations are equally likely, graph sparsity complexity between k ln(p) and k. More knowledge leads to the reduced complexity

Example Tree structured sparsity in wavelet compression Original image Recovery with unstructured sparsity, O(k ln p) Recovery with structured sparsity, O(k) Additional structure information leads to better performance.

Related Works (I) Bayesian framework for group/tree sparsity Wipf&Rao 2007, Ji et al. 2008, He&Carin 2008 Empirical evidence and no theoretical results show how much better (under what kind of conditions) Group Lasso Extensive literatures for empirical evidences (Yuan&Lin 2006) Theoretical justifications (Bach 2008, Kowalski&Yuan 2008, Obozinski et al. 2008, Nardi&Rinaldo 2008, Huang&Zhang 2009) Limitations: 1) inability for more general structure; 2) inability for overlapping groups

Related Works (II) Composite absolute penalty (CAP) [Zhao et al. 2006] Handle overlapping groups; no theory for the effectiveness. Mixed norm penalty [Kowalski&Torresani 2009] Structured shrinkage operations to identify the structure maps; no additional theoretical justifications Model based compressive sensing [Baraniuk et al. 2009] Some theoretical results for the case in compressive sensing No generic framework to flexibly describe a wide class of structures

Our Goal Empirical works evidently show better performance can be achieved with additional structures No general theoretical framework for structured sparsity that can quantify its effectiveness Goals Quantifying structured sparsity; Minimal number bounds of measurements required in CS; estimation accuracy guarantee under stochastic noise; A generic scheme and algorithm to flexible handle a wide class of structured sparsity problems

Structured Sparsity Regularization Quantifying structure cl(F): number of binary bits to encode a feature set F; Coding complexity: number of samples needed in CS: noise tolerance in learning is Assumption: not all sparse patterns are equally likely Optimization problem: Akaike's information criterion (AIC) Minimal Description Length (MDL)

Examples of structured sparsity Standard sparsity complexity: s=O( k + k log(2p)) (k is sparsity number) Group sparsity: nonzeros tend to occur in groups complexity: s=O(k + g log(2m)) Graph sparsity (with O(1) maximum degree) if a feature is nonzero, then near-by features are more likely to be nonzero. The complexity is s=O(k + g log p), where g is number of connected components. Random field sparsity: any binary-random field probability distribution over the features induce a complexity as −log (probability).

Example: connected region A nonzero pixel implies adjacent pixels are more likely to be nonzeros The complexity is O(k + g ln p) where g is the number of connected components Practical complexity: O(k) with small g.

Example: hierarchical tree Parent nonzero implies children are more likely to be nonzeros. Complexity: O(k) instead of O(k ln p) Requires parent as a feature if one child is a feature (zero-tree) Implication: O(k) projections for wavelet CS requires parent as a feature if one child is a feature (zero-tree)

Proof Sketch of Graph Complexity Pick a starting point for every connected component coding complexity is O(g ln p) for tree, start from root with coding complexity 0 Grow each feature node into adjacent nodes with coding complexity O(1) require O(k) bits to code k nodes. Total is O(k + g ln p)

Solving Structured Sparsity Structured sparse eigenvalue condition: for n×p Gaussian projection matrix, any t > 0 and , let Then with probability at least : for all vector with coding complexity no more than s:

Coding Complexity Regularization Coding complexity regularization formulation With probability 1−η, the ε-OPT solution of coding complexity regularization satisfies: Good theory but computationally inefficient. convex relaxation: difficult to apply. In graph sparsity example, we need to search through connected components (dynamic groups) and penalize each group Greedy algorithm, easy

StructOMP Repeat: Block selection rule: compute the gain ratio: Find w to minimize Q(w) in the current feature set select a block of features from a predefined “block set”, and add to the current feature set Block selection rule: compute the gain ratio: and pick the feature-block to maximize the gain: fastest objective value reduction per unit increase of coding complexity

Convergence of StructOMP Assume structured sparse eigenvalue condition at each step StructOMP solution achieving OPT(s) +ε : Coding complexity regularization: for strongly sparse signals (coefficients suddenly drop to zero; worst case scenario): solution complexity O(s log(1/ ε)) weakly sparse (coefficients decay to zero) q-compressible signals (decay at power q): solution complexity O(qs). Under RIP, OMP competitive to Lasso for weakly sparse signals

Experiments Focusing on graph sparsity Demonstrate the advantage of structured sparsity over standard/group sparsity. Compare the StructOMP with the OMP, Lasso and group Lasso The data matrix X are randomly generated with i.i.d draws from standard Gaussian distribution Quantitative evaluation: the recovery error is defined as the relative difference in 2-norm between the estimated sparse coefficient and the ground truth

Example: Strongly sparse signal $p=512$, $k=64$ and$g=4$. $n=160$ Recovery results of 1D signal with line-structuredsparsity. original data; (b) recovered results with OMP (erroris 0.9921); (c) recovered results with Lasso (error is 0.8660); (d) recovered results with Group Lasso (error is 0.4832 with groupsize gs=2); (e) recovered results with Group Lasso (error is0.4832 with group size gs=4); (f) recovered results with GroupLasso (error is 0.2646 with group size gs=8); (g) recovered resultswith Group Lasso (error is 0.3980 with group size gs=16); (h)recovered results with StructOMP (error is 0.0246).

Example: Weakly sparse signal $p=512$ and all coefficientof the signal are not zeros. We define the sparsity $k$ as the number of coefficients that contain $95\%$ of the image energy. The support set of these signals is composed of $g=2$ connected regions. Recovery results of 1D weakly sparse signal withline-structured sparsity. original data; (b) recovered resultswith OMP (error is 0.5599); (c) recovered results with Lasso(error is 0.6686); (d) recovered results with Group Lasso (erroris 0.4732 with group size gs=2); (e) recovered results with GroupLasso (error is 0.2893 with group size gs=4); (f) recovered resultswith Group Lasso (error is 0.2646 with group size gs=8); (g)recovered results with Group Lasso (error is 0.5459 with groupsize gs=16); (h) recovered results with StructOMP (error is 0.0846).

Strong vs.Weak Sparsity strongly sparse (that is, allnonzero coefficients are significantly different from zero). weakly sparse (that is, allnonzero coefficients are gradually decay to zero). OMP and Lasso do not achieve goodrecovery results, whereas the StructOMP algorithm achieves near perfect recovery of the original signal. As we do not know the predefined groups for group Lasso, we just try group Lasso with several different group sizes (gs=2, 4, 8, 16). Although the results obtained with group Lasso are better than those of OMP and Lasso, they are still inferior to the results with StructOMP. Figure. Recovery error vs. Sample size ratio (n/k): a) 1D strong sparse signals; (b) 1D Weak sparse signal

2D Image with Graph Sparsity $p=HW=48*48, k=160 and g=4. n=640 Figure. Recovery results of a 2D gray image: original gray image, (b) recovered image with OMP (error is 0.9012), (c) recovered image with Lasso (error is 0.4556) and (d) recovered image with StructOMP (error is 0.1528)

Hierarchical Structure in Wavelets 64 * 64 image. n=2048 p=64*64 We define the sparsity $k$ as the number of coefficients that contain $95%$ of the image energy Figure. Recovery results : (a) the original image, (b) recovered image with OMP (error is 0.21986), (c) recovered image with Lasso (error is 0.1670) and (d) recovered image with StructOMP (error is 0.0375)

Connected Region Structure 80 * 80 image. K~=300. n=900 Figure. Recovery results: (a) the background subtracted image, (b) recovered image with OMP (error is 1.1833), (c) recovered image with Lasso (error is 0.7075) and (d) recovered image with StructOMP (error is 0.1203)

Connected Region Structure Figure. Recovery error vs. Sample size: a) 2D image with tree structured sparsity in wavelet basis; (b) background subtracted images with structured sparsity

Summary Proposed: Open questions General theoretical framework for structured sparsity Flexible coding scheme for structure descriptions Efficient algorithm: StructOMP Graph sparsity as examples Open questions Backward steps Convex relaxation for structured sparsity More general structure representation

Thank you !