Sparse Coding Arthur Pece Outline Generative-model-based vision Linear, non-Gaussian, over-complete generative models The penalty method.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Dimension reduction (2) Projection pursuit ICA NCA Partial Least Squares Blais. “The role of the environment in synaptic plasticity…..” (1998) Liao et.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Ch11 Curve Fitting Dr. Deshi Ye
Chapter 4: Linear Models for Classification
2008 SIAM Conference on Imaging Science July 7, 2008 Jason A. Palmer
Visual Recognition Tutorial
x – independent variable (input)
Application of Statistical Techniques to Neural Data Analysis Aniket Kaloti 03/07/2006.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Information Theory and Learning
Machine Learning CMPT 726 Simon Fraser University
Independent Component Analysis (ICA) and Factor Analysis (FA)
Retrieval Theory Mar 23, 2008 Vijay Natraj. The Inverse Modeling Problem Optimize values of an ensemble of variables (state vector x ) using observations:
Active Appearance Models Suppose we have a statistical appearance model –Trained from sets of examples How do we use it to interpret new images? Use an.
Visual Recognition Tutorial
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
A Unifying Review of Linear Gaussian Models
(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.
Radial Basis Function Networks
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
1 / 41 Inference and Computation with Population Codes 13 November 2012 Inference and Computation with Population Codes Alexandre Pouget, Peter Dayan,
“A fast method for Underdetermined Sparse Component Analysis (SCA) based on Iterative Detection- Estimation (IDE)” Arash Ali-Amini 1 Massoud BABAIE-ZADEH.
2 2  Background  Vision in Human Brain  Efficient Coding Theory  Motivation  Natural Pictures  Methodology  Statistical Characteristics  Models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Projects: 1.Predictive coding in balanced spiking networks (Erwan Ledoux). 2.Using Canonical Correlation Analysis (CCA) to analyse neural data (David Schulz).
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Sparse Signals Reconstruction Via Adaptive Iterative Greedy Algorithm Ahmed Aziz, Ahmed Salim, Walid Osamy Presenter : 張庭豪 International Journal of Computer.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
An Introduction to Kalman Filtering by Arthur Pece
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
Lecture 2: Statistical learning primer for biologists
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Joint Moments and Joint Characteristic Functions.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
An Introduction of Independent Component Analysis (ICA) Xiaoling Wang Jan. 28, 2003.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
RECONSTRUCTION OF MULTI- SPECTRAL IMAGES USING MAP Gaurav.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
1 Nonlinear models for Natural Image Statistics Urs Köster & Aapo Hyvärinen University of Helsinki.
Ch 12. Continuous Latent Variables ~ 12
Probability Theory and Parameter Estimation I
Special Topics In Scientific Computing
Latent Variables, Mixture Models and EM
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Learning Theory Reza Shadmehr
Presentation transcript:

Sparse Coding Arthur Pece

Outline Generative-model-based vision Linear, non-Gaussian, over-complete generative models The penalty method of Olshausen+Field & Harpur+Prager Matching pursuit The inhibition method An application to medical images A hypothesis about the brain

Generative-Model-Based Vision Generative models Bayes’ theorem (gives an objective function) Iterative optimization (for parameter estimation) Occam’s razor (for model selection) A less-fuzzy definition of model-based vision. Four basic principles (suggested by the speaker):

Why? A generative model and Bayes’ theorem lead to a better understanding of what the algorithm is doing When the MAP solution cannot be found analytically, iterating between top-down and bottom-up becomes necessary (as in EM, Newton- like and conjugate-gradient methods) Models should not only be likely, but also lead to precise predictions, hence (one interpretation of) Occam’s razor

Linear Generative Models x is the observation vector ( n samples/pixels) s is the source vector ( m sources) A is the mixing matrix ( n x m ) n is the noise vector ( n dimensions) The noise vector is really a part of the source vector x = A.s + n

Learning vs. Search/Perception Learning: given an ensemble X = {x i }, maximize the posterior probability of the mixing matrix A Perception: given an instance x, maximize the posterior probability of the source vector s x = A.s + n

MAP Estimation From Bayes’ theorem: log p(A | X) = log p(X | A) + log p(A) - log p(X) (marginalize over S ) log p(s | x) = log p(x | s) + log p(s) - log p(x) (marginalize over A )

Statistical independence of the sources Why is A not the identity matrix ? Why is p(s) super-Gaussian (lepto-kurtic)? Why m>n ?

Why is A not the identity matrix? Pixels are not statistically independent: log p(x) /= Σ log p(x i ) Sources are (or should be) statistically independent: log p(s) = Σ log p(s j ) Thus, the log p.d.f. of images is equal to the sum of the log p.d.f.’s of the coefficients, NOT equal to the sum of the log p.d.f.’s of the pixels.

Why is A not the identity matrix? (continued) From the previous slide: Σ log p(c j ) /= Σ log p(x i ) But, statistically, the estimated probability of an image is higher if the estimate is given by the sum of coefficient probabilities, rather than the sum of pixel probabilities: E [ Σ log p(c j ) ] > E [ Σ log p(x i ) ] This is equivalent to: H [ p(c j )] < H [ p(x i )]

Why m>n ? Why is p(s) super-Gaussian? The image “sources” are edges; ultimately, objects Edges can be found at any image location and can have any orientation and intensity profile Objects can be found at any location in the scene and can have many different shapes Many more (potential) edges or (potential) objects than pixels Most of these potential edges or objects are not found in a specific image

Linear non-Gaussian generative model Super-Gaussian prior p.d.f. of sources Gaussian prior p.d.f. of noise log p(s | x,A) = log p(x | s,A) + log p(s) – log p(x | A) log p(x | s,A) = log p(x - A.s) = log p(n) = - n.n/(2σ 2 ) - log Z = - || x - A.s || 2 /(2σ 2 ) - log Z

Linear non-Gaussian generative model (continued) Example: Laplacian p.d.f. of sources: log p(s) = - Σ |s| / λ - log Q log p(s | x,A) = log p(x | s,A) + log p(s) – log p(x | A) = - || x - A.s || 2 /(2σ 2 ) - Σ |s| / λ - const.

Summary Generative-model-based vision Learning vs. Perception Over-complete expansions Sparse prior distribution of sources Linear over-complete generative model with Laplacian prior distribution for the sources

The Penalty Method: Coding Gradient-based optimization of the log-posterior probability of the coefficients (d/ds) log p(s | x) = - A T.(x - A.s) / σ 2 - sign(s) / λ Note: as the noise variance tends to zero, the quadratic term dominates the right-hand side and the MAP estimate could be obtained by solving a linear system. However, if m>n then minimizing a quadratic objective function would spread the image energy over non-orthogonal coefficients

Linear inference from linear generative models with Gaussian prior p.d.f. The logarithm of a multivariate Gaussian is a weighted sum of squares The gradient of a sum of squares is a linear function The MAP solution is the solution of a linear system

Non-linear inference from linear generative models with non-Gaussian prior p.d.f. The logarithm of a multivariate non-Gaussian p.d.f. is NOT a weighted sum of squares The gradient of a non-Gaussian p.d.f. is NOT a linear function The MAP solution is NOT the solution of a linear system: in general, no analytical solution exists (this is why over-complete bases are not popular)

PCA, ICA, SCA PCA generative model: multivariate Gaussian -> closed-form solution ICA generative model: non-Gaussian -> iterative optimization over image ensemble SCA generative model: over-complete non-Gaussian -> iterate for each image for perception, over the image ensemble for learning

The Penalty Method: Learning Gradient-based optimization of the log-posterior probability* of the mixing matrix Δ A = - A (z. c T + I) where z j = (d/ds j ) log p(s j ) and c is the MAP estimate of s * actually log-likelihood

Summary Generative-model-based vision Learning vs. Perception Over-complete expansions Sparse prior distribution of sources Linear over-complete generative model with Laplacian prior distribution for the sources Iterative coding as MAP estimation of sources Learning an over-complete expansion

Vector Quantization General VQ: K-means clustering of signals/images Shape-gain VQ: clustering on the unit sphere (after a change from Cartesian to polar coordinates) Iterated VQ: iterative VQ of the residual signal/image

Matching Pursuit Iterative shape-gain vector quantization: Projection of the residual image onto all expansion images Selection of the largest (in absolute value) projection Updating of the corresponding coefficient Subtraction of the updated component from the residual image

Inhibition Method Similar iteration structure, but more than one coefficient updated per iteration: Projection of the residual image onto all expansion images Selection of the largest (in absolute value) k projections Selection of orthogonal elelemnts in this reduced set Updating of the corresponding coefficients Subtraction of the updated components from the residual image

MacKay Diagram

Selection in Matching Pursuit

Selection in the Inhibition Method

Encoding natural images: Lena

Encoding natural images: a landscape

Encoding natural images: a bird

Comparison to the penalty method

Visual comparisons JPEG inhibition methodpenalty method

Expanding the dictionary

An Application to Medical Images X-ray images decomposed by means of matching pursuit Image reconstruction by optimally re-weighting the components obtained by matching pursuit Thresholding to detect micro-calcifications

Tumor detection in mammograms

Residual image after several matching pursuit iterations

Image reconstructed from matching-pursuit components

Weighted reconstruction

Receiver Operating Curve

A Hypothesis about the Brain Some facts All input to the cerebral cortex is relayed through the thalamus: e.g. all visual input from the retina is relayed through the LGN Connections between cortical areas and thalamic nuclei are always reciprocal Feedback to the LGN seems to be negative Hypothesis: cortico-thalamic loops minimize prediction error

Additional references Donald MacKay (1956) D Field (1994) Harpur and Prager (1995) Lewicki and Olshausen (1999) Yoshida (1999)