The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Lecture 3 Probability and Measurement Error, Part 2.
The Horseshoe Estimator for Sparse Signals Reading Group Presenter: Zhen Hu Cognitive Radio Institute Friday, October 08, 2010 Authors: Carlos M. Carvalho,
Visual Recognition Tutorial
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Adaptive Rao-Blackwellized Particle Filter and It’s Evaluation for Tracking in Surveillance Xinyu Xu and Baoxin Li, Senior Member, IEEE.
SYSTEMS Identification
Visual Recognition Tutorial
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Introduction to Bayesian Parameter Estimation
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Empirical Bayes approaches to thresholding Bernard Silverman, University of Bristol (joint work with Iain Johnstone, Stanford) IMS meeting 30 July 2002.
Review of Probability.
Chapter Two Probability Distributions: Discrete Variables
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Bayes Factor Based on Han and Carlin (2001, JASA).
PATTERN RECOGNITION AND MACHINE LEARNING
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
An Efficient Sequential Design for Sensitivity Experiments Yubin Tian School of Science, Beijing Institute of Technology.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Example: Bioassay experiment Problem statement –Observations: At each level of dose, 5 animals are tested, and number of death are observed.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Lecture 2: Statistical learning primer for biologists
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Markov Chain Monte Carlo in R
Bayesian Semi-Parametric Multiple Shrinkage
Chapter 3: Maximum-Likelihood Parameter Estimation
Bayesian Generalized Product Partition Model
Analyzing Redistribution Matrix with Wavelet
Special Topics In Scientific Computing
Latent Variables, Mixture Models and EM
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
EE513 Audio Signals and Systems
Generally Discriminant Analysis
LECTURE 09: BAYESIAN LEARNING
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Convergence of Sequential Monte Carlo Methods
Presentation transcript:

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010

Overview This paper proposes a highly analytically tractable horseshoe estimator that is more robust and adaptive to different sparsity patterns than existing approaches. Two theorems are proved characterizing the proposed estimator’s tail robustness and demonstrating super-efficient rate of convergence to the correct estimate of the sampling density in sparse situation. The proposed estimator’s performance is demonstrated using both real and simulated data. The authors show its answer correspond quite closely to those obtained by Bayesian model averaging.

Consider a p-dimensional vector where is sparse, the authors propose the following model for estimation and prediction: where is a standard half-Cauchy distribution with mean 0 and scale parameter a. The name horseshoe prior arises from the observation that, for fixed values where and is the amount of shrinkage toward zero, a posteriori. has a horseshoe shaped prior. The horseshoe estimator

The meaning of is as follows: yields virtually no shrinkage, and describes signals while yields near total shrinkage and (hopefully) describes noise. At right is the prior on the shrinkage coefficient.

The horseshoe density function An analytic density function lacks an analytic form, but very tight bounds are available: Theorem 1. The univariate horseshoe density satisfies the following: (a) (b) For where Alternatively, it is possible to integrate over yielding though the dependence among causes more issues. Therefore the authors do not take this approach.

Horseshoe estimator for sparse signals

Review of similar methods Scott & Berger (2006) studied the discrete mixture where Tipping (2001) studied the Student-t prior is defined by an inverse-gamma mixing density, The double-exponential prior (Bayesian lasso) has mixing density

Review of similar methods The normal-Jeffreys prior is an improper prior and is induced by placing the Jeffreys’ prior on each variance term leading to. This choice is commonly used in the absence of a global scale parameter. The Strawderman-Berger prior does not have an analytic form, but arises from assuming, with The normal-exponential-gamma family of priors generalizes the lasso specification using to mix over the exponential rate parameter, leading to

Review of similar methods Shrinkage of noiseTail robustness of prior

Robustness to large signals Theorem 2. Let be the likelihood, and suppose that is a zero-mean scale mixture of normals: with having proper prior. Assume further that the likelihood and are such that the marginal density is finite for all. Define the following three pseudo-densities, which may be improper: Then

If is a Gaussian likelihood, then the result of Theorem 2 reduces to A key result of Theorem 2 is that if the prior on is chosen such that the derivative of the log probability leads to the derivative of the log predictive probability that is bounded at 0 at large. This happens for heavy-tailed priors, including the proposed horseshoe prior. This yields Robustness to large signals

The horseshoe score function Theorem 3. Suppose. Let denote the predictive density under the horseshoe prior for known scale parameter, i.e. where and. Then for some that depends upon, and Corollary: Although the horseshoe prior has no analytic form, it does lead to the following posterior mean where is a degenerate hypergeometric function of two variables.

Estimating The conditional posterior distribution of is approximately if dimensionality p is large. This approximately yields a distribution for where. If most observations are shrunk toward 0, then will be small with high probability.

Comparison to double exponential

Super-efficient convergence Theorem 4. Suppose the true sampling model is. Then: (1) For under the horseshoe prior, the optimal rate of convergence of when is where b is a constant. When, the optimal rate is. (2) Suppose is any other density that is continuous, bounded above, and strictly positive on a neighborhood of the true value. For under, the optimal rate of convergence of, regardless of, is

Example - simulated data Data generated from

Example-Vanguard mutual-fund data Here, the authors show how the horseshoe can provide a regularized estimate of a large covariance matrix whose inverse may be sparse. Vanguard mutual funds dataset containing n = 86 weekly returns for p = 59 funds. Suppose the observation matrix is with each p-dimensional vector is drawn from a zero-mean Gaussian with covariance matrix. We will model the Cholesky decomposition of.

Example-Vanguard mutual-fund data The goal is to estimate the ensemble of regression models in the implied triangular system, where is the column of Y. The regression coefficients are assumed to have a Horseshoe prior, and posterior means were computed using MCMC.

Conclusions This paper introduces the horseshoe prior as a good default prior for sparse problems. Empirically, the model performs similarly to Bayesian model averaging, the current standard. The model exhibits strong global shrinkage and robust local adaptation to signals.