Model Inference and Averaging

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
The Maximum Likelihood Method
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Model Assessment, Selection and Averaging
Visual Recognition Tutorial
The Simple Linear Regression Model: Specification and Estimation
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML)
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Evaluating Hypotheses
Parametric Inference.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Visual Recognition Tutorial
Computer vision: models, learning and inference
July 3, A36 Theory of Statistics Course within the Master’s program in Statistics and Data mining Fall semester 2011.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
EM and expected complete log-likelihood Mixture of Experts
Statistical Decision Theory
880.P20 Winter 2006 Richard Kass 1 Maximum Likelihood Method (MLM) Does this procedure make sense? The MLM answers this question and provides a method.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Computacion Inteligente Least-Square Methods for System Identification.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
The Maximum Likelihood Method
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Model Inference and Averaging
Ch3: Model Building through Regression
The Maximum Likelihood Method
The Maximum Likelihood Method
Statistical Methods For Engineers
CONCEPTS OF ESTIMATION
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

Model Inference and Averaging Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University

Model Inference and Averaging Contents The Bootstrap and Maximum Likelihood Methods Bayesian Methods Relationship Between the Bootstrap and Bayesian Inference The EM Algorithm MCMC for Sampling from the Posterior Bagging Model Averaging and Stacking Stochastic Search: Bumping 2017/4/22 Model Inference and Averaging

Bootstrap by Basis Expansions Consider a linear expansion The least square error solution The Covariance of \beta 2017/4/22 Model Inference and Averaging

Model Inference and Averaging 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Parametric Model Assume a parameterized probability density (parametric model) for observations 2017/4/22 Model Inference and Averaging

Maximum Likelihood Inference Suppose we are trying to measure the true value of some quantity (xT). We make repeated measurements of this quantity {x1, x2, … xn}. The standard way to estimate xT from our measurements is to calculate the mean value: and set . DOES THIS PROCEDURE MAKE SENSE? The maximum likelihood method (MLM) answers this question and provides a general method for estimating parameters of interest from data. 2017/4/22 Model Inference and Averaging

The Maximum Likelihood Method Statement of the Maximum Likelihood Method Assume we have made N measurements of x {x1, x2, …, xn}. Assume we know the probability distribution function that describes x: f(x, a). Assume we want to determine the parameter a. MLM: pick a to maximize the probability of getting the measurements (the xi's) we did! 2017/4/22 Model Inference and Averaging

The MLM Implementation The probability of measuring If the measurements are independent, the probability of getting the measurements we did is: We can drop the dxn term as it is only a proportionality constant. 2017/4/22 Model Inference and Averaging

Log Maximum Likelihood Method We want to pick the a that maximizes L: Often easier to maximize lnL. L and lnL are both maximum at the same location. we maximize rather than L itself because converts the product into a summation. 2017/4/22 Model Inference and Averaging

Log Maximum Likelihood Method The new maximization condition is: could be an array of parameters (e.g. slope and intercept) or just a single variable. equations to determine a range from simple linear equations to coupled non-linear equations. 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Gaussian Let f(x, ) be given by a Gaussian distribution function. Let =μ be the mean of the Gaussian. We want to use our data+MLM to find the mean. We want the best estimate of a from our set of n measurements {x1, x2, …, xn}. Let’s assume that s is the same for each measurement. 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Gaussian Gaussian PDF The likelihood function for this problem is: 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Gaussian We want to find the a that maximizes the log likelihood function: 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Gaussian If are different for each data point then is just the weighted average: Weighted Average 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Poisson Let f(x, ) be given by a Poisson distribution. Let = μ be the mean of the Poisson. We want the best estimate of a from our set of n measurements {x1, x2, … xn}. Poisson PDF: 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Poisson The likelihood function for this problem is: 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example: Poisson Find a that maximizes the log likelihood function: Average 2017/4/22 Model Inference and Averaging

General properties of MLM For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution. Maximum likelihood estimates are usually consistent. For large n the estimates converge to the true value of the parameters we wish to determine. Maximum likelihood estimates are usually unbiased. For all sample sizes the parameter of interest is calculated correctly. 2017/4/22 Model Inference and Averaging

General properties of MLM Maximum likelihood estimate is efficient: the estimate has the smallest variance. Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s). The solution from MLM is unique. Bad news: we must know the correct probability distribution for the problem at hand! 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Maximum Likelihood We maximize the likelihood function Log-likelihood function 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Score Function Assess the precision of using the likelihood function Assume that L takes its maximum in the interior parameter space. Then 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Likelihood Function We maximize the likelihood function We omit normalization since only adds a constant factor Think of L as a function of with Z fixed Log-likelihood function 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Fisher Information Negative sum of second derivatives is the information matrix is called the observed information, should be greater 0. Fisher information ( expected information ) is 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Sampling Theory Basic result of sampling theory The sampling distribution of the max-likelihood estimator approaches the following normal distribution, as when we sample independently from This suggests to approximate the distribution with 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Error Bound The corresponding error estimates are obtained from The confidence points have the form 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Simplified form of the Fisher information Suppose, in addition, that the operations of integration and differentiation can be swapped for the second derivative of f(x;θ) as well, i.e., In this case, it can be shown that the Fisher information equals The Cramér–Rao bound can then be written as 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Single-parameter proof If the expectation of T is denoted by ψ(θ), then, for all θ, Let X be a random variable with probability density function f(x;θ). Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If V is the score, i.e. then the expectation of V, written E(V), is zero. If we consider the covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because E(V) = 0. Expanding this expression we have 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Using the chain rule and the definition of expectation gives, after cancelling f(x;θ), because the integration and differentiation operations commute (second condition). The Cauchy-Schwarz inequality shows that Therefore 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example Consider a linear expansion The least square error solution The Covariance of \beta 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example The confidence region 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Bayesian Methods Given a sampling model Pr(Z | ) and a prior Pr( ) for the parameters, estimate the posterior probability By drawing samples or estimating its mean or mode Differences to mere counting ( frequentist approach ) Prior: allow for uncertainties present before seeing the data Posterior: allow for uncertainties present after seeing the data 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Bayesian Methods The posterior distribution affords also a predictive distribution of seeing future values In contrast, the max-likelihood approach would predict future data on the basis of not accounting for the uncertainty in the parameters 2017/4/22 Model Inference and Averaging

Model Inference and Averaging An Example Consider a linear expansion The least square error solution 2017/4/22 Model Inference and Averaging

Model Inference and Averaging The posterior distribution for \beta is also Gaussian, with mean and covariance The corresponding posterior values for \mu(x), 2017/4/22 Model Inference and Averaging

Model Inference and Averaging 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Bootstrap vs Bayesian The bootstrap mean is an approximate posterior average Simple example: Single observation z drawn from a normal distribution Assume a normal prior for : Resulting posterior distribution 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Bootstrap vs Bayesian Three ingredients make this work The choice of a noninformative prior for The dependence of on Z only through the max-likelihood estimate Thus The symmetry of 2017/4/22 Model Inference and Averaging

Model Inference and Averaging Bootstrap vs Bayesian The bootstrap distribution represents an (approximate) nonparametric, noninformative posterior distribution for our parameter. But this bootstrap distribution is obtained painlessly without having to formally specify a prior and without having to sample from the posterior distribution. Hence we might think of the bootstrap distribution as a \poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out. 2017/4/22 Model Inference and Averaging

Model Inference and Averaging The EM Algorithm The EM algorithm for two-component Gaussian mixtures Take initial guesses for the parameters Expectation Step: Compute the responsibilities 2017/4/22 Model Inference and Averaging

Model Inference and Averaging The EM Algorithm Maximization Step: Compute the weighted means and variances Iterate 2 and 3 until convergence 2017/4/22 Model Inference and Averaging

The EM Algorithm in General Baum-Welch algorithm Applicable to problems for which maximizing the log-likelihood is difficult but simplified by enlarging the sample set with unobserved ( latent ) data ( data augmentation ). 2017/4/22 Model Inference and Averaging

The EM Algorithm in General 2017/4/22 Model Inference and Averaging

Model Inference and Averaging 2017/4/22 Model Inference and Averaging

The EM Algorithm in General Start with initial params Expectation Step: at the j-th step compute as a function of the dummy argument Maximization Step: Determine the new params by maximizing Iterate 2 and 3 until convergence 2017/4/22 Model Inference and Averaging

Model Averaging and Stacking Given predictions Under square-error loss, seek weights Such that Here the input x is fixed and the N observations in Z are distributed according to P 2017/4/22 Model Inference and Averaging

Model Averaging and Stacking The solution is the population linear regression of Y on namely Now the full regression has smaller error, namely Population linear regression is not available, thus replace it by the linear regression over the training set 2017/4/22 Model Inference and Averaging