Maximum likelihood (ML) and likelihood ratio (LR) test

Slides:

Advertisements

Similar presentations

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.

Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.

Statistical Estimation and Sampling Distributions

Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.

Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.

Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.

The General Linear Model. The Simple Linear Model Linear Regression.

Visual Recognition Tutorial

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Elementary hypothesis testing

Factor Analysis Purpose of Factor Analysis

Basics of ANOVA Why ANOVA Assumptions used in ANOVA

Point estimation, interval estimation

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

Maximum likelihood (ML)

Elementary hypothesis testing

Generalised linear models

Maximum likelihood (ML) and likelihood ratio (LR) test

Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.

Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.

Log-linear and logistic models

Mixed models Various types of models and their relation

Statistical Background

Linear and generalised linear models

1 STATISTICAL INFERENCE PART I EXPONENTIAL FAMILY & POINT ESTIMATION.

July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.

Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.

Linear and generalised linear models

Basics of regression analysis

Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.

Maximum likelihood (ML)

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Model Inference and Averaging

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Random Sampling, Point Estimation and Maximum Likelihood.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Chapter 7 Point Estimation

SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.

Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.

Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Brief Review Probability and Statistics. Probability distributions Continuous distributions.

M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.

Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.

Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.

R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.

Computacion Inteligente Least-Square Methods for System Identification.

Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.

STATISTICS POINT ESTIMATION

12. Principles of Parameter Estimation

EC 331 The Theory of and applications of Maximum Likelihood Method

Parametric Methods Berlin Chen, 2005 References:

Learning From Observed Data

12. Principles of Parameter Estimation

Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Applied Statistics and Probability for Engineers

Presentation transcript:

Maximum likelihood (ML) and likelihood ratio (LR) test Conditional distribution and likelihood Maximum likelihood estimator Information in the data and likelihood Observed and Fisher’s information Likelihood ratio test Exercise

Introduction It is often the case that we are interested in finding values of some parameters of the system. Then we design an experiment and get some observations (x1,,,xn). We want to use these observation and estimate parameters of the system. Once we know (it may be a challenging mathematical problem) how parameters and observations are related then we can use this fact to estimate parameters. Maximum likelihood is one the techniques to estimate parameters using observations or experiments. There are other estimation techniques also. These include Bayesian, least-squares, method of moments, minimum chi-squared. The result of the estimation is a function of observation – t(x1,,,xn). It is a random variable and in many cases we want to find its distribution also. In general, finding the distribution of the statistic is a challenging problem. But there are numerical technique to deal with this.

Desirable properties of estimation Unbiasedness. Bias is defined as difference between estimator (t) and true parameter (). Expectation taken using probability distribution of observations Efficiency. Efficient estimation is that with minimum variance (var(t)). Consistency. As the number of observation goes to infinity an estimator converges to true value Minimum mean square error (m.s.e). M.s.e. is defined as the expectation value of the square of the difference between estimator and the true value It means that this estimator must be efficient and unbiased. It is very difficult to achieve all these properties. Under some conditions ML estimator obeys them asymptotically. Moreover ML is asymptotically normal that simplifies the interpretation of results.

Conditional probability distribution and likelihood Let us assume that we know that our random sample points came from the the population with the distribution with parameter(s) . We do not know . If we would know it then we could write the probability distribution of a single observation f(x|). Here f(x|) is the conditional distribution of the observed random variable if the parameter would be known. If we observe n independent sample points from the same population then the joint conditional probability distribution of all observations can be written: We could write the product of the individual probability distribution because the observations are independent (independent conditionally when parameters are known). f(x|) is the probability of an observation for discrete and density of the distribution for continuous cases. We could interpret f(x1,x2,,,xn|) as the probability of observing given sample points if we would know parameter . If we would vary the parameter(s) we would get different values for the probability f. Since f is the probability distribution, parameters are fixed and observation varies. For a given observation we define likelihood proportional to the conditional probability distribution.

Conditional probability distribution and likelihood: Cont. When we talk about conditional probability distribution of the observations given parameter(s) then we assume that parameters are fixed and observations vary. When we talk about likelihood then observations are fixed parameters vary. That is the major difference between likelihood and conditional probability distribution. Sometimes to emphasize that parameters vary and observations are fixed, likelihood is written as: In this and following lectures we will use one notation for probability and likelihood. When we will talk about probability then we will assume that observations vary and when we will talk about likelihood we will assume that parameters vary. Principle of maximum likelihood states that best parameters are those that maximise probability of observing current values of observations. Maximum likelihood chooses parameters that satisfy:

Maximum likelihood Purpose of maximum likelihood is to maximize the likelihood function and estimate parameters. If derivatives of the likelihood function exist then it can be done using: Solution of this equation will give possible values for maximum likelihood estimator. If the solution is unique then it will be the only estimator. In real application there might be many solutions. Usually instead of likelihood its logarithm is maximized. Since log is strictly monotonically increasing function, derivative of the likelihood and derivative of the log of likelihood will have exactly same roots. If we use the fact that observations are independent then joint probability distributions of all observations is equal to product of individual probabilities. We can write log of the likelihood (denoted as l): Usually working with sums is easier than working with products

Maximum likelihood: Example – success and failure Let us consider two examples. First example corresponds to discrete probability distribution. Let us assume that we carry out trials. Possible outcomes of the trials are success or failure. Probability of success is  and probability of failure is 1- . We do not know value of . Let us assume we have n trials and k of them are successes and n-k of them are failures. Value of random variable describing our trials are either 0 (failure) or 1 (success). Let us denote observations as y=(y1,y2,,,,yn). Probability of the observation yi at the ith trial is: Since individual trials are independent we can write for n trials: For log of this function we can write: Derivative of the likelihood w.r.t unknown parameter is: The ML estimator for the parameter is equal to the fraction of successes.

Maximum likelihood: Example – success and failure In the example of successes and failures the result was not unexpected and we could have guessed it intuitively. More interesting problems arise when parameter  itself becomes function of some other parameters and possible observations also. Let us say: It may happen that xi themselves are random variables also. If it is the case and the function corresponds to normal distribution then analysis is called Probit analysis. Then log likelihood function would look like: Finding maximum of this function is more complicated. This problem can be considered as a non-linear optimization problem. This kind of problems are usually solved iteratively. I.e. a solution to the problem is guessed and then it is improved iteratively.

Maximum likelihood: Example – normal distribution Now let us assume that the sample points came from the population with normal distribution with unknown mean and variance. Let us assume that we have n observations, y=(y1,y2,,,yn). We want to estimate the population mean and variance. Then log likelihood function will have the form: If we get derivative of this function w.r.t mean value and variance then we can write: Fortunately first of these equations can be solved without knowledge about the second one. Then if we use result from the first solution in the second solution (substitute  by its estimate) then we can solve second equation also. Result of this will be sample variance:

Maximum likelihood: Example – normal distribution Maximum likelihood estimator in this case gave sample mean and biased sample variance. Many statistical techniques are based on maximum likelihood estimation of the parameters when observations are distributed normally. All parameters of interest are usually inside mean value. In other words  is a function of several parameters. Then problem is to estimate parameters using maximum likelihood estimator. Usually either x-s are fixed values (fixed effects model) or random variables (random effects model). Parameters are -s. If this function is linear on parameters then we have linear regression. If variances are known then the Maximum likelihood estimator using observations with normal distribution becomes least-squares estimator.

Information matrix: Observed and Fisher’s One of the important aspects of the likelihood function is its behavior near to the maximum. If the likelihood function is flat then observations have little to say about the parameters. It is because changes of the parameters will not cause large changes in the probability. That is to say same observation can be observed with similar probabilities for various values of the parameters. On the other hand if likelihood has a pronounced peak near to the maximum then small changes of the parameters would cause large changes in the probability. In this cases we say that observation has more information about parameters. It is usually expressed as the second derivative (or curvature) of the minus log-likelihood function. Observed information is equal to the second derivative of the minus log-likelihood function: When there are more than one parameter it is called information matrix. Usually it is calculated at the maximum of the likelihood. There are other definitions of information also. Example: In case of successes and failures we can write:

Information matrix: Observed and Fisher’s Expected value of the observed information matrix is called expected information matrix or Fisher’s information. Expectation is taken over observations: It is calculated at any value of the parameter. Remarkable fact about Fisher’s information matrix is that it is also equal to the expected value of the product of the gradients (first derivatives): Note that observed information matrix depends on particular observation whereas expected information matrix depends only on the probability distribution of the observations (It is a result of integration. When we integrate over some variables we loose dependence on particular values): When sample size becomes large then maximum likelihood estimator becomes approximately normally distributed with variance close to : Fisher points out that inversion of observed information matrix gives slightly better estimate to variance than that of the expected information matrix.

Information matrix: Observed and Fisher’s More precise relation between expected information and variance is given by Cramer and Rao inequality. According to this inequality variance of the maximum likelihood estimator never can be less than inversion of information: Now let us consider an example of successes and failures. If we get expectation value for the second derivative of minus log likelihood function we can get: If we take this at the point of maximum likelihood then we can say that variance of the maximum likelihood estimator can be approximated by: This statement is true for large sample sizes.

Likelihood ratio test Let us assume that we have a sample of size n (x=(x1,,,,xn)) and we want to estimate a parameter vector =( 1,2). Both 1 and 2 can also be vectors. We want to test null-hypothesis against alternative one: Let us assume that likelihood function is L(x| ). Then likelihood ratio test works as follows: 1) Maximise the likelihood function under null-hypothesis (I.e. fix parameter(s) 1 equal to 10 , find the value of likelihood at the maximum, 2)maximise the likelihood under alternative hypothesis (I.e. unconditional maximisation), find the value of the likelihood at the maximum, then find the ratio: w is the likelihood ratio statistic. Tests carried out using this statistic are called likelihood ratio tests. In this case it is clear that: If the value of w is small then null-hypothesis is rejected. If g(w) is the the density of the distribution for w then critical region can be calculated using:

References Berthold, M. and Hand, DJ (2003) “Intelligent data analysis” Stuart, A., Ord, JK, and Arnold, S. (1991) Kendall’s advanced Theory of statistics. Volume 2A. Classical Inference and the Linear models. Arnold publisher, London, Sydney, Auckland

Exercise 1 a) Assume that we have a sample of size n independently drawn from the population with the density of probability (exponential distribution) What is the maximum likelihood estimator for . What is the observed and expected information. b) Let us assume that we have a sample of size n of two-dimensional vectors ((x1,x2)=((x11,x21), (x12,x22),,,,(x1n,x2n) from the normal distribution: Find the maximum of the likelihood under the following hypotheses: Try to find the likelihood ratio statistic. Note that variance is also unknown.