Wireless Information Transmission System Lab. Institute of Communications Engineering National Sun Yat-sen University 2011 Summer Training Course ESTIMATION.

Slides:



Advertisements
Similar presentations
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 6 Point Estimation.
Advertisements

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 7. Statistical Estimation and Sampling Distributions
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Statistical Estimation and Sampling Distributions
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
STAT 497 APPLIED TIME SERIES ANALYSIS
Visual Recognition Tutorial
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Chapter 6 Introduction to Sampling Distributions
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluating Hypotheses
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Linear and generalised linear models
1 STATISTICAL INFERENCE PART I EXPONENTIAL FAMILY & POINT ESTIMATION.
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Linear and generalised linear models
Maximum likelihood (ML)
Lecture II-2: Probability Review
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
1 Chapter 8 The Discrete Fourier Transform 2 Introduction  In Chapters 2 and 3 we discussed the representation of sequences and LTI systems in terms.
Component Reliability Analysis
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Two Functions of Two Random.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Advanced Engineering Mathematics, 7 th Edition Peter V. O’Neil © 2012 Cengage Learning Engineering. All Rights Reserved. CHAPTER 4 Series Solutions.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Probability and Moment Approximations using Limit Theorems.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Joint Moments and Joint Characteristic Functions.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Computacion Inteligente Least-Square Methods for System Identification.
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
STATISTICS POINT ESTIMATION
12. Principles of Parameter Estimation
統計通訊 黃偉傑.
Chapter 2 Minimum Variance Unbiased estimation
Computing and Statistical Data Analysis / Stat 7
Parametric Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Maximum Likelihood Estimation (MLE)
Presentation transcript:

Wireless Information Transmission System Lab. Institute of Communications Engineering National Sun Yat-sen University 2011 Summer Training Course ESTIMATION THEORY Chapter 7 Maximum Likelihood Estimation

Outline ◊ Why use MLE? ◊ How to find the MLE? ◊ Properties of MLE ◊ Numerical Determination of the MLE 2

Introduction ◊ We now investigate an alternative to the MVU estimator, which is desirable in situations where the MVU estimator does not exist or cannot be found even if it does exist. ◊ This estimator, which is based on the maximum likelihood principle, is overwhelmingly the most popular approach to obtaining practical estimator. ◊ It has the distinct advantage of being a turn-the-crank procedure, allowing it to be implemented for complicated estimation problems. 3

◊ In general, the MLE has the asymptotic properties of being unbiased, achieving the CRLB, and having a Gaussian PDF. 4

Why use MLE? ◊ Example DC Level White Gaussian Noise ◊ Consider the observed data set where A is an unknown level, which is assumed to be positive (A > 0), and is WGN with unknown variance A ( ). ◊ The PDF is : 5

◊ Taking the derivative of the log-likelihood function, we have : ◊ We can still determine the CRLB for this problem to find that : ◊ We next try to find the MVU estimator by resorting to the theory of sufficient statistics. 6

◊ Sufficient statistics ◊ Theorem 5.1(p.104) ◊ Theorem 5.2(p.109) ◊ If is an unbiased estimator of and is a sufficient statistic for then is I. A valid estimator for ( not dependent on ) II. Unbiased III. Of lesser or equal variance than that of, for all. Additionally, if the sufficient statistic is complete, then is the MVU estimator. In essence, a statistic is complete if there is only one function of the statistic that is unbiased

◊ First approach : Use Theorem 5.1 ◊ Two steps : ◊ First step : Find ◊ Second step : Find function g so that is an unbiased estimator of A Cont.

◊ First step : ◊ Attempting to factor (7.1) into the form of (5.3),we note that ◊ so that the PDF factor as ◊ Based on the Neyman-Fisher factorization theorem a single sufficient statistic for A is.

◊ Second step : ◊ Assuming that is a complete sufficient statistic. To do so we need to find a function g such that since ◊ It is not obvious how to choose.

◊ Second approach : Use Theorem 5.2 ◊ It would be to determine the conditional expectation where is any unbiased estimator. ◊ Example : Let, then the MVU estimator would take the form ◊ Unfortunately, the evaluation of the conditional expectation appears to be a formidable task.

◊ Example DC level in White Gaussian Noise ◊ We propose the following estimator : ◊ This estimator is biased since

◊ As,we have by the law of large number and therefore from (7.6) ◊ To find the mean and variance of as we use the statistical linearization argument described in section 3.6. ◊ Section 3.6 is a estimator of DC level ( )

◊ It might be supposed that is efficient for. ◊ Let.If we linearize about A, we have the approximation Then,the estimator is unbiased. Also, the estimator achieves the CRLB

◊ The is an estimator of. ◊ Let be that function, so that where, and therefore, ◊ Linearizing about,we have or

◊ It now follows that the asymptotic mean is so that is asymptotically unbiased. Additionally, the asymptotic variance becomes, from (7.7) ◊ But can be shown to be (prove is in next page), so that (asymptotically efficient)

◊ Summarizing our result : a. The proposed estimator given by (7.6) is asymptotically unbiased and asymptotically achieves the CRLB.Hence, it is asymptotically efficient. b. Furthermore, by the central limit theorem the random variable is Gaussian as. Because is a linear function of this Gaussian random variable for large data records, it too will have a Gaussian PDF. (ex:,, y is Gaussian.)

7.4 How to find the MLE? ◊ The MLE for a scalar parameter is defined to be the value of that maximizes for x is fixed, i.e., the value that maximizes the likelihood function. ◊ Since will also be a function of x,the maximization produces a that is a function of x.

◊ Example DC Level in white Gaussian Noise ◊ where is WGN with unknown variance A. ◊ To actually find the MLE for this problem we first write the PDF from (7.1) as ◊ Differentiating the log-likelihood function, we have

◊ Setting it equal to zero produces

◊ We choose the solution to correspond to the permissible range of A or A>0. ◊ Not only does the maximum likelihood procedure yield an estimator that is asymptotically efficient, it also sometimes yields an efficient estimator for finite data records.

◊ Example DC Level in white Gaussian Noise ◊ For the received data where A is the unknown level to be estimated and is WGN with known variance, the PDF is ◊ Taking the derivative of the log-likelihood function produces

◊ Which being set equal to zero yields the MLE ◊ This result is true in general. If an efficient estimator exists, the maximum likelihood procedure will produce it. proof: 因為 efficient estimator 存在, 所以 依照 maximum likelihood procedure, 令 得

7.5 Properties of the MLE ◊ The example discussed in Section 7.3 led to an estimator that for large data records was unbiased, achieved the CRLB,and had a Gaussian PDF,the MLE was distributed as ◊ Invariance property (MLE for transformed parameters). ◊ Of course, in practice it is seldom known in advance how large N must be in order for (7.8) to hold. ◊ An analytical expression for the PDF of the MLE is usually impossible to derive. As an alternative means of assessing performance, a computer simulation is usually required.

◊ Example DC Level in white Gaussian Noise ◊ A computer simulation was performed to determine how large the data record had to be for the asymptotic results to apply. ◊ In principle the exact PDF of (see (7.6)) could be found but would be extremely tedious.

◊ Using the Monte Carlo method,M=1000 realizations of were generated for various data record lengths. ◊ The mean and variance were estimated by ◊ Instead of the CRLB of (7.2),we tabulate ◊ For a value of A equal to 1 the results are shown in Table 7.1.

◊ To check this the number of realizations was increased to M=5000 for a data record length of N=20.This resulted in the mean and normalized variance shown in parentheses.

◊ Next, the PDF of was determined using a Monte Carlo Computer simulation.This was done for data record lengths of N=5 and N=20 (M=5000).

Theorem 7.1 ◊ Theorem 7.1 (Asymptotic Properties of the MLE) If the PDF p(x; θ) of the data x satisfies some “regularity” conditions, then the MLE of the unknown parameter θ is asymptotically distributed (for large data records) according to where I(θ) is the Fisher information evaluated at the true value of the unknown parameter.

◊ Regularity condition: ◊ From the asymptotic distribution, the MLE is seen to be asymptotically unbiased and asymptotically attains the CRLB. ◊ It is therefore asymptotically efficient, and hence asymptotically optimal.

7.5 Properties of the MLE ◊ Example 7.6 – MLE of the Sinusoidal Phase ◊ We wish to estimate the phase of a sinusoid embedded in noise or where w[n] is WGN with variance σ 2 and the amplitude A and frequency f 0 are assumed to be known. ◊ We saw in Chapter 5 that no single sufficient statistic exists for this problem. Cont.

◊ The sufficient statistics were

◊ The MLE is found by maximizing p(x; ) or or, equivalently, by minimizing ◊ Differentiating with respect to produces

◊ Setting it equal to zero yields ◊ But the right-hand side may be approximated since for f 0 not near 0 or 1/2. (P.33)

◊ Thus, the left-hand side when divided by N and set equal to zero will produce an approximate MLE, which satisfies

◊ According to Theorem 7.1, the asymptotic PDF of the phase estimator is ◊ From Example 3.4 so that the asymptotic variance is where is the SNR.

◊ To determine the data record length for the asymptotic mean and variance to apply we performed a computer simulation using A = 1, f 0 = 0.08, = π/4, σ 2 = var<CRLB!! Maybe bias.

◊ We fixed the data record length N = 80 and varied the SNR.

◊ In Figure 7.4 we have plotted

◊ The large error estimates are said to be outliers and cause the threshold effect. ◊ Nonlinear estimators nearly always exhibit this effect. ◊ In summary, the asymptotic PDF of the MLE is valid for large enough data records. ◊ For signal in noise problems the CRLB may be attained even for short data records if the SNR is high enough.

◊ To see why this is so the phase estimator can be written as example 7.6

◊ If the data record is large and/or the sinusoidal power is large, the noise terms is small. It is this condition, the estimation error will be small, that allows the MLE to attain its asymptotic distribution. ◊ In some cases the asymptotic distribution does not hold, no matter how large the data record and/or the SNR becomes.

◊ Example 7.7 – DC Level in Non-independent Non- Gaussian Noise ◊ Consider the observations

◊ The PDF is symmetric about w[n] =0 and has a maximum at w[n] = 0. Furthermore, we assume all the noise samples are equal or w[0]=w[1]=…=w[N-1]. In estimate A, we need to consider only a single observation (x[0]=A+w[0]) since all observation are identical. ◊ The MLE of A is the value that maximizes because, we can get ◊ This estimator has the mean.

◊ The variance of is the same as the variance of x[0] or of w[0]. ◊ The CRLB (problem 3.2) the two are not in general equal. (see Problem 7.16) ◊ So in this sample, the estimator error does not decrease as the data record length increase but remains the same.

7.6 MLE for Transformed Parameters Example 7.8 – Transformed DC Level in WGN ◊ Consider the data where w[n] is WGN with variance σ 2. ◊ We wish to find the MLE of ◊ The PDF is given as

◊ However, since α is a one-to-one transformation of A, we can equivalently parameterize the PDF as ◊ Clearly, is the PDF of the data set ◊ Setting the derivative of with respect to α equal to zero yields

◊ But is just the MLE of A, so that ◊ The MLE of the transformed parameter is found by substituting the MLE of the original parameter in to the transformation. ◊ This property of the MLE is termed the invariance property.

◊ Example 7.9 – Transformed DC Level in WGN ◊ Consider the transformation for the data set in the previous example. ◊ Attempting to parameterize p(x;A) with respect to α, we find that since the transformation is not one-to-one. ◊ If we choose, then some of the possible PDFs will be missing.

◊ We actually require two sets of PDFs (7.23) to characterize all possible PDFs. ◊ It is possible to find the MLE of α as the value of α that yields the maximum of and or

◊ Alternatively, we can find the maximum in two steps as ◊ For a given value of, say, determine whether or is large. For example, if then denote the value of as. Repeat for all to form. ◊ The MLE is given as the that maximizes over

Construction of modified likelihood function

◊ The function can be thought of as a modified likelihood function, having been derived from the original likelihood function by transforming the value of A that yields the maximum value for a given. ◊ The MLE is :

Theorem 7.2 ◊ Theorem 7.2 (Invariance Property of the MLE) ◊ The MLE of the parameter α= g (θ), where the PDF p(x;θ) is parameterized by θ, is given by where is the MLE of θ. The MLE of is obtained by maximizing p(x;θ). ◊ If g is not a one-to-one function, then maximizes the modified likelihood function, defined as

7.6 MLE for Transformed Parameters ◊ Example 7.10 – Power of WGN in dB ◊ We observe N samples of WGN with variance σ 2 whose power in dB is to be estimated. ◊ To do so we first find the MLE of σ 2. Then, we use the invariance principle to find the power P in dB, which is defined as ◊ The PDF is given by Cont.

◊ Differentiating the log-likelihood function produces ◊ Setting it equal to zero yields the MLE ◊ The MLE of the power in dB readily follows as

7.7 Numerical Determination of the MLE ◊ A distinct advantage of the MLE is that we can always find it for a given data set numerically. ◊ The safest way to find the MLE is to grid search, as long as the spacing between search is small enough, we are guaranteed to find the MLE

◊ If, however, the range is not confirmed to a finite interval, then a grid search may not be computationally feasible. ◊ We use iterative maximization procedures :  Newton-Raphson method  The scoring approach  The expectation-maximization algorithm ◊ These methods will produce the MLE if the initial guess is close to the true maximum. If not, convergence may not be attained, or only convergence to a local maximum.

The Newton-Raphson method ◊ This is a nonlinear equation and can not be solved directly. ◊ Consider : ◊ Guess :

◊ Note that at convergence, we get,

◊ The iteration may not converge,when the second derivation of the log-likelihood function is small. The correct term may fluctuate wildly. ◊ Even if the iteration converges, the point found may not be the global maximum but only a local maximum or even a local minimum. ◊ Generally, if the initial point is close to the global maximum, the iteration will converge to it.

The Method of Scoring ◊ A second common iterative procedure is the method of scoring, it recognizes that : ◊ Proof : By the law of large numbers.

◊ So we get

◊ Example 7.11 – Exponential in WGN ◊ the parameter, the exponential factor, is to be estimated. ◊ so, we want to minimize : ◊ differentiating and setting it equal to :

◊ Applying the Newton-Raphson iteration method : Cont.

◊ so we get

◊ Applying the method of scoring : ( ) so that

Computer Simulation ◊ Consider

◊ Using N = 50, r = 0.5, ◊ We apply the Newton-Raphson iteration by using several initial guesses. (0.8, 0.2, and 1.2) ◊ For 0.2 and 0.8 the iteration quickly converged to the true maximum. However, for r 0 = 1.2 the convergence was much slower with 29 iterations. ◊ If the initial guess was less than 0.18 or greater than 1.2, the succeeding iterates exceeded 1 and keep increasing, the Newton-Raphson iteration fails to converge.

Conclusion ◊ If PDF is known, MLE can be used. ◊ With MLE, the unknown parameter is estimated by : where is the vector of observed data. (N samples) ◊ Asymptotically unbiased : ◊ Asymptotically efficient : Find a, which maximum the probability.