A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Estimation of Means and Proportions
Point Estimation Notes of STAT 6205 by Dr. Fan.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Chapter 7. Statistical Estimation and Sampling Distributions
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Statistical Estimation and Sampling Distributions
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Maximum likelihood (ML)
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Parametric Inference.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Visual Recognition Tutorial
1 STATISTICAL INFERENCE PART I EXPONENTIAL FAMILY & POINT ESTIMATION.
Lecture 7 1 Statistics Statistics: 1. Model 2. Estimation 3. Hypothesis test.
Review of Probability and Statistics
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
IRDM WS Chapter 2: Basics from Probability Theory and Statistics 2.1 Probability Theory Events, Probabilities, Random Variables, Distributions,
Model Inference and Averaging
Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Machine Learning 5. Parametric Methods.
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Conditional Expectation
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Estimating standard error using bootstrap
Bayesian Estimation and Confidence Intervals
STATISTICS POINT ESTIMATION
Probability Theory and Parameter Estimation I
Model Inference and Averaging
STATISTICAL INFERENCE PART I POINT ESTIMATION
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Applied Statistics and Probability for Engineers
Presentation transcript:

A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if it can be completely described by a finite number of parameters, e.g., the family of Normal distributions for a finite number of parameters μ, σ: II.2 Statistical Inference: Sampling and Estimation October 25, 2011II.1IR&DM, WS'11/12

Statistical Inference Given a parametric model M and a sample X 1,...,X n, how do we infer (learn) the parameters of M? For multivariate models with observed variable X and „outcome (response)“ variable Y, this is called prediction or regression, for a discrete outcome variable this is also called classification. r(x) = E[Y | X=x] is called the regression function. October 25, 2011II.2IR&DM, WS'11/12

Idea of Sampling Distribution X (e.g., a population, objects of interest) Samples X 1,…,X n drawn from X (e.g., people, objects) Statistical Inference What can we say about X based on X 1,…,X n ? Example: Suppose we want to estimate the average salary of employees in German companies.  Sample 1: Suppose we look at n=200 top-paid CEOs of major banks.  Sample 2: Suppose we look at n=100 employees across all kinds of companies. Distrib. Param. Sample Param. μ X mean variance N size n October 25, 2011II.3IR&DM, WS'11/12

Basic Types of Statistical Inference Given a set of iid. samples X 1,...,X n ~ X of an unknown distribution X. e.g.: n single-coin-toss experiments X 1,...,X n ~ X: Bernoulli(p) Parameter Estimation e.g.: - what is the parameter p of X: Bernoulli(p) ? - what is E[X], the cdf F X of X, the pdf f X of X, etc.? Confidence Intervals e.g.: give me all values C=(a,b) such that P(p  C) ≥ 0.95 where a and b are derived from samples X 1,...,X n Hypothesis Testing e.g.: H 0 : p = 1/2 vs. H 1 : p ≠ 1/2 October 25, 2011II.4IR&DM, WS'11/12

Statistical Estimators A point estimator for a parameter  of a prob. distribution X is a random variable derived from an iid. sample X 1,...,X n. Examples: Sample mean: Sample variance: An estimator for parameter  is unbiased if E[ ] =  ; otherwise the estimator has bias E[ ] – . An estimator on a sample of size n is consistent if Sample mean and sample variance are unbiased and consistent estimators of μ X and. October 25, 2011II.5IR&DM, WS'11/12

Estimator Error Let be an estimator for parameter  over iid. samples X 1,...,X n. The distribution of is called the sampling distribution. The standard error for is: The mean squared error (MSE) for is: Theorem: If bias  0 and se  0 then the estimator is consistent. The estimator is asymptotically Normal if converges in distribution to standard Normal N(0,1). October 25, 2011II.6IR&DM, WS'11/12

Types of Estimation Nonparametric Estimation No assumptions about model M nor the parameters θ of the underlying distribution X  “Plug-in estimators” (e.g. histograms) to approximate X Parametric Estimation (Inference) Requires assumptions about model M and the parameters θ of the underlying distribution X Analytical or numerical methods for estimating θ  Method-of-Moments estimator  Maximum Likelihood estimator and Expectation Maximization (EM) October 25, 2011II.7IR&DM, WS'11/12

Nonparametric Estimation The empirical distribution function is the cdf that puts probability mass 1/n at each data point X i : A statistical functional (“statistics”) T(F) is any function over F, e.g., mean, variance, skewness, median, quantiles, correlation. The plug-in estimator of  = T(F) is:  Simply use instead of F to calculate the statistics T of interest. October 25, 2011II.8IR&DM, WS'11/12

Histograms as Density Estimators Instead of the full empirical distribution, often compact data synopses may be used, such as histograms where X 1,...,X n are grouped into m cells (buckets) c 1,..., c m with bucket boundaries lb(c i ) and ub(c i ) s.t. lb(c 1 ) = , ub(c m ) = , ub(c i ) = lb(c i+1 ) for 1  i<m, and freq f (c i ) = freq F (c i ) = Histograms provide a (discontinuous) density estimator. Example: X 1 = 1 X 2 = 1 X 3 = 2 X 4 = 2 X 5 = 2 X 6 = 3 … X 20 =7 x f X (x) /20 3/20 5/20 4/20 3/20 2/20 1/20 October 25, 2011II.9IR&DM, WS'11/12

Parametric Inference (1): Method of Moments Compute j-th moment: Method-of-moments estimators are usually consistent and asymptotically Normal, but may be biased. Suppose parameter θ = (θ 1,…,θ k ) has k components. j-th sample moment: for 1 ≤ j ≤ k Estimate parameter  by method-of-moments estimator s.t. and … and (for the first k moments)  Solve equation system with k equations and k unknowns. October 25, 2011II.10IR&DM, WS'11/12

Parametric Inference (2): Maximum Likelihood Estimators (MLE) Let X 1,...,X n be iid. with pdf f(x;θ). Estimate parameter  of a postulated distribution f(x;  ) such that the likelihood that the sample values x 1,...,x n are generated by this distribution is maximized.  Maximum likelihood estimation: Maximize L(x 1,...,x n ;  ) ≈ P[x 1,...,x n originate from f(x;  )] Usually formulated as L n (  ) = ∏ i f(X i ;  ) Or (alternatively)  Maximize l n (  ) = log L n (  ) The value that maximizes L n (  ) is the MLE of . If analytically untractable  use numerical iteration methods October 25, 2011II.11IR&DM, WS'11/12

Simple Example for Maximum Likelihood Estimator Given: Coin toss experiment (Bernoulli distribution) with unknown parameter p for seeing heads, 1-p for tails Sample (data): h times head with n coin tosses Want: Maximum likelihood estimation of p Let L(h, n, p) with h = ∑ i X i Maximize log-likelihood function: log L (h, n, p) October 25, 2011II.12IR&DM, WS'11/12

MLE for Parameters of Normal Distributions  October 25, 2011II.13IR&DM, WS'11/12

MLE Properties Maximum Likelihood estimators are consistent, asymptotically Normal, and asymptotically optimal (i.e., efficient) in the following sense: Consider two estimators U and T which are asymptotically Normal. Let u 2 and t 2 denote the variances of the two Normal distributions to which U and T converge in probability. The asymptotic relative efficiency of U to T is ARE(U,T) := t 2 /u 2. Theorem: For an MLE and any other estimator the following inequality holds: That is, among all estimators MLE has the smallest variance. October 25, 2011II.14IR&DM, WS'11/12

Bayesian Viewpoint of Parameter Estimation Assume prior distribution g(  ) of parameter  Choose statistical model (generative model) f (x |  ) that reflects our beliefs about RV X Given RVs X 1,...,X n for the observed data, the posterior distribution is h (  | x 1,...,x n ) For X 1 = x 1,...,X n = x n the likelihood is which implies (posterior is proportional to likelihood times prior) MAP estimator (maximum a posteriori): Compute  that maximizes h(  | x 1, …, x n ) given a prior for . October 25, 2011II.15IR&DM, WS'11/12

Analytically Non-tractable MLE for parameters of Multivariate Normal Mixture with expectation values and invertible, positive definite, symmetric m  m covariance matrices  Maximize log-likelihood function: Consider samples from a k-mixture of m-dimensional Normal distributions with the density (e.g. height and weight of males and females): October 25, 2011II.16IR&DM, WS'11/12

Expectation-Maximization Method (EM) Key idea: When L(X 1,...,X n, θ) (where the X i and  are possibly multivariate) is analytically intractable then introduce latent (i.e., hidden, invisible, missing) random variable(s) Z such that the joint distribution J(X 1,...,X n, Z,  ) of the “complete” data is tractable (often with Z actually being multivariate: Z 1,...,Z m ) iteratively derive the expected complete-data likelihood by integrating J and find best  : E Z|X,  [J(X 1,…,X n, Z,  )] October 25, 2011II.17IR&DM, WS'11/12

EM Procedure E step (expectation): estimate posterior probability of Z: P[Z | X 1,…,X n,  (t) ] assuming  were known and equal to previous estimate  (t), and compute E Z|X,θ(t) [log J(X 1,…,X n, Z,  (t) )] by integrating over values for Z Initialization: choose start estimate for  (0) (e.g., using Method-of-Moments estimator) Iterate (t=0, 1, …) until convergence: M step (maximization, MLE step): Estimate  (t+1) by maximizing  (t+1) = arg max θ E Z|X,θ [log J(X 1,…,X n, Z,  )]  Convergence is guaranteed (because the E step computes a lower bound of the true L function, and the M step yields monotonically non-decreasing likelihood), but may result in local maximum of (log-)likelihood function October 25, 2011II.18IR&DM, WS'11/12

EM Example for Multivariate Normal Mixture Expectation step (E step): Maximization step (M step): Z ij = 1 if i th data point X i was generated by j th component, 0 otherwise October 25, 2011II.19IR&DM, WS'11/12 See L. Wasserman, p.121 ff. for k=2, m=1

Confidence Intervals Estimator T for an interval for parameter  such that For the distribution of random variable X, a value x  (0<  <1) with is called a  -quantile; the 0.5-quantile is called the median. For the Normal distribution N(0,1) the  -quantile is denoted  . [T-a, T+a] is the confidence interval and 1–  is the confidence level. October 25, 2011II.20IR&DM, WS'11/12  For a given a or α, find a value z of N(0,1) that denotes the [T-a, T+a] conf. interval or a corresponding  -quantile for 1– .

Confidence Intervals for Expectations (1) Let x 1,..., x n be a sample from a distribution with unknown expectation  and known variance  2. For sufficiently large n, the sample mean is N( ,  2 /n) distributed and is N(0,1) distributed: For confidence interval or confidence level 1-  set then look up  (z) to find 1–α then October 25, 2011II.21IR&DM, WS'11/12

Confidence Intervals for Expectations (2) Let X 1,..., X n be an iid. sample from a distribution X with unknown expectation  and unknown variance  2 and known sample variance S 2. For sufficiently large n, the random variable has a t distribution (Student distribution) with n-1 degrees of freedom: with the Gamma function: October 25, 2011II.22IR&DM, WS'11/12

Summary of Section II.2 Quality measures for statistical estimators Nonparametric vs. parametric estimation Histograms as generic (nonparametric) plug-in estimators Method-of-Moments estimator good initial guess but may be biased Maximum-Likelihood estimator & Expectation Maximization Confidence intervals for parameters October 25, 2011II.23IR&DM, WS'11/12

Normal Distribution Table October 25, 2011II.24IR&DM, WS'11/12

Student‘s t Distribution Table October 25, 2011II.25IR&DM, WS'11/12