Statistical Decision Theory Abraham Wald (1902 - 1950) Wald’s test Rigorous proof of the consistency of MLE “Note on the consistency of the maximum likelihood.

Slides:



Advertisements
Similar presentations
: INTRODUCTION TO Machine Learning Parametric Methods.
Advertisements

Oct 9, 2014 Lirong Xia Hypothesis testing and statistical decision theory.
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Pattern Recognition and Machine Learning
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Chapter 7. Statistical Estimation and Sampling Distributions
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Minimaxity & Admissibility Presenting: Slava Chernoi Lehman and Casella, chapter 5 sections 1-2,7.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Thanks to Nir Friedman, HU
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Chapter 6: Sampling Distributions
Statistical Decision Theory
Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
1 E. Fatemizadeh Statistical Pattern Recognition.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
BCS547 Neural Decoding.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Chapter 5 Statistical Inference Estimation and Testing Hypotheses.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Introduction to Estimation Theory: A Tutorial
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Stat 223 Introduction to the Theory of Statistics
Chapter 6: Sampling Distributions
Lecture 2. Bayesian Decision Theory
Chapter 3: Maximum-Likelihood Parameter Estimation
STATISTICS POINT ESTIMATION
STATISTICAL INFERENCE
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Chapter 6: Sampling Distributions
Parameter Estimation 主講人:虞台文.
Bias and Variance of the Estimator
Pattern Recognition PhD Course.
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
POINT ESTIMATOR OF PARAMETERS
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
The loss function, the normal equation,
Parametric Methods Berlin Chen, 2005 References:
CS639: Data Management for Data Science
Applied Statistics and Probability for Engineers
Presentation transcript:

Statistical Decision Theory Abraham Wald ( ) Wald’s test Rigorous proof of the consistency of MLE “Note on the consistency of the maximum likelihood estimate”, Ann. Math. Statist., 20,

Statistical Decision Theory and Hypothesis Testing. A major use of statistical inference is its application to decision making under uncertainty, say Parameter estimation Unlike classical statistics which is only directed towards the use of sampling information in making inferences about unknown numerical quantities, an attempt in decision theory is made to combine the sampling information with a knowledge of the consequences of our decisions.

Three elements in SDT State of Nature: θ, some unknown quantities, say parameters. Decision Space D : A space of all possible values of decisions/actions/rules/estimators Loss function L( θ, d(X) ): –a non-negative function on Θ x D. –a measure of how much we lose by choosing action d when θ is used. –In estimation, a measure of the accuracy of estimators d of θ.

For example, θ = 0 means “nuclear warhead is NOT headed to UBC” θ = 1 means “nuclear warhead is headed to UBC” L(0,0) = 0 L(0,1) = cost of moving L(θ, d): L(1,1) = cost of moving + cost of belongings we cannot move L(1,0) = loss of belongings + D ={0,1}={ Stay in Vancouver, Leave }

Common loss functions Univariate –L1 = | θ – d(x)| (absolute error loss) –L2 = ( θ – d(x)) 2 (squared error loss) Multivariate –(Generalized) Euclidean norm: [ θ – d(x)] T Q [ θ – d(x)], where Q is positive definite More generally, –Non-decreasing functions of L1 or Euclidean norm

Loss function, L (d(X), θ), is random Frequentist Bayesian LE (in Frequentist) (in Bayesian) Risk or Posterior risk

Estimator Comparison The Risk principle: the estimator d1(X) is better than another estimator d2(X) in the sense of risk if R(θ,d1) ≤ R(θ,d2) for all θ, with strict inequality for some θ. However, in general, it does not exist Best estimator (uniformly minimum risk estimator) d*(X) = arg min R(θ, d(X)) for all θ d Class of all estimators

For instance, only consider mean-unbiased estimators. In particular, UMVUE is the best unbiased estimator when L2 is used. Shrink the class of estimators. Then find the best estimator in this class. Class of estimators Smaller class of estimators I am the best!! Too Large

Weaken the optimality criterion by considering the maximum value of the risk over all θ. Then choose the estimator with the smallest maximum risk. The best estimator according to this minimax principle is called minimax estimator. Notice that the risk depends on θ. So, the risks of two estimators often cross each other. This is another possibility that the best estimator does not exist. d1d2 θ R(θ,d) WinnerLoser

Alternatively, we can find the best estimator by minimizing the average risk with respect to a prior π of θ in the Bayesian framework. The estimator having the smallest Bayes risk with a specific prior π is called the Bayes estimator (with π). Given a prior π of θ, the average risk of the estimator d(X) defined by is called Bayes risk, r π (d)

Under Bayes risk principle R(θ, d1) θ π(θ) π(θ) R(θ, d2) d1d2 WinnerLoser WinnerLoser

In general, it is not easy to find the Bayes estimator by minimizing the Bayes risk. However, if the Bayes risk of the Bayes estimator is finite, then the estimator minimizing the posterior risk and the Bayes estimator are the same.

Some examples for finding the Bayes estimator (1) Squared error loss: E[(Θ-d) 2 |x] = E[Θ 2 |x] -2d E[Θ|x] + d 2 min E[(Θ-d) 2 |x] d Minimize the posterior risk The minimizer of f(d) is E[Θ|x], i.e. the posterior mean.  f(d)

Some examples for finding the Bayes estimator (2) Absolute error loss: min E[|Θ-d| |x] d The minimizer is med[Θ|x], i.e. the posterior median. (3) Linear error loss: L( θ,d ) = K 0 (θ-d) if θ-d>=0 and = K 1 (d-θ) if θ-d<0 The K 0 /(K 0 +K 1 ) th quantile of the posterior is the Bayes estimator of θ.

Relationship between minimax and Bayes estimators In particular, if the Bayes estimator has a constant risk, then it is minimax. Denote by d π the Bayes estimator with respect to π. If the Bayes risk of d π is equal to the maximum risk of d π, i.e. Then the Bayes estimator d π is minimax.

Problems for the risk measure: The risk measure is too sensitive to the choice of loss function. All estimators are assumed to have finite risks. So, in general, the risk measure fails to use in problems with heavy tails or outliers.

Other measures: (1) Pitman measure of closeness, PMC: θ d1 d2 P θ ( )≥1/2 d1 is Pitman-closer to θ than d2 if the above condition holds for all θ.

Other measures: (2) Universal domination, u.d.: d1(X) is said to universally dominate d2(X) if, for all nondecreasing functions h and all θ, E θ [h(|| d1(X) - θ || Q )] ≤ E θ [h(|| d2(X) - θ || Q )]. (3) Stochastic domination, s.d.: d1(X) is said to stochastically dominate d2(X) if, for every c > 0 and all θ, P θ [|| d1(X) - θ || Q ≤ c ] ≥ P θ [|| d2(X) - θ || Q ≤ c ].

Problems: (1) Pitman measure of closeness, PMC: θ d1 d2 d3

Problems: (2) Universal domination, u.d.: d1(X) is said to universally dominate d2(X) if, for all nondecreasing functions h and all θ, E θ [h(|| d1(X) - θ || Q )] ≤ E θ [h(|| d2(X) - θ || Q )]. Expectation is a linear operator h(t) = at + b, a>0

θ [h(|| d1(X) - θ || Q )] For all nondecreasing functions h, ET where T has a property that T[h(y)] =h[T(y)]

Thank You!!