# CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.

## Presentation on theme: "CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter."— Presentation transcript:

CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter in ISSO –Bias-variance tradeoff –Model selection: Cross-validation –Fisher information matrix: Definition, examples, and efficient computation Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

13-2 Model Definition and MSE Assume model z = h( , x) + v, where z is output, h(·) is some function, x is input, v is noise, and  is vector of model parameters –h(·) may represent simulation model –h(·) may represent “metamodel” (response surface) of existing simulation A fundamental goal is to take n data points and estimate , forming A common measure of effectiveness for estimate is mean of squared model error (MSE) at fixed x:

13-3 Bias-Variance Decomposition The MSE of the model at a fixed x can be decomposed as: E { [h(, x)  E(z|x)] 2 | x } = E { [h(, x)  E(h(, x))] 2 | x } + [E(h(, x))  E(z|x)] 2 = variance at x + (bias at x) 2 where expectations are computed w.r.t. Above implies: Model too simple  High bias / low variance Model too complex  Low bias / high variance

13-4 Unbiased Estimator May Not be Best (Example 13.1 from ISSO) Unbiased estimator is such that (i.e., mean of prediction is same as mean of data z) Example:Example: Let denote sample mean of scalar i.i.d. data as estimator of true mean  (h( ,  x) =  in notation above) Alternative biased estimator of  is where 0 < r < 1 MSE of biased and unbiased estimators generally satisfy Biased estimate better in MSE sense –However, optimal value of r requires knowledge of unknown (true) 

13-5 Bias-Variance Tradeoff in Model Selection in Simple Problem

13-6 Example 13.2 in ISSO: Bias-Variance Tradeoff Suppose true process produces output according to z = f(x) + noise, where f  (x) = (x + x 2  ) 1.1 Compare linear, quadratic, and cubic approximations Table below gives average bias, variance, and MSE Overall pattern of decreasing bias and increasing variance; optimal tradeoff is quadratic model

13-7 Model Selection The bias-variance tradeoff provides conceptual framework for determining a good model –Bias-variance tradeoff not directly useful Need a practical method for optimizing bias-variance tradeoff Practical aim is to pick a model that minimizes a criterion: f 1 (fitting  error  from  given  data)  +  f 2 (model  complexity) where f 1 and f 2 are increasing functions All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) Criterion above may/may not be explicitly used in given method

13-8 Methods for Model Selection Among many popular methods are: –Akaike Information Criterion (AIC) (Akaike, 1974) Popular in time series analysis –Bayesian selection (Akaike, 1977) –Bootstrap-based selection (Efron and Tibshirini, 1997) –Cross-validation (Stone, 1974) –Minimum description length (Risannen, 1978) –V-C dimension (Vapnik and Chervonenkis, 1971) Popular in computer science Cross-validation appears to be most popular model fitting method

13-9 Cross-Validation Cross-validation is simple, general method for comparing candidate models –Other specialized methods may work better in specific problems Cross-validation uses the training set of data Method is based on iteratively partitioning the full set of training data into training and test subsets estimate evaluateFor each partition, estimate model from training subset and evaluate model on test subset –Number of training (or test) subsets = number of model fits required Select model that performs best over all test subsets

13-10 Choice of Training and Test Subsets Let n denote total size of data set, n T denote size of test subset, n T < n Common strategy is leave-one-out: n T = 1 –Implies n test subsets during cross-validation process Often better to choose n T > 1 –Sometimes more efficient (sampling w/o replacement) –Sometimes more accurate model selection If n T > 1, sampling may be with or without replacement –“With replacement” indicates that there are “n choose n T ” test subsets, written –With replacement may be prohibitive in practice: e.g., n = 30, n T = 6 implies nearly 600K model fits! Sampling without replacement reduces number of test subsets to n  / n T (disjoint test subsets) –“With replacement” indicates that there are “n choose n T ” samplings –Above may be prohibitive in practice ee means have may lead to huge number of samlingslarge tno Cross-validation uses the training set of data Method is based on iteratively partitioning the full set of training data into training and test subsets estimate evaluateFor each partition, estimate model from training subset and evaluate model on test subset Select model that performs best over all test subsets

13-11 Conceptual Example of Sampling Without Replacement: Cross-Validation with 3 Disjoint Test Subsets

13-12 Typical Steps for Cross-Validation Step 0 (initialization) i Step 0 (initialization) Determine size of test subsets and candidate model. Let i be counter for test subset being used. Step 1 (estimation) i Step 1 (estimation) For the i th test subset, let the remaining data be the i th training subset. Estimate  from this training subset. Step 2 (error calculation) i i Step 2 (error calculation) Based on estimate for  from Step 1 (i th training subset), calculate MSE (or other measure) with data in i th test subset. Step 3 (new training and test subsets) ii Step 3 (new training and test subsets) Update i to i + 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated. Step 4 (new model) Choose model with lowest mean MSE as best. Step 4 (new model) Repeat steps 1 to 3 for next model. Choose model with lowest mean MSE as best.

13-13 Numerical Illustration of Cross-Validation (Example 13.4 in ISSO) Consider true system corresponding to a sine function of the input with additive normally distributed noise Consider three candidate models –Linear (affine) model –3rd-order polynomial –10th-order polynomial Suppose 30 data points are available, divided into 5 disjoint test subsets (sampling w/o replacement) Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred See following plot

13-14 Sine wave (process mean) Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations 10 th -order Linear 3 rd -order

13-15 Fisher Information Matrix Fundamental role of data analysis is to extract information from data Parameter estimation for models is central to process of extracting information The Fisher information matrix plays a central role in parameter estimation for measuring information Information matrix summarizes the amount of information in the data relative to the parameters being estimated

13-16 Problem Setting Consider the classical statistical problem of estimating parameter vector  from n data vectors z 1, z 2,…, z n Suppose have a probability density and/or mass function associated with the data The parameters  appear in the probability function and affect the nature of the distribution –Example: z i  N(mean(  ), covariance(  )) for all i Let l (  |z 1, z 2,…, z n ) represent the likelihood function, i.e., the p.d.f./p.m.f. viewed as a function of  conditioned on the data

13-17 Information Matrix—Definition Recall likelihood function l (  |z 1, z 2,…, z n ) Information matrix defined as where expectation is w.r.t. z 1, z 2,…, z n Equivalent form based on Hessian matrix: F n (  ) is positive semidefinite of dimension p  p (p=dim(  ))

13-18 Information Matrix—Two Key Properties Connection of F n (  ) and uncertainty in estimate is rigorously specified via two famous results (   = true value of  ): 1. Asymptotic normality: where 2. Cramér-Rao inequality: Above two results indicate: greater variability of  “smaller” F n (  ) (and vice versa)

13-19 Selected Applications Information matrix is measure of performance for several applications. Four uses are: 1. 1. Confidence regions for parameter estimation –Uses asymptotic normality and/or Cramér-Rao inequality 2. 2. Prediction bounds for mathematical models 3. 3. Basis for “D-optimal” criterion for experimental design –Information matrix serves as measure of how well  can be estimated for a given set of inputs 4. 4. Basis for “noninformative prior” in Bayesian analysis –Sometimes used for “objective” Bayesian inference

Download ppt "CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter."

Similar presentations