Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

SJS SDI_21 Design of Statistical Investigations Stephen Senn 2 Background Stats.
1 Regression as Moment Structure. 2 Regression Equation Y =  X + v Observable Variables Y z = X Moment matrix  YY  YX  =  YX  XX Moment structure.
Copula Regression By Rahul A. Parsa Drake University &
The Simple Regression Model
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning: Kernel Methods.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
STAT 497 APPLIED TIME SERIES ANALYSIS
Multivariate distributions. The Normal distribution.
Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.
The Multiple Regression Model Prepared by Vera Tabakova, East Carolina University.
The Simple Linear Regression Model: Specification and Estimation
Pattern Recognition and Machine Learning
1 Chapter 3 Multiple Linear Regression Ray-Bing Chen Institute of Statistics National University of Kaohsiung.
SYSTEMS Identification
Probability theory 2011 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different definitions.
Tch-prob1 Chapter 4. Multiple Random Variables Ex Select a student’s name from an urn. S In some random experiments, a number of different quantities.
Mixed models Various types of models and their relation
Ordinary Kriging Process in ArcGIS
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.
Lecture II-2: Probability Review
Review of Probability.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown.
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
NON-LINEAR REGRESSION Introduction Section 0 Lecture 1 Slide 1 Lecture 6 Slide 1 INTRODUCTION TO Modern Physics PHYX 2710 Fall 2004 Intermediate 3870 Fall.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computacion Inteligente Least-Square Methods for System Identification.
Econometrics III Evgeniya Anatolievna Kolomak, Professor.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Biointelligence Laboratory, Seoul National University
Chapter 7. Classification and Prediction
(5) Notes on the Least Squares Estimate
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Ch9 Random Function Models (II)
Non-Parametric Models
Evgeniya Anatolievna Kolomak, Professor
10701 / Machine Learning Today: - Cross validation,
OVERVIEW OF LINEAR MODELS
Pattern Recognition and Machine Learning
OVERVIEW OF LINEAR MODELS
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Applied Statistics and Probability for Engineers
Uncertainty Propagation
Presentation transcript:

Additional Topics in Prediction Methodology

Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about Y 0 that is contained in Y n. not completely specify Y 0 but does provide a probability distribution of more likely and less likely values of Y 0 E{Y 0 |Y n } is the best MSPE predictor of Y 0

Hierarchical models have two stages X  R d f 0 =f(x 0 ) known p*1 vector F=(f j (x j )) known n*p matrix  unknown p*1 vector regression coefficients R=(R(x i -x j )) known n*n matrix correlations among trainning data Y n r 0 =(R(x i -x 0 )) known n*1 vector correlations of Y 0 with Y n

Predictive Distributions when  Z 2, R and r 0 are known

Interesting features of (a) and (b) Non-informative Prior is the limit of the normal prior as  While the prior is non-informative, it is not a proper distribution. The corresponding predictive distribution is proper. The same conditioning argument can be applied to drive posterior mean for the non-informative prior and normal prior.

The mean and variance of the predictive distribution (mean)  0|n (x 0 ) and  0|n (x 0 ) depend on x 0 only through the regression function f 0 and correlation vector r 0  0|n (x 0 ) is a linear unbiased predictor of Y(x 0 ) The continuity and other smoothness properties of  0|n (x 0 ) are inherited from correlation function R(.) and the regressors {f(.)} j=1 p

 0 |n(x 0 ) depends on the parameters  z 2  2 only through their ratio  0 |n(x 0 ) interpolate the training data. When x 0 =x i, f 0 =f(x i ), and r 0 T R -1 =e i T, the i th unit vector.

The mean and variance of the predictive distribution (Variance) MSPE(  0 |n(x 0 ) )=  0|n 2 (x 0 ) The variance of the posterior of Y(x 0 ) given Y n should be 0 whenever x 0 =x i  0|n 2 (x i )=0

Most important use of Theorem 4.1.1

Predictive Distributions when R and r 0 are known The posterior is a location shifted and scaled univariate t distribution having degrees of freedom that are enhanced when there is informative prior information for either  or  z 2

Degree of freedom Base value for the degree of freedom i =n-p P additional degrees of freedom when prior  is informative 0 additional degree of freedom when  z 2 is informative

Location shift The same centering value as Theorem (known  z 2 ) The non-informative prior gives the BLUP

Scale factor  i 2 (x 0 ) (compare with 4.1.6) Estimate of the scale factor  0|n 2 (x 0 ). Q i 2 / i : estimate  z 2 Q i 2 : get information about  z 2 from the conditional distribution Y n given  z 2 and information from the prior of  z 2  i 2 (x i )=0, x i is any of the training data points.

Prediction Distributions when Correlation parameters are unknown If the correlations among the observations is unknown (R r 0 are unknown)? –Assume y(.) has a Gaussian prior with correlation function R(.|  ),  is unknown vector parameters Two issues –Standard error of Plug-in predictor  0|n (x 0 |  ) by substituting  comes from MLE or REML –Bayesian approach to uncertainty in  which is to model it by a prior distribution

Prediction of Multiple Response Models Several outputs are available for from a computer experiment Several codes are available for computing the same response (fast and slow code) Competing response Several stochastic models for joint response Using these models to describe the optimal predictor for one of the several computed responses.

Modeling Multiple Outputs Z i (.): marginally mean zero stationary Gaussian stochastic processes with unknown variance and correlation function R Z i (x) implies that the correlation between Z i (x 1 ) and Z i (x 2 ) only depends on x 1 -x 2 Assume Cov(Z i (x 1 ), Z j (x 2 ))=  i  j R ij (x 1 -x 2 ) R ij (.) cross-correlation function of Z i (.) and Z j (.) Linear model: global mean of the Y i process. f i (.): known regression functions  i : unknown regression parameters

Selection of correlation and cross- correlation functions are complicated Reason: for any input sites xli, the multivariate normal distributed random vector (Z 1 (x 1 1 ), ….) T must have a nonnegative definite covariance matrix Solution: construct the Z i (.) from a set of elementary processes (usually this processes are mutually independent)

Example by Kennedy and O’Hagan Y i (x): prior for the i th code level (i=m top-level code). The autoregressive model: –Y i (x)=  i-1 Y i-1 (x)+  i (x), i=2, …, m The output for each successive higher level code i at x is related to the output of the less precise code i-1 at x plus the refinement  i (x) –Cov(Y i (x), Y i-1 (w)|Y i-1 (x))=0 for all w~=x No additional second-order knowledge of code i at x can be obtained from the lower-level code i-1 if the value of code i-1 at x is known (Markov property on the hierarchy of codes) Since there is no natural hierarchy of computer code in such applications, we need find something better.

More reasonable Model Each constraint function is associated with the objective function plus a refinement –Y i (x)=  i Y 1 (x)+  i (x), i=2, …, m+1 Ver Hoef and Marry –Form models in the environmental sciences –Include an unknown smooth surface plus a random measurement error. –Moving averages over white noise processes

Morris and Mitchell model Prior information about y(x) is specified by a Gaussian processor Y(.) Prior information about the partial derivatives y (j) (x) is obtained by considering the “derivative” processes of Y(.) –Y1(.)=y(.), y2(.)= y (1) (.), y 1+m (.)=y (m) (.) Natural prior for y(j)(x): The covariances between Y(x 1 ), Y (j) (x 2 ) and Y (i) (x 1 ), Y (j) (x 2 ) are:

Optimal Predictors for Multiple Outputs The best MSPE predictor based on training data is: Where Y 0 =Y 1 (X 0 ), Y i ni =(Y i (x 1 i ), …), and y i ni is observed value for i=[1,m]

The joint distribution is the multivariate normal distribution

Conditional expectation ….. In practice, this is useless (it requires knowledge of marginal correlation functions, joint correlation function and ratio of all the process variance) Empirical versions are of practical use: –Every time we assume each of the correlation matrices R i and cross-correlation matrices R ij are known up to a vector of parameters. –Estimate  using MLE or REML

example1 14 point training data has feature that it allows us to learn over the entire input space: space-filling Compare two model –Using the predictor of y(.) based on y(.) alone –Using the predictor of y(.) base on (y(.), y (1) (.), y (2) (.)) Second one is both more visually fit and has 24% smaller ERMSPE

Thank you!