Statistical model building Marian Scott Dept of Statistics, University of Glasgow Glasgow, Sept 2007.

Slides:



Advertisements
Similar presentations
Course round-up subtitle- Statistical model building Marian Scott University of Glasgow Glasgow, Aug 2013.
Advertisements

Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2010.
Statistical model building
Course round-up subtitle- Statistical model building Marian Scott University of Glasgow Glasgow, Aug 2012.
Uncertain models and modelling uncertainty
Assumptions underlying regression analysis
Properties of Least Squares Regression Coefficients
The Multiple Regression Model.
Brief introduction on Logistic Regression
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Modelling unknown errors as random variables Thomas Svensson, SP Technical Research Institute of Sweden, a statistician working with Chalmers and FCC in.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Geog 409: Advanced Spatial Analysis & Modelling © J.M. Piwowar1Principles of Spatial Modelling.
Model Assessment, Selection and Averaging
What is a sample? Epidemiology matters: a new introduction to methodological foundations Chapter 4.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Chapter 11 Multiple Regression.
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
How Science Works Glossary AS Level. Accuracy An accurate measurement is one which is close to the true value.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Introduction to Regression Analysis, Chapter 13,
Chemometrics Method comparison
Ordinary Least Squares
Regression and Correlation Methods Judy Zhong Ph.D.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Development of An ERROR ESTIMATE P M V Subbarao Professor Mechanical Engineering Department A Tolerance to Error Generates New Information….
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
CHAPTER 1 LESSON 3 Math in Science.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Make observations to state the problem *a statement that defines the topic of the experiments and identifies the relationship between the two variables.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.
Why Model? Make predictions or forecasts where we don’t have data.
Ping Zhu, AHC5 234, Office Hours: M/W/F 10AM - 12 PM, or by appointment M/W/F,
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 21 The Simple Regression Model.
Robust Estimators.
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Machine Learning 5. Parametric Methods.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Canadian Bioinformatics Workshops
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Virtual University of Pakistan
Stats Methods at IC Lecture 3: Regression.
Why Model? Make predictions or forecasts where we don’t have data.
SUR-2250 Error Theory.
Statistical Data Analysis - Lecture /04/03
12 Inferential Analysis.
CHAPTER 29: Multiple Regression*
Introduction to Instrumentation Engineering
Linear Model Selection and regularization
12 Inferential Analysis.
Propagation of Error Berlin Chen
Propagation of Error Berlin Chen
Presentation transcript:

Statistical model building Marian Scott Dept of Statistics, University of Glasgow Glasgow, Sept 2007

Outline of presentation Statistical models- what are the principles – describing variation – empiricism Fitting models- calibration Testing models- validation or verification Quantifying and apportioning variation in model and data. Stochastic and deterministic models. Model choice

All models are wrong but some are useful (and some are more useful than others) (All data are useful, but some are more varied than others.) a quote from a famous statistician, George Box

Step 1 why do you want to build a model- what is your objective? what data are available and how were they collected? is there a natural response or outcome and other explanatory variables or covariates?

Modelling objectives explore relationships make predictions improve understanding test hypotheses

Conceptual system Data Model Policy inputs & parameters model results feedbacks

Why model? Purposes of modelling: – Describe/summarise – Predict - what if…. – Test hypotheses – Manage What is a good model? – Simple, realistic, efficient, reliable, valid

Value judgements Different criteria of unequal importance key comparison often comparison to observational data but such comparisons must include the model uncertainties and the uncertainties on the observational data (touched on later).

Questions we ask about models Is the model valid? Are the assumptions reasonable? Does the model make sense based on best scientific knowledge? Is the model credible? Do the model predictions match the observed data? How uncertain are the results?

Statistical models Always includes an term to describe random variation Empirical Descriptive and predictive Model building goal: simplest model which is adequate used for inference

Physical/process based models Uses best scientific knowledge May not explicitly include, or any random variation Descriptive and predictive Goal may not be simplest model Not used for inference

Models Mathematical (deterministic/process based) models tend to be complex to ignore important sources of uncertainty Statistical models tend to be empirical To ignore much of the biological/physical/chemical knowledge

Stages in modelling Design and conceptualisation: – Visualisation of structure – Identification of processes (variable selection) – Choice of parameterisation Fitting and assessment – parameter estimation (calibration) – Goodness of fit

a visual model- atmospheric flux of pollutants Atmospheric pollutants dispersed over Europe In the 1970 considerable environmental damage caused by acid rain International action Development of EMEP programme, models and measurements

The mathematical flux model L: Monin-Obukhov length u*: Friction velocity of wind c p : constant (=1.01) : constant (=1246 gm -3 ) T: air temperature (in Kelvin) k: constant (=0.41) g: gravitational force (=9.81m/s) H: the rate of heat transfer per unit area gasht: Current height that measurements are taken at. d: zero plane displacement

what would a statistician do if confronted with this problem? ask what the objective of modelling is look at the data, and data quality try and understand the measurement processes think about how the scientific knowledge, conceptual model relates to what we have measured think about uncertainty

Step 2- understand your data study your data learn its properties tools- graphical

measured atmospheric fluxes for 1997 measured fluxes for 1997 are still noisy. Is there a statistical signal and at what timescale?

Sulphur and NitrogenEMEP: 15 stations in United Kingdom and 4 in Republic of Ireland from 1978 to (aggregated to monthly means). The Monitoring Networks UK National Air Quality Information Archive: 8 stations in the United Kingdom corresponding to some of EMEP stations from 1983 to (hourly data, subsequently aggregated to monthly day and night means). Ozone (O 3 ) (SO 2, SO 4, NO 2, NO 3, NH 4, HNO 3 +NO 3, NH 3 +NH 4 )

GB02 Eskdalemuir GB03 Goonhilly GB04 Stoke ferry GB05 Ludlow GB06 Lough Navar GB07 Barcombe Mills GB13 Yarner Wood GB14 High Muffles GB15 Starth Vaich Dam GB16 Glen Dye GB36 Harwell GB37 Ladybower GB38 Lullington Heath GB43 Narberth GB45 Wicken Fen IE01 Valentia Obs. IE02 Turlough Hill IE03 The Burren IE04 Ridge of Capard

THE DATA What evidence is there of a trend in the atmospheric concentrations? OutliersMissing valuesDiscontinuities

Loch Leven

Key properties of any measurement Accuracy refers to the deviation of the measurement from the true value Precision refers to the variation in a series of replicate measurements (obtained under identical conditions)

Accurate Imprecise Inaccurate Precise Accuracy and precision

Evaluation of precision Analysis of the instrumentation method to make a single measurement, and the propagation of any errors Repeat measurements (true replicates) – using homogeneous material, repeatedly subsampling, etc…. Precision is linked to Variance (standard deviation)

The nature of measurement All measurement is subject to uncertainty Analytical uncertainty reflects that every time a measurement is made (under identical conditions), the result is different. Sampling uncertainty represents the natural variation in the organism within the environment.

The error and uncertainty in a measurement The error is a single value, which represents the difference between the measured value and the true value The uncertainty is a range of values, and describes the errors which might have been observed were the measurement repeated under IDENTICAL conditions Error (and uncertainty) includes a combination of variance and bias

Effect of uncertainties Lack of observations contribute to – uncertainties in input data – uncertainty in model parameter values Conflicting evidence contributes to – uncertainty about model form – uncertainty about validity of assumptions

Step 3- build the statistical model Outcomes or Responses sometimes referred to as dependent variables. Causes or Explanations these are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to asindependent variables, but more commonly known as covariates.

Statistical models In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses. In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses. we may not know which covariates are important. recognise that we may build several models before making the final choice

Specifying a statistical models Models specify the way in which outcomes and causes link together, eg. Metabolite = Temperature The = sign does not indicate equality in a mathematical sense and there should be an additional item on the right hand side giving a formula:- Metabolite = Temperature + Error

Specifying a statistical model Metabolite = Temperature + Error In mathematical terms, there will be some unknown parameters to be estimated, and some assumptions will be made about the error distribution Metabolite = + temperature + ~ N(0, σ 2 )- appropriate perhaps? σ,, are model parameters and are unknown

Model calibration Statisticians tend to talk about model fitting, calibration means something else to them. Methods- least squares or maximum likelihood least squares:- find the parameter estimates that minimise the sum of squares (SS) SS= (observed y- model fitted y )2 maximum likelihood- find the parameter estimates that maximise the likelihood of the data

Calibration-using the data A good idea, if possible to have a training and a test set of data-split the data (e.g. 90%/10%) Fit the model using the training set, evaluate the model using the test set. why? because if we assess how well the model performs on the data that were used to fit it, then we are being over optimistic

How good is my statistical model? What criteria do we use to judge the value of our model? – may depend on what the model was built to do Obvious ones – Closeness to the observed data – Goodness of predictions at previously unobserved covariate values – % variation in response explained (R 2 )

Model validation what is validation? Fit the model using the training set, evaluate the model using the test set. why? because if we assess how well the model performs on the data that were used to fit it, then we are being over optimistic assessment of goodness of fit? – residual sums of squares, mean square error for prediction

Model validation splitting the data set, is it possible? cross-validation – leave – one-out, leave-k-out split at random, a small % kept aside for testing other methods: bootstrap and jack-knife

an aside- how well should models agree? 6 physical-deterministic ocean models (process based- transport, sedimentary processes, numerical solution scheme, grid size) used to predict the dispersal of a pollutant Results to be used to determine a remediation policy for an illegal dumping of radioactive waste The what if scenario investigation The models differ in their detail and also in their spatial scale

Predictions of levels of cobalt-60 Different models, same input data Predictions vary by considerable margins Magnitude of variation a function of spatial distribution of sites

model ensembles becoming increasingly common in climate, meteorology, to make many model runs, different models, different starting conditions and then to average the results. why would we do this?

the statistical approach to model building and selection in a regression situation, we may have many potential explanatory variables, how do we choose which to include in the final model? – answer may depend on purpose, on how many explanatory variables there may be identify variables that can be omitted on statistical grounds (no evidence of effect) (see regression sessions with Adrian for testing and CI approaches)- is an effect statistically significant?

the statistical approach to model building and selection automatic selection procedures can be useful but also potentially dangerous (e.g. stepwise regression, best subset-regression) they often identify the best under a defined criterion in a family of models (like smallest residual sum of squares). but this best model could in an absolute sense be poor.

the statistical approach to model building and selection other statistical criteria exist for model choice-AIC, DIC, BIC- based on likelihood approaches, can be used to compare non-nested models (ie parameter set of one model is not contained within parameter set of the larger model) need to be careful of dredging for significance remember statistical significance is not always equal to practical importance

Information criterion In the general case, the AIC is – AIC=2k-2ln(L) where k is the number of parameters, and L is the likelihood function.parameterslikelihood if we assume that the model errors are normally and independently distributed. Let n be the number of observations and RSS be the residual sum of squares. Then AIC becomesobservations residual sum of squares – AIC=2k+nln(RSS/n)

AIC Increasing the number of free parameters to be estimated improves the goodness of fit. Hence AIC not only rewards goodness of fit, but also includes a penalty that is an increasing function of the number of estimated parameters. This penalty discourages overfitting. The preferred model is the one with the lowest AIC value. The AIC methodology attempts to find the model that best explains the data with a minimum of free parameters.overfitting

blending the statistical modelling approach to deterministic models relatively new area (at least for statisticians) phrased in a Bayesian framework (see later session on Bayesian methods) makes use of data (very important) and data modelling still at the research stage (probably most used on climatology)

in summary model building is iterative should combine statistical skills and scientific knowledge think about your objectives, think about the data model selection- many different approaches uncertainty is a factor at all stages and should be considered.