Course round-up subtitle- Statistical model building Marian Scott University of Glasgow Glasgow, Aug 2012.

Slides:



Advertisements
Similar presentations
Course round-up subtitle- Statistical model building Marian Scott University of Glasgow Glasgow, Aug 2013.
Advertisements

Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2010.
Statistical model building
Statistical basics Marian Scott Dept of Statistics, University of Glasgow August 2008.
Statistical model building Marian Scott Dept of Statistics, University of Glasgow Glasgow, Sept 2007.
Uncertain models and modelling uncertainty
Assumptions underlying regression analysis
Experimental Measurements and their Uncertainties
Properties of Least Squares Regression Coefficients
Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.
Design of Experiments Lecture I
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Modelling unknown errors as random variables Thomas Svensson, SP Technical Research Institute of Sweden, a statistician working with Chalmers and FCC in.
Sensitivity Analysis In deterministic analysis, single fixed values (typically, mean values) of representative samples or strength parameters or slope.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
MARLAP Measurement Uncertainty
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
BA 555 Practical Business Analysis
Evaluating Hypotheses
1 Validation and Verification of Simulation Models.
Chapter 11 Multiple Regression.
CHAPTER 6 Statistical Analysis of Experimental Data
Lecture 16 – Thurs, Oct. 30 Inference for Regression (Sections ): –Hypothesis Tests and Confidence Intervals for Intercept and Slope –Confidence.
Lecture II-2: Probability Review
Introduction to Regression Analysis, Chapter 13,
Chemometrics Method comparison
1 D r a f t Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module k – Uncertainty in LCA.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Inference for regression - Simple linear regression
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Simple Linear Regression
Applications of Bayesian sensitivity and uncertainty analysis to the statistical analysis of computer simulators for carbon dynamics Marc Kennedy Clive.
Name: Angelica F. White WEMBA10. Teach students how to make sound decisions and recommendations that are based on reliable quantitative information During.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
17 May 2007RSS Kent Local Group1 Quantifying uncertainty in the UK carbon flux Tony O’Hagan CTCD, Sheffield.
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
Reserve Variability – Session II: Who Is Doing What? Mark R. Shapland, FCAS, ASA, MAAA Casualty Actuarial Society Spring Meeting San Juan, Puerto Rico.
Chapter 10 Verification and Validation of Simulation Models
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
Question paper 1997.
1 Summarizing Performance Data Confidence Intervals Important Easy to Difficult Warning: some mathematical content.
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
L Berkley Davis Copyright 2009 MER301: Engineering Reliability Lecture 12 1 MER301: Engineering Reliability LECTURE 12: Chapter 6: Linear Regression Analysis.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Statistical Concepts Basic Principles An Overview of Today’s Class What: Inductive inference on characterizing a population Why : How will doing this allow.
1 Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module k – Uncertainty in LCA.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Stats 242.3(02) Statistical Theory and Methodology.
Overview G. Jogesh Babu. R Programming environment Introduction to R programming language R is an integrated suite of software facilities for data manipulation,
Stats Methods at IC Lecture 3: Regression.
Estimating standard error using bootstrap
SUR-2250 Error Theory.
Electromagnetism lab project
Statistical Data Analysis
Chapter 10 Verification and Validation of Simulation Models
CHAPTER 29: Multiple Regression*
Introduction to Instrumentation Engineering
More about Posterior Distributions
Statistical Thinking and Applications
Propagation of Error Berlin Chen
Presentation transcript:

Course round-up subtitle- Statistical model building Marian Scott University of Glasgow Glasgow, Aug 2012

Outline of presentation Statistical models- what are the principles – describing variation – Empiricism Knowing your data Fitting models- calibration Testing models- validation or verification and in general inference- what matters? Quantifying and apportioning variation in model and data. Stochastic and deterministic models. intro to uncertainty and sensitivity analysis resources

Step 1 why do you want to build a model- what is your objective? what data are available and how were they collected? is there a natural response or outcome and other explanatory variables or covariates?

Modelling objectives explore relationships make predictions improve understanding test hypotheses

Conceptual system Data Model Policy inputs & parameters model results feedbacks

Why model? Purposes of modelling: – Describe/summarise – Predict - what if…. – Test hypotheses – Manage What is a good model? – Simple, realistic, efficient, reliable, valid

Value judgements Different criteria of unequal importance key comparison often comparison to observational data (RSS, AIC......) but such comparisons must include the model uncertainties and the uncertainties on the observational data.

Questions we ask about models Is the model valid? Are the assumptions reasonable? Does the model make sense based on best scientific knowledge? Is the model credible? Do the model predictions match the observed data? How uncertain are the results?

Stages in modelling Design and conceptualisation: – Visualisation of structure – Identification of processes – Choice of parameterisation Fitting and assessment – parameter estimation (calibration) – Goodness of fit

a visual model- atmospheric flux of pollutants Atmospheric pollutants dispersed over Europe In the 1970 considerable environmental damage caused by acid rain International action Development of EMEP programme, models and measurements

The mathematical flux model L: Monin-Obukhov length u*: Friction velocity of wind c p : constant (=1.01) : constant (=1246 gm -3 ) T: air temperature (in Kelvin) k: constant (=0.41) g: gravitational force (=9.81m/s) H: the rate of heat transfer per unit area gasht: Current height that measurements are taken at. d: zero plane displacement

what would a statistician do if confronted with this problem? Look at the data understand the measurement processes think about how the scientific knowledge, conceptual model relates to what we have measured

Step 2- understand your data study your data learn its properties tools- graphical

The data- variation soil or sediment samples taken side-by-side, from different parts of the same plant, or from different animals in the same environment, exhibit different activity densities of a given radionuclide. The distribution of values observed will provide an estimate of the variability inherent in the population of samples that, theoretically, could be taken.

Activity (log 10 ) of particles (Bq Cs-137) with Normal or Gaussian density superimposed Variation

measured atmospheric fluxes for 1997 measured fluxes for 1997 are still noisy. Is there a statistical signal and at what timescale?

Key properties of any measurement Accuracy refers to the deviation of the measurement from the true value Precision refers to the variation in a series of replicate measurements (obtained under identical conditions)

Accurate Imprecise Inaccurate Precise Accuracy and precision

Evaluation of precision Analysis of the instrumentation method to make a single measurement, and the propagation of any errors Repeat measurements (true replicates) – using homogeneous material, repeatedly subsampling, etc…. Precision is linked to Variance (standard deviation)

The nature of measurement All measurement is subject to uncertainty Analytical uncertainty reflects that every time a measurement is made (under identical conditions), the result is different. Sampling uncertainty represents the natural variation in the organism within the environment.

The error and uncertainty in a measurement The error is a single value, which represents the difference between the measured value and the true value The uncertainty is a range of values, and describes the errors which might have been observed were the measurement repeated under IDENTICAL conditions Error (and uncertainty) includes a combination of variance and bias

Effect of uncertainties Lack of observations contribute to – uncertainties in input data – uncertainty in model parameter values Conflicting evidence contributes to – uncertainty about model form – uncertainty about validity of assumptions

Data properties Nature and distribution of the data- continuous, counts.... Normal, exponential, poisson, maybe need a transformation Missing data- outliers- limits of detection Use pictures to explore

Step 3- build the statistical model Outcomes or Responses Causes or Explanations these are the conditions or environment within which the outcomes or responses have been observed -the covariates. This has very much been the focus of much of the week- whether a linear model, a smooth flexible model, a time series model, a bayesian model.....

Statistical models In experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses. In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses. Some of your experiments are in the lab where you control factors such as temperature, concentration etc, some are in the world laboratory where you observe...

Specifying a statistical model Models specify the way in which outcomes and causes link together, eg. Metabolite ~ Temperature The ~ sign does not indicate equality in a mathematical sense and there should be an additional item on the right hand side giving a formula:- Metabolite ~ Temperature + Error Thinking also about possible random effect models (see Marks section)

Specifying a statistical models In mathematical terms, there will be some unknown parameters to be estimated, and some assumptions will be made about the error distribution Metabolite = + temperature + ~ N(0, σ 2 ) All statistical models make assumptions and we need to check them

Are you a bayesian? What does that mean? It means, you have prior information (belief) that you want to include in your statistical model You need to find a way of capturing this in the prior distribution Model output then a posterior distribution on the quantity of interest- automatically incorporates uncertainty

Model calibration Statisticians tend to talk about model fitting, calibration means something else to them. Methods- least squares or maximum likelihood least squares:- find the parameter estimates that minimise the sum of squares (SS) SS= (observed y- model fitted y )2 maximum likelihood- find the parameter estimates that maximise the likelihood of the data

Calibration-using the data A good idea, if possible to have a training and a test set of data-split the data (90%/10%) Fit the model using the training set, evaluate the model using the test set. why? because if we assess how well the model performs on the data that were used to fit it, then we are being over optimistic other methods: bootstrap and jackknife

Model validation what is validation? Fit the model using the training set, evaluate the model using the test set. why? because if we assess how well the model performs on the data that were used to fit it, then we are being over optimistic other methods: bootstrap and jackknife

Which variables to include Use your science knowledge Use pictures to look for patterns Maybe use some of the more algorithmic ways to select the set (stepwise, BSR...) How to compare models? Nested models (ANOVA, likelihood ratio test)

More than 1 model- how well should models agree? 6 ocean models (process based-transport, sedimentary processes, numerical solution scheme, grid size) used to predict the dispersal of a pollutant Results to be used to determine a remediation policy for an illegal dumping of radioactive waste The what if scenario investigation The models differ in their detail and also in their spatial scale

Predictions of levels of cobalt-60 Different models, same input data Predictions vary by considerable margins Magnitude of variation a function of spatial distribution of sites Another component of uncertainty

Statistical models and process models Loch Leven, modelling nutrients process model based on differential equations statistical model based on empirically determined relationships We can connect the two but this is a research area rather than a routine...

Uncertainty and sensitivity analysis

Uncertainty (in variables, models, parameters, data) what are uncertainty and sensitivity analyses?

Effect of uncertainties Lack of observations contribute to – uncertainties in input data – uncertainty in model parameter values Conflicting evidence contributes to – uncertainty about model form – uncertainty about validity of assumptions

Modelling tools - SA/UA Sensitivity analysis determining the amount and kind of change produced in the model predictions by a change in a model parameter Uncertainty analysis an assessment/quantification of the uncertainties associated with the parameters, the data and the model structure.

Modellers conduct SA to determine (a)if a model resembles the system or processes under study, (b)the factors that mostly contribute to the output variability, (c)the model parameters (or parts of the model itself) that are insignificant, (d)if there is some region in the space of input factors for which the model variation is maximum, and (e)if and which (group of) factors interact with each other.

SA flow chart ( Saltelli, Chan and Scott, 2000)

Design of the SA experiment Simple factorial designs (one at a time) Factorial designs (including potential interaction terms) Fractional factorial designs Important difference: design in the context of computer code experiments – random variation due to variation in experimental units does not exist.

Global SA Global SA apportions the output uncertainty to the uncertainty in the input factors, covering their entire range space. A global method evaluates the effect of x j while all other x i,i j are varied as well.

How is a sampling (global) based SA implemented? Step 1:define model, input factors and outputs Step 2:assign p.d.f.s to input parameters/factors and if necessary covariance structure. DIFFICULT Step 3:simulate realisations from the parameter pdfs to generate a set of model runs giving the set of output values.

SA -analysis At the end of the computer experiment, data is of the form (y ij, x 1i,x 2i,….,x ni ), where x 1,..,x n are the realisations of the input factors. Analysis includes regression analysis (on raw and ranked values), standard hypothesis tests of distribution (mean and variance) for subsamples corresponding to given percentiles of x, and Analysis of Variance.

How can SA/UA help? SA/UA have a role to play in all modelling stages: – We learn about model behaviour and robustness to change; – We can generate an envelope of outcomes and see whether the observations fall within the envelope; – We can tune the model and identify reasons/causes for differences between model and observations

On the other hand - Uncertainty analysis Parameter uncertainty – usually quantified in form of a distribution. Model structural uncertainty – more than one model may be fit, expressed as a prior on model structure. Scenario uncertainty – uncertainty on future conditions.

Tools for handling uncertainty Parameter uncertainty – Probability distributions and Sensitivity analysis Structural uncertainty – Bayesian framework one possibility to define a discrete set of models, other possibility to use a Gaussian process – model averaging

An uncertainty example ( Ron Smith ) Original Mean of 100 simulations Standard deviation

An uncertainty example CV from 100 simulations Possible bias from 100 simulations

An uncertainty example model sensitivity analysis identifies weak areas lack of knowledge of accuracy of inputs a significant problem there may be biases in the model output which, although probably small in this case, may be important for critical loads

Take home message Only able to give you a flavour of what might be possible Good environmental science and good statistical science is key for all problems Think critically- test and re-test your hypotheses and assumptions

Take home message Resources Many good books (have seen some of these over the sessions- not one size fits all JISC mail list- Envstat (worth joining) Royal Statistical Society has an Environmental Statistics section, sometimes holds tutorial meetings on topics.