Presentation on theme: "Statistical model building"— Presentation transcript:
1Statistical model building Marian Scott and Ron SmithDept of Statistics, University of Glasgow, CEHGlasgow, Aug 2008
2Outline of presentation Statistical models- what are the principlesdescribing variationempiricismFitting models- calibrationTesting models- validation or verificationQuantifying and apportioning variation in model and data.Stochastic and deterministic models.intro to uncertainty and sensitivity analysis
3Step 1 why do you want to build a model- what is your objective? what data are available and how were they collected?is there a natural response or outcome and other explanatory variables or covariates?
4Modelling objectives explore relationships make predictions improve understandingtest hypotheses
5Conceptual system feedbacks Data Model inputs & parameters Policy model results
6Why model? Purposes of modelling: What is a good model? Describe/summarisePredict - what if….Test hypothesesManageWhat is a good model?Simple, realistic, efficient, reliable, valid
7Value judgements Different criteria of unequal importance key comparison often comparison to observational databutsuch comparisons must include the model uncertainties and the uncertainties on the observational data.
8Questions we ask about models Is the model valid?Are the assumptions reasonable?Does the model make sense based on best scientific knowledge?Is the model credible?Do the model predictions match the observed data?How uncertain are the results?
9Stages in modelling Design and conceptualisation: Visualisation of structureIdentification of processesChoice of parameterisationFitting and assessmentparameter estimation (calibration)Goodness of fit
10a visual model- atmospheric flux of pollutants Atmospheric pollutants dispersed over EuropeIn the 1970’ considerable environmental damage caused by acid rainInternational actionDevelopment of EMEP programme, models and measurements
11The mathematical flux model L: Monin-Obukhov lengthu*: Friction velocity of windcp: constant (=1.01): constant (=1246 gm-3)T: air temperature (in Kelvin)k: constant (=0.41)g: gravitational force (=9.81m/s)H: the rate of heat transfer per unit areagasht: Current height that measurements are taken at.d: zero plane displacement
12what would a statistician do if confronted with this problem? Look at the dataunderstand the measurement processesthink about how the scientific knowledge, conceptual model relates to what we have measured
13Step 2- understand your data study your datalearn its propertiestools- graphical
14The data- variationsoil or sediment samples taken side-by-side, from different parts of the same plant, or from different animals in the same environment, exhibit different activity densities of a given radionuclide.The distribution of values observed will provide an estimate of the variability inherent in the population of samples that, theoretically, could be taken.
15VariationActivity (log10) of particles (Bq Cs-137) with Normal or Gaussian density superimposed
16measured atmsopheric fluxes for 1997 measured fluxes for 1997 are still noisy.Is there a statistical signal and at what timescale?
17Key properties of any measurement Accuracy refers to the deviation of the measurement from the ‘true’ valuePrecision refers to the variation in a series of replicate measurements (obtained under identical conditions)
18Accuracy and precision AccurateInaccuratePreciseImprecise
19Evaluation of accuracy In a laboratory inter-comparison, known-concentration material is used to define the ‘true’ concentrationThe figure shows a measure of accuracy for individual laboratoriesAccuracy is linked to Bias
20Evaluation of precision Analysis of the instrumentation method to make a single measurement, and the propagation of any errorsRepeat measurements (true replicates) – using homogeneous material, repeatedly subsampling, etc….Precision is linked to Variance (standard deviation)Precision, error, uncertainty, all the terminology again how to estimate
21The nature of measurement All measurement is subject to uncertaintyAnalytical uncertainty reflects that every time a measurement is made (under identical conditions), the result is different.Sampling uncertainty represents the ‘natural’ variation in the organism within the environment.
22The error and uncertainty in a measurement The error is a single value, which represents the difference between the measured value and the true valueThe uncertainty is a range of values, and describes the errors which might have been observed were the measurement repeated under IDENTICAL conditionsError (and uncertainty) includes a combination of variance and bias
23Effect of uncertainties Lack of observations contribute touncertainties in input datauncertainty in model parameter valuesConflicting evidence contributes touncertainty about model formuncertainty about validity of assumptions
24Step 3- build the statistical model Outcomes or Responses these are the results of the practical work and are sometimes referred to as ‘dependent variables’.Causes or Explanations these are the conditions or environment within which the outcomes or responses have been observed and are sometimes referred to as ‘independent variables’, but more commonly known as covariates.
25Statistical modelsIn experiments many of the covariates have been determined by the experimenter but some may be aspects that the experimenter has no control over but that are relevant to the outcomes or responses.In observational studies, these are usually not under the control of the experimenter but are recorded as possible explanations of the outcomes or responses.
26Specifying a statistical models Models specify the way in which outcomes and causes link together, eg.Metabolite = TemperatureThe = sign does not indicate equality in a mathematical sense and there should be an additional item on the right hand side giving a formula:-Metabolite = Temperature + Error
27Specifying a statistical models Metabolite = Temperature + ErrorIn mathematical terms, there will be some unknown parameters to be estimated, and some assumptions will be made about the error distributionMetabolite = + temperature +~ N(0, σ2)
28statistical model interpretation Metabolite = Temperature + ErrorThe outcome Metabolite is explained by Temperature and other things that we have not recorded which we call Error.The task that we then have in terms of data analysis is simply to find out if the effect that Temperature has is ‘large’ in comparison to that which Error has so that we can say whether or not the Metabolite that we observe is explained by Temperature.
29SS=(observed y- model fitted y)2 Model calibrationStatisticians tend to talk about model fitting, calibration means something else to them.Methods- least squares or maximum likelihoodleast squares:- find the parameter estimates that minimise the sum of squares (SS)SS=(observed y- model fitted y)2maximum likelihood- find the parameter estimates that maximise the likelihood of the data
30Calibration-using the data A good idea, if possible to have a training and a test set of data-split the data (90%/10%)Fit the model using the training set, evaluate the model using the test set.why?because if we assess how well the model performs on the data that were used to fit it, then we are being over optimisticother methods: bootstrap and jackknife
31Model validation what is validation? Fit the model using the training set, evaluate the model using the test set.why?because if we assess how well the model performs on the data that were used to fit it, then we are being over optimisticother methods: bootstrap and jackknife
32Example 4: Models- how well should models agree? 6 ocean models (process based-transport, sedimentary processes, numerical solution scheme, grid size) used to predict the dispersal of a pollutantResults to be used to determine a remediation policy for an illegal dumping of “radioactive waste” The what if scenario investigationThe models differ in their detail and also in their spatial scale
33Predictions of levels of cobalt-60 Different models, same input dataPredictions vary by considerable marginsMagnitude of variation a function of spatial distribution of sites
34Statistical models and process models Loch Leven, modelling nutrientsprocess model based on differential equationsstatistical model based on empirically determined relationships
39Uncertainty (in variables, models, parameters, data) what are uncertainty and sensitivity analyses?an example.
40Effect of uncertainties Lack of observations contribute touncertainties in input datauncertainty in model parameter valuesConflicting evidence contributes touncertainty about model formuncertainty about validity of assumptions
41Modelling tools - SA/UA Sensitivity analysis determining the amount and kind of change produced in the model predictions by a change in a model parameter Uncertainty analysis an assessment/quantification of the uncertainties associated with the parameters, the data and the model structure.
42Modellers conduct SA to determine (a) if a model resembles the system or processes under study,(b) the factors that mostly contribute to the output variability,(c) the model parameters (or parts of the model itself) that are insignificant,(d) if there is some region in the space of input factors for which the model variation is maximum,and(e) if and which (group of) factors interact with each other.
44Design of the SA experiment Simple factorial designs (one at a time)Factorial designs (including potential interaction terms)Fractional factorial designsImportant difference: design in the context of computer code experiments – random variation due to variation in experimental units does not exist.
45SA techniques Screening techniques Local/differential analysis O(ne) A(t) T(ime), factorial, fractional factorial designs used to isolate a set of important factorsLocal/differential analysisSampling-based (Monte Carlo) methodsVariance based methodsvariance decomposition of output to compute sensitivity indices
46Screeningscreening experiments can be used to identify the parameter subset that controls most of the output variability with low computational effort.
47Screening methodsVary one factor at a time (NOT particularly recommended)Morris OAT design (global)Estimate the main effect of a factor by computing a number r of local measures at different points x1,…,xr in the input space and then average them.Order the input factors
48Local SALocal SA concentrates on the local impact of the factors on the model. Local SA is usually carried out by computing partial derivatives of the output functions with respect to the input variables.The input parameters are varied in a small interval around a nominal value. The interval is usually the same for all of the variables and is not related to the degree of knowledge of the variables.
49Global SAGlobal SA apportions the output uncertainty to the uncertainty in the input factors, covering their entire range space.A global method evaluates the effect of xj while all other xi,ij are varied as well.
50How is a sampling (global) based SA implemented? Step 1:define model, input factors and outputsStep 2:assign p.d.f.’s to input parameters/factors and if necessary covariance structure. DIFFICULTStep 3:simulate realisations from the parameter pdfs to generate a set of model runs giving the set of output values.
51Choice of sampling method S(imple) or Stratified R(andom) S(ampling)Each input factor sampled independently many times from marginal distbns to create the set of input values (or randomly sampled from joint distbn.)Expensive (relatively) in computational effort if model has many input factors, may not give good coverage of the entire range spaceL(atin) H(ypercube) S(sampling)The range of each input factor is categorised into N equal probability intervals, one observation of each input factor made in each interval.
52SA -analysisAt the end of the computer experiment, data is of the form (yij, x1i,x2i,….,xni), where x1,..,xn are the realisations of the input factors.Analysis includes regression analysis (on raw and ranked values), standard hypothesis tests of distribution (mean and variance) for subsamples corresponding to given percentiles of x, and Analysis of Variance.
53Some ‘newer’ methods of analysis Measures of importanceVarXi(E(Y|Xj =xj))/Var(Y)HIM(Xj) =yiyi’/NSobol sensitivity indicesFourier Amplitude Sensitivity Test (FAST)
54How can SA/UA help? SA/UA have a role to play in all modelling stages: We learn about model behaviour and ‘robustness’ to change;We can generate an envelope of ‘outcomes’ and see whether the observations fall within the envelope;We can ‘tune’ the model and identify reasons/causes for differences between model and observations
55On the other hand - Uncertainty analysis Parameter uncertaintyusually quantified in form of a distribution.Model structural uncertaintymore than one model may be fit, expressed as a prior on model structure.Scenario uncertaintyuncertainty on future conditions.
56Tools for handling uncertainty Parameter uncertaintyProbability distributions and Sensitivity analysisStructural uncertaintyBayesian frameworkone possibility to define a discrete set of models, other possibility to use a Gaussian processmodel averaging
57An uncertainty example (1) Wet deposition israinfall ion concentrationRainfall is measured at approximately 4000 locations, map produced by UK Met Office.Rain ion concentrations are measured weekly (now fortnightly or monthly) at around 32 locations.
58An uncertainty example (2) BUTalmost all measurements are at low altitudesmuch of Britain is uplandAND measurement campaigns showrain increases with altituderain ion concentrations increase with altitudeSeeder rain, falling through feeder rain on hills, scavenges cloud droplets with high pollutant concentrations.
59An uncertainty example (3) Solutions:More measurementsX at high altitude are not routine and are complicated(b) Derive relationship with altitudeX rain shadow and wind drift (over about 10km down wind) confound any direct altitude relationships(c) Derive relationship from rainfall map model rainfall in 2 separate components
61An uncertainty example (5) Wet deposition is modelled byr actual rainfalls rainfall on ‘low’ ground (r = s on ‘low’ ground, and(r-s) is excess rainfall caused by the hill)c rain ion concentration as measured on ‘low’ groundf enhancement factor (ratio of rain ion concentrationin excess rainfall to rain ion concentration in‘low’ground rainfall)deposition = s.c + (r-s).c.f
62An uncertainty example (6) RainfallConcentrationDeposition
63An uncertainty example (7) a) modelled rainfall to 5km squares provided by UKMO - unknown uncertaintyscale issue - rainfall a point measurementmeasurement issue - rain gauges difficult touse at high altitudeoptimistic 30% pessimistic 50%how is the uncertainty represented?(not e.g. 30% everywhere)
64An uncertainty example (8) b) some sort of smoothed surface(change in prevalence of westerly windsmeans it alters between years)c) kriged interpolation of annualrainfall weighted mean concentrations(variogram not well specified)assume 90% of observations within ±10% of correct valued) campaign measurements indicate valuesbetween 1.5 and 3.5
65An uncertainty example (9) Output measures in the sensitivity analysis are the average flux (kg S ha-1 y-1) for(a) GB, and(b) 3 sample areas
66An uncertainty example (10) Morris indices are one way of determining which effects are more important than others, so reducing further work.but different parameters are important in different areas
67An uncertainty example (11) 100 simulations Latin Hypercube Sampling of 3 uncertainty factors:enhancement ratio% error in rainfall map% error in concentration
68An uncertainty example (12) Note skewed distributions for GB and for the 3 selected areas
69An uncertainty example (13) Mean of 100 simulationsStandard deviationOriginal
70An uncertainty example (14) CV from 100 simulationsPossible bias from 100 simulations
71An uncertainty example (15) model sensitivity analysis identifies weak areaslack of knowledge of accuracy of inputs a significant problemthere may be biases in the model output which, although probably small in this case, may be important for critical loads
72Conclusions The world is rich and varied in its complexity Modelling is an uncertain activitySA/UA are an important tools in model assessmentThe setting of the problem in a unified Bayesian framework allows all the sources of uncertainty to be quantified, so a fuller assessment to be performed.
73Challenges Some challenges: different terminologies in different subject areas.need more sophisticated tools to deal with multivariate nature of problem.challenges in describing the distribution of input parameters.challenges in dealing with the Bayesian formulation of structural uncertainty for complex models.Computational challenges in simulations for large and complex computer models with many factors.