Presentation on theme: "Data Mining of Environmental Models for Sensitivity Analysis Tom Stockton Paul Black, Andy Schuh, Kate Catlett, John Tauxe Neptune and Company, Inc. www.neptuneandco.com."— Presentation transcript:
Data Mining of Environmental Models for Sensitivity Analysis Tom Stockton Paul Black, Andy Schuh, Kate Catlett, John Tauxe Neptune and Company, Inc. Knowledge Discovery re
Issue How to conduct a sensitivity analysis of a complex high dimensional probabilistic environmental model?
Decision Modeling 1.Decision Model, build and solve –Decision Actions and Outcomes –Utility (costs, liabilities, desires) –Probabilistic model Scenario Model Parameter 2.Sensitivity analysis (knowledge re-discovery) 3.Value of information analysis (OUT-path) 4.Data collection 5.Update model (Bayesian or ad hoc)
Decision Modeling U(d | I) = sup d S M Y U(d | y, S, M, M ) utility function p(S) scenario uncertainty p(M | S) model uncertainty p( M | S) parameter uncertainty p(I | M M, S) data likelihood p(y | M, M, S) risk predictive dist dy dS dM d M where: U= utility, loss, costM= model structure d= decision M = model parameters I= information/dataS= scenario y= risk
Sensitivity Analysis Given a model: Y = f (X) [Y = GoldSim(X)] Sensitivity analysis is aimed at describing the influence of each input variable X i on the model response Y
Desirable Properties of a SA Measure Efficiency –account for all effects while being computationally affordable Simplicity –implementable and interpretable Model Independent –The method can handle non-linearity, non- monotonicity (across time and space) K. Chan, S. Tarantola and A. Saltelli, 2000, Variance-Based Methods, in Sensitivity Analysis, A. Saltelli, K. Chan, E.M.Scott.John Wiley and Sons.
Sensitivity Measures OAT and Differential Analysis, for complex probabilistic models, often are –not efficient, and –not model independent
Global Sensitivity Measures Sensitivity Measure Build a statistical model of the model response and the model inputs using the Monte Carlo simulation results Decompose variance of the output and attribute to input variables
Standardized Rank Regression SRR –Rank Y and X i and scale the ranks to mean of 0 and variance of 1 for convenience Based on the ranks of Y and X i Assuming the X i are independent
Fourier Amplitude Sensitivity Test FAST –Explores the multidimensional input space of the input factors by a search curve using Fourier transform function. –Handles main and interaction effects K. Chan, S. Tarantola and A. Saltelli, 2000, Variance-Based Methods, in Sensitivity Analysis, A. Saltelli, K. Chan, E.M.Scott.John Wiley and Sons.
Issues Differential Analysis –not feasible: derivatives of complex models SRR and OAT –not model independent: trouble with nonmonotonic nonlinear models. –not efficient: trouble with interaction effects in high dimensional models FAST –not efficient: Separate model runs
Possible Solutions Data mine the probabilistic model output –Multivariate Adaptive Regression Splines (MARS) –Multiple Additive Regression Trees (MART)
Data Mining MARS –Non-parametric recursive partitioning approach that fits separate splines to distinct intervals of the predictor variables. MART –Explores the multidimensional input space of the input factors using gradient boosting of additive regression models. Advantages –Search for interactions between variables, allowing any degree of interaction to be considered. –Tracks very complex data structures in high-dimensional data.
Sensitivity Indices via ANOVA decomposition Sensitivity indices are calculated using basis functions not including x s
Analytical Example Sobol’ g-function Saltelli A., Tarantola S., and Chan K.P.-S. (1999), “A Quantitative Model-Independent Method for Global Sensitivity Analysis of Model Output,” Technometrics, 41,
Example: Sobol’ g-function Inputa Sensitivities Analytic MARTMARSFASTSRR x x x x x x x7x x8x Saltelli A., Tarantola S., and Chan K.P.-S. (1999), “A Quantitative Model-Independent Method for Global Sensitivity Analysis of Model Output,” Technometrics, 41,
Radioactive Waste Disposal Example NTS GCD GoldSim model Simulated data is exported to MS Access SQL query data from and run sensitivity analysis ( is an Open Source statistical programming language )
Public BenefitAnalysis Costs ALARA Costs Monitoring Costs Disposal Fees Cumulative (CA) Management Options - Institutional Controls - Site Maintenance - Waste Acceptance - Closure - Monitoring/Surveillance Potential Liabilities Closure Costs Research, Monitoring, Information & Data Collection Choose Management Options & Update Management Plan YES NO Ecosystem MOP & IHI Occupational Regulations & Guidance Can the risk be managed to regulatory thresholds at an acceptable cost with an acceptable level of uncertainty? Disposal Costs Budgets Maintenance Review Periodic Review Waste Acceptance DecisionWaste Acceptance Decision Closure Decision Fate & Transport Existing Inventory Future Inventory Cost Management Risk Contamination Uncertainty analysis Sensitivity analysis Value of Information 6 7 Iteration loop Legend 1 Sequence number 8
Simulation Results Model Inputs ( X ) –Inventory –Fate and transport Upward advection Biotic transport Model response ( Y ) –“EPA-SUM”
Summary MART and MARS appear to provide an –Efficient –Simple (?) –Model Independent approach to data mining probabilistic model results for sensitivity analysis
Finally… The decision context: –Is the uncertainty in the model response too high? –Is there value in reducing input uncertainty? –SA and cost used to estimate the value of collecting additional information.
MARS Non-parametric recursive partitioning approach that fits separate splines to distinct intervals of the predictor variables. Both the selected variables and the knots are found via a brute force, exhaustive search procedure optimized simultaneously by evaluating a "loss of fit" criterion. Searches for interactions between variables, allowing any degree of interaction to be considered. Tracks very complex data structures in high-dimensional data. J.H. Friedman, (1991), “Multivariate Adaptive Regression Splines,” The Annals of Statistics, 19, 1-14 Software: Trevor Hastie and Robert Tibshirani, MDA Library for R (‘GNU S’). Ross Ihaka and Robert Gentleman, (1996) R: A Language for Data Analysis and Graphics, Journal of Computational and Graphical Statistics, 5, 3,
MART Multiple Additive Regression Trees –Explores the multidimensional input space of the input factors using gradient boosting of additive regression models. –Handles main and interaction effects. –Fast K. Chan, S. Tarantola and A. Saltelli, 2000, Variance-Based Methods, in Sensitivity Analysis, A. Saltelli, K. Chan, E.M.Scott.John Wiley and Sons.