Extreme values Adam Butler Biomathematics & Statistics Scotland Seminar at MLURI, January 2008.

Slides:



Advertisements
Similar presentations
Introduction to modelling extremes
Advertisements

Introduction to modelling extremes Marian Scott (with thanks to Clive Anderson, Trevor Hoey) NERC August 2009.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Searching for applications of EVT in biology Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
STAT 497 APPLIED TIME SERIES ANALYSIS
Analysis of Extremes in Climate Science Francis Zwiers Climate Research Division, Environment Canada. Photo: F. Zwiers.
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Maximum likelihood (ML) and likelihood ratio (LR) test
Heuristic alignment algorithms and cost matrices
Extremes ● An extreme value is an unusually large – or small – magnitude. ● Extreme value analysis (EVA) has as objective to quantify the stochastic behavior.
Chapter 7 Sampling and Sampling Distributions
Climate Change and Extreme Wave Heights in the North Atlantic Peter Challenor, Werenfrid Wimmer and Ian Ashton Southampton Oceanography Centre.
Evaluating Hypotheses
Presenting: Assaf Tzabari
Chapter 9 Audit Sampling: An Application to Substantive Tests of Account Balances McGraw-Hill/Irwin ©2008 The McGraw-Hill Companies, All Rights Reserved.
Chapter Sampling Distributions and Hypothesis Testing.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Maximum likelihood (ML)
Introduction to the design (and analysis) of experiments James M. Curran Department of Statistics, University of Auckland
Copyright © 2007 Pearson Education Canada 1 Chapter 12: Audit Sampling Concepts.
Hypothesis Testing:.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Bayesian Spatial Modeling of Extreme Precipitation Return Levels Daniel COOLEY, Douglas NYCHKA, and Philippe NAVEAU (2007, JASA)
Extreme Value Analysis What is extreme value analysis?  Different statistical distributions that are used to more accurately describe the extremes of.
Estimation of Statistical Parameters
Traffic Modeling.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
1 A non-Parametric Measure of Expected Shortfall (ES) By Kostas Giannopoulos UAE University.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Extreme values and risk Adam Butler Biomathematics & Statistics Scotland CCTC meeting, September 2007.
Statistical approach Statistical post-processing of LPJ output Analyse trends in global annual mean NPP based on outputs from 19 runs of the LPJ model.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Risk Analysis & Modelling Lecture 10: Extreme Value Theory.
Auditing: The Art and Science of Assurance Engagements Chapter 13: Audit Sampling Concepts Copyright © 2011 Pearson Canada Inc.
Spatial oceanographic extremes Adam Butler (Lancaster University), talk at RSC2003 Coworkers: Janet Heffernan, Jonathan Tawn, Roger Flather Data supplied.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
11 Chapter 5 The Research Process – Hypothesis Development – (Stage 4 in Research Process) © 2009 John Wiley & Sons Ltd.
New approaches in extreme-value modeling A.Zempléni, A. Beke, V. Csiszár (Eötvös Loránd University, Budapest) Flood Risk Workshop,
Extreme Value Analysis
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Review of statistical modeling and probability theory Alan Moses ML4bio.
CE 3354 ENGINEERING HYDROLOGY Lecture 6: Probability Estimation Modeling.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
URBDP 591 A Lecture 16: Research Validity and Replication Objectives Guidelines for Writing Final Paper Statistical Conclusion Validity Montecarlo Simulation/Randomization.
Hydrological Forecasting. Introduction: How to use knowledge to predict from existing data, what will happen in future?. This is a fundamental problem.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
Climate change, hydrodynamical models & extreme sea levels Adam Butler Janet Heffernan Jonathan Tawn Lancaster University Department of Mathematics &
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Markov Chain Monte Carlo in R
Application of Extreme Value Theory (EVT) in River Morphology
15 Inferential Statistics.
Estimating standard error using bootstrap
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Environmental Statistics
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Extreme values Adam Butler Biomathematics & Statistics Scotland Seminar at MLURI, January 2008

Motivation 1. Motivation What is EVT? Applications Current research

Flooding, Budapest, 2002 Graham Berry

What is the probability that the flood defenses of Budapest will be overtopped during 2008?

Northern Rock branch, London, 2007 Alex Gunningham

What is the probability of today’s value of the Dow Jones index being at least 9.5% lower than yesterday’s?

Log daily return = log(value today / value yesterday) Value drops by 9.5%  LDR drops by log(0.905) = Q. On this particular day, what is the chance of getting a log daily return of less than –0.10?

Dow Jones Data for the period

To answer this question we clearly need to extrapolate, since –0.1 is well outside the range of the data… Extrapolation should be avoided whenever possible, but in many real-life problems it is unavoidable

So how should we go about estimating this probability? We could assume that the data are normally distributed…

P(X < –0.1) 

…but the extreme values that have been observed don’t play much of a role when we estimate the parameters (e.g. the mean and variance) Hence, our chosen model (e.g. the normal distribution) might do badly in describing their properties…

Empirical: P(X < –0.05)  Normal: P(X < –0.05) 

…and, worse still, extrapolations beyond the range of the data often differ radically between models that provide a very similar fit to the bulk of the data For example, we might decide to fit a Cauchy rather than a normal distribution…

Cauchy: P(X < –0.1)  0.02 Normal: P(X < –0.1) 

We need an alternative statistical approach that is more robust, in the sense that it is does not require us to make strong and untestable assumptions about the process that is generating our data This is the motivation for EVT – Extreme Value Theory

Motivation Motivation 2. What is EVT? Applications Current research

General characteristics of an “EVT” problem We are interested in a process that can be quantified, and for which we have some data …and we want to use this data to say something about the probability that a rare or extreme event will occur We will usually be interested in events that are beyond the range of the data e.g. we want to extrapolate

To deal with such problems, we begin from the principle that our inferences should only be based on the most extreme data that we have actually observed  e.g. we should throw away almost all of the data

Extreme value theory (EVT) then provides us with some simple and robust models that can then be used to describe the properties of these extreme data

Q. What is the probability of getting more than 100mm of rain on any given day?

We might decide to only use data for days with 25mm or more of rainfall…

Histogram of data above a threshold of 25mm

Threshold exceedance = Value - Threshold

The GPD model A good statistical model for threshold exceedances is the GPD (Generalised Pareto Distribution) The probability density function is of the form f(x) = 1 – (1 +  x /  ) -1/  There are two parameters, a scale parameter  and a shape parameter , which needed to be estimated

Threshold = u = 25mm  and  estimated by maximum likelihood to be 7.70 and P(X > 100) estimated to be (once per 131 years) GPD model fitted to threshold exceedances

But why is the GPD a good model to use? The mathematical justification is given by asymptotic theory The theory says that, for almost any random variable X, the exceedances of a high threshold u will tend towards following the GPD model as u tends towards infinity In practice, we use a threshold that is high but still finite: we rely on the fact that if this level is sufficiently high then the asymptotic result will still be approximately true

When choosing a threshold, we need to balance Precision: If the threshold is low then our results will tend to be more certain than if it is high Bias: extreme value methods will only be valid when the threshold is sufficiently high We can do this in a partly subjective way using parameter stability plots

Parameter stability plot for shape parameter, 

The GEV model Another approach involves analysing block maxima For example, if we have hourly sea level data then we may choose to analyse only the largest value that occurs each year: the annual maximum value The same method can also be used to analyse minima

A good statistical model for block maxima is the GEV (Generalised Extreme Value Distribution) The probability density function is of the form f(x) = exp{-[1 +  ((x -  ) /  )] -1/  } There are three parameters - a location parameter , a scale parameter , and a shape parameter  - which need to be estimated

The r-largest model The GEV model uses only one value per block An extension of this model involves using the r largest values per block, where r is greater than one e.g. We might model the 20 highest sea levels per year

The shape parameter All of the extreme value models contain a common parameter  that determines the shape of the distribution The extremes of a light tailed distribution will have a negative shape parameter (  0) The extreme values of a normal distribution have  = 0

GPD: impact of the shape parameter,   = 0  = 1  = -0.5

Covariates The properties of extreme values may depend on time, location, or other covariates (explanatory variables) We can easily build these covariates into our extreme value models, in a similar way that we would build them into a regression model or GLM The key difference is that in a GLM we only build covariates into the mean, whereas in EV models we might build them into any of the three parameters

Venice sea level data – linear trend in location parameter

More advanced statistical modelling Methods to deal with clustering: e.g. declustering algorithms, estimation of the extremal index Semiparametric modelling: allow trends to vary smoothly over time, using local likelihood or smoothing splines Bayesian methods: allow for the incorporation of prior information, and for the construction of relatively complicated hierarchical models

Example of semiparametric modelling: estimated trends in storm surge levels at Dover

Software Add-on packages are available for R (extRemes, ismev, evir, evd, evdbayes), Splus (EVIS, S+FinMetrics) and Matlab (EVIM, EXTREMES) The extremes toolkit provides a user-friendly interface - Some methods are also available in Genstat Stand-alone commercial software: Xtremes, HYFRAN

Advantages Robust Relies on weak assumptions Avoids bias Theoretically sound Justified by asymptotic theory Quick & relatively easy to use Honest …about the uncertainties involved in making statements about very rare events Disadvantages Inefficient Most of the data are thrown away …we may over-estimate uncertainty …relies on having a large sample size Asymptotics The theory only holds exactly for infinitely extreme events Difficult to extend to multivariate case Data quality Sensitive to errors in extreme data Should I be using EVT?

Motivation Motivation What is EVT? 3. Applications Current research

Environmental sciences EVT is widely used by scientists working in hydrology, climatology, oceanography and fire science It is also used for operational purposes in flood risk assessment and civil engineering Particular interest in studying the impact of climate change upon extreme events – e.g. MICE project ( WASA project: Waves & Storms in the NE Atlantic.

Thames Barrier, London Source: Roger Haworth

Risk assessment and design Extreme value problems in hydrology and coastal engineering are often phrased in terms of return levels N-year return level: the level that is exceeded with probability 1/N in a particular year – definition applies to nonstationary processes too, but interpretation is harder e.g. Thames Barrier: “…was originally designed to protect London against a flood level with a return period of 1000 years in the year 2030…” (Wikipedia)

Biology Biologists are also often interested in studying the properties of extreme or rare events, but rarely use EVT Some likely reasons – Relatively small sample sizes (compared to e.g. hydrology) Extreme events not so easily defined in quantitative terms New applications are likely to arise from the increasing use of large datasets (e.g. in genetics), and from an increased focus on quantitative risk assessment

Genetics A major application of EVT is in sequence alignment, and extreme value models are used by BLAST and FASTA Compare a sequence against a vast database of known sequences - 1. define a similarity score 2. search for the best match within the database 3. use EVT to evaluate the significance of this match “…a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences…” (Wikipedia)

Ecology Review papers by Gaines & Denny (1993) and Katz et al. (2005) focus on disturbance – studying the extremes of environmental processes that are known to lead to ecological disturbance e.g. sediment rates, fire sizes, frost days They also consider longevity & survival – i.e. studying the maximum lifespan or size of an individual

Bumblebee on Echinacea purpurea /

Possible new applications in ecology Dispersal & spread: spatial spread (of diseases, pollen, invasive species) known to be influenced by long-range dispersal; can EVT be used to analyse dispersal data? Population dynamics: estimating the probability of extinction or explosion of a population Ecological modelling: study the properties of extreme events simulated by complex process-based ecological models – e.g. mass extinction events

Other areas where EVT is used Finance and insurance: in particular, calculation of Value at Risk ( Telecommunications: e.g. estimation of very large file sizes in internet traffic Sport science: trends in record times for athletics …and many, many more…

Motivation Motivation What is EVT? Applications 4. Current research

Extreme value theory remains an area of active methodological research, with two key strands: 1)Improving the practical utility of existing extreme value methods by making use of recent developments in statistics and computing e.g. Bayesian extremes 2)Developing methods for multivariate extremes – this involves much theoretical work

Multivariate extremes Standard (“univariate”) extreme value methods concentrate on the extremes of a single random variable Multivariate extreme value theory studies how the values of different variable are related at extreme levels The different random variables may relate to genuinely different processes (e.g. tide and waves) or to the same process at different locations (spatial extremes)

Some applications: 1)Calculating the risk that there will be in a fall in the overall value of a portfolio of investments 2)Assessing regional flood risk e.g. estimating the probability that a severe flood will occur at one or more locations within a region 3)Evaluating the probability that two atmospheric pollutants will simultaneously reach hazardous levels

Two random variables X 1 and X 2 may either be… Asymptotically dependent: Extreme values of X 2 occur when X 1 is also extreme Asymptotically independent: Extreme values of X 2 occur when X 1 is not extreme, and vice-versa

There is a rich mathematical theory about asymptotic dependence, providing us with statistical models that we can use if we are prepared to make this assumption There is little theory about asymptotic independence, and practical techniques for dealing with data that exhibit this have only been developed in the past 5-10 years There are relatively few practical techniques for testing whether your data exhibit asymptotic (in)dependence

Thank you for listening! Phone: