Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering.

Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering and Geosciences UNESCO-IHE, Guest Lecturer Year 2009 Lectures on: October 27 th 8.45 – 10.30h October 28 th 15.45 – 17.30h November 5 th 15.45 – 17.30h November 6 th 8.45 – 10.30h A selection of these slides will be presented during the course contact for students Dr.ir. P.H.A.J.M. van Gelder Dr.ir. P.H.A.J.M. van Gelder room: 3.87 ext: 86544

Overall outline of the course Review of statistics and frequency analysis Data analysis, random variables, classification, stat. moments, frequency distributions; samples, populations and probability models; parameter estimation and confidence intervals.

Introduction to this Course 1 + 1 lecture periods (basic statistics), 1 + 1 lecture periods on frequency analysis and a written exam Dialog instead of monolog (ask questions!) Complete power point presentations can be found on Van Gelder’s website Brief self-introduction: –Background (name, country, university education etc.) –Interests, experiences with statistics? –What do you hope to get out of this part of the course? –Who has experienced hydrological extremes?

Outline for today Introduction on natural hazards and probabilistic design Data sets for river– and coastal engineers Theory on: –Probability and events –Random (or stochastic) variables –Transformations –Multivariate distributions

Number of Floods worldwide Source: Dartmouth Flood Observatory, 2003

Regional Distribution of Large Floods Source: Dartmouth Flood Observatory, 2003 1999-2002 1985-1988

Reasons for Concern Source: Smith et al 2001, TAR IPCC WG II

More flood-producing rain (but regional differences) Longer rainless periods and higher evaporation demand of the atmosphere There are indications for:  Changing regimes, flood timing  More frequent floods, land slides etc.  No reduction of droughts, but decreased summer low flows e.g. in western Europe Does Global Warming Lead to an Intensification of the Hydrological Cycle and More Extremes Events?

Effects of changing Mean and Variance of Precipitation on Stream Flow (i.e. non-linear processes, thresholds, etc.) (from Middelkoop 2005, after Arnell 1996)

Frequency analyses can be done for … High flows  Flood peak discharge  Flood volume  Occurrence of high flows in certain periods/months Low flows  Minimum low flow discharge  Runoff deficits Groundwater levels, groundwater fluctuations etc. Estimated cost of damage caused by natural hazards Simulated variables (model outputs) Daily rainfall, rainfall intensities etc. …. Let’s start with examples of flood peak discharges

How do we measure floods? (Aus: Hornberger et al., 1998)

Rating curve (Aus: Hornberger et al., 1998)

Application of rating curve to measured water levels at a gauge (Aus: Hornberger et al., 1998)

…. that is not always that easy! (The big flood in the HJ Andrews 1996)

Continuous measurement of discharges (incl. floods and low flows) (Aus: Hornberger et al., 1998)

Floods in 2002; river Elbe, city of Dresden, Germany

Flood defence structures Storm Surge Barrier Oosterschelde (NL)

Maeslantkering - storm surge barrier (NL)

Variability in annual precipitation (NL)

Variability in daily river discharges (Meuse river, NL)

Example: Evaluating of hydrological extremes (flood runoff); Case in Austria 2005

Example: Evaluating of floods in a regional context (Case study: Austria 2005)

Ongoing EU projects on Flood Risks EFFS: A European Flood Forecasting System –to develop a prototype of a European flood forecasting system for 4-10 days in advance, which could provide daily information on potential floods for large rivers, and flash floods in small basins. SPHERE: Systematic, Palaeoflood and Historical data for the improvement of flood Risk Estimation –to develop a new approach which complements hydrologic modelling and the application of historical and paleoflood hydrology to increase the temporal framework of the largest floods over time spans from decades to millennia; in order to improve extreme flood occurrences.

THARMIT: Torrent Hazard Control in the European Alps –to develop practical tools and methodologies for hazard assessment, prevention and mitigation, and to devise methods for saving and monitoring potentially dangerous areas. CARPE DIEM: Critical assessment of Available Radar Precipitation Estimation techniques and Development of Innovative approaches for Environmental Management. –to improve real-time estimation of radar rainfall fields for flood forecasting, by coupling multi- parameter polarisation radar data and NWP, and exploiting NWP results in order to improve the interpretation of radar observations.

IMPACT: Investigation of Extreme Flood Processes and Uncertainty –to investigate extreme flood and defense failure processes, their risk and uncertainty. Will consider dam breach formation, sediment movement, flood propagation and predictive models, within an overall framework of flood risk management. GLACIORISK: Survey and Prevention of Extreme Glaciological Hazards in European Mountainous Regions –to develop scientific studies for detection, survey and prevention of glacial disasters in order to save lives and reduce damages.

SAFERELNET: Risk assessment of natural hazards in Europe MITCH: Mitigation of Climate Induced Hazards –dealing with the mitigation of natural hazards with a meteorological cause, in order to assist planning and management. The main focus will be on flood forecasting and warning, but it will also include other flood related hazards, such as landslips and debris flow, and longer term climate hazards, such as drought, and the possible impact of climate change on the frequency and magnitude hazards ADC-RBM: Advanced Study Course in River Basin Modelling for Flood Risk Mitigation - June 2002

FLOODMAN: Near real-time flood forecasting, warning and management system based on satellite radar images, hydrological and hydraulic models and in-situ data –near real-time monitoring of flood extent using spaceborne SAR, optical data & in-situ measurements, hydrological and hydraulic model data. The result will be an expert decision system for monitoring, management and forecast of floods in selected areas in Europe. The monitoring will also be used to update the hydrological/hydraulic models and thereby improving the quality of flood forecasts.

FLOODSITE: The FLOODsite project covers the physical, environmental, ecological and socio- economic aspects of floods from rivers, estuaries and the sea. The project is arranged into seven themes covering: Risk analysis – hazard sources, pathways and vulnerability of receptors. Risk management – pre-flood measures and flood emergency management. Technological integration – decision support and uncertainty. Pilot applications – for river, estuary and coastal sites. Training and knowledge uptake – guidance for professionals, public information and educational material. Networking, review and assessment. Co-ordination and management.

Extreme Events; Two very realistic simulations 1. River dike failure in the Netherlands 2. Asteroid impact in the Atlantic Ocean

The role Statistics in Water Engineering Properties of the hydrological system Data analysis and statistics (later also modeling etc.) Statistics of hydrological variables for decision support

We need Information about … Water balance: P = R + ET + dS/dt Variability and heterogeneity of hydrological variables (groundwater levels, precip. patterns etc.) Hydrological extremes: Scenarios for: Land use change Climate chance Different water management strategies ETC. droughts floods x-year flood Statistical Analysis !!

4 types of error occur when measuring a hydrological variable 1.Operation and function errors: malfunction of measuring instrument, personal (human) error. 2.Random error: caused by numerous minor impacts partly independent from each other. If repeated frequently, the values fluctuate around the true value. 3.Constant systematic error: inherent in any kind of equipment (e.g., wrong installation of instruments, wrongly indicated zero point, incorrect rating curve etc.); constant in respect of time, but may vary according to the measuring range. 4.Variable systematic error: usually caused by insufficient control during the measuring period; mostly the origin in the instrument (e.g., drift” of the device, growing of plants at the location of measurement etc.). Can be avoided through continuous comparison of the measurement and repeated calibration of the instruments.  Systematic errors can not be reduced by increasing the number of measurements, if equipment and measuring conditions remain the same!

Structural design principles Old methods: –determine a worst case load –determine a worst case strength –determine the geometry of the structure

Disadvantages of old method Unknown how safe the structure is No insight in contribution of different individual failure mechanisms No insight in importance of different input parameters Uncertainties in variables cannot be taken into account Uncertainties in the physical models cannot be taken into account

Failure mechanisms of a dike

Design of a structure Random boundary conditions

Fault tree with AND and OR

Mathematics of AND and OR In case of an AND-gate, you should multiply the probabilities In case of an OR-gate, you should add the probabilities (and substract the multiplication of the two probabilities) Important condition: This is only true when both mechanisms are fully (statistically) independent

Example of dependence Modern wave run-up formula is: (for shallow water, last equation is somewhat different and implicit)

Example of dependence (2) So the answer depends on H and T But in a single wave field, T = f(H), for example:T = 3.9 * H 0.376 This can be modelled as: T = A * H B, in which both A and B are stochastic variables with a mean and standard deviation

Example of dependence (3) But this is only true in case of a single wave field (wind waves OR swell waves) When there are more wave fields H and T are NOT statistically independent There is no good model for run-up due to double peaked spectra, but there is an approximation by Van der Meer

Example of dependence (4)

Two approaches First approach: start at bottom and calculate the probability of failure according to normal design practice second approach: start at top and assign probability to failure mechanisms

Two approaches (2) Usually with a start at bottom, you do not reach at the required overall failure probability Usually with a start at top, you cannot construct some elements So in practice, you have to make a mixture

Sensitivity analysis What is the effect of 10% change in input on the output ? This determines how important is an input parameter

End of introduction

Data availability Internet offers a huge source of past - and real time data

The theory will be explained with examples and data sets taken from river engineering: The Global Runoff Data Centre http://grdc.bafg.de/servlet/is/910/ http://grdc.bafg.de/servlet/is/910/ Mediterranean Hydrological Cycle Observing System (Med-HYCOS project) http://medhycos.mpl.ird.fr UNESCO International Hydrological Programme http://webworld.unesco.org/water/ihp/db/ http://webworld.unesco.org/water/ihp/db/ The Global River Discharge Database” (RIVDIS) http://www.rivdis.sr.unh.edu http://www.rivdis.sr.unh.edu

Local Websites Ministry of Water Resources of China http://www.mwr.gov.ch/english/index.asp http://www.mwr.gov.ch/english/index.asp Ministry of Water Resources of India http://mowr.gov.in http://mowr.gov.in Water Commission of India http://cwc.nic.in These sites were useful for obtaining basic information about river basins, not so useful in downloading discharge data Apart from websites, data is also published in National Water Resources Books

Datasets for river engineers The longest period of observation is recorded for the river Nemunas at Smalinninkai: 1812-2003 (LT). The majority of European rivers have observation records dating from the period 1910-1920 and continuing to 1999-2004. On most Asian rivers water discharges have been observed since the period 1930-1940, although the river Bia at Biisk (Russia) has a record 108 year period of observation (1895 to 2003). The shortest period of observation is found on the Indian rivers (1939 -1979), the Chinese rivers (1930-1985) and the Iranian rivers (1963-1985).

Your data for the exercise is available at: http://www.citg.tudelft.nl/live/pagina.jsp? id=418a276e-b63e-4cec-a6fe- 763feb04f984&lang=en

Some snap shots

Data for coastal engineers www.oceanor.no/ www.knmi.nl/onderzk/oceano/waves/era40/lice nse.cgi www.globalwavestatisticsonline.com/ http://www.golfklimaat.nl http://www.actuelewaterdata.nl http://www.hydraulicengineering.tudelft.nl/public /gelder/paper56-data3.zip

Hm0, H1/3,HTE3, Tm02, TH1/3, Th0, wind direction, wind speed, water level, surge

Important data source for Dutch data

Some snap shots

real time wave data; significant wave heights

wave periods

Water levels gauges in Mid West Netherlands

Water levels and astronomical tide

The wave data on the wave climate site is available in data files per year (now : 1979 - 2002). The files contain wave data in the following format : 19880221 0100 90 4 84 12 52 66 351 10 -6 3433402 19880221 0400 76 3 73 15 53 65 339 -95 -2 3433402 19880221 0700 67 3 66 10 42 53 346 -31 -1 3433402 19880221 1000 67 3 64 12 40 48 344 51 -8 3433402 19880221 1300 66 3 65 11 36 44 308 -3 0 3433402 19880221 1600 75 3 73 12 40 47 298 -93 6 3433402 19880221 1900 91 4 79 13 41 48 325 0 3 3433402 19880221 2200 86 4 80 11 41 47 334 95 0 3433402 19880222 0100 84 5 81 13 42 50 323 38 4 0400402 …….. etc.

The files are arranged as follows : Column nr.Nameunit 1Date[yyyymmdd] 2time[hhmm] MET ! 3wave height Hm0[cm] 4accuracy wave height Hm0 (standard deviation)[cm] 5wave height H1/3[cm] 6wave height HTE3[cm] 7wave period Tm02[0.1 s] 8wave period TH1/3[0.1 s] 9wave direction Th0[gr], nautical [1] [1] 10water level[cm] NAP/MSL 11surge[cm] 12code number which indicates the origin of the given value[-] [1] [1] Nautical degrees : (from) North = 0 degrees, (from) East = 90 , South = 180 , West = 270  and North again = 360 .

Probability P(A) = probability of event A Mathematical definition Frequentistic definition

Mathematical definition Axioms: 1. P(A)  0 2. P(  ) = 1 3. P(A or B) = P(A) + P(B) (if A and B are independent)

Frequentistic definition P(A) = N(A) / N in which: N(A)number of experiments leading to A Ntotal number of experiments example: probability that a consumer product fails within 1 year after production

example interpretation P(A) = n(A) / N Pprobability n(A)number of outcomes in experiment A Ntotal number of outcomes P(A) = 4 / 24 = 1 / 6 A

example dice P(x=4) = 1/6 P(x  5) = 2/6 P(x even)= 3/6

Some history 1650Pascal / Fermat 1750Bernouilli / Bayes 1850Venn / Boole 1920Von Mises 1960Savage / Lindleydecision making 1970Benjamin / Cornelldecision making

1960’s and onwards ‘Years ago a statistician might have claimed that statistics deals with the processing of data; today statisticians will be more likely to say that statistics is concerned with decision making in the face of uncertainty.’

probability calculation calculation of a probability from other probabilities

Joint events Union A or B Cross section A and B Implication A in B Denial A not  A B  A B  A B  A

Union P(A or B) P(A or B) = P(A) + P(B) - P(A en B) 13/24= 6/24 + 9/24 - 2/24 A B ?

Cross section P(A and B) P(A and B) = n AB / n = (n A / n) * (n AB / n A ) = P(A) * P(B | A) = 6/24 * 2/6 = 2/24 A B

Conditional probability P(A | B) = probability of A given the fact that event B has occured P(A and B) = P(B) P(A | B) P(A | B) = P(A and B) / P(B)

Conditional probability P(rain in Delft on sept. 18, 2024)? P(rain in Delft on 9/18-2024| rain in Amsterdam on 9/18-2024)? P(rain in Delft on 9/18-2024|rain in Cape Town on 9/18-2024)?

example: dice

Independence A and B are independent if In that case:

Important rules Theorem of total probability Generalisation to continuous integral “in which the uncertainty is integrated out” Theorem of Bayes

example: quiz dilemma Car in A, B or C U: chooses A QM: Good that you didn’t choose B, because it is empty. Would you still like to switch to C? A BC

Quiz-dilemma Theorem of Bayes: Yes, switch to C! Notes: P(info)=0.5 because there can be a car in B or not. P(info|C)=1, because if we have information on C and B(info), we know that A should contain the car with 100% certainty A clever student is invited to write a simulation programme to find out if this is indeed true

Solution of Jeroen van den Bos Development of 2 Matlab scripts Quiz_noinfo.m (choose a box and check if the car is there with no information from the QM) Quiz_info.m (choose a box and update your choice when the QM gives his information on another box)

Quiz_noinfo.m clear; N = 2000; NoOfBoxes = 3; NoSuccess=0; for i = 1:N, Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box Box_Guess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box NoSuccess = NoSuccess + (Box_Car == Box_Guess); %if guess is right increase # of succesful attempts Fr_Success(i)=NoSuccess/i; %frequency of success after i attemps end; P_Success = Fr_Success(N) %final result % output % ------ plot(1:N,Fr_Success,'.',[0 N],[1 1]/NoOfBoxes) axis([0 N 0 1]) legend('Simulation result',['P = 1/' num2str(NoOfBoxes)])

Quiz_info.m clear; N = 2000; NoOfBoxes = 3; NoSuccess_Stay=0; NoSuccess_Switch = 0; for i = 1:N, Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box Box_1stGuess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box k = 0; %Select possible empty boxes for j = 1:NoOfBoxes, if (Box_Car ~= j) & (Box_1stGuess ~= j) %empty box cannot be 'car' or 'guess' k = k + 1; Empty_Boxes(k) = j; %vector of empty boxes end; Box_Empty = Empty_Boxes(fix(1+rand(1)*k)); %Choose randomly from empty boxes k = 0; %Select possible alternatives for j = 1:NoOfBoxes, if (Box_Empty ~= j) & (Box_1stGuess ~= j) %alt box cannot be 'empty' or 'guess' k = k + 1; Alternative_Boxes(k) = j; %vector of alternative boxes end; Box_Alternative = Alternative_Boxes(fix(1+rand(1)*k)); %Choose randomly from alternative boxes NoSuccess_Stay = NoSuccess_Stay + (Box_Car == Box_1stGuess); %if 1st guess is right increase # of succesful attempts 'stay' strategy NoSuccess_Switch = NoSuccess_Switch + (Box_Car == Box_Alternative); %if alt. guess is right increase # of succesful attempts 'switch' strategy Fr_Success_Stay(i)=NoSuccess_Stay/i; %frequency of success after i attemps Fr_Success_Switch(i)=NoSuccess_Switch/i; %frequency of success after i attemps end; P_Success_Stay = Fr_Success_Stay(N) %final result P_Success_Switch = Fr_Success_Switch(N) %final result % output % ------ plot(1:N,Fr_Success_Stay,'r.',1:N,Fr_Success_Switch,'b.',[0 N],[1 1]*P_Success_Stay,'r--',[0 N],[1 1]*P_Success_Switch,'b--') axis([0 N 0 1]) %legend('Simulation result Stay','Simulation result Switch',['P_{Stay} = ' num2str(P_Success_Stay)],['P_{Switch} =', num2str(P_Success_Switch)],'location','EastOutside') legend(['"Stay" stragegy (P_{success} = ' num2str(P_Success_Stay) ')'],['"Switch" strategy (P_{success} = ' num2str(P_Success_Switch) ')']);

Updating process If new information becomes available, new estimates can be made about the failure probability of a system

stochastic variables

What is a stochastic variable? probability distributions Fast characteristics Distribution types Two stochastic variables

stochastic variable Quantity with uncertainty: –Natural variation –Lack of statististical data –Schematizations example: –strength of concrete –outcome of a dice –temperature in Delft on September 16 th, 2014

Relation with events uncertainty can be expressed with probabilities probability that stochastic variable X –is smaller than x –larger than x –equal to x –is in the interval [x, x+  x] –etc.

probability distribution probability distribution function = probability P(X  ): F X (  ) = P(X  ) stochast dummy 0.2 0.4 0.6 0.8 1 F X ()()  0

probability density This is a probability density function

probability density Differentiation of F to  : f X (  ) = dF X (  ) / d  f = probability density function f X (  ) d  = P(  < X   +d 

0 0.2 0.4 0.6 0.8 1 F X ()()  0 0.1 0.2 0.3 0.4 0.5 fX()fX() P(X    d  P(  < X  

0 0.2 0.4 0.6 0.8 1 F X ()()  0 0.1 0.2 0.3 0.4 0.5 fX()fX() P(X   

Discrete and continuous

Complementary probability distribution Complementary (cumulative distribution function (ccdf) of variable S P(S > x) = 1 - P(S  x) = 1-F S (x) P{S > S d } = P target SdSd tail of the distribution

Fast characteristics -4-20246 0 0.1 0.2 0.3 0.4 0.5 x f X (x) XX XX  X mean  X standard deviation, indication for spreading

012345 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x XX XX Mean  maximum (mode) Median is the value m for which P(X<m)=50% f X (x)

Mean Variance Standard deviation Variation coefficient

distribution types Uniform distribution Normal distribution Lognormal distribution Gumbel distribution Weibull distribution Gamma distribution ….

Uniform distribution  fX()fX() a b 1/(b-a) area = total probability = 1 mean  = (a+b)/2 Standard deviation  = (b-a)/  12

Matlab demonstration Generate numbers from a Uniform distr. Make a histogram (observe the variability around its mean) Calculate the mean value and standard deviation They should converge to 0.5 and 0.2887 (1/sqrt(12)) This is indeed confirmed by the simulation

Normal distribution -100102030405060708090 0 0.01 0.02 0.03 0.04 0.05 0.06 bending strength (N/mm2) probability density strength of timber probability density function normal distribution mean = 37 N/mm2 standard deviation = 8.6 N/mm2 XX XX

Normal distribution in CDF domain

Normal distribution in linearised CDF domain

normal distribution probability density:probability distribution: in which:  mean  standard deviation (  > 0)  dummy variable (-  <  <  )  2  2 1 X e 2 1  f            

standard normal distribution normal distributed variable X: standard normal distributed variable u: probability density:probability distribution: table or

This table can be used in both directions: 1. Given an x value, what is the corresponding exceedence probability 2. Given a probability, what is the corresponding x value Note that the table only describes the right hand tail of the standard normal distribution. The left hand tail can be obtained by symmetry around the point (0, 0.5). For ordinary normal distributions, always scale back to a standard normal distribution (by subtracting the mean value, and dividing by the standard deviation)

normal distribution Why so popular? Central limit theorem Sum of many variables (i.i.d.) is (almost) normally distributed. Y = X 1 + X 2 + X 3 + X 4 + …. i.i.d. = independent identically distributed

Normal Distributions A continuous rv X is said to have a normal distribution with parameters

Standard Normal Distributions The normal distribution with parameter values is called a standard normal distribution. The random variable is denoted by Z. The pdf is The cdf is

Standard Normal Cumulative Areas 0 z Standard normal curve

Standard Normal Distribution a. Area to the left of 0.85 = 0.8023 b. P(Z > 1.32) Let Z be the standard normal variable. Find (from table)

Find the area to the left of 1.78 then subtract the area to the left of –2.1. = 0.9625 – 0.0179 = 0.9446

Notation will denote the value on the measurement axis for which the area under the z curve lies to the right of 0

= 2[P(z < Z ) – ½] P(z < Z < –z ) = 2P(0 < Z < z) z = 1.32 Ex. Let Z be the standard normal variable. Find z if a. P(Z < z) = 0.9278. Look at the table and find an entry = 0.9278 then read back to find z = 1.46. b. P(–z < Z < z) = 0.8132 = 2P(z < Z ) – 1= 0.8132 P(z < Z ) = 0.9066

Nonstandard Normal Distributions If X has a normal distribution with mean and standard deviation, then has a standard normal distribution.

Normal Curve 68% 95% 99.7% Approximate percentage of area within given standard deviations (empirical rule).

Ex. Let X be a normal random variable with = 0.2266

Ex. A particular rash shown up at an elementary school. It has been determined that the length of time that the rash will last is normally distributed with Find the probability that for a student selected at random, the rash will last for between 3.75 and 9 days.

= 0.9772 – 0.0668 = 0.9104

Percentiles of an Arbitrary Normal Distribution (100p)th percentile for normal

Lognormal distribution 012345 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7  XX XX fX()fX()

Lognormal distribution y  f X (  ) : lognormal f Y (y) : normal If X is lognormal distributed, than Y = ln(X) is normal distributed y = ln  or  = exp(y)

Lognormal distribution X lognormal distributed  Y = ln(X) normal distributed probability density function for X: in which  Y and  Y parameters of the lognormal distribution:  Y mean value of Y (not of X !!)  Y standard deviation of Y (not of X !!)

Lognormal distribution X lognormal distributed  Y = ln(X) normal distributed

Lognormal distribution As a consequence of the central limit theorem: Product of many variables is (almost) lognormal distributed so log y (almost) normal distributed. Definition: log y normal  y lognormal Example (salaries in a country are LN distr.

Asymptotic distributions Normal Lognormal Weibull Gumbel Asymptotic distr. return in the 5th year course CT5310 by Vrijling and Van Gelder

Discrete distributions

Exponential Distribution A continuous rv X has an exponential distribution with parameter if the pdf is

Mean and Variance The mean and variance of a random variable X having the exponential distribution

Applications of the Exponential Distribution Suppose that the number of events occurring in any time interval of length t has a Poisson distribution with parameter and that the numbers of occurrences in nonoverlapping intervals are independent of one another. Then the distribution of elapsed time between the occurrences of two successive events is exponential with parameter

Two stochasts

joint probability density

Contour lines of the joint density   f XY ( ,  ) fX()fX() fY()fY()

Two stochasts Relation with events  ,f XY    Y and  XP      Also here fdensity F(cumulative) distribution 

Example Length Weight 1.21.41.61.822.22.42.6 0 0.5 1 1.5 2 2.5 3 lengte (m) kansdichtheid (1/m) 5060708090100110 0 0.01 0.02 0.03 0.04 0.05 gewicht (kg) kansdichtheid (1/kg)

Corresponding contour plot? 1.41.61.822.2 50 60 70 80 90 100 110 length (m) weight (kg)

Scatter plot results of a large survey 1.41.51.61.71.81.922.12.2 50 60 70 80 90 100 110 length (m) weight (kg) health investigation 1000 observations

Dependence 1.41.61.822.2 50 60 70 80 90 100 110 weight (kg) 50 60 70 80 90 100 110 0 0.01 0.02 0.03 0.04 0.05 lengte (m) weight (kg) density (1/kg) 1.41.61.822.2 length (m) 0 0.5 1 1.5 2 2.5 3 densiity (1/m)

Characteristics  X,  Y  X,  Y Dependence cov XY covariance or  XY = cov XY /  X  Y correlation, between -1 and 1

Covariance Cov(X,Y)=E((X-EX)(Y-EY)) calculation example on black board

Correlation

Correlations between wave height and wave period (data from golfklimaat.nl website)

Introduction Engineering: structural reliability Reliability: Probability that structure falls apart The smaller the probability, the larger the reliability Risk = probability x consequences Structure: Strength Load Falls apart if strength < load

Introduction Design value - principle P{S > S d } = P target SdSd   Probability density function (pdf) of the load S: f S (x)

Introduction Cumulative distribution function cdf pdf

Introduction Design value - principle SdSd P{S  S d } = 1-P target Cumulative probability distribution function (cdf) of the load S: F S (x)

Introduction Design value - principle Complementary cumulative distribution function (ccdf) of the load S = 1-F S (x) P{S > S d } = P target SdSd SdSd

Introduction Design value load S d Design value load S d or quantile is defined as: P{S > S d } = P target during reference period T ref Target probability P target : Depends on consequences of structural failure Is specified in building codes Typical: P target = 10 -4 - 10 -1 (structural collapse) T ref = 15 - 100 years

Peaks over Treshold analysis for quantile estimation Let X1; X2;... ; Xn be a series of independent random observations of a random variable X with the distribution function F(x). To model the upper tail of F(x), consider k exceedances of X over a threshold u and let Y1; Y2;... ; Yk denote the excesses (or peaks), i.e. Yi=Xi-u.

Extreme value statistics If we know the distribution of a random variable (for instance, monthly water level, daily wave height, etc), how does the distribution of the maximum of n random variables behave?

Careful for inhomogenities

Stationary Time Series Exhibits stationarity in that it fluctuates around a constant long run mean Has a finite variance that is time invariant Has a theoretical covariance between values of y t that depends only on the difference apart in time

WHITE NOISE PROCESS X t = u t u t ~ IID(0, σ 2 ) Stationary time series

Examples of non-stationary series Share PricesExchange Rate Income

Unit Root Tests How do you find out if a series is stationary or not?

Yb0b0 YI YYYb0b0 I ttt tttt     1 1 (1) (0)   Order of Integration of a Series A series which is stationary after being differenced once is said to be integrated of order 1 and is denoted by I(1). In general a series which is stationary after being differenced d times is said to be integrated of order d, denoted I(d). A series, which is stationary without differencing, is said to be I(0)

Informal Procedures to identify non-stationary processes (1) Eye ball the data (a) Constant mean? (b) Constant variance?

Statistical Tests for stationarity: Simple t-test Set up AR(1) processwith drift (b 0 ) Y t = b 0 + b 1 Y t-1 +  t  t ~ iid(0,σ 2 ) (1) Simple approach is to estimate eqn (1) using OLS and examine estimated b 1 Use a t-test with null Ho: b 1 = 1 (non-stationary) against alternative Ha: b 1 < 1 (stationary). Test Statistic: TS = (b 1 – 1) / (Std. Err.(b 1 )) reject null hypothesis when test statistic is large negative - 5% critical value is -1.65

Distribution of a maximum of random variables

Therefore...

Extreme value distribution of a uniform distribution

From an ‘operational point of view’ rather than conceptual point of view Matlab code ct53100.m for j=1:100, n=12; for i=1:n, x(i)=5*rand(1); end y(j)=max(x); end

Back to the operational viewpoint - Change the number of observations - Change the distribution type - From minimum to maximum - etc.

Introduction From a visual point of view Statistics wind: Mean: 5 m/s Standard deviation: 2.5 m/s Source: time probability density

Introduction visual point of view 20 m/s5 m/s28 m/s instantaneous maximum 1 year maximum 50 years wind speed

visual point of view

Procedure for minima of r.v.’s

Approach 1: extreme value distributions The extreme value distribution Distribution type Distribution parameters Quantile values (design values) Example: wind loads

Approach 1: extreme value distributions The extreme value distribution Limit theorem (Fisher-Tippet): Maximum of many random variables has distribution: Reverse Weibull (bounded maximum) or Gumbel or Frechet (bounded minimum) regardless of parent distribution Conditions: Random variables are independent Random variables have the same parent distribution

Approach 1: extreme value distributions Extreme value distributions Reverse Weibull (convex) Gumbel (straight) Frechet (concave)

Approach 1: extreme value distributions Generalized Extreme Value distribution All three extreme value distributions are special cases of the Generalized Extreme Value distribution (GEV):  = 0 Gumbel (EV type I for maxima)  > 0 Frechet (EV type II for maxima)  < 0 Reverse Weibull (EV type III for maxima)

Approach 1: extreme value distributions Domain of attraction Asymptotic distribution type of maximum (domain of attraction) Parent distribution FrechetPareto, Cauchy, Student-t (fat tail) GumbelNormal, exponential, gamma, lognormal, Weibull Reverse Weibulluniform, beta (short tail)

Approach 1: extreme value distributions Example: wind load Example: Wind load Maximum over 50 years Quantile value at P target = 0.15 (design load) Steps: –Determine extreme value distribution type –Determine distribution parameters –Calculate requested quantile (design load)

Approach 1: extreme value distributions Distribution type Statistics: parent distribution is approx. Weibull EV-theory: domain of attraction is Gumbel Plot: monthly maxima of hourly averaged wind speeds Slightly convex?

Approach 1: extreme value distributions Gumbel probability plot CDF on Gumbel probability paper: Reverse Weibull-like deviation (poor convergence)

Approach 1: extreme value distributions Wind speed annual maxima Schiphol

Approach 1: extreme value distributions Wind pressure monthly maxima Schiphol pressure: q = 0.5  U 2 with:  air density U wind speed

Transformations

Presenting large datasets In a histogram On probability paper

Classify your data order the n observations Number of classes: All classes have preferably the same width 1 + 1.33 ln(n)

Histogram Wave height (cm’s): 25 45 35 25 30 70 20 45 65 30 40 40 35 45 55 35 32 37 28 45 49 39 40 60 29 34 47 35 45 49 35 45 34 28 34 54 48 38 32 39 45 58

Histogram #classes: 1 + 1.33 ln(42) ≈ 6 highest - lowest = 70 - 20 = 50 class width about 50 / 6 = 8 (we take 5 to choose a round number)

Histogram 4 classfrequency freq/width (unit=5) 17,5 - 27,5 33/2 27,5 - 32,577/1 32,5 - 37,599/1 37,5 - 42,566/1 42,5 - 47,588/1 47,5 - 57,555/2 57,5 - 77,544/4

Case study: Groundwater chemistry data set (from; Y. Zhou, 2006: Hydrogeostatistics. UNESCO-IHE lecture note.)

Step 1: Range of data R = x max - x min R(Cl) = 184 - 4.72 = 179.28 mg/l Frequency tables

Step 2: Number of class intervals, m Class number 6 < m < 25 m = 1 + 1.33 ln(N) Example Cl: m should be around 6-7 class Class width  x:  x > R/m Example Cl: 15 > 179.28/13=14 Class limits x j- (lower limit) and x j+ (upper limit): x j- = x 0 + (j-1)*Δx < values in class j < x j- + Δx = x j+ Frequency tables

Step 3: Number of measurements per class n j : absolute frequency f j =n j /n: relative frequency Frequency tables

Step 4: Creating frequency table Frequency tables

Step 5: Absolute and relative frequencies Creating of a Histogram

Step 6: Cumulative frequency table Frequency tables

Step 7: Cumulative frequency distribution curve Frequency distribution

Some typical frequency distributions

Statistical descriptors Descriptors of central tendency Mode : the value with largest frequency; average value of measurements of the class with the largest frequency Not applicable for distributions with several peaks For Cl: the mode is 19.4 mg/l

Descriptors of central tendency Median : the value corresponds to 50% of cumulative frequency, the value of mid measurement for odd number samples or the average of two mid measurements for even number samples For Cl: median = 31.4 mg/l Insensitive to the tails or outsiders of the distribution, preferable for data sets with exceptional values Statistical descriptors

1. Descriptors of central tendency Quartiles: split the data into quarters –Lower quartile: 25% cumulative frequency For Cl: lower quartile = 16.3 mg/l –Upper quartile: 75% cumulative frequency For Cl: upper quartile = 56.3 mg/l In practice also the 1%, 5%,10%, 90%, 95% and 99% values are used (e.g. discharge data). Statistical descriptors

1. Descriptors of central tendency Arithmetic mean: average value of measurements For Cl: arithmetic mean = 43.72 mg/l More representative of the sample, sensitive to outsiders in a small sample. Most distributions are sufficiently characterised by the mean and the variance. Statistical descriptors

1. Descriptors of central tendency Geometric mean For Cl: geometric mean = 29.53 mg/l Not applicable for negative values. Often hydrogeological variables are not symmetrical, but the log transformations are symmetrical. Then geometric mean is applicable. The radius of a grain with main axes a, b, and c is characterised best by the third root of the a*b*c. Statistical descriptors

1. Descriptors of central tendency Harmonic mean For Cl: harmonic mean = 20.08 mg/l Appropriate for phenomenon where small values are more important (e.g. hydraulic conductance; see lecture notes form Zhou, 2006) Statistical descriptors

DescriptorsSummary of properties --------------------------------------------------------------------------------------------------------------------- Modeindication of abundant values, isolated property, not applicable for distributions with several peaks. --------------------------------------------------------------------------------------------------------------------- Medianinsensitive to the tails of the distribution, preferable for data sets with exceptional values. --------------------------------------------------------------------------------------------------------------------- Arithmetic Meanmore representative of the sample, sensitive to exceptional values in a small sample. Most distributions are sufficiently characterised by the mean and the variance. --------------------------------------------------------------------------------------------------------------------- Geometric meannot applicable for negative values. The radius of a grain with main axes a, b, and c is characterized best by the third root of the product a b c. --------------------------------------------------------------------------------------------------------------------- Harmonic meanmore appreciate for phenomenon where small values are more important.

Relations between central tendency descriptors The harmonic mean is smaller than the geometric mean, and the geometric mean is, in turn, smaller than the arithmetic mean. They are equal only if x 1 = x 2 =... = x n. If the frequency distribution is symmetrical. They are not equal when the distribution is not symmetrical. The mean of the log x -distribution is equal to the logarithm of the geometric mean of x. Statistical descriptors

2. Descriptors of dispersion (variation) Sample variance For Cl: variance = 1768.72 [mg/l] 2 Standard deviation: the square root of the variance For Cl: standard deviation = 42.06 mg/l Statistical descriptors

2. Descriptors of dispersion (variation) Statistical descriptors

2. Descriptors of dispersion (variation) Coefficient of variation For Cl: coefficient of variation = 0.96 Useful to compare the variations of two or more data sets. Statistical descriptors

3. Descriptors of asymmetry Coefficients of skewness For Cl: coefficient of skewness α 3 = 1.97 Statistical descriptors Moments or product moments!

Moments or product moments Central moments Statistical descriptors Moments are statistical descriptors of a data set used for: –1 st moment: mean or expected value, µ x (“central tendency”) –2 nd moment: variance, σ 2 x ; standard deviation σ x is square root of variance (“spread around the central value”) –3 rd moment: skewness, γ x (“measure of symmetry”) –4 th moment: kurtosis, κ x (“peakedness of central portion of distribution”)

3. Descriptors of asymmetry Statistical descriptors

4. Descriptor of flatness/’peakedness’ Coefficient of kurtosis For Cl: coefficient of Kurtosis = 7.04 α 4 =3: for Normal distribution α 4 >3: steeper than Normal distribution α 4 <3: flatter than Normal distribution Statistical descriptors

Normal probability paper Sort the data from small to large Assign each observation to i/N+1 in which i the order number and N the total number of data points

year deseasonalised daily mean run-off Daily Mean Run-Off Anomalies at Achleiten Danube River POT

Flood frequency analysis (peak flows) Annual maximum series (more common)  One can miss a large event if more than one per year; but continuous and easy to process  Often used for estimating extremes in long records (>10 years) Partial duration series (“Peaks-Over-Threshold, POT”)  Definition of the threshold is tricky and requires experience  Often used for short records (<10 years) threshold

Flood frequency analysis (peak flows): Annual max. series vs. partial duration series (Davie, 2002)

Annual max. series vs. partial duration series (POT) Langbein showed the following relationship (Chow 1964): 1/T = 1- e -(1/T p ) T : return period using annual max. series T p :return period using partial duration series Differences get smaller for larger return periods (less than 1% difference for a 10-year recurrence interval)!

Assumptions of frequency analysis All data points are correct and precisely measured  Be aware of the uncertainty of peak flow data (uncertainty and errors come later in this course!) Independent events: peak flows are not part of the same event  Carefully check the data set; plot the whole record, in particular all events of the POT series  Problems with events at the transition of the year (31 Dec – 1 Jan) in humid temperate or some tropical climates Random sample: Every value in the population has equal chance of being included in the sample The hydrological regime has remained static during the complete time period of the record  No land use change, no climate change, no changes in the river channels, no change in the flood water management etc. in the catchment (often not the case for long records!) All floods originate from the same statistical population (homogeneity)  Different flood generating mechanisms (e.g. rain storms, snow melt, snow- on-ice etc.) might cause floods with different frequencies/recurrence intervals

Describing the frequency mathematically: Probability Distribution Function Typically defined in either of two forms: Probability density function (PDF) Cumulative distribution function (CDF) Discrete Continuous PDF CDF

Basics (examples using measured flow, Q) Probability of exceedence, P(X): probability that the flow Q is greater or equal X; P(X)ε[0,1] Relative frequency, F(X): probability of flow Q being less than a value X; F(X)ε[0,1]. Can be read from a cumulative probability curve, but be careful with the selected class intervals. Average recurrence interval or return period, T(X): statistical term meaning the chance of exceedence once every T years over a long record (time step is usually one year). –Not exactly the number of years that are between certain size events! –More the average number of years, in which flow is greater than X! –No regularity or periodicity in occurrences of exceedences (assumption) P(X) = 1–F(X) T(X) = 1/P(X) = 1/(1-F(x))

Relative frequency F(X) PDF Probability of exceedance P(X) CDF Probability of exceedance P(X) Relative frequency F(X)

Recurrence intervals for design purposes (flood protection) in Germany Class 1  Settlements, urban areas, important infrastructure: 50-100 years Class 2  Single buildings, not always inhabit neighborhoods: 25-50 years Class 3  Farm land, intensively used: 10-25 years Class 4  Farm land; 5-10 years (according to DIN 19700, part 99) What about large dams, nuclear power plants etc.? PMF

Exceedence probability for a specified number of time intervals (see Box C-4, page 561 in Dingman, 2002)

Examples: Exceedence probability and return period (based on Box C-4 in Dingman, 2002) What is the probability that a flood greater or equal a 100-year flood will occur next year?  P(X) = 1/T(X) = 0.01 What is the probability that we will not have a flood that is greater or equal the 50-year flood next year?  F(X) = 1-P(X) = 1-0.02 = 0.98 What is the probability that we will not have a flood that is greater or equal the 20-year flood in the next 5 years?  F(X) = (1-P(X)) n = 0.95 5 = 0.774 What is the probability that the next exceedence of the 100-year flood will occur in the 10 th year from now on?  p = (1-0.01) 9 x 0.01 = 0.00916 What is the probability that the 100-year flood will be exceeded at least once in the next 40 years?  p = 1-(1-0.01) 40 = 0.331 What is the probability that the 50-year flood will be exceeded twice in a row (two independent events in one year), and how many 50-year floods can be expected on averages in 1000 years?  p = 0.02 x 0.02 = 0.0004; and on average 20 floods in 1000 years.

(Bedient and Huber, 2002) ( ) - = = =

But how do we estimate P(X) and F(X) from data? Example: Flood frequency analysis

Example: Annual max. series for the river Wye (1971-97) (not Normal-distributed!) (Davie, 2002)

Plotting position – Weibull formula Rank the annual maximum series data from low to high (independent data, the year of occurrence is irrelevant) Calculate F(X) with the rank r and N total data points (i.e. length of record: N years) F(X) = r/(N+1) –For example: The largest value of a 25 year record would plot at a recurrence interval of 26 years. –F(X) can never reach 1 –If you rank from high to low P(X) = 1-F(X) is calculated

Gringorten formula Difference to Weibull formula is often not great Use is often down to personal preferences Empirical constants (0.44 and 0.12) are valid for Gumbel distribution F(X) = (r-0.44) / (N+0.12) Comparison of Weibull and Gringorten formulae (Davie, 2002) [Please note: F(X)=p in the Workshop course note]

Example: Annual max. series for the river Wye (1971-97) (not Normal-distributed!) (Davie, 2002) Reliability is good!

Extrapolation beyond the data set Weibull or Gringorten formulae only good for flood frequency estimations for flows within the measured record, and even unreliable near either limiting value For extrapolation the fit of a probability distribution is needed. Estimate the parameters through: 1.Method of moments (widely used) 2.Method of L-moments (less widely, used, quite complex) 3.Method of maximum likelihood (not widely used) An alternative is a graphical approach to fit the distribution (subjective approach) Choice of the distribution function often based on personal preferences (but always take the distribution that fits your data best in a particular region), but there are sometimes guidelines (depend on the region) Extreme values are usually not normally distributed, however, mean annual flows in humid areas are often normally distributed

Method of moments – Example: Gumbel distribution Product moments are statistical descriptors of a data set (characterize the probability distribution): –1 st moment: mean or expected value, µ x (“central tendency”) –2 nd moment: variance, σ 2 x ; standard deviation σ x is square root of variance (“spread around the central value”) –3 rd moment: skewness, γ x (“measure of symmetry”) –4 th moment: kurtosis, κ x (“peakedness of central portion of distribution”) –Coefficient of variation: measure of spread CV = σ x /µ x (L-moments are used for small sample sizes; see Dingman (2002), Appendix C) γ x > 0 γ x < 0

Method of moments – Gumbel distribution F(X) = exp(-exp(-b(X-a))) a = mean(Q) - 0.5772/b b = π/(σ Q 6 0.5 ) F(X) leads to P(X) and T(X) for a certain size of flow X Re-arranging the formulae leads to the size of flow for a given recurrence interval: X = a – 1/b ln ln(T(X)/(T(X)-1)) Example: for the 50-year flood, you need to compute the natural logarithm of 50/49 and then the natural logarithm of this result. The parameters a and b are estimated from the sample. Rule of thumb: Do not extrapolate recurrence intervals beyond twice the length of your stream flow record

Example: Annual max. series for the river Wye (1971-97); values required for the Gumbel formula Applying the method of moments and Gumbel formula to the data gives some interesting results. The values used in the formula are shown in the table above and can be easily computed. When the formula is applied to find the flow values for an average recurrence interval of 50 years it is calculated as 39.1 m 3 /s. This is less than the largest flow during the record which under the Weibull formula has an average recurrence interval of 29 years. This discrepancy is due to the method of moments formula treating the highest flow as an extreme outlier. If we invert the formula we can calculate that a flood with a flow of 48.87 m 3 /s (the largest on record) has an average recurrence interval of around three hundred years. (Davie, 2002) Mean (Q) Standard deviation ( σ Q )a value b value 21.21 6.91 18.11 0.19

Distributions often used in hydrology (1/2) (Dingman, 2002)

Distributions often used in hydrology (2/2) (Dingman, 2002)

Use of common distributions in hydrology Flood frequency analysis Most commonly applied are the Exponential (EXP), Log-Normal (LN), Log- Pearson 3 (LP3) and Generalised Extreme Value (GEV) distributions. In practice The choice of probability distribution may be dictated by mathematical convenience or by familiarity with a certain distribution (“personal bias”). Sample estimators must be adopted in order to obtain estimates of the statistics for determining the distribution parameters. In some cases, more than one distribution may fit the available data equally well. General three-step procedure 1)A suitable form of standard frequency distribution is chosen to represent the observations; 2)the chosen distribution is fitted to the data by determining values for its parameters; and 3)the required quantiles are computed from the fitted cumulative distribution function (CDF).

Distributions often used in hydrology 1. Normal distribution (ND) Probably the most important distribution, but often not useful for hydrological extremes PDF: Standard Normal Distribution (S-ND) a: scale parameter = standard deviation, σ c: location parameter = mean, μ CDF: (compare lecture of Dr. Zhou)

Location parameters A probability distribution is characterized by location and scale parameters. Location parameter equal to zero and scale parameter equal to one (Standard- ND) vs. ND with a location parameter of 10 and a scale parameter of 1. Scale Parameter The next plot has a scale parameter of 3 (location parameter is zero). The effect is that the graph is ‘stretched out’.

2. The Lognormal distribution y = ln(x) If y follows the Normal distribution, x follows the Log-normal distribution. Lognormal probability density function: Probability Distribution Functions

Lognormal distribution is skewed to the right! Lognormal distribution function

Effects of shape parameters for the lognormal distribution PDF CDF

Histogram indicates positive skewness distribution Plot of cumulative frequency on Lognormal probability paper shows a straight line Take logarithm transformation of data y = ln(x) Calculate sample mean and standard deviation of logarithmic values Carry out analysis on logarithmic values Processing Log-normal Distribution Function

3. Pearson type III distribution (Often used for flood frequencies) PDF: a - scale parameter b - shape parameter c - location parameter Г( ) – Gamma function:  commonly fitted to the logarithms of floods (so- called log-Pearson type III distribution) 4. Gamma Distribution (when c=0)

Effects of shape parameter, gamma PDF CDF

5. Exponential distribution PDF: Particularly useful when applying partial duration series Standard Exponential Distribution CDF: With a mean of 1/λ, a variance of 1/λ 2, and a skewness of 2.

Plots of exponential distribution PDF CDF

Example: Exponential distribution applied to storm interval times (from Bedient & Huber 2002)

6. General extreme value (GEV) distribution c - location parameter a - scale parameter k - shape parameter k=0 Extreme value type I (EV1) (Gumbel) k<0 Extreme value type II (EV2) k>0 Extreme value type III (EV3) closely related to the Weibull distribution CDF:

0 x=c Type III, k>0 Type II, k<0 Type I, k=0 y1y1y1y1x Comparison of the three types of GEV distributions (PDF) If the sample for which frequency distribution is required exhibits skewness, a three-parameter distribution is useful (e.g. GEV).

General Extreme Value distribution Type I (= Gumbel distribution) PDF: CDF: -> widely used for annual maximum series! (Note: little different in description (use of parameters) than in the example of the river Wye, from Davie (2002)) CDF: (standardized)

Plots of Gumbel distribution PDF CDF

How good is the fit of the distribution function? Graphic check: visual check of the plotted graph (How good are the observations reproduced by the fitted PDF/CDF?) Mathematical check: statistical test to determine the goodness fit -chi-square (χ 2 ): PDF (needs much data, depends on classified intervals) -Kolmogorov-Smirnov (K-S) tests: CDF  Not necessary to divide the data into intervals; thus error associated with the number and size of intervals is avoided.  Good if n>35 and even better if n>50.  Quick and easy, but only one value is considered) -Unfortunately, often several distributions provide acceptable fits to the available data (no identification of the “true” or “best” distribution); confidence limits are too large

Graphical method (Example of the river Wye) Frequency of flows less than a value X. The F(X) values on the x ‑ axis have undergone a transformation to fit the Gumbel distribution; called ‘reduced variate’ (cf. Workshop in Hydrology). Is this fit suitable for the whole data set?

Comparison of different PDFs (Significant differences in particular for the extremes!)

Example: Kolmogorov-Smirnov (K-S) tests (according to Schoenwiese 2000) F X (x i ) - CDF of the assumed distribution S n (x i ) - CDF of the observed ordered sample If D n ≤ the tabulated value (see below) D n α, the assumed distribution is acceptable at the significance level α (n: sample size). α 0.20 0.10 0.05 0.01 0.001 D n α 1.073/n 0.5 1.224/n 0.5 1.358 /n 0.5 1.628/n 0.5 1.040/n 0.5 (see sketch on black board)

Variability of quantile estimates – confidence limits Sources of errors assumption of a particular distribution (cannot be quantified) sampling errors in estimation of the parameters of the distribution (quantifiable through standard errors) Confidence limits (CL): 100(1-α)% confidence intervals Standard error: Q T : quantile C : constant (see Workshop page 13; course notes from Hall page 31f) σ : standard deviation from a sample of size n t 1-α/2 : value of Student’s t- distribution for a 97.5% level of confidence (two-tailed test) and (n-1) degrees of freedom; tabulated

Example: Lognormal with 90%-confidence limits (Bedient & Huber 2002)

Example: Estimation of confidence limits (according to Schoenwiese 2000) The mean annual temperature at the Hohenpeissenberg, Germany, for the period 1954-1970 (n=17; Normal-distributed C=1) is 6.24 0 C with a standard deviation of 0.73 0 C. The confidence limits (α = 5%) can be calculated as: CL (mean annual temperature in 0 C)= 6.24 ± t 97.5% (0.73/17 0.5 ) 0 C = 6.24 ± 0.38 0 C The mean of the annual temperature is at significance level of 95% in the interval of 6.24 ± 0.38 0 C, thus between 5.86 and 6.62 0 C. Please note, if the records lengths would have been 120 years (= n) with the same standard deviation, the interval would be 6.24 ± 0.13 0 C.

A few remarks on: Low flow frequency analysis Data required: annual minimum series Problem of independent events; do not split the year in the middle of low period (i.e. low flow periods can be long) Often zero-values (e.g. in arid climates or cold climates) There is finite limit on how low a low flow can be (no negative flows!)  Different statistical treatment of the data  Fit an exponential distribution rather than, for instance, a lognormal distribution  Other often used distributions are the Weibull, Gumbel, Pearson Type III, and log-normal distributions

A few remarks on: Low flow frequency analysis Figure 7.15 Two probability density functions. The usual log ‑ normal distribution (solid line) is contrasted with the truncated log ‑ normal distribution (dashed line) that is possible with low flows (where the minimum flow can equal zero). Figure 7.16 Probability values (calculated from the Weibull sorting formula) plotted on a log scale against values of annual minimum flow (hypothetical values).

Application in the Rur river (7-1) Station Stah (Germany) area2245 km 2 area2245 km 2 record1953 to 2001 record1953 to 2001 Prepared by Tu Min (2004)

Application in the Rur river (7-2) Annual flood peaks (1954 - 2001) the water years (Nov - Oct) Homogeneity Statistical tests (change point)

Application in the Rur river (7-3) Flood frequency analysis Example – Normal distribution N = 48 μ = 76.1 σ = 28.5 t 97.5 % = 2.011

N = 48 μ y = 4.3 σ y = 0.4 t 97.5 % = 2.011 Application in the Rur river (7-4) Flood frequency analysis Example – LN distribution

Application in the Rur river (7-5) Flood frequency analysis Example – LN3 distribution N = 48 X min = 27.6 X max = 139.3 X med = 73.1 μ y = 5.0 σ y = 0.2 t 97.5 % = 2.011

Application in the Rur river (7-6) Flood frequency analysis Example – Gumbel distribution Example – Gumbel distribution Distribution fitting Distribution fitting

Application in the Rur river (7-7) Flood frequency analysis Example – Gumbel distribution Example – Gumbel distribution Magnitude of T-year flood Magnitude of T-year flood

Take home messages Frequency analysis, in particular of hydrological extremes, is prerequisite for sustainable water resources management Annual max. series or partial duration series depends on the length of the record Understanding of probability, recurrence interval and risk Knowledge of often used statistical distributions Calculation of confidence intervals Test of the goodness of fit of a PDF or CDF Specialty of low flow values

Closure probability and event stochastic variables, cont. and discrete Transformations Joint distributions Linear regression Parameter estimation (with Bestfit)

Closure The role of statistics in hydrology and water resources is what? You have now knowledge of  Frequency tables,  Histograms,  Distributions functions,  statistical descriptors, and  Standard Normal Distribution and Log-Normal distribution. Are you now well prepared for the two assignments?

Assignment 1: Basic Statistics Use the data set from Van Gelder’s website 1.Calculate the range and estimate a reasonable number of intervals as well as class limits. 2.Calculate the relative and absolute frequencies. 3.Make a histogram and cumulative frequency distribution, hand drawn on linear paper (figures). Interpret this briefly (one sentence!) 4.Calculate the median, mode, and arithmetic mean. 5.Calculate the variance, standard deviation and coefficient of variation. 6.Calculate the skewness and kurtosis. Interpret the results briefly (one sentence!) 7.Plot the data as a graph on normal probability paper (figure). Is the Normal-Distribution suitable for that data set? Compare the mean value and standard deviation from your graph with your results in question 4+5. Deliver your printed report to Pieter van Gelder on November 5 th at 15.45h at the start of the lectures

Assignment 2: Frequency Analysis Download your dataset from Van Gelder’s website Determine the PDF with the lowest Chi-Square value in Bestfit Include in your report a plot of the observations and optimal fit Extrapolate the fitted Exponential distribution to a 10^-3 /yr quantile Calculate the 95% Confidence Bounds around the 10^-3 /yr quantile for the Exponential fit Deliver your printed report to Pieter van Gelder on November 6th at 8.45h at the start of the lectures

Assignment 3: Transformations of distributions Download your parameters a and b from Van Gelder’s website Generate 28 random numbers from the Uniform distribution with lowerbound a and upperbound b Plot your data in a histogram and draw the PDF of the uniform distribution in the same plot Generate 100 sets of 28 random numbers from the above Uniform distribution and take from each set the maximum number Plot these 100 maxima in a histogram, derive the theoretical distribution function for these 100 maxima and draw the PDF in the same plot Assume that the above 100 numbers are monthly maximum wind speeds. Transform your wind speeds to wind pressures and plot your wind pressures in a cumulative distribution plot. Deliver your printed report to Pieter van Gelder on November 13th at 17.00h at the reception of UNESCO-IHE

Final mark for Review of statistics and frequency analysis (module 1) Weight factor of all computer exercises is 0.5 in your final mark Weight factor of written test is 0.5 in your final mark

Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering.

Similar presentations

Presentation on theme: "Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering.

Similar presentations

Presentation on theme: "Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering."— Presentation transcript:

Similar presentations

About project

Feedback