Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering.

Similar presentations


Presentation on theme: "Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering."— Presentation transcript:

1 Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering and Geosciences UNESCO-IHE, Guest Lecturer Year 2009 Lectures on: October 27 th 8.45 – 10.30h October 28 th 15.45 – 17.30h November 5 th 15.45 – 17.30h November 6 th 8.45 – 10.30h A selection of these slides will be presented during the course contact for students Dr.ir. P.H.A.J.M. van Gelder Dr.ir. P.H.A.J.M. van Gelder room: 3.87 ext: 86544

2 Overall outline of the course Review of statistics and frequency analysis Data analysis, random variables, classification, stat. moments, frequency distributions; samples, populations and probability models; parameter estimation and confidence intervals.

3 Introduction to this Course 1 + 1 lecture periods (basic statistics), 1 + 1 lecture periods on frequency analysis and a written exam Dialog instead of monolog (ask questions!) Complete power point presentations can be found on Van Gelder’s website Brief self-introduction: –Background (name, country, university education etc.) –Interests, experiences with statistics? –What do you hope to get out of this part of the course? –Who has experienced hydrological extremes?

4 Outline for today Introduction on natural hazards and probabilistic design Data sets for river– and coastal engineers Theory on: –Probability and events –Random (or stochastic) variables –Transformations –Multivariate distributions

5 Number of Floods worldwide Source: Dartmouth Flood Observatory, 2003

6 Regional Distribution of Large Floods Source: Dartmouth Flood Observatory, 2003 1999-2002 1985-1988

7

8

9 Reasons for Concern Source: Smith et al 2001, TAR IPCC WG II

10 More flood-producing rain (but regional differences) Longer rainless periods and higher evaporation demand of the atmosphere There are indications for:  Changing regimes, flood timing  More frequent floods, land slides etc.  No reduction of droughts, but decreased summer low flows e.g. in western Europe Does Global Warming Lead to an Intensification of the Hydrological Cycle and More Extremes Events?

11 Effects of changing Mean and Variance of Precipitation on Stream Flow (i.e. non-linear processes, thresholds, etc.) (from Middelkoop 2005, after Arnell 1996)

12 Frequency analyses can be done for … High flows  Flood peak discharge  Flood volume  Occurrence of high flows in certain periods/months Low flows  Minimum low flow discharge  Runoff deficits Groundwater levels, groundwater fluctuations etc. Estimated cost of damage caused by natural hazards Simulated variables (model outputs) Daily rainfall, rainfall intensities etc. …. Let’s start with examples of flood peak discharges

13 How do we measure floods? (Aus: Hornberger et al., 1998)

14 Rating curve (Aus: Hornberger et al., 1998)

15 Application of rating curve to measured water levels at a gauge (Aus: Hornberger et al., 1998)

16 …. that is not always that easy! (The big flood in the HJ Andrews 1996)

17 Continuous measurement of discharges (incl. floods and low flows) (Aus: Hornberger et al., 1998)

18 Floods in 2002; river Elbe, city of Dresden, Germany

19 Flood defence structures Storm Surge Barrier Oosterschelde (NL)

20 Maeslantkering - storm surge barrier (NL)

21 Variability in annual precipitation (NL)

22 Variability in daily river discharges (Meuse river, NL)

23 Example: Evaluating of hydrological extremes (flood runoff); Case in Austria 2005

24 Example: Evaluating of floods in a regional context (Case study: Austria 2005)

25 Ongoing EU projects on Flood Risks EFFS: A European Flood Forecasting System –to develop a prototype of a European flood forecasting system for 4-10 days in advance, which could provide daily information on potential floods for large rivers, and flash floods in small basins. SPHERE: Systematic, Palaeoflood and Historical data for the improvement of flood Risk Estimation –to develop a new approach which complements hydrologic modelling and the application of historical and paleoflood hydrology to increase the temporal framework of the largest floods over time spans from decades to millennia; in order to improve extreme flood occurrences.

26 THARMIT: Torrent Hazard Control in the European Alps –to develop practical tools and methodologies for hazard assessment, prevention and mitigation, and to devise methods for saving and monitoring potentially dangerous areas. CARPE DIEM: Critical assessment of Available Radar Precipitation Estimation techniques and Development of Innovative approaches for Environmental Management. –to improve real-time estimation of radar rainfall fields for flood forecasting, by coupling multi- parameter polarisation radar data and NWP, and exploiting NWP results in order to improve the interpretation of radar observations.

27 IMPACT: Investigation of Extreme Flood Processes and Uncertainty –to investigate extreme flood and defense failure processes, their risk and uncertainty. Will consider dam breach formation, sediment movement, flood propagation and predictive models, within an overall framework of flood risk management. GLACIORISK: Survey and Prevention of Extreme Glaciological Hazards in European Mountainous Regions –to develop scientific studies for detection, survey and prevention of glacial disasters in order to save lives and reduce damages.

28 SAFERELNET: Risk assessment of natural hazards in Europe MITCH: Mitigation of Climate Induced Hazards –dealing with the mitigation of natural hazards with a meteorological cause, in order to assist planning and management. The main focus will be on flood forecasting and warning, but it will also include other flood related hazards, such as landslips and debris flow, and longer term climate hazards, such as drought, and the possible impact of climate change on the frequency and magnitude hazards ADC-RBM: Advanced Study Course in River Basin Modelling for Flood Risk Mitigation - June 2002

29 FLOODMAN: Near real-time flood forecasting, warning and management system based on satellite radar images, hydrological and hydraulic models and in-situ data –near real-time monitoring of flood extent using spaceborne SAR, optical data & in-situ measurements, hydrological and hydraulic model data. The result will be an expert decision system for monitoring, management and forecast of floods in selected areas in Europe. The monitoring will also be used to update the hydrological/hydraulic models and thereby improving the quality of flood forecasts.

30 FLOODSITE: The FLOODsite project covers the physical, environmental, ecological and socio- economic aspects of floods from rivers, estuaries and the sea. The project is arranged into seven themes covering: Risk analysis – hazard sources, pathways and vulnerability of receptors. Risk management – pre-flood measures and flood emergency management. Technological integration – decision support and uncertainty. Pilot applications – for river, estuary and coastal sites. Training and knowledge uptake – guidance for professionals, public information and educational material. Networking, review and assessment. Co-ordination and management.

31 Extreme Events; Two very realistic simulations 1. River dike failure in the Netherlands 2. Asteroid impact in the Atlantic Ocean

32 The role Statistics in Water Engineering Properties of the hydrological system Data analysis and statistics (later also modeling etc.) Statistics of hydrological variables for decision support

33 We need Information about … Water balance: P = R + ET + dS/dt Variability and heterogeneity of hydrological variables (groundwater levels, precip. patterns etc.) Hydrological extremes: Scenarios for: Land use change Climate chance Different water management strategies ETC. droughts floods x-year flood Statistical Analysis !!

34 4 types of error occur when measuring a hydrological variable 1.Operation and function errors: malfunction of measuring instrument, personal (human) error. 2.Random error: caused by numerous minor impacts partly independent from each other. If repeated frequently, the values fluctuate around the true value. 3.Constant systematic error: inherent in any kind of equipment (e.g., wrong installation of instruments, wrongly indicated zero point, incorrect rating curve etc.); constant in respect of time, but may vary according to the measuring range. 4.Variable systematic error: usually caused by insufficient control during the measuring period; mostly the origin in the instrument (e.g., drift” of the device, growing of plants at the location of measurement etc.). Can be avoided through continuous comparison of the measurement and repeated calibration of the instruments.  Systematic errors can not be reduced by increasing the number of measurements, if equipment and measuring conditions remain the same!

35 Structural design principles Old methods: –determine a worst case load –determine a worst case strength –determine the geometry of the structure

36 Disadvantages of old method Unknown how safe the structure is No insight in contribution of different individual failure mechanisms No insight in importance of different input parameters Uncertainties in variables cannot be taken into account Uncertainties in the physical models cannot be taken into account

37 Failure mechanisms of a dike

38 Design of a structure Random boundary conditions

39 Fault tree with AND and OR

40 Mathematics of AND and OR In case of an AND-gate, you should multiply the probabilities In case of an OR-gate, you should add the probabilities (and substract the multiplication of the two probabilities) Important condition: This is only true when both mechanisms are fully (statistically) independent

41 Example of dependence Modern wave run-up formula is: (for shallow water, last equation is somewhat different and implicit)

42 Example of dependence (2) So the answer depends on H and T But in a single wave field, T = f(H), for example:T = 3.9 * H 0.376 This can be modelled as: T = A * H B, in which both A and B are stochastic variables with a mean and standard deviation

43 Example of dependence (3) But this is only true in case of a single wave field (wind waves OR swell waves) When there are more wave fields H and T are NOT statistically independent There is no good model for run-up due to double peaked spectra, but there is an approximation by Van der Meer

44 Example of dependence (4)

45 Two approaches First approach: start at bottom and calculate the probability of failure according to normal design practice second approach: start at top and assign probability to failure mechanisms

46 Two approaches (2) Usually with a start at bottom, you do not reach at the required overall failure probability Usually with a start at top, you cannot construct some elements So in practice, you have to make a mixture

47 Sensitivity analysis What is the effect of 10% change in input on the output ? This determines how important is an input parameter

48 End of introduction

49 Data availability Internet offers a huge source of past - and real time data

50 The theory will be explained with examples and data sets taken from river engineering: The Global Runoff Data Centre http://grdc.bafg.de/servlet/is/910/ http://grdc.bafg.de/servlet/is/910/ Mediterranean Hydrological Cycle Observing System (Med-HYCOS project) http://medhycos.mpl.ird.fr UNESCO International Hydrological Programme http://webworld.unesco.org/water/ihp/db/ http://webworld.unesco.org/water/ihp/db/ The Global River Discharge Database” (RIVDIS) http://www.rivdis.sr.unh.edu http://www.rivdis.sr.unh.edu

51 Local Websites Ministry of Water Resources of China http://www.mwr.gov.ch/english/index.asp http://www.mwr.gov.ch/english/index.asp Ministry of Water Resources of India http://mowr.gov.in http://mowr.gov.in Water Commission of India http://cwc.nic.in These sites were useful for obtaining basic information about river basins, not so useful in downloading discharge data Apart from websites, data is also published in National Water Resources Books

52 Datasets for river engineers The longest period of observation is recorded for the river Nemunas at Smalinninkai: 1812-2003 (LT). The majority of European rivers have observation records dating from the period 1910-1920 and continuing to 1999-2004. On most Asian rivers water discharges have been observed since the period 1930-1940, although the river Bia at Biisk (Russia) has a record 108 year period of observation (1895 to 2003). The shortest period of observation is found on the Indian rivers (1939 -1979), the Chinese rivers (1930-1985) and the Iranian rivers (1963-1985).

53 Your data for the exercise is available at: http://www.citg.tudelft.nl/live/pagina.jsp? id=418a276e-b63e-4cec-a6fe- 763feb04f984&lang=en

54 Some snap shots

55

56

57

58

59

60

61 Data for coastal engineers www.oceanor.no/ www.knmi.nl/onderzk/oceano/waves/era40/lice nse.cgi www.globalwavestatisticsonline.com/ http://www.golfklimaat.nl http://www.actuelewaterdata.nl http://www.hydraulicengineering.tudelft.nl/public /gelder/paper56-data3.zip

62 Hm0, H1/3,HTE3, Tm02, TH1/3, Th0, wind direction, wind speed, water level, surge

63 Important data source for Dutch data

64 Some snap shots

65

66

67

68

69 real time wave data; significant wave heights

70 wave periods

71 Water levels gauges in Mid West Netherlands

72 Water levels and astronomical tide

73 The wave data on the wave climate site is available in data files per year (now : 1979 - 2002). The files contain wave data in the following format : 19880221 0100 90 4 84 12 52 66 351 10 -6 3433402 19880221 0400 76 3 73 15 53 65 339 -95 -2 3433402 19880221 0700 67 3 66 10 42 53 346 -31 -1 3433402 19880221 1000 67 3 64 12 40 48 344 51 -8 3433402 19880221 1300 66 3 65 11 36 44 308 -3 0 3433402 19880221 1600 75 3 73 12 40 47 298 -93 6 3433402 19880221 1900 91 4 79 13 41 48 325 0 3 3433402 19880221 2200 86 4 80 11 41 47 334 95 0 3433402 19880222 0100 84 5 81 13 42 50 323 38 4 0400402 …….. etc.

74 The files are arranged as follows : Column nr.Nameunit 1Date[yyyymmdd] 2time[hhmm] MET ! 3wave height Hm0[cm] 4accuracy wave height Hm0 (standard deviation)[cm] 5wave height H1/3[cm] 6wave height HTE3[cm] 7wave period Tm02[0.1 s] 8wave period TH1/3[0.1 s] 9wave direction Th0[gr], nautical [1] [1] 10water level[cm] NAP/MSL 11surge[cm] 12code number which indicates the origin of the given value[-] [1] [1] Nautical degrees : (from) North = 0 degrees, (from) East = 90 , South = 180 , West = 270  and North again = 360 .

75 Probability P(A) = probability of event A Mathematical definition Frequentistic definition

76 Mathematical definition Axioms: 1. P(A)  0 2. P(  ) = 1 3. P(A or B) = P(A) + P(B) (if A and B are independent)

77 Frequentistic definition P(A) = N(A) / N in which: N(A)number of experiments leading to A Ntotal number of experiments example: probability that a consumer product fails within 1 year after production

78 example interpretation P(A) = n(A) / N Pprobability n(A)number of outcomes in experiment A Ntotal number of outcomes P(A) = 4 / 24 = 1 / 6 A

79 example dice P(x=4) = 1/6 P(x  5) = 2/6 P(x even)= 3/6

80 Some history 1650Pascal / Fermat 1750Bernouilli / Bayes 1850Venn / Boole 1920Von Mises 1960Savage / Lindleydecision making 1970Benjamin / Cornelldecision making

81 1960’s and onwards ‘Years ago a statistician might have claimed that statistics deals with the processing of data; today statisticians will be more likely to say that statistics is concerned with decision making in the face of uncertainty.’

82 probability calculation calculation of a probability from other probabilities

83 Joint events Union A or B Cross section A and B Implication A in B Denial A not  A B  A B  A B  A

84 Union P(A or B) P(A or B) = P(A) + P(B) - P(A en B) 13/24= 6/24 + 9/24 - 2/24 A B ?

85 Cross section P(A and B) P(A and B) = n AB / n = (n A / n) * (n AB / n A ) = P(A) * P(B | A) = 6/24 * 2/6 = 2/24 A B

86 Conditional probability P(A | B) = probability of A given the fact that event B has occured P(A and B) = P(B) P(A | B) P(A | B) = P(A and B) / P(B)

87 Conditional probability P(rain in Delft on sept. 18, 2024)? P(rain in Delft on 9/18-2024| rain in Amsterdam on 9/18-2024)? P(rain in Delft on 9/18-2024|rain in Cape Town on 9/18-2024)?

88 example: dice

89 Independence A and B are independent if In that case:

90 Important rules Theorem of total probability Generalisation to continuous integral “in which the uncertainty is integrated out” Theorem of Bayes

91 example: quiz dilemma Car in A, B or C U: chooses A QM: Good that you didn’t choose B, because it is empty. Would you still like to switch to C? A BC

92 Quiz-dilemma Theorem of Bayes: Yes, switch to C! Notes: P(info)=0.5 because there can be a car in B or not. P(info|C)=1, because if we have information on C and B(info), we know that A should contain the car with 100% certainty A clever student is invited to write a simulation programme to find out if this is indeed true

93 Solution of Jeroen van den Bos Development of 2 Matlab scripts Quiz_noinfo.m (choose a box and check if the car is there with no information from the QM) Quiz_info.m (choose a box and update your choice when the QM gives his information on another box)

94 Quiz_noinfo.m clear; N = 2000; NoOfBoxes = 3; NoSuccess=0; for i = 1:N, Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box Box_Guess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box NoSuccess = NoSuccess + (Box_Car == Box_Guess); %if guess is right increase # of succesful attempts Fr_Success(i)=NoSuccess/i; %frequency of success after i attemps end; P_Success = Fr_Success(N) %final result % output % ------ plot(1:N,Fr_Success,'.',[0 N],[1 1]/NoOfBoxes) axis([0 N 0 1]) legend('Simulation result',['P = 1/' num2str(NoOfBoxes)])

95 Quiz_info.m clear; N = 2000; NoOfBoxes = 3; NoSuccess_Stay=0; NoSuccess_Switch = 0; for i = 1:N, Box_Car = fix(1+rand(1)*(NoOfBoxes)); %car is put randomly in a box Box_1stGuess = fix(1+rand(1)*(NoOfBoxes)); %random choice of a box k = 0; %Select possible empty boxes for j = 1:NoOfBoxes, if (Box_Car ~= j) & (Box_1stGuess ~= j) %empty box cannot be 'car' or 'guess' k = k + 1; Empty_Boxes(k) = j; %vector of empty boxes end; Box_Empty = Empty_Boxes(fix(1+rand(1)*k)); %Choose randomly from empty boxes k = 0; %Select possible alternatives for j = 1:NoOfBoxes, if (Box_Empty ~= j) & (Box_1stGuess ~= j) %alt box cannot be 'empty' or 'guess' k = k + 1; Alternative_Boxes(k) = j; %vector of alternative boxes end; Box_Alternative = Alternative_Boxes(fix(1+rand(1)*k)); %Choose randomly from alternative boxes NoSuccess_Stay = NoSuccess_Stay + (Box_Car == Box_1stGuess); %if 1st guess is right increase # of succesful attempts 'stay' strategy NoSuccess_Switch = NoSuccess_Switch + (Box_Car == Box_Alternative); %if alt. guess is right increase # of succesful attempts 'switch' strategy Fr_Success_Stay(i)=NoSuccess_Stay/i; %frequency of success after i attemps Fr_Success_Switch(i)=NoSuccess_Switch/i; %frequency of success after i attemps end; P_Success_Stay = Fr_Success_Stay(N) %final result P_Success_Switch = Fr_Success_Switch(N) %final result % output % ------ plot(1:N,Fr_Success_Stay,'r.',1:N,Fr_Success_Switch,'b.',[0 N],[1 1]*P_Success_Stay,'r--',[0 N],[1 1]*P_Success_Switch,'b--') axis([0 N 0 1]) %legend('Simulation result Stay','Simulation result Switch',['P_{Stay} = ' num2str(P_Success_Stay)],['P_{Switch} =', num2str(P_Success_Switch)],'location','EastOutside') legend(['"Stay" stragegy (P_{success} = ' num2str(P_Success_Stay) ')'],['"Switch" strategy (P_{success} = ' num2str(P_Success_Switch) ')']);

96

97

98 Updating process If new information becomes available, new estimates can be made about the failure probability of a system

99 stochastic variables

100 What is a stochastic variable? probability distributions Fast characteristics Distribution types Two stochastic variables

101 stochastic variable Quantity with uncertainty: –Natural variation –Lack of statististical data –Schematizations example: –strength of concrete –outcome of a dice –temperature in Delft on September 16 th, 2014

102 Relation with events uncertainty can be expressed with probabilities probability that stochastic variable X –is smaller than x –larger than x –equal to x –is in the interval [x, x+  x] –etc.

103 probability distribution probability distribution function = probability P(X  ): F X (  ) = P(X  ) stochast dummy 0.2 0.4 0.6 0.8 1 F X ()()  0

104 probability density This is a probability density function

105 probability density Differentiation of F to  : f X (  ) = dF X (  ) / d  f = probability density function f X (  ) d  = P(  < X   +d 

106 0 0.2 0.4 0.6 0.8 1 F X ()()  0 0.1 0.2 0.3 0.4 0.5 fX()fX() P(X    d  P(  < X  

107 0 0.2 0.4 0.6 0.8 1 F X ()()  0 0.1 0.2 0.3 0.4 0.5 fX()fX() P(X   

108 Discrete and continuous

109 Complementary probability distribution Complementary (cumulative distribution function (ccdf) of variable S P(S > x) = 1 - P(S  x) = 1-F S (x) P{S > S d } = P target SdSd tail of the distribution

110

111 Fast characteristics -4-20246 0 0.1 0.2 0.3 0.4 0.5 x f X (x) XX XX  X mean  X standard deviation, indication for spreading

112 012345 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x XX XX Mean  maximum (mode) Median is the value m for which P(X<m)=50% f X (x)

113 Mean Variance Standard deviation Variation coefficient

114 distribution types Uniform distribution Normal distribution Lognormal distribution Gumbel distribution Weibull distribution Gamma distribution ….

115 Uniform distribution  fX()fX() a b 1/(b-a) area = total probability = 1 mean  = (a+b)/2 Standard deviation  = (b-a)/  12

116 Matlab demonstration Generate numbers from a Uniform distr. Make a histogram (observe the variability around its mean) Calculate the mean value and standard deviation They should converge to 0.5 and 0.2887 (1/sqrt(12)) This is indeed confirmed by the simulation

117 Normal distribution -100102030405060708090 0 0.01 0.02 0.03 0.04 0.05 0.06 bending strength (N/mm2) probability density strength of timber probability density function normal distribution mean = 37 N/mm2 standard deviation = 8.6 N/mm2 XX XX

118 Normal distribution in CDF domain

119 Normal distribution in linearised CDF domain

120 normal distribution probability density:probability distribution: in which:  mean  standard deviation (  > 0)  dummy variable (-  <  <  )  2  2 1 X e 2 1  f            

121 standard normal distribution normal distributed variable X: standard normal distributed variable u: probability density:probability distribution: table or

122 This table can be used in both directions: 1. Given an x value, what is the corresponding exceedence probability 2. Given a probability, what is the corresponding x value Note that the table only describes the right hand tail of the standard normal distribution. The left hand tail can be obtained by symmetry around the point (0, 0.5). For ordinary normal distributions, always scale back to a standard normal distribution (by subtracting the mean value, and dividing by the standard deviation)

123 normal distribution Why so popular? Central limit theorem Sum of many variables (i.i.d.) is (almost) normally distributed. Y = X 1 + X 2 + X 3 + X 4 + …. i.i.d. = independent identically distributed

124 Normal Distributions A continuous rv X is said to have a normal distribution with parameters

125 Standard Normal Distributions The normal distribution with parameter values is called a standard normal distribution. The random variable is denoted by Z. The pdf is The cdf is

126 Standard Normal Cumulative Areas 0 z Standard normal curve

127 Standard Normal Distribution a. Area to the left of 0.85 = 0.8023 b. P(Z > 1.32) Let Z be the standard normal variable. Find (from table)

128 Find the area to the left of 1.78 then subtract the area to the left of –2.1. = 0.9625 – 0.0179 = 0.9446

129 Notation will denote the value on the measurement axis for which the area under the z curve lies to the right of 0

130 = 2[P(z < Z ) – ½] P(z < Z < –z ) = 2P(0 < Z < z) z = 1.32 Ex. Let Z be the standard normal variable. Find z if a. P(Z < z) = 0.9278. Look at the table and find an entry = 0.9278 then read back to find z = 1.46. b. P(–z < Z < z) = 0.8132 = 2P(z < Z ) – 1= 0.8132 P(z < Z ) = 0.9066

131 Nonstandard Normal Distributions If X has a normal distribution with mean and standard deviation, then has a standard normal distribution.

132 Normal Curve 68% 95% 99.7% Approximate percentage of area within given standard deviations (empirical rule).

133 Ex. Let X be a normal random variable with = 0.2266

134 Ex. A particular rash shown up at an elementary school. It has been determined that the length of time that the rash will last is normally distributed with Find the probability that for a student selected at random, the rash will last for between 3.75 and 9 days.

135 = 0.9772 – 0.0668 = 0.9104

136 Percentiles of an Arbitrary Normal Distribution (100p)th percentile for normal

137 Lognormal distribution 012345 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7  XX XX fX()fX()

138 Lognormal distribution y  f X (  ) : lognormal f Y (y) : normal If X is lognormal distributed, than Y = ln(X) is normal distributed y = ln  or  = exp(y)

139 Lognormal distribution X lognormal distributed  Y = ln(X) normal distributed probability density function for X: in which  Y and  Y parameters of the lognormal distribution:  Y mean value of Y (not of X !!)  Y standard deviation of Y (not of X !!)

140 Lognormal distribution X lognormal distributed  Y = ln(X) normal distributed

141 Lognormal distribution As a consequence of the central limit theorem: Product of many variables is (almost) lognormal distributed so log y (almost) normal distributed. Definition: log y normal  y lognormal Example (salaries in a country are LN distr.

142 Asymptotic distributions Normal Lognormal Weibull Gumbel Asymptotic distr. return in the 5th year course CT5310 by Vrijling and Van Gelder

143 Discrete distributions

144 Exponential Distribution A continuous rv X has an exponential distribution with parameter if the pdf is

145 Mean and Variance The mean and variance of a random variable X having the exponential distribution

146 Applications of the Exponential Distribution Suppose that the number of events occurring in any time interval of length t has a Poisson distribution with parameter and that the numbers of occurrences in nonoverlapping intervals are independent of one another. Then the distribution of elapsed time between the occurrences of two successive events is exponential with parameter

147 Two stochasts

148 joint probability density

149 Contour lines of the joint density   f XY ( ,  ) fX()fX() fY()fY()

150 Two stochasts Relation with events  ,f XY    Y and  XP      Also here fdensity F(cumulative) distribution 

151 Example Length Weight 1.21.41.61.822.22.42.6 0 0.5 1 1.5 2 2.5 3 lengte (m) kansdichtheid (1/m) 5060708090100110 0 0.01 0.02 0.03 0.04 0.05 gewicht (kg) kansdichtheid (1/kg)

152 Corresponding contour plot? 1.41.61.822.2 50 60 70 80 90 100 110 length (m) weight (kg)

153 Scatter plot results of a large survey 1.41.51.61.71.81.922.12.2 50 60 70 80 90 100 110 length (m) weight (kg) health investigation 1000 observations

154 Dependence 1.41.61.822.2 50 60 70 80 90 100 110 weight (kg) 50 60 70 80 90 100 110 0 0.01 0.02 0.03 0.04 0.05 lengte (m) weight (kg) density (1/kg) 1.41.61.822.2 length (m) 0 0.5 1 1.5 2 2.5 3 densiity (1/m)

155 Characteristics  X,  Y  X,  Y Dependence cov XY covariance or  XY = cov XY /  X  Y correlation, between -1 and 1

156 Covariance Cov(X,Y)=E((X-EX)(Y-EY)) calculation example on black board

157 Correlation

158 Correlations between wave height and wave period (data from golfklimaat.nl website)

159 Introduction Engineering: structural reliability Reliability: Probability that structure falls apart The smaller the probability, the larger the reliability Risk = probability x consequences Structure: Strength Load Falls apart if strength < load

160 Introduction Design value - principle P{S > S d } = P target SdSd   Probability density function (pdf) of the load S: f S (x)

161 Introduction Cumulative distribution function cdf pdf

162 Introduction Design value - principle SdSd P{S  S d } = 1-P target Cumulative probability distribution function (cdf) of the load S: F S (x)

163 Introduction Design value - principle Complementary cumulative distribution function (ccdf) of the load S = 1-F S (x) P{S > S d } = P target SdSd SdSd

164 Introduction Design value load S d Design value load S d or quantile is defined as: P{S > S d } = P target during reference period T ref Target probability P target : Depends on consequences of structural failure Is specified in building codes Typical: P target = 10 -4 - 10 -1 (structural collapse) T ref = 15 - 100 years

165 Peaks over Treshold analysis for quantile estimation Let X1; X2;... ; Xn be a series of independent random observations of a random variable X with the distribution function F(x). To model the upper tail of F(x), consider k exceedances of X over a threshold u and let Y1; Y2;... ; Yk denote the excesses (or peaks), i.e. Yi=Xi-u.

166 Extreme value statistics If we know the distribution of a random variable (for instance, monthly water level, daily wave height, etc), how does the distribution of the maximum of n random variables behave?

167 Careful for inhomogenities

168 Stationary Time Series Exhibits stationarity in that it fluctuates around a constant long run mean Has a finite variance that is time invariant Has a theoretical covariance between values of y t that depends only on the difference apart in time

169 WHITE NOISE PROCESS X t = u t u t ~ IID(0, σ 2 ) Stationary time series

170 Examples of non-stationary series Share PricesExchange Rate Income

171 Unit Root Tests How do you find out if a series is stationary or not?

172 Yb0b0 YI YYYb0b0 I ttt tttt     1 1 (1) (0)   Order of Integration of a Series A series which is stationary after being differenced once is said to be integrated of order 1 and is denoted by I(1). In general a series which is stationary after being differenced d times is said to be integrated of order d, denoted I(d). A series, which is stationary without differencing, is said to be I(0)

173 Informal Procedures to identify non-stationary processes (1) Eye ball the data (a) Constant mean? (b) Constant variance?

174 Statistical Tests for stationarity: Simple t-test Set up AR(1) processwith drift (b 0 ) Y t = b 0 + b 1 Y t-1 +  t  t ~ iid(0,σ 2 ) (1) Simple approach is to estimate eqn (1) using OLS and examine estimated b 1 Use a t-test with null Ho: b 1 = 1 (non-stationary) against alternative Ha: b 1 < 1 (stationary). Test Statistic: TS = (b 1 – 1) / (Std. Err.(b 1 )) reject null hypothesis when test statistic is large negative - 5% critical value is -1.65

175 Distribution of a maximum of random variables

176 Therefore...

177 Extreme value distribution of a uniform distribution

178 From an ‘operational point of view’ rather than conceptual point of view Matlab code ct53100.m for j=1:100, n=12; for i=1:n, x(i)=5*rand(1); end y(j)=max(x); end

179

180 Back to the operational viewpoint - Change the number of observations - Change the distribution type - From minimum to maximum - etc.

181 Introduction From a visual point of view Statistics wind: Mean: 5 m/s Standard deviation: 2.5 m/s Source: time probability density

182 Introduction visual point of view 20 m/s5 m/s28 m/s instantaneous maximum 1 year maximum 50 years wind speed

183 visual point of view

184 Procedure for minima of r.v.’s

185 Approach 1: extreme value distributions The extreme value distribution Distribution type Distribution parameters Quantile values (design values) Example: wind loads

186 Approach 1: extreme value distributions The extreme value distribution Limit theorem (Fisher-Tippet): Maximum of many random variables has distribution: Reverse Weibull (bounded maximum) or Gumbel or Frechet (bounded minimum) regardless of parent distribution Conditions: Random variables are independent Random variables have the same parent distribution

187 Approach 1: extreme value distributions Extreme value distributions Reverse Weibull (convex) Gumbel (straight) Frechet (concave)

188 Approach 1: extreme value distributions Generalized Extreme Value distribution All three extreme value distributions are special cases of the Generalized Extreme Value distribution (GEV):  = 0 Gumbel (EV type I for maxima)  > 0 Frechet (EV type II for maxima)  < 0 Reverse Weibull (EV type III for maxima)

189 Approach 1: extreme value distributions Domain of attraction Asymptotic distribution type of maximum (domain of attraction) Parent distribution FrechetPareto, Cauchy, Student-t (fat tail) GumbelNormal, exponential, gamma, lognormal, Weibull Reverse Weibulluniform, beta (short tail)

190 Approach 1: extreme value distributions Example: wind load Example: Wind load Maximum over 50 years Quantile value at P target = 0.15 (design load) Steps: –Determine extreme value distribution type –Determine distribution parameters –Calculate requested quantile (design load)

191 Approach 1: extreme value distributions Distribution type Statistics: parent distribution is approx. Weibull EV-theory: domain of attraction is Gumbel Plot: monthly maxima of hourly averaged wind speeds Slightly convex?

192 Approach 1: extreme value distributions Gumbel probability plot CDF on Gumbel probability paper: Reverse Weibull-like deviation (poor convergence)

193 Approach 1: extreme value distributions Wind speed annual maxima Schiphol

194 Approach 1: extreme value distributions Wind pressure monthly maxima Schiphol pressure: q = 0.5  U 2 with:  air density U wind speed

195 Transformations

196

197

198 Presenting large datasets In a histogram On probability paper

199 Classify your data order the n observations Number of classes: All classes have preferably the same width 1 + 1.33 ln(n)

200 Histogram Wave height (cm’s): 25 45 35 25 30 70 20 45 65 30 40 40 35 45 55 35 32 37 28 45 49 39 40 60 29 34 47 35 45 49 35 45 34 28 34 54 48 38 32 39 45 58

201 Histogram #classes: 1 + 1.33 ln(42) ≈ 6 highest - lowest = 70 - 20 = 50 class width about 50 / 6 = 8 (we take 5 to choose a round number)

202 Histogram 4 classfrequency freq/width (unit=5) 17,5 - 27,5 33/2 27,5 - 32,577/1 32,5 - 37,599/1 37,5 - 42,566/1 42,5 - 47,588/1 47,5 - 57,555/2 57,5 - 77,544/4

203 Case study: Groundwater chemistry data set (from; Y. Zhou, 2006: Hydrogeostatistics. UNESCO-IHE lecture note.)

204 Step 1: Range of data R = x max - x min R(Cl) = 184 - 4.72 = 179.28 mg/l Frequency tables

205 Step 2: Number of class intervals, m Class number 6 < m < 25 m = 1 + 1.33 ln(N) Example Cl: m should be around 6-7 class Class width  x:  x > R/m Example Cl: 15 > 179.28/13=14 Class limits x j- (lower limit) and x j+ (upper limit): x j- = x 0 + (j-1)*Δx < values in class j < x j- + Δx = x j+ Frequency tables

206 Step 3: Number of measurements per class n j : absolute frequency f j =n j /n: relative frequency Frequency tables

207 Step 4: Creating frequency table Frequency tables

208 Step 5: Absolute and relative frequencies Creating of a Histogram

209 Step 6: Cumulative frequency table Frequency tables

210 Step 7: Cumulative frequency distribution curve Frequency distribution

211 Some typical frequency distributions

212 Statistical descriptors Descriptors of central tendency Mode : the value with largest frequency; average value of measurements of the class with the largest frequency Not applicable for distributions with several peaks For Cl: the mode is 19.4 mg/l

213 Descriptors of central tendency Median : the value corresponds to 50% of cumulative frequency, the value of mid measurement for odd number samples or the average of two mid measurements for even number samples For Cl: median = 31.4 mg/l Insensitive to the tails or outsiders of the distribution, preferable for data sets with exceptional values Statistical descriptors

214 1. Descriptors of central tendency Quartiles: split the data into quarters –Lower quartile: 25% cumulative frequency For Cl: lower quartile = 16.3 mg/l –Upper quartile: 75% cumulative frequency For Cl: upper quartile = 56.3 mg/l In practice also the 1%, 5%,10%, 90%, 95% and 99% values are used (e.g. discharge data). Statistical descriptors

215 1. Descriptors of central tendency Arithmetic mean: average value of measurements For Cl: arithmetic mean = 43.72 mg/l More representative of the sample, sensitive to outsiders in a small sample. Most distributions are sufficiently characterised by the mean and the variance. Statistical descriptors

216 1. Descriptors of central tendency Geometric mean For Cl: geometric mean = 29.53 mg/l Not applicable for negative values. Often hydrogeological variables are not symmetrical, but the log transformations are symmetrical. Then geometric mean is applicable. The radius of a grain with main axes a, b, and c is characterised best by the third root of the a*b*c. Statistical descriptors

217 1. Descriptors of central tendency Harmonic mean For Cl: harmonic mean = 20.08 mg/l Appropriate for phenomenon where small values are more important (e.g. hydraulic conductance; see lecture notes form Zhou, 2006) Statistical descriptors

218 DescriptorsSummary of properties --------------------------------------------------------------------------------------------------------------------- Modeindication of abundant values, isolated property, not applicable for distributions with several peaks. --------------------------------------------------------------------------------------------------------------------- Medianinsensitive to the tails of the distribution, preferable for data sets with exceptional values. --------------------------------------------------------------------------------------------------------------------- Arithmetic Meanmore representative of the sample, sensitive to exceptional values in a small sample. Most distributions are sufficiently characterised by the mean and the variance. --------------------------------------------------------------------------------------------------------------------- Geometric meannot applicable for negative values. The radius of a grain with main axes a, b, and c is characterized best by the third root of the product a b c. --------------------------------------------------------------------------------------------------------------------- Harmonic meanmore appreciate for phenomenon where small values are more important.

219 Relations between central tendency descriptors The harmonic mean is smaller than the geometric mean, and the geometric mean is, in turn, smaller than the arithmetic mean. They are equal only if x 1 = x 2 =... = x n. If the frequency distribution is symmetrical. They are not equal when the distribution is not symmetrical. The mean of the log x -distribution is equal to the logarithm of the geometric mean of x. Statistical descriptors

220 2. Descriptors of dispersion (variation) Sample variance For Cl: variance = 1768.72 [mg/l] 2 Standard deviation: the square root of the variance For Cl: standard deviation = 42.06 mg/l Statistical descriptors

221 2. Descriptors of dispersion (variation) Statistical descriptors

222 2. Descriptors of dispersion (variation) Coefficient of variation For Cl: coefficient of variation = 0.96 Useful to compare the variations of two or more data sets. Statistical descriptors

223 3. Descriptors of asymmetry Coefficients of skewness For Cl: coefficient of skewness α 3 = 1.97 Statistical descriptors Moments or product moments!

224 Moments or product moments Central moments Statistical descriptors Moments are statistical descriptors of a data set used for: –1 st moment: mean or expected value, µ x (“central tendency”) –2 nd moment: variance, σ 2 x ; standard deviation σ x is square root of variance (“spread around the central value”) –3 rd moment: skewness, γ x (“measure of symmetry”) –4 th moment: kurtosis, κ x (“peakedness of central portion of distribution”)

225 3. Descriptors of asymmetry Statistical descriptors

226 4. Descriptor of flatness/’peakedness’ Coefficient of kurtosis For Cl: coefficient of Kurtosis = 7.04 α 4 =3: for Normal distribution α 4 >3: steeper than Normal distribution α 4 <3: flatter than Normal distribution Statistical descriptors

227 Normal probability paper Sort the data from small to large Assign each observation to i/N+1 in which i the order number and N the total number of data points

228

229 year deseasonalised daily mean run-off Daily Mean Run-Off Anomalies at Achleiten Danube River POT

230 Flood frequency analysis (peak flows) Annual maximum series (more common)  One can miss a large event if more than one per year; but continuous and easy to process  Often used for estimating extremes in long records (>10 years) Partial duration series (“Peaks-Over-Threshold, POT”)  Definition of the threshold is tricky and requires experience  Often used for short records (<10 years) threshold

231 Flood frequency analysis (peak flows): Annual max. series vs. partial duration series (Davie, 2002)

232 Annual max. series vs. partial duration series (POT) Langbein showed the following relationship (Chow 1964): 1/T = 1- e -(1/T p ) T : return period using annual max. series T p :return period using partial duration series Differences get smaller for larger return periods (less than 1% difference for a 10-year recurrence interval)!

233 Assumptions of frequency analysis All data points are correct and precisely measured  Be aware of the uncertainty of peak flow data (uncertainty and errors come later in this course!) Independent events: peak flows are not part of the same event  Carefully check the data set; plot the whole record, in particular all events of the POT series  Problems with events at the transition of the year (31 Dec – 1 Jan) in humid temperate or some tropical climates Random sample: Every value in the population has equal chance of being included in the sample The hydrological regime has remained static during the complete time period of the record  No land use change, no climate change, no changes in the river channels, no change in the flood water management etc. in the catchment (often not the case for long records!) All floods originate from the same statistical population (homogeneity)  Different flood generating mechanisms (e.g. rain storms, snow melt, snow- on-ice etc.) might cause floods with different frequencies/recurrence intervals

234 Describing the frequency mathematically: Probability Distribution Function Typically defined in either of two forms: Probability density function (PDF) Cumulative distribution function (CDF) Discrete Continuous PDF CDF

235 Basics (examples using measured flow, Q) Probability of exceedence, P(X): probability that the flow Q is greater or equal X; P(X)ε[0,1] Relative frequency, F(X): probability of flow Q being less than a value X; F(X)ε[0,1]. Can be read from a cumulative probability curve, but be careful with the selected class intervals. Average recurrence interval or return period, T(X): statistical term meaning the chance of exceedence once every T years over a long record (time step is usually one year). –Not exactly the number of years that are between certain size events! –More the average number of years, in which flow is greater than X! –No regularity or periodicity in occurrences of exceedences (assumption) P(X) = 1–F(X) T(X) = 1/P(X) = 1/(1-F(x))

236 Relative frequency F(X) PDF Probability of exceedance P(X) CDF Probability of exceedance P(X) Relative frequency F(X)

237 Recurrence intervals for design purposes (flood protection) in Germany Class 1  Settlements, urban areas, important infrastructure: 50-100 years Class 2  Single buildings, not always inhabit neighborhoods: 25-50 years Class 3  Farm land, intensively used: 10-25 years Class 4  Farm land; 5-10 years (according to DIN 19700, part 99) What about large dams, nuclear power plants etc.? PMF

238 Exceedence probability for a specified number of time intervals (see Box C-4, page 561 in Dingman, 2002)

239 Examples: Exceedence probability and return period (based on Box C-4 in Dingman, 2002) What is the probability that a flood greater or equal a 100-year flood will occur next year?  P(X) = 1/T(X) = 0.01 What is the probability that we will not have a flood that is greater or equal the 50-year flood next year?  F(X) = 1-P(X) = 1-0.02 = 0.98 What is the probability that we will not have a flood that is greater or equal the 20-year flood in the next 5 years?  F(X) = (1-P(X)) n = 0.95 5 = 0.774 What is the probability that the next exceedence of the 100-year flood will occur in the 10 th year from now on?  p = (1-0.01) 9 x 0.01 = 0.00916 What is the probability that the 100-year flood will be exceeded at least once in the next 40 years?  p = 1-(1-0.01) 40 = 0.331 What is the probability that the 50-year flood will be exceeded twice in a row (two independent events in one year), and how many 50-year floods can be expected on averages in 1000 years?  p = 0.02 x 0.02 = 0.0004; and on average 20 floods in 1000 years.

240 (Bedient and Huber, 2002) ( ) - = = =

241 But how do we estimate P(X) and F(X) from data? Example: Flood frequency analysis

242 Example: Annual max. series for the river Wye (1971-97) (not Normal-distributed!) (Davie, 2002)

243 Plotting position – Weibull formula Rank the annual maximum series data from low to high (independent data, the year of occurrence is irrelevant) Calculate F(X) with the rank r and N total data points (i.e. length of record: N years) F(X) = r/(N+1) –For example: The largest value of a 25 year record would plot at a recurrence interval of 26 years. –F(X) can never reach 1 –If you rank from high to low P(X) = 1-F(X) is calculated

244 Example: Annual max. series for the river Wye (1971-97) (not Normal-distributed!) (Davie, 2002)

245 Gringorten formula Difference to Weibull formula is often not great Use is often down to personal preferences Empirical constants (0.44 and 0.12) are valid for Gumbel distribution F(X) = (r-0.44) / (N+0.12) Comparison of Weibull and Gringorten formulae (Davie, 2002) [Please note: F(X)=p in the Workshop course note]

246 Example: Annual max. series for the river Wye (1971-97) (not Normal-distributed!) (Davie, 2002)

247 Example: Annual max. series for the river Wye (1971-97) (not Normal-distributed!) (Davie, 2002) Reliability is good!

248 Extrapolation beyond the data set Weibull or Gringorten formulae only good for flood frequency estimations for flows within the measured record, and even unreliable near either limiting value For extrapolation the fit of a probability distribution is needed. Estimate the parameters through: 1.Method of moments (widely used) 2.Method of L-moments (less widely, used, quite complex) 3.Method of maximum likelihood (not widely used) An alternative is a graphical approach to fit the distribution (subjective approach) Choice of the distribution function often based on personal preferences (but always take the distribution that fits your data best in a particular region), but there are sometimes guidelines (depend on the region) Extreme values are usually not normally distributed, however, mean annual flows in humid areas are often normally distributed

249 Method of moments – Example: Gumbel distribution Product moments are statistical descriptors of a data set (characterize the probability distribution): –1 st moment: mean or expected value, µ x (“central tendency”) –2 nd moment: variance, σ 2 x ; standard deviation σ x is square root of variance (“spread around the central value”) –3 rd moment: skewness, γ x (“measure of symmetry”) –4 th moment: kurtosis, κ x (“peakedness of central portion of distribution”) –Coefficient of variation: measure of spread CV = σ x /µ x (L-moments are used for small sample sizes; see Dingman (2002), Appendix C) γ x > 0 γ x < 0

250 Method of moments – Gumbel distribution F(X) = exp(-exp(-b(X-a))) a = mean(Q) - 0.5772/b b = π/(σ Q 6 0.5 ) F(X) leads to P(X) and T(X) for a certain size of flow X Re-arranging the formulae leads to the size of flow for a given recurrence interval: X = a – 1/b ln ln(T(X)/(T(X)-1)) Example: for the 50-year flood, you need to compute the natural logarithm of 50/49 and then the natural logarithm of this result. The parameters a and b are estimated from the sample. Rule of thumb: Do not extrapolate recurrence intervals beyond twice the length of your stream flow record

251 Example: Annual max. series for the river Wye (1971-97); values required for the Gumbel formula Applying the method of moments and Gumbel formula to the data gives some interesting results. The values used in the formula are shown in the table above and can be easily computed. When the formula is applied to find the flow values for an average recurrence interval of 50 years it is calculated as 39.1 m 3 /s. This is less than the largest flow during the record which under the Weibull formula has an average recurrence interval of 29 years. This discrepancy is due to the method of moments formula treating the highest flow as an extreme outlier. If we invert the formula we can calculate that a flood with a flow of 48.87 m 3 /s (the largest on record) has an average recurrence interval of around three hundred years. (Davie, 2002) Mean (Q) Standard deviation ( σ Q )a value b value 21.21 6.91 18.11 0.19

252 Distributions often used in hydrology (1/2) (Dingman, 2002)

253 Distributions often used in hydrology (2/2) (Dingman, 2002)

254 Use of common distributions in hydrology Flood frequency analysis Most commonly applied are the Exponential (EXP), Log-Normal (LN), Log- Pearson 3 (LP3) and Generalised Extreme Value (GEV) distributions. In practice The choice of probability distribution may be dictated by mathematical convenience or by familiarity with a certain distribution (“personal bias”). Sample estimators must be adopted in order to obtain estimates of the statistics for determining the distribution parameters. In some cases, more than one distribution may fit the available data equally well. General three-step procedure 1)A suitable form of standard frequency distribution is chosen to represent the observations; 2)the chosen distribution is fitted to the data by determining values for its parameters; and 3)the required quantiles are computed from the fitted cumulative distribution function (CDF).

255 Distributions often used in hydrology 1. Normal distribution (ND) Probably the most important distribution, but often not useful for hydrological extremes PDF: Standard Normal Distribution (S-ND) a: scale parameter = standard deviation, σ c: location parameter = mean, μ CDF: (compare lecture of Dr. Zhou)

256 Location parameters A probability distribution is characterized by location and scale parameters. Location parameter equal to zero and scale parameter equal to one (Standard- ND) vs. ND with a location parameter of 10 and a scale parameter of 1. Scale Parameter The next plot has a scale parameter of 3 (location parameter is zero). The effect is that the graph is ‘stretched out’.

257 2. The Lognormal distribution y = ln(x) If y follows the Normal distribution, x follows the Log-normal distribution. Lognormal probability density function: Probability Distribution Functions

258 Lognormal distribution is skewed to the right! Lognormal distribution function

259 Effects of shape parameters for the lognormal distribution PDF CDF

260 Histogram indicates positive skewness distribution Plot of cumulative frequency on Lognormal probability paper shows a straight line Take logarithm transformation of data y = ln(x) Calculate sample mean and standard deviation of logarithmic values Carry out analysis on logarithmic values Processing Log-normal Distribution Function

261 3. Pearson type III distribution (Often used for flood frequencies) PDF: a - scale parameter b - shape parameter c - location parameter Г( ) – Gamma function:  commonly fitted to the logarithms of floods (so- called log-Pearson type III distribution) 4. Gamma Distribution (when c=0)

262 Effects of shape parameter, gamma PDF CDF

263 5. Exponential distribution PDF: Particularly useful when applying partial duration series Standard Exponential Distribution CDF: With a mean of 1/λ, a variance of 1/λ 2, and a skewness of 2.

264 Plots of exponential distribution PDF CDF

265 Example: Exponential distribution applied to storm interval times (from Bedient & Huber 2002)

266 6. General extreme value (GEV) distribution c - location parameter a - scale parameter k - shape parameter k=0 Extreme value type I (EV1) (Gumbel) k<0 Extreme value type II (EV2) k>0 Extreme value type III (EV3) closely related to the Weibull distribution CDF:

267 0 x=c Type III, k>0 Type II, k<0 Type I, k=0 y1y1y1y1x Comparison of the three types of GEV distributions (PDF) If the sample for which frequency distribution is required exhibits skewness, a three-parameter distribution is useful (e.g. GEV).

268 General Extreme Value distribution Type I (= Gumbel distribution) PDF: CDF: -> widely used for annual maximum series! (Note: little different in description (use of parameters) than in the example of the river Wye, from Davie (2002)) CDF: (standardized)

269 Plots of Gumbel distribution PDF CDF

270 How good is the fit of the distribution function? Graphic check: visual check of the plotted graph (How good are the observations reproduced by the fitted PDF/CDF?) Mathematical check: statistical test to determine the goodness fit -chi-square (χ 2 ): PDF (needs much data, depends on classified intervals) -Kolmogorov-Smirnov (K-S) tests: CDF  Not necessary to divide the data into intervals; thus error associated with the number and size of intervals is avoided.  Good if n>35 and even better if n>50.  Quick and easy, but only one value is considered) -Unfortunately, often several distributions provide acceptable fits to the available data (no identification of the “true” or “best” distribution); confidence limits are too large

271 Graphical method (Example of the river Wye) Frequency of flows less than a value X. The F(X) values on the x ‑ axis have undergone a transformation to fit the Gumbel distribution; called ‘reduced variate’ (cf. Workshop in Hydrology). Is this fit suitable for the whole data set?

272 Comparison of different PDFs (Significant differences in particular for the extremes!)

273 Example: Kolmogorov-Smirnov (K-S) tests (according to Schoenwiese 2000) F X (x i ) - CDF of the assumed distribution S n (x i ) - CDF of the observed ordered sample If D n ≤ the tabulated value (see below) D n α, the assumed distribution is acceptable at the significance level α (n: sample size). α 0.20 0.10 0.05 0.01 0.001 D n α 1.073/n 0.5 1.224/n 0.5 1.358 /n 0.5 1.628/n 0.5 1.040/n 0.5 (see sketch on black board)

274 Variability of quantile estimates – confidence limits Sources of errors assumption of a particular distribution (cannot be quantified) sampling errors in estimation of the parameters of the distribution (quantifiable through standard errors) Confidence limits (CL): 100(1-α)% confidence intervals Standard error: Q T : quantile C : constant (see Workshop page 13; course notes from Hall page 31f) σ : standard deviation from a sample of size n t 1-α/2 : value of Student’s t- distribution for a 97.5% level of confidence (two-tailed test) and (n-1) degrees of freedom; tabulated

275 Example: Lognormal with 90%-confidence limits (Bedient & Huber 2002)

276 Example: Estimation of confidence limits (according to Schoenwiese 2000) The mean annual temperature at the Hohenpeissenberg, Germany, for the period 1954-1970 (n=17; Normal-distributed C=1) is 6.24 0 C with a standard deviation of 0.73 0 C. The confidence limits (α = 5%) can be calculated as: CL (mean annual temperature in 0 C)= 6.24 ± t 97.5% (0.73/17 0.5 ) 0 C = 6.24 ± 0.38 0 C The mean of the annual temperature is at significance level of 95% in the interval of 6.24 ± 0.38 0 C, thus between 5.86 and 6.62 0 C. Please note, if the records lengths would have been 120 years (= n) with the same standard deviation, the interval would be 6.24 ± 0.13 0 C.

277 A few remarks on: Low flow frequency analysis Data required: annual minimum series Problem of independent events; do not split the year in the middle of low period (i.e. low flow periods can be long) Often zero-values (e.g. in arid climates or cold climates) There is finite limit on how low a low flow can be (no negative flows!)  Different statistical treatment of the data  Fit an exponential distribution rather than, for instance, a log- normal distribution  Other often used distributions are the Weibull, Gumbel, Pearson Type III, and log-normal distributions

278 A few remarks on: Low flow frequency analysis Figure 7.15 Two probability density functions. The usual log ‑ normal distribution (solid line) is contrasted with the truncated log ‑ normal distribution (dashed line) that is possible with low flows (where the minimum flow can equal zero). Figure 7.16 Probability values (calculated from the Weibull sorting formula) plotted on a log scale against values of annual minimum flow (hypothetical values).

279 Application in the Rur river (7-1) Station Stah (Germany) area2245 km 2 area2245 km 2 record1953 to 2001 record1953 to 2001 Prepared by Tu Min (2004)

280 Application in the Rur river (7-2) Annual flood peaks (1954 - 2001) the water years (Nov - Oct) Homogeneity Statistical tests (change point)

281 Application in the Rur river (7-3) Flood frequency analysis Example – Normal distribution N = 48 μ = 76.1 σ = 28.5 t 97.5 % = 2.011

282 N = 48 μ y = 4.3 σ y = 0.4 t 97.5 % = 2.011 Application in the Rur river (7-4) Flood frequency analysis Example – LN distribution

283 Application in the Rur river (7-5) Flood frequency analysis Example – LN3 distribution N = 48 X min = 27.6 X max = 139.3 X med = 73.1 μ y = 5.0 σ y = 0.2 t 97.5 % = 2.011

284 Application in the Rur river (7-6) Flood frequency analysis Example – Gumbel distribution Example – Gumbel distribution Distribution fitting Distribution fitting

285 Application in the Rur river (7-7) Flood frequency analysis Example – Gumbel distribution Example – Gumbel distribution Magnitude of T-year flood Magnitude of T-year flood

286 Take home messages Frequency analysis, in particular of hydrological extremes, is prerequisite for sustainable water resources management Annual max. series or partial duration series depends on the length of the record Understanding of probability, recurrence interval and risk Knowledge of often used statistical distributions Calculation of confidence intervals Test of the goodness of fit of a PDF or CDF Specialty of low flow values

287 Closure probability and event stochastic variables, cont. and discrete Transformations Joint distributions Linear regression Parameter estimation (with Bestfit)

288 Closure The role of statistics in hydrology and water resources is what? You have now knowledge of  Frequency tables,  Histograms,  Distributions functions,  statistical descriptors, and  Standard Normal Distribution and Log-Normal distribution. Are you now well prepared for the two assignments?

289 Assignment 1: Basic Statistics Use the data set from Van Gelder’s website 1.Calculate the range and estimate a reasonable number of intervals as well as class limits. 2.Calculate the relative and absolute frequencies. 3.Make a histogram and cumulative frequency distribution, hand drawn on linear paper (figures). Interpret this briefly (one sentence!) 4.Calculate the median, mode, and arithmetic mean. 5.Calculate the variance, standard deviation and coefficient of variation. 6.Calculate the skewness and kurtosis. Interpret the results briefly (one sentence!) 7.Plot the data as a graph on normal probability paper (figure). Is the Normal-Distribution suitable for that data set? Compare the mean value and standard deviation from your graph with your results in question 4+5. Deliver your printed report to Pieter van Gelder on November 5 th at 15.45h at the start of the lectures

290 Assignment 2: Frequency Analysis Download your dataset from Van Gelder’s website Determine the PDF with the lowest Chi-Square value in Bestfit Include in your report a plot of the observations and optimal fit Extrapolate the fitted Exponential distribution to a 10^-3 /yr quantile Calculate the 95% Confidence Bounds around the 10^-3 /yr quantile for the Exponential fit Deliver your printed report to Pieter van Gelder on November 6th at 8.45h at the start of the lectures

291 Assignment 3: Transformations of distributions Download your parameters a and b from Van Gelder’s website Generate 28 random numbers from the Uniform distribution with lowerbound a and upperbound b Plot your data in a histogram and draw the PDF of the uniform distribution in the same plot Generate 100 sets of 28 random numbers from the above Uniform distribution and take from each set the maximum number Plot these 100 maxima in a histogram, derive the theoretical distribution function for these 100 maxima and draw the PDF in the same plot Assume that the above 100 numbers are monthly maximum wind speeds. Transform your wind speeds to wind pressures and plot your wind pressures in a cumulative distribution plot. Deliver your printed report to Pieter van Gelder on November 13th at 17.00h at the reception of UNESCO-IHE

292 Final mark for Review of statistics and frequency analysis (module 1) Weight factor of all computer exercises is 0.5 in your final mark Weight factor of written test is 0.5 in your final mark


Download ppt "Review of statistics and frequency analysis Academic year 2009 - 2010 Associate Professor: Dr. P.H.A.J.M. van Gelder TU Delft, Faculty of Civil Engineering."

Similar presentations


Ads by Google