DDS – 12. December 2011 What is the correct number of break points hidden in a climate record?

Slides:



Advertisements
Similar presentations
Statistical Techniques I EXST7005 Start here Measures of Dispersion.
Advertisements

Mean, Proportion, CLT Bootstrap
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Stick Tossing and Confidence Intervals Asilomar - December 2006 Bruce Cohen Lowell High School, SFUSD
Break Position Errors in Climate Records Ralf Lindau & Victor Venema University of Bonn Germany.
Visual Recognition Tutorial
ECIV 201 Computational Methods for Civil Engineers Richard P. Ray, Ph.D., P.E. Error Analysis.
Infinite Sequences and Series
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
Chapter 6 Continuous Random Variables and Probability Distributions
Definitions Uniform Distribution is a probability distribution in which the continuous random variable values are spread evenly over the range of possibilities;
Statistical Background
Copyright © Cengage Learning. All rights reserved. 5 Integrals.
7. Homogenization Seminar Budapest – October 2011 What is the correct number of break points hidden in a climate record? Ralf Lindau Victor Venema.
Daily Stew Kickoff – 27. January 2011 First Results of the Daily Stew Project Ralf Lindau.
Chapter 4 Continuous Random Variables and Probability Distributions
Two and a half problems in homogenization of climate series concluding remarks to Daily Stew Ralf Lindau.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
Memory Aid Help.  b 2 = c 2 - a 2  a 2 = c 2 - b 2  “c” must be the hypotenuse.  In a right triangle that has 30 o and 60 o angles, the longest.
Simple Linear Regression Models
CORRELATION & REGRESSION
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Unit 4: Modeling Topic 6: Least Squares Method April 1, 2003.
Detection of inhomogeneities in Daily climate records to Study Trends in Extreme Weather Detection of Breaks in Random Data, in Data Containing True Breaks,
Copyright © 2009 Cengage Learning Chapter 10 Introduction to Estimation ( 추 정 )
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Physics 114: Exam 2 Review Lectures 11-16
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
On the multiple breakpoint problem and the number of significant breaks in homogenisation of climate records Separation of true from spurious breaks Ralf.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
T- and Z-Tests for Hypotheses about the Difference between Two Subsamples.
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
Breaks in Daily Climate Records Ralf Lindau University of Bonn Germany.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
CHAPTER 3 Model Fitting. Introduction Possible tasks when analyzing a collection of data points: Fitting a selected model type or types to the data Choosing.
GG 313 Geological Data Analysis Lecture 13 Solution of Simultaneous Equations October 4, 2005.
7. Homogenization Seminar Budapest – 24. – 27. October 2011 What is the correct number of break points hidden in a climate record? Ralf Lindau Victor Venema.
Discussion of time series and panel models
Understanding Your Data Set Statistics are used to describe data sets Gives us a metric in place of a graph What are some types of statistics used to describe.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
On the reliability of using the maximum explained variance as criterion for optimum segmentations Ralf Lindau & Victor Venema University of Bonn Germany.
12 INFINITE SEQUENCES AND SERIES. In general, it is difficult to find the exact sum of a series.  We were able to accomplish this for geometric series.
Estimators and estimates: An estimator is a mathematical formula. An estimate is a number obtained by applying this formula to a set of sample data. 1.
Radiation Detection and Measurement, JU, 1st Semester, (Saed Dababneh). 1 Radioactive decay is a random process. Fluctuations. Characterization.
Correlation & Regression Analysis
Copyright © Cengage Learning. All rights reserved.
Heuristics for Minimum Brauer Chain Problem Fatih Gelgi Melih Onus.
Correction of spurious trends in climate series caused by inhomogeneities Ralf Lindau.
COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Test Review: Ch. 4-6 Peer Tutor Slides Instructor: Mr. Ethan W. Cooper, Lead Tutor © 2013.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
The joint influence of break and noise variance on break detection Ralf Lindau & Victor Venema University of Bonn Germany.
This represents the most probable value of the measured variable. The more readings you take, the more accurate result you will get.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
THE NORMAL DISTRIBUTION
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
GOVT 201: Statistics for Political Science
The simple linear regression model and parameter estimation
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
Break and Noise Variance
The break signal in climate records: Random walk or random deviations
Copyright © Cengage Learning. All rights reserved.
Adjustment of Temperature Trends In Landstations After Homogenization ATTILAH Uriah Heat Unavoidably Remaining Inaccuracies After Homogenization Heedfully.
Dipdoc Seminar – 15. October 2018
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Introduction to Estimation
Product moment correlation
Presentation transcript:

DDS – 12. December 2011 What is the correct number of break points hidden in a climate record?

DDS – 12. December 2011 Defining Breaks Relocations of climate stations or changes in intrumention lead to slightly different measurements in different periods. These small, but abrupt changes are called breaks. Breaks become visible, when differences to neighbour stations are considered. The reason is that the dominating natural variability is filtered out in this way.

DDS – 12. December 2011 Internal and External Variance Consider the differences of one station compared to a reference. (Kriged ensemble of surrounding stations) Breaks are defined by abrupt changes in the station-reference time series. Internal variance within the subperiods External variance between the means of different subperiods Criterion: Maximum external variance attained by a minimum number of breaks

DDS – 12. December 2011 Decomposition of Variance n total number of years N subperiods n i years within a subperiod The sum of external and internal variance is constant.

DDS – 12. December 2011 Two questions Titel of this talk asks: How many breaks? Where are they situated? Testing of all permutions is not feasible. The best solution for a fixed number of breaks can be found by Dynamical Programming

DDS – 12. December 2011 Dynamical Programming (1) Find the optimum positions for a fixed number of breaks. Consider not only the complete time series, but all possible truncated variants.

DDS – 12. December 2011 Dynamical Programming (2) Find the optimum positions for a fixed number of breaks. Consider not only the complete time series, but all possible truncated variants. Find the best first break by simply testing all permutions.

DDS – 12. December 2011 Dynamical Programming (3) Find the optimum positions for a fixed number of breaks. Consider not only the complete time series, but all possible truncated variants. Find the best first break by simply testing all permutions. Fill up all truncated variants. The internal variance consists now of two parts: that of the truncated variant plus that of the rest. Important: Variances are additive

DDS – 12. December 2011 Dynamical Programming (4) Find the optimum positions for a fixed number of breaks. Consider not only the complete time series, but all possible truncated variants. Find the best first break by simply testing all permutions. Fill up all truncated variants. The internal variance consists of two parts: that of the truncated variant plus that of the rest. Search the minimum out of n.

DDS – 12. December 2011 Dynamical Programming (5) The 2-breaks optimum for the full length is found. To begin the search for 3 breaks, we need as before the previous solutions for all, also shorter length. This needs n 2 /2 searches, which is for larger numbers of breaks k much less than all permutations (n over k).

DDS – 12. December 2011 Position & Number Solved: The optimum positions for a fixed number of breaks are known by Dynamical Programming. Left: Find the optimum number of breaks. The external variance increase in any case with increasing number of breaks. Use as reference the behaviour of a random time series.

DDS – 12. December 2011 Segment averages with stddev = 1 Segment averages x i scatter randomly mean : 0 stddev:1/ Because any deviation from zero can be seen as inaccuracy due to the limited number of members.

DDS – 12. December 2011 External Variance The external variance is equal to the mean square sum of a random standard normal distributed variable. Weighted measure for the variability of the subperiods‘ means

DDS – 12. December 2011  2 -distribution n:Length of time series (Number of years) k:Number of breaks N = k+1:Number of subsegments [ ]:Mean over several break position permutations [var ext ] = (N-1)/n = k/n In average, the external variance increases linearly with k. However, we consider the best member as found by DP. var ext ~  N 2 The external variance is chi 2 -distributed. Def.: Take N values out of N (0,1), square and add them up. By repeating a  N 2 -distribution is obtained.

DDS – 12. December years random data (1) 1000 random time series are created. Only 21-years long, so that explicite tests of all permutations are possible. The mean increases linearly. However, the maximum is relevant (the best solution as found by DP) Can we describe this function? First guess:

DDS – 12. December years random data (2) Above, we expected the data for a fixed number of breaks being chi 2 -distributed.

DDS – 12. December 2011 The random data does not fit exactly to a chi 2 -distribution. The reason is that chi 2 has no upper bounds. But var ext cannot exceed 1. A kind of confined chi 2 is the beta distribution. From  2 to  distribution n = 21 years k = 7 breaks data 

DDS – 12. December 2011 From  2 to  distribution n = 21 years k = 7 breaks data  X ~  2 (a) and Y ~  2 (b)  X / (X+Y) ~  (a/2, b/2) If we normalize a chi 2 -distributed variable by the sum of itself and another chi 2 -distributed variable, the result will be  -distributed. The  -distribution fits well to the data and is the theoretical distribution for the external variance of all break position permutations. 

DDS – 12. December 2011 From  2 to  distribution   with We are interested in the best solution, with the highest external variance, as provided by DP. We need the exceeding probability for high var ext

DDS – 12. December 2011 Incomplete Beta Function External variance v is  -distributed and depends on n (years) and k (breaks): The exceeding probability P gives the best (maximum) solution for v Incomplete Beta Function Solvable for even k and odd n:

DDS – 12. December 2011 Example 21 years, 4 breaks k = 4  i = 2 n = 21  m = 9

DDS – 12. December 2011 Theory and Data Theory (Curve): Random data (hached) fits well.

DDS – 12. December 2011 Nominal Combination Number For n = 21 and k = 4 there are break combinations. If they all were independent we could read the maximum external variance at (4845) -1 ≈ being However, we suspect that the break combinations are not independent. And we know the correct value of var ext.

DDS – 12. December 2011 Effective and Nominal Remember: var ext = for k=4 The reverse reading leads to an 23 times higher exceeding probability. This shows that the break permutations are strongly dependent and the effective number of combinations is smaller than the nominal. However, the theorectical function is correct.

DDS – 12. December 2011 From 21 years to 101 years As we now know the theoretical function, we quit the explicit check by random data. And skip from unrealistic short time series (n=21) to more realistic (n=101). Again the numerical values of the external variance is known and we can conclude the effective combination numbers. Can we give a formula for in order to derive v(k) ? 2 20 breaks

DDS – 12. December 2011 dv/dk sketch Increasing the break number from k to k+1 has two consequences: 1.The probability function changes. 2.The number combinations increase. Both increase the external variance. k breaks k+1 breaks

DDS – 12. December 2011 Using the Slope P(v) is a complicated function and hard to invert into v(P). Thus, dv is concluded from dP / slope. We just derived P(v) by integrating p(v), so that the slope p(v) is known. k breaks k+1 breaks

DDS – 12. December 2011 The Slope Insert the known functions: The last summand dominates: Reduce and replace m and i:

DDS – 12. December 2011 Distance between the Curves The last summand dominates: Reduce and replace m and i:

DDS – 12. December 2011 Effective combination growth Nominal Growth Rate -2 ln ( (n-1- k) / k) Ln:Logarithmic sketch minus:Number of combinations is reciprocal to Exceeding Probability 2:Exceeding Probability only known for even break numbers (n-1-k) / k However, break combinations are not independent and we know the effective number of combinations

DDS – 12. December 2011 Ratio: nominal / effective k1k2knominaleffectivc=nom/eff The ratio of nominal / effective is approximatly constant with c = 0.3

DDS – 12. December 2011 Very Rough Solution Normalisation for small k * for n = 100

DDS – 12. December 2011 The Two Contributions truth estimate

DDS – 12. December 2011 Exact Solution

DDS – 12. December 2011 Constance of Solution 101 years 21 years The solution for the exponent  is constant for different length of time series (21 and 101 years).

DDS – 12. December 2011 The extisting algorithm Prodige Original formulation of Caussinus & Lyazrhi for the penalty term as adopted by Mestre for Prodige Translation into terms used by us. Normalisation by k* = k / (n -1) Derivation to get the minimum In Prodige it is postulated that the relative gain of external variance is a constant for given n.

DDS – 12. December 2011 Our Results vs Prodige We know the function for the relative gain of external variance. Its uncertainty as given by isolines of exceeding probabilities for 2 -i are characterised by constant distances. Caussinus and Lyazrhi (adopted by Mestre) propose just a constant of 2 ln(n) ≈ 9 Exceeding probability 1/128 1/64 1/32 1/16 1/8 1/4

DDS – 12. December 2011 Wrong Direction n = 101 yearsn = 21 years

DDS – 12. December 2011 Conclusion We have found a general mathematical formulation how the external variance of a random time series is increasing when more and more breaks as given by Dynamical Programming are inserted. This is much more accurate than existing estimations and can be used in future as reference to define the optimum number of breaks.

DDS – 12. December 2011 Integrated result How does the found function look like after integration? Crosses: Test data Line: Theory Error bars: 90 and 95 percentile

DDS – 12. December 2011 Appendix (1) Consider the individual summands of the sum as defined in The factor of change f between a certain summand and its successor is: m and i can be replaced by n and k: inserting k instead of lk is a lower limit for f because (n-1-l k )/l k, the rate of change of the binomial coefficients, is decreasing monotonously with k: where li runs from zero to i. The ratio of consecutive binomial coefficients can be replaced and it follows: normalised by 1/(n-1):

DDS – 12. December 2011 Appendix (2) the approximate solution is known with 1-v = (1- k*) 4 We can conclude that each element of the sum given above is by a factor f larger than the prior element. For small k* the factor f is greater than about 4 and grows to infinity for large k*. Consequently, we can approximate the sum by its last summand according to:

DDS – 12. December 2011 Application (1) Insert in each of 1000 random time series 5 breaks of variance 1. The change of external variance for low break numbers (1, 2, 3 up to about 10) increase. Lying above the theoretical function for random time series without any break (arrow). Variances of break numbers higher than 5 increase, because the inserted 5 breaks are not always the biggest.

DDS – 12. December 2011 Application (2) Stop break search, when the growth rate for the external variance drops firstly below the theoretical one for zero breaks. 1 Example of 1000 test time series Crosses:Observations Thin line:Inserted breaks Fat line:Detected breaks In average over 1000 samples: Added variance: 86% (theoretically 5/6) Remaining after correction:27% Average detected break number5.48