Presentation is loading. Please wait.

Presentation is loading. Please wait.

Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema.

Similar presentations


Presentation on theme: "Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema."— Presentation transcript:

1 Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema

2 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Content  Introduction to benchmark dataset  Some results  Some questions about exercise  Question about future work  Analyse and publish the results

3 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Benchmark dataset 1)Real (inhomogeneous) climate records  Most realistic case  Investigate if various HA find the same breaks 2)Synthetic data  For example, Gaussian white noise  Insert know inhomogeneities  Test performance 3)Surrogate data  Empirical distribution and correlations  Insert know inhomogeneities  Compare to synthetic data: test of assumptions

4 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Creation benchmark – Outline talk 1)Start with homogeneous data 2)Multiple surrogate and synthetic realisations 3)Mask surrogate records 4)Add global trend 5)Insert inhomogeneities in station time series 6)Published on the web 7)Homogenize by COST participants and third parties 8)Analyse the results and publish

5 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 1) Start with homogeneous data  Monthly mean temperature and precipitation  Later also daily data (WG4), maybe other variables (pressure, wind)  Homogeneous, no missing data  Longer surrogates are based on multiple copies  Generated networks are 100 a

6 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 2) Multiple surrogate realisations  Multiple surrogate realisations –Temporal correlations –Station cross-correlations –Empirical distribution function  Annual cycle removed before, added at the end  Number of stations, 5, 9 or 15  Cross correlation varies as much as possible

7 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Independent breaks  Determined at random for every station and time  5 Breaks per 100 a  Monthly slightly different perturbations  Temperature –Additive –Size: Gaussian distribution, σ=0.8°C  Rain –Multiplicative –Size: Gaussian distribution, =1, σ=10%

8 Example break perturbations station

9 Example break perturbations network

10 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Correlated break in network  One break in 10 % of networks  In 30 % of the station simultaneously  Position random –At least 10 % of data points on either side

11 Example correlated break

12 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Outliers  Size –Temperature: 99 percentile –Rain: 99.9 percentile  Frequency –50 % of networks: 1 % –50 % of networks: 3 %

13 Example outlier perturbations station

14 Example outliers network

15 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 5) Insert inhomogeneities in stations  Local trends (only temperature)  Linear increase or decrease in one station  Duration: between 30 and 60a  Maximum size: Gaussian distribution, σ=0.8°C  Frequency: once in 10 % of the stations

16 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Example local trends

17 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 6) Published on the web  Inhomogeneous data are published on the COST- HOME homepage  Everyone is welcome to download and homogenize the data  http://www.meteo.uni-bonn.de/ venema/themes/homogenisation

18 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain 7) Homogenize by participants  Return homogenised data  Should be in COST-HOME file format (next slide) –For real data including quality flags  Return break detection file –BREAK –OUTLI –BEGTR –ENDTR  Multiple breaks at one data possible

19 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Typical errors  The file format needs to be perfect!  Forgetting the station-file that describes which stations belong to the homogenised network  Changing the file names in this station file to homogeneous data files ►  (Forgetting to return the files with the quality flags)  The sizes of the breaks are not in the break file  Please, keep directory structure of the benchmark like it is, also for partial contributions –The only difference is the main directory  All files are tab-delimited ASCII files

20 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain COST-HOME file format – network file

21 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Typical errors  The file format needs to be perfect!  Forgetting the station-file that describes which stations belong to the homogenised network  Changing the file names in this station file to homogeneous data files  (Forgetting to return the files with the quality flags)  The sizes of the breaks are not in the break file ►  Please, keep directory structure of the benchmark like it is, also for partial contributions –The only difference is the main directory  All files are tab-delimited ASCII files

22 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Detected breaks file

23 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Typical errors – see discussion  The file format needs to be perfect!  Forgetting the station-file that describes which stations belong to the homogenised network  Changing the file names in this station file to homogeneous data files  (Forgetting to return the files with the quality flags)  The sizes of the breaks are not in the break file  Please, keep directory structure of the benchmark like it is, also for partial contributions –The only difference is the main directory  All files are tab-delimited ASCII files ►

24 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain COST-HOME file format – monthly data

25 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Contributions ParticipantAlgorithmRemarks 1. José GuijarroClimatol6 Versions with different settings 2. Péter DomonkosCM-D, MASH-D, NSHT-D 3 Versions / detection algorithms 3. Michele BrunettiBrunettiDetection Craddock based; 2 surrogate temp. networks 4. Dubravka Rasol & Olivier Mestre PRODIGEAll surrogate temp.; 13 surrogate precip. Networks 5. Matthew Menne & Claude Williams Automated pairwise hom. 2 Versions; “all” temp. Networks (part of real #3 is missing) 6. Christine Gruber & Ingeborg Auer HOCLIS1 Surrogate temp. & 1 surrogate precip. 7. Gregor VertacnikMASHAll surrogate temp. 8. Petr StepanekAnClim1 Surrogate temp. & 1 surrogate precip. 9. Lucie VincentVincent1 Surrogate temp. 10. Enric AguilarNSHTNot in the right format yet

26 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain No. homogenised networks - algorithm Table 1. Number of homogenised networks per algorithm Homogenisation alg.All networksReal netw.Surrogate netw.Synthetic netw. PRODIGE320 0 Brunetti2020 MASH250 0 Vincent1010 HOCLIS1010 AnClim2020 Climatol A921240 Climatol C921240 Climatol D921240 Climatol E921240 Climatol F921240 ClimatolG012020 APHa24251819 APHa14251819 CM-D5050 MASH-D5050 SNHT-D5050

27 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain No. homogenised networks – input data Table 3. Summary data: Number of homogenised networks per network NetworkNo. networksTemp. netw.Precip. netw. All624371253 Real704030 Surrogate316193123 Surrogate #1291712 Surrogate ~#1287176111 Synthetic238138100 Synthetic #11275 Synthetic ~#122613195

28 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Mean no. outliers per station Table 21. Mean number of outliers per station for every algorithm Homogenisation alg.All networksReal netw.Surrogate netw.Synthetic netw. PRODIGE0.0NaN0.0NaN Brunetti3.4NaN3.4NaN MASH16.1NaN16.1NaN Vincent0.0NaN0.0NaN HOCLIS6.0NaN6.0NaN AnClim5.5NaN5.5NaN Climatol A3.60.24.04.2 Climatol C88.556.494.891.9 Climatol D54.434.758.056.7 Climatol E54.231.860.754.4 Climatol F43.833.247.842.9 ClimatolG014.4NaN4.4NaN APHa21.90.02.12.2 APHa11.90.02.12.2 CM-D15.7NaN15.7NaN MASH-D15.5NaN15.5NaN SNHT-D15.3NaN15.3NaN

29 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Mean no. breaks per station Table 22. Mean number of breaks per station for every algorithm Homogenisation alg.All networksReal netw.Surrogate netw.Synthetic netw. PRODIGE2.7NaN2.7NaN Brunetti5.0NaN5.0NaN MASH4.6NaN4.6NaN Vincent0.0NaN0.0NaN HOCLIS2.8NaN2.8NaN AnClim1.2NaN1.2NaN Climatol A1.30.81.51.3 Climatol C1.50.81.6 Climatol D1.41.01.51.4 Climatol E1.60.91.81.6 Climatol F1.51.21.71.4 ClimatolG011.2NaN1.2NaN APHa21.81.22.11.7 APHa11.71.01.91.6 CM-D4.6NaN4.6NaN MASH-D3.9NaN3.9NaN SNHT-D3.2NaN3.2NaN

30 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Homogenising the exercise  Tab-delimited files: also space-delimited? –Mixture of strings and numbers  Data quality files only for real data section  Do we want to use the Diurnal Temperature Range (DTR)? –Not useful for surrogate and synthetic data! –If we do, everyone should do it  End or begin uncorrected? –Compute statistics independent of absolute level?  Filling missing values part exercise?  Human quality control or raw algorithm output?  Homogenise all or homogenisable networks, times

31 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Contributions – who is missing? ParticipantAlgorithmRemarks 1. José GuijarroClimatol6 Versions with different settings 2. Péter DomonkosCM-D, MASH-D, NSHT-D 3 Versions / detection algorithms 3. Michele BrunettiBrunettiDetection Craddock based; 2 surrogate temp. networks 4. Dubravka Rasol & Olivier Mestre PRODIGEAll surrogate temp.; 13 surrogate precip. Networks 5. Matthew Menne & Claude Williams Automated pairwise hom. 2 Versions; “all” temp. Networks (part of real #3 is missing) 6. Christine Gruber & Ingeborg Auer HOCLIS1 Surrogate temp. & 1 surrogate precip. 7. Gregor VertacnikMASHAll surrogate temp. 8. Petr StepanekAnClim1 Surrogate temp. & 1 surrogate precip. 9. Lucie VincentVincent1 Surrogate temp. 10. Enric AguilarNSHTNot in the right format yet

32 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Analysing the results  What measures define a well homogenised dataset? –Real data vs. data with known truth  Ensemble mean for real data? –Breaks  Position, hit rate  size distribution  detection probability as function of size –Data itself  Root mean square error (RMSE)  RMSE (without outliers)  RMSE (bias corrected)  Uncertainty in the network mean trend  How to study which components are best?

33 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Deadline(s)  Agreed on 09/2009, September this year  Multiple deadlines –For example: synthetic data, real data, surrogate data –After deadline the truth can be revealed –After deadline the other contributions can be revealed(?) –Start earlier analysing the results –For example: May, July, September  Bologna, 25 – 26 May, EGU, 19 – 24 April

34 Victor Venema, Victor.Venema@uni-bonn.de, COST HOME, March 2009, Tarragona, Spain Articles  Articles –Overview COST Action & benchmark with very basic analysis results  Performance difference between synthetic (Gaussian, white noise) and surrogate data  How to deal multiple contributions per algorithm?  Do we have references to all algorithms? –What should the others be about  Analysing results, which components are best  Who will organise, coordinate it? –Not everyone should do the same analysis –How to subdivide the work?  After deadline: sensitivity analysis


Download ppt "Benchmark database inhomogeneous data, surrogate data and synthetic data Victor Venema."

Similar presentations


Ads by Google