Ensemble Verification I

Slides:

Advertisements

Similar presentations

Verification of Probabilistic Forecast J.P. Céron – Direction de la Climatologie S. Mason - IRI.

Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…

RT1 Development of the Ensemble Prediction System Aim Build and test an ensemble prediction system based on global Earth System models developed in Europe,

Fill in missing numbers or operations

Lecture Slides Elementary Statistics Eleventh Edition

Chapter 1 The Study of Body Function Image PowerPoint

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

THOR Annual Meeting - Bergen 9-11 November /25 On the impact of initial conditions relative to external forcing on the skill of decadal predictions:

Norwegian Meteorological Institute met.no LAMEPS – Limited area ensemble forecasting in Norway, using targeted EPS Marit Helene Jensen, Inger-Lise Frogner,

Federal Department of Home Affairs FDHA Federal Office of Meteorology and Climatology MeteoSwiss Extended range forecasts at MeteoSwiss: User experience.

Page 1© Crown copyright 2004 Seasonal forecasting activities at the Met Office Long-range Forecasting Group, Hadley Centre Presenter: Richard Graham ECMWF.

Page 1© Crown copyright 2004 Presentation to ECMWF Forecast Product User Meeting 16th June 2005.

ECMWF Slide 1Met Op training course – Reading, March 2004 Forecast verification: probabilistic aspects Anna Ghelli, ECMWF.

User Meeting 15 June 2005 Monthly Forecasting Frederic Vitart ECMWF, Reading, UK.

ECMWF User Meeting / 1 Pertti Nurmi Juha Kilpinen Annakaisa Sarkanen ( Finnish Meteorological Institute ) Probabilistic Forecasts.

Page 1 © Crown copyright 2005 ECMWF User Meeting, June 2006 Developments in the Use of Short and Medium-Range Ensembles at the Met Office Ken Mylne.

Training Course 2009 – NWP-PR: Ensemble Verification II 1/33 Ensemble Verification II Renate Hagedorn European Centre for Medium-Range Weather Forecasts.

Sub-seasonal to seasonal prediction David Anderson.

Year 6 mental test 5 second questions

Year 6 mental test 10 second questions

Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION

1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.

© 2002 Prentice-Hall, Inc.Chap 17-1 Basic Business Statistics (8 th Edition) Chapter 17 Decision Making.

Chapter 7 Sampling and Sampling Distributions

Chi Square Interpretation. Examples of Presentations The following are examples of presentations of chi-square tables and their interpretations. These.

Simple Linear Regression 1. review of least squares procedure 2

The basics for simulations

NIPRL Chapter 10. Discrete Data Analysis 10.1 Inferences on a Population Proportion 10.2 Comparing Two Population Proportions 10.3 Goodness of Fit Tests.

(This presentation may be used for instructional purposes)

Detection Chia-Hsin Cheng. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outlines Detection Theory Simple Binary Hypothesis Tests Bayes.

Biostatistics Unit 10 Categorical Data Analysis 1.

Chapter 16 Goodness-of-Fit Tests and Contingency Tables

5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.

Meteorological Training Course, 20 March /25 Using Combined Prediction Systems (CPS) for wind energy applications European Centre for Medium-Range.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Lecture 3 Validity of screening and diagnostic tests

1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen

1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.

Before Between After.

Chapter 8: Introduction to Hypothesis Testing. 2 Hypothesis Testing An inferential procedure that uses sample data to evaluate the credibility of a hypothesis.

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

Subtraction: Adding UP

Putting Statistics to Work

Institut für Physik der Atmosphäre Institut für Physik der Atmosphäre Object-Oriented Best Member Selection in a Regional Ensemble Forecasting System Christian.

Statistical Inferences Based on Two Samples

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Chapter Thirteen The One-Way Analysis of Variance.

Chapter 8 Estimation Understandable Statistics Ninth Edition

PSSA Preparation.

Simple Linear Regression Analysis

Multiple Regression and Model Building

4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Commonly Used Distributions

1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona RFC Verification Workshop,

Verification of probability and ensemble forecasts

Daria Kluver Independent Study From Statistical Methods in the Atmospheric Sciences By Daniel Wilks.

4IWVM - Tutorial Session - June 2009 Verification of categorical predictands Anna Ghelli ECMWF.

Probabilistic Forecasting. pdfs and Histograms Probability density functions (pdfs) are unobservable. They can only be estimated. They tell us the density,

ECMWF Training Course Reading, 25 April 2006 EPS Diagnostic Tools Renate Hagedorn European Centre for Medium-Range Weather Forecasts.

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Verification and Metrics (CAWCR)

EUMETCAL NWP-course 2007: The Concept of Ensemble Forecasting Renate Hagedorn European Centre for Medium-Range Weather Forecasts The General Concept of.

Verification of ensemble systems Chiara Marsigli ARPA-SIMC.

Common verification methods for ensemble forecasts

Verifying and interpreting ensemble products

Probabilistic forecasts

COSMO-LEPS Verification

Short Range Ensemble Prediction System Verification over Greece

Presentation transcript:

Ensemble Verification I Renate Hagedorn European Centre for Medium-Range Weather Forecasts

Objective of diagnostic/verification tools Assessing the goodness of a forecast system involves determining skill and value of forecasts A forecast has skill if it predicts the observed conditions well according to some objective or subjective criteria. A forecast has value if it helps the user to make better decisions than without knowledge of the forecast. Forecasts with poor skill can be valuable (e.g. location mismatch) Forecasts with high skill can be of little value (e.g. blue sky desert)

Ensemble Prediction System • 1 control run + 50 perturbed runs (TL399 L62)  added dimension of ensemble members  f(x,y,z,t,e) • How do we deal with added dimension when  interpreting, verifying and diagnosing EPS output? Transition from deterministic (yes/no) to probabilistic

Assessing the quality of a forecast • The forecast indicated 10% probability for rain • It did rain on the day • Was it a good forecast? □ Yes □ No □ I don’t know (what a stupid question…) • Single probabilistic forecasts are never completely wrong or right (unless they give 0% or 100% probabilities) • To evaluate a forecast system we need to look at a (large) number of forecast–observation pairs

Assessing the quality of a forecast system Characteristics of a forecast system: Consistency*: Do the observations statistically belong to the distributions of the forecast ensembles? (consistent degree of ensemble dispersion) Reliability: Can I trust the probabilities to mean what they say? Sharpness: How much do the forecasts differ from the climatological mean probabilities of the event? Resolution: How much do the forecasts differ from the climatological mean probabilities of the event, and the systems gets it right? Skill: Are the forecasts better than my reference system (chance, climatology, persistence,…)? * Note that terms like consistency, reliability etc. are not always well defined in verification theory and can be used with different meanings in other contexts

Rank Histogram Rank Histograms asses whether the ensemble spread is consistent with the assumption that the observations are statistically just another member of the forecast distribution Check whether observations are equally distributed amongst predicted ensemble Sort ensemble members in increasing order and determine where the observation lies with respect to the ensemble members Rank 1 case Rank 4 case Temperature -> Temperature ->

Rank Histograms OBS is indistinguishable from any other ensemble member OBS is too often below the ensemble members (biased forecast) OBS is too often outside the ensemble spread A uniform rank histogram is a necessary but not sufficient criterion for determining that the ensemble is reliable (see also: T. Hamill, 2001, MWR)

Reliability A forecast system is reliable if: statistically the predicted probabilities agree with the observed frequencies, i.e. taking all cases in which the event is predicted to occur with a probability of x%, that event should occur exactly in x% of these cases; not more and not less. A reliability diagram displays whether a forecast system is reliable (unbiased) or produces over-confident / under-confident probability forecasts A reliability diagram also gives information on the resolution (and sharpness) of a forecast system Forecast PDF Climatological PDF

Reliability Diagram Take a sample of probabilistic forecasts: e.g. 30 days x 2200 GP = 66000 forecasts How often was event (T > 25) forecasted with X probability? FC Prob. # FC OBS-Frequency (perfect model) (imperfect model) 100% 8000 8000 (100%) 7200 (90%) 90% 5000 4500 ( 90%) 4000 (80%) 80% 4500 3600 ( 80%) 3000 (66%) …. 10% 5500 550 ( 10%) 800 (15%) 0% 7000 0 ( 0%) 700 (10%) 25 25 25

Reliability Diagram • • • • • Take a sample of probabilistic forecasts: e.g. 30 days x 2200 GP = 66000 forecasts How often was event (T > 25) forecasted with X probability? FC Prob. # FC OBS-Frequency (perfect model) (imperfect model) 100% 8000 8000 (100%) 7200 (90%) 90% 5000 4500 ( 90%) 4000 (80%) 80% 4500 3600 ( 80%) 3000 (66%) …. 10% 5500 550 ( 10%) 800 (15%) 0% 7000 0 ( 0%) 700 (10%) 100 • • • OBS-Frequency • • 0 100 FC-Probability

Reliability Diagram over-confident model perfect model

under-confident model Reliability Diagram under-confident model perfect model

Reliability score (the smaller, the better) Reliability diagram Reliability score (the smaller, the better) imperfect model perfect model

Components of the Brier Score N = total number of cases I = number of probability bins ni = number of cases in probability bin i fi = forecast probability in probability bin I oi = frequency of event being observed when forecasted with fi  Reliability: forecast probability vs. observed relative frequencies

Reliability diagram c c Poor resolution Good resolution Reliability score (the smaller, the better) Resolution score (the bigger, the better) Size of red bullets represents number of forecasts in probability category (sharpness) c c Poor resolution Good resolution

Components of the Brier Score  Uncertainty: variance of observations frequency in sample N = total number of cases I = number of probability bins ni = number of cases in probability bin i fi = forecast probability in probability bin I oi = frequency of event being observed when forecasted with fi c = frequency of event being observed in whole sample  Reliability: forecast probability vs. observed relative frequencies  Resolution: ability to issue reliable forecasts close to 0% or 100% Brier Score = Reliability – Resolution + Uncertainty

Brier Score • The Brier score is a measure of the accuracy of probability forecasts • Considering N forecast – observation pairs the BS is defined as: with p: forecast probability (fraction of members predicting event) o: observed outcome (1 if event occurs; 0 if event does not occur) • BS varies from 0 (perfect deterministic forecasts) to 1 (perfectly wrong!) • BS corresponds to RMS error for deterministic forecasts

Brier Skill Score Skill score = • Skill scores are used to compare the performance of forecasts with that of a reference forecast such as climatology or persistence • Constructed so that perfect FC takes value 1 and reference FC = 0 score of current FC – score for ref FC Skill score = score for perfect FC – score for ref FC • positive (negative) BSS  better (worse) than reference

Brier Skill Score & Reliability Diagram • How to construct the area of positive skill? line of no skill area of skill (RES > REL) perfect reliability Observed Frequency climatological frequency (line of no resolution) Forecast Probability

Reliability: 2m-Temp.>0 0.039 0.899 0.141 BSS Rel-Sc Res-Sc 0.140 0.095 0.926 0.169 -0.001 0.877 0.123 0.065 0.918 0.147 -0.064 0.838 0.099 0.047 0.893 0.153 0.204 0.990 0.213 DEMETER: 1 month lead, start date May, 1980 - 2001 CERFACS CNRM ECMWF INGV LODYC MPI UKMO DEMETER

Assessing the quality of a forecast system Characteristics of a forecast system: Consistency: Do the observations statistically belong to the distributions of the forecast ensembles? (consistent degree of ensemble dispersion) Reliability: Can I trust the probabilities to mean what they say? Sharpness: How much do the forecasts differ from the climatological mean probabilities of the event? Resolution: How much do the forecasts differ from the climatological mean probabilities of the even, and the systems gets it right? Skill: Are the forecasts better than my reference system (chance, climatology, persistence,…)? Rank Histogram Reliability Diagram Brier Skill Score

Discrimination Until now, we looked at the question: If the forecast system predicts x, what is the observation y? When we are interested in the ability of a forecast system to discriminate between events and non-events, we investigate the question: If the event y occurred, what was the forecast x? Based on signal-detection theory, the Relative Operating Characteristic (ROC) measures this discrimination ability The ROC curve is defined as the curve of the hit rate (H) over the false alarm rate (F) H and F can be calculated from the contingency table

Verification of two category (yes/no) situations • Compute 2 x 2 contingency table: (for a set of cases) Event observed Yes No total Event forecasted a b a+b c d c+d a+c b+d a+b+c+d=n • Event Probability: s = (a+c) / n • Probability of a Forecast of occurrence: r = (a+b) / n • Frequency Bias: B = (a+b) / (a+c) • Proportion Correct: PC = (a+d) / n

Example of Finley Tornado Forecasts (1884) • Compute 2 x 2 contingency table: (for a set of cases) Event observed Yes No total Event forecasted 28 72 100 23 2680 2703 51 2752 2803 • Event Probability: s = (a+c) / n = 51/2803 = 0.018 • Probability of a Forecast of occurrence: r = (a+b) / n = 100/2803 = 0.036 • Frequency Bias: B = (a+b) / (a+c) = 100/51 = 1.961 • Proportion Correct: PC = (a+d) / n = 2708/2803 = 0.966 96.6% Accuracy

Example of Finley Tornado Forecasts (1884) • Compute 2 x 2 contingency table: (for a set of cases) Event observed Yes No total Event forecasted 51 2752 2803 • Event Probability: s = (a+c) / n = 51/2803 = 0.018 • Probability of a Forecast of occurrence: r = (a+b) / n = 0/2803 = 0.0 • Frequency Bias: B = (a+b) / (a+c) = 0/51 = 0.0 • Proportion Correct: PC = (a+d) / n = 2752/2803 = 0.982 98.2% Accuracy!

Some Scores and Skill Scores Formula Finley (original) Finley (never fc T.) (always fc. T.) Proportion Correct PC=(a+d)/n 0.966 0.982 0.018 Threat Score TS=a/(a+b+c) 0.228 0.000 Odds Ratio Θ=(ad)/(bc) 45.3 - Odss Ratio Skill Score Q=(ad-bc)/(ad+bc) 0.957 Heidke Skill Score HSS=2(ad-bc)/ (a+c)(c+d)+(a+b)(b+d) 0.355 0.0 Peirce Skill Score PSS=(ad-bc)/(a+c)(b+d) 0.523 Clayton Skill Score CSS=(ad-bc)/(a+b)(c+d) 0.271 Gilbert Skill Score (ETS) GSS=(a-aref)/(a-aref+b+c) aref = (a+b)(a+c)/n 0.216

Verification of two category (yes/no) situations • Compute 2 x 2 contingency table: (for a set of cases) Event observed Yes No total Event forecasted a b a+b c d c+d a+c b+d a+b+c+d=n • Event Probability: s = (a+c) / n • Probability of a Forecast of occurrence: r = (a+b) / n • Frequency Bias: B = (a+b) / (a+c) • Hit Rate: H = a / (a+c) • False Alarm Rate: F = b / (b+d) • False Alarm Ratio: FAR = b / (a+b)

Example of Finley Tornado Forecasts (1884) • Compute 2 x 2 contingency table: (for a set of cases) Event observed Yes No total Event forecasted 28 72 100 23 2680 2703 51 2752 2803 • Event Probability: s = (a+c) / n = 0.018 • Probability of a Forecast of occurrence: r = (a+b) / n = 0.036 • Frequency Bias: B = (a+b) / (a+c) = 1.961 • Hit Rate: H = a / (a+c) = 0.549 • False Alarm Rate: F = b / (b+d) = 0.026 • False Alarm Ratio: FAR = b / (a+b) = 0.720

Extension of 2 x 2 contingency table for prob. FC Event observed Yes No threshold H F Event forecasted >80% - 100% 30 5 >80% 0.29 0.05 >60% - 80% 25 10 >60% 0.52 0.14 >40% - 60% 20 15 >40% 0.71 >20% - 40% >20% 0.86 0.48 >0% - 20% >0% 0.95 0% 1.00 total 105 • • • • • • 0 1 False Alarm Rate Hit Rate 1 >80 >60 >40 >20 >0 0

ROC curve • ROC curve is plot of H against F for range of probability thresholds H low threshold moderate threshold • ROC area (area under the ROC curve) is skill measure A=0.5 (no skill), A=1 (perfect deterministic forecast) A=0.83 high threshold F • ROC curve is independent of forecast bias, i.e. represents potential skill • ROC is conditioned on observations (if y occurred, what did FC predict?)

ROCSS vs. BSS • ROCSS or BSS > 0 indicate skilful forecast system Northern Extra-Tropics 500 hPa anomalies > 2σ (spring 2002) ROC skill score Brier skill score Richardson, 2005

Summary I A forecast has skill if it predicts the observed conditions well according to some objective or subjective criteria To evaluate a forecast system we need to look at a (large) number of forecast – observation pairs Different scores measure different characteristics of the forecast system: Reliability / Resolution, Brier Score (BSS), ROC,… Perception of usefulness of ensemble may vary with score used It is important to understand the behaviour of different scores and choose appropriately

Goal of Practice Session How to construct a contingency table How to plot a Reliability Diagram (including Frequency Diagram) from the contingency table How to interpret Reliability and Frequency Diagram How to calculate the Brier Score and Brier Skill Score The “direct” way From the contingency table (BS=REL-RES+UNC) How to plot a ROC Diagram Compare characteristics of Reliability and ROC diagram