# 14 May 2001QPF Verification Workshop Verification of Probability Forecasts at Points WMO QPF Verification Workshop Prague, Czech Republic 14-16 May 2001.

## Presentation on theme: "14 May 2001QPF Verification Workshop Verification of Probability Forecasts at Points WMO QPF Verification Workshop Prague, Czech Republic 14-16 May 2001."— Presentation transcript:

14 May 2001QPF Verification Workshop Verification of Probability Forecasts at Points WMO QPF Verification Workshop Prague, Czech Republic 14-16 May 2001 Barbara G. Brown NCAR Boulder, Colorado, U.S.A. bgb@ucar.edu

14 May 2001QPF Verification Workshop Why probability forecasts? “…the widespread practice of ignoring uncertainty when formulating and communicating forecasts represents an extreme form of inconsistency and generally results in the largest possible reductions in quality and value.” --Murphy (1993)

14 May 2001QPF Verification Workshop Outline 1.Background and basics –Types of events –Types of forecasts –Representation of probabilistic forecasts in the verification framework

14 May 2001QPF Verification Workshop Outline continued 2.Verification approaches: focus on 2-category case –Measures –Graphical representations –Using statistical models –Signal detection theory –Ensemble forecast verification –Extensions to multi-category verification problem –Comparing probabilistic and categorical forecasts 3.Connections to value 4.Summary, conclusions, issues

14 May 2001QPF Verification Workshop Background and basics Types of events: –Two-category –Multi-category Two-category events: –Either event A happens or Event B happens –Examples: Rain/No-rain Hail/No-hail Tornado/No-tornado Multi-category event –Event A, B, C, ….or Z happens –Example: Precipitation categories (< 1 mm, 1-5 mm, 5-10 mm, etc.)

14 May 2001QPF Verification Workshop Background and basics cont. Types of forecasts –Completely confident Forecast probability is either 0 or 1 Example: Rain/No rain –Probabilistic Objective (deterministic, statistical, ensemble-based) Subjective Probability is stated explicitly

14 May 2001QPF Verification Workshop Background and basics cont. Representation of probabilistic forecasts in the verification framework x = 0 or 1 f = 0, …, 1.0 f may be limited to only certain values between 0 and 1 Joint distribution: p(f,x), where x = 0, 1 Ex: If there are 12 possible values of f, then p(f,x) is comprised of 24 elements

14 May 2001QPF Verification Workshop Background and basics, cont. Factorizations: Conditional and marginal probabilities –Calibration-Refinement factorization: p(f,x) = p(x|f) p(f) p(x=0|f) = 1 – p(x=1|f) = 1 – E(x|f)  Only one number is needed to specify the distribution p(x|f) for each f p(f) is the frequency of use of each forecast probability Likelihood-Base Rate factorization: p(f,x) = p(f|x) p(x) p(x) is the relative frequency of a Yes observation (e.g., the sample climatology of precipitation); p(x) = E(x)

14 May 2001QPF Verification Workshop Attributes [from Murphy and Winkler(1992)] (sharpness)

14 May 2001QPF Verification Workshop Use the counts in this table to compute various common statistics (e.g., POD, POFD, H-K, FAR, CSI, Bias, etc.) Verification approaches: 2x2 case Completely confident forecasts:

14 May 2001QPF Verification Workshop Verification measures for 2x2 (Yes/No) completely confident forecasts

14 May 2001QPF Verification Workshop Relationships among measures in the 2x2 case Many of the measures in the 2x2 case are strongly related in surprisingly complex ways. For example:

14 May 2001QPF Verification Workshop The lines indicate different values of POD and POFD (where POD = POFD). From Brown and Young (2000) 0.10 0.30 0.50 0.70 0.90

14 May 2001QPF Verification Workshop CSI as a function of p(x=1) and POD=POFD 0.1 0.3 0.5 0.7 0.9

14 May 2001QPF Verification Workshop CSI as a function of FAR and POD

14 May 2001QPF Verification Workshop Measures for Probabilistic Forecasts Summary measures: –Expectation Conditional: E(f|x=0), E(f|x=1) E(x|f) Marginal: E(f) E(x) = p(x=1) –Correlation Joint distribution –Variability Conditional: Var.(f|x=0), Var(f|x=1) Var(x|f) Marginal : Var(f) Var(x) = E(x)[1-E(x)]

14 May 2001QPF Verification Workshop From Murphy and Winkler (1992) Summary measures for joint and marginal distributions:

14 May 2001QPF Verification Workshop From Murphy and Winkler (1992) Summary measures for conditional distributions:

14 May 2001QPF Verification Workshop Performance measures Brier score: –Analogous to MSE; negative orientation; –For perfect forecasts: BS=0 Brier skill score: –Analogous to MSE skill score

14 May 2001QPF Verification Workshop From Murphy and Winkler (1992):

14 May 2001QPF Verification Workshop Brier score displays From Shirey and Erickson, http://www.nws.noaa.gov/tdl/synop/amspapers/masmrfpap.htm

14 May 2001QPF Verification Workshop Brier score displays From http://www.nws.noaa.gov/tdl/synop/mrfpop/mainframes.htm

14 May 2001QPF Verification Workshop Decomposition of the Brier Score Break Brier score into more elemental components: ReliabilityResolutionUncertainty Where I = the number of distinct probability values and Then, the Brier Skill Score can be re-formulated as

14 May 2001QPF Verification Workshop Graphical representations of measures Reliability diagram p(x=1|f i ) vs. f i Sharpness diagram p(f) Attributes diagram –Reliability, Resolution, Skill/No-skill Discrimination diagram p(f|x=0) and p(f|x=1) Together, these diagrams provide a relatively complete picture of the quality of a set of probability forecasts

14 May 2001QPF Verification Workshop Reliability and Sharpness (from Wilks 1995) ClimatologyMinimal RESUnderforecasting Good RES, at expense of REL Reliable forecasts of rare event Small sample size

14 May 2001QPF Verification Workshop Reliability and Sharpness (from Murphy and Winkler 1992) St. Louis 12-24 h PoP Cool Season No skill No RES Model Sub Model Sub

14 May 2001QPF Verification Workshop Attributes diagram (from Wilks 1995)

14 May 2001QPF Verification Workshop Icing forecast examples

14 May 2001QPF Verification Workshop Use of statistical models to describe verification features Exploratory study by Murphy and Wilks (1998) Case study –Use regression model to model reliability –Use Beta distribution to model p(f) as measure of sharpness –Use multivariate diagram to display combinations of characteristics Promising approach that is worthy of more investigation

14 May 2001QPF Verification Workshop Fit Beta distribution to p(f) 2 parameters: p. q 0 1 Ideal: p<1; q<1

14 May 2001QPF Verification Workshop Fit regression to Reliability diagram [p(x|f) vs. f] 2 parameters: b 0, b 1 Murphy and Wilks (1997)

14 May 2001QPF Verification Workshop Summary Plot Murphy and Wilks 1997

14 May 2001QPF Verification Workshop Signal Detection Theory (SDT) Approach that has commonly been applied in medicine and other fields Brought to meteorology by Ian Mason (1982) Evaluates the ability of forecasts to discriminate between occurrence and non-occurrence of an event Summarizes characteristics of the Likelihood-Base Rate decomposition of the framework Tests model performance relative to specific threshold Ignores calibration Allows comparison of categorical and probabilistic forecasts

14 May 2001QPF Verification Workshop Mechanics of SDT Based on likelihood-base rate decomposition p(f,x) = p(f|x) p(x) Basic elements : –Hit rate (HR) HR = POD = YY / (YY+NY) Estimate of p(f=1|x=1) –False Alarm Rate (FA) FA = 1 - POFD = YN / (YN + NN) Estimate of p(f=1|x=0) Relative Operating Characteristic curve –Plot HR vs. FA

14 May 2001QPF Verification Workshop ROC Examples: Mason(1982)

14 May 2001QPF Verification Workshop ROC Examples: Icing forecasts

14 May 2001QPF Verification Workshop ROC Area under the ROC is a measure of forecast skill –Values less than 0.5 indicate negative skill Measurement of ROC Area often is better if a normal distribution model is used to model HR and FA –Area can be underestimated if curve is approximated by straight line segments –Harvey et al (1992), Mason (1982); Wilson (2000)

14 May 2001QPF Verification Workshop Idealized ROC (Mason 1982) S=2S=1S=0.5 f(x=1) f(x=0) S =    

14 May 2001QPF Verification Workshop Comparison of Approaches Brier score –Based on squared error –Strictly proper scoring rule –Calibration is an important factor; lack of calibration impacts scores –Decompositions provide insight into several performance attributes –Dependent on frequency of occurrence of the event ROC –Considers forecasts’ ability to discriminate between Yes and No events –Calibration is not a factor –Less dependent on frequency of occurrence of event –Provides verification information for individual decision thresholds

14 May 2001QPF Verification Workshop Relative operating levels Analogous to the ROC, but from the Calibration- Refinement perspective (i.e., given the forecast) Curves based on –Correct Alarm Ratio: –Miss Ratio: These statistics are estimates of two conditional probabilities: –Correct Alarm Ratio: p(x=1|f=1) –Miss Ratio: p(x=1|f=0) –For a system with no skill, p(x=1|f=1) = p(x=1|f=0) = p(x)

14 May 2001QPF Verification Workshop ROC Diagram (Mason and Graham 1999)

14 May 2001QPF Verification Workshop ROL Diagram (Mason and Graham 1999)

14 May 2001QPF Verification Workshop Verification of ensemble forecasts Output of ensemble forecasting systems can be treated as –A probability distribution –A probability –A categorical forecast Probabilistic forecasts from ensemble systems can be verified using standard approaches for probabilistic forecasts Common methods –Brier score –ROC

14 May 2001QPF Verification Workshop Example: Palmer et al. (2000) Reliability ECMWF ensembleMulti-model ensemble <0 <1

14 May 2001QPF Verification Workshop Example: Palmer et al. (2000) ROC ECMWF ensemble Multi-model ensemble

14 May 2001QPF Verification Workshop Verification of ensemble forecasts (cont.) A number of methods have been developed specifically for use with ensemble forecasts. For example: Rank histograms –Rank position of observations relative to ensemble members –Ideal: Uniform distribution –Non-ideal can occur for many reasons (Hamill 2001) Ensemble distribution approach (Wilson et al. 1999) –Fit distribution to ensemble –Determine probability associated with that observation

14 May 2001QPF Verification Workshop Rank histograms:

14 May 2001QPF Verification Workshop Distribution approach (Wilson et al. 1999)

14 May 2001QPF Verification Workshop Extensions to multiple categories Examples: –QPF with several thresholds/categories Approach 1: Evaluate each category on its own –Compute Brier score, reliability, ROC, etc. for each category separately –Problems: Some categories will be very rare, have few Yes observations Throws away important information related to the ordering of predictands and magnitude of error

14 May 2001QPF Verification Workshop Example: Brier skill score for several categories From http://www.nws.noaa.gov/tdl/synop/mrfpop/mainframes.htm

14 May 2001QPF Verification Workshop Extensions to multiple categories (cont.) Approach 2: Evaluate all categories simultaneously –Rank Probability Score (RPS) –Analogous to Brier Score for multiple categories –Skill score: –Decompositions analogous to BS, BSS

14 May 2001QPF Verification Workshop Multiple categories: Examples of alternative approaches Continuous ranked probability score (Bouttier 1994; Brown 1974; Matheson and Winkler 1976; Unger 1985) and decompositions (Hersbach 2000) –Analogous to RPS with infinite number of classes –Decompose into Reliability and Resolution/uncertainty components Multi-category reliability diagrams (Hamill 1997) –Measures calibration in a cumulative sense –Reduces impact of categories with few forecasts Other references –Bouttier 1994 –Brown 1974 –Matheson and Winkler 1976 –Unger 1985

14 May 2001QPF Verification Workshop Continuous RPS example (Hersbach 2000)

14 May 2001QPF Verification Workshop MCRD example (Hamill 1997)

14 May 2001QPF Verification Workshop Connections to value Cost-Loss ratio model Optimal to protect whenever C C/L where p is the probability of adverse weather

14 May 2001QPF Verification Workshop Wilks’ Value Score (Wilks 2001) VS is the percent improvement in value between climatological and perfect information as a function of C/L VS is impacted by (lack of) calibration VS can be generalized for particular/idealized distributions of C/L

14 May 2001QPF Verification Workshop VS example: Wilks (2001) Las Vegas, PoP April 1980 – March 1987

14 May 2001QPF Verification Workshop VS example: Icing forecasts

14 May 2001QPF Verification Workshop VS: Beta model example (Wilks 2001)

14 May 2001QPF Verification Workshop Richardson approach ROC context Calibration errors don’t impact the score

14 May 2001QPF Verification Workshop Miscellaneous issues Quantifying the uncertainty in verification measures –Issue: Spatial and temporal correlation –A few approaches: Parametric methods Ex: Seaman et al. (1996) Robust methods (confidence intervals for medians) Ex: Brown et al. (1997) Velleman and Hoaglin (1981) Bootstrap methods Ex:Hamill (1999) Kane and Brown (2001) Treatment of observations as probabilistic?

14 May 2001QPF Verification Workshop Conclusions Basis for evaluating probability forecasts was established many years ago (Brier, Murphy, Epstein) Recent renewal in interest has led to new ideas Still more to do –Develop and implement a cohesive set of meaningful and useful methods –Develop greater understanding of methods we have and how they inter-relate

14 May 2001QPF Verification Workshop Verification of Probabilistic QPFs: Selected References Brown, B.G., G. Thompson, R.T. Bruintjes, R. Bullock and T. Kane, 1997: Intercomparison of in-flight icing algorithms. Part II: Statistical verification results. Weather and Forecasting, 12, 890-914. Davis, C., and F. Carr, 2000: Summary of the 1998 workshop on mesoscale model verification. Bulletin of the American Meteorological Society, 81, 809-819. Hamill, T.M., 1997: Reliability diagrams for multicategory probabilistic forecasts. Weather and Forecasting, 12, 736–741. Hamill, T.M., 1999: Hypothesis tests for evaluating numerical precipitation forecasts. Weather and Forecasting, 14, 155-167. Hamill, T.M., 2001: Interpretation of rank histograms for verifying ensemble forecasts. Monthly Weather Review, 129, 550-560.

14 May 2001QPF Verification Workshop References (cont.) Harvey, L.O., Jr., K.R. Hammond, C.M. Lusk, and E.F. Mross, 1992: The application of signal detection theory to weather forecasting behavior. Monthly Weather Review, 120, 863-883. Hersbach, H., 2000: Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15, 559- 570. Hsu, W.-R., and A.H. Murphy, 1986: The attributes diagram: A geometrical framework for assessing the quality of probability forecasts. International Journal of Forecasting, 2, 285-293. Kane, T.L., and B.G. Brown, 2000: Confidence intervals for some verification measures – a survey of several methods. Preprints, 15 th Conference on Probability and Statistics in the Atmospheric Sciences, 8-11 May, Asheville, NC, U.S.A., American Meteorological Society (Boston), 46-49.

14 May 2001QPF Verification Workshop References (cont.) Mason, I., 1982: A model for assessment of weather forecasts. Australian Meteorological Magazine, 30, 291-303. Mason, I., 1989: Dependence of the critical success index on sample climate and threshold probability. Australian Meteorological Magazine, 37, 75-81. Mason, S., and N.E. Graham, 1999: Conditional probabilities, relative operating characteristics, and relative operating levels. Weather and Forecasting, 14, 713-725. Murphy, A.H., 1993: What Is a god forecast? An essay on the nature of goodness in weather forecasting. Weather and Forecasting, 8, 281–293. Murphy, A.H., and D.S. Wilks, 1998: A case study of the use of statistical models in forecast verification: Precipitation probability forecasts. Weather and Forecasting, 13, 795-810.

14 May 2001QPF Verification Workshop References (cont.) Murphy, A.H., and R.L. Winkler, 1992: Diagnostic verification of probability forecasts. International Journal of Forecasting, 7, 435-455. Richardson, D.S., 2000: Skill and relative economic value of the ECMWF ensemble prediction system. Quarterly Journal of the Royal Meteorological Society, 126, 649-667. Seaman, R., I. Mason, and F. Woodcock, 1996: Confidence intervals for some performance measures of Yes-No forecasts. Australian Meteorological Magazine, 45, 49-53. Stanski, H., L.J. Wilson, and W.R. Burrows, 1989: Survey of common verification methods in meteorology. WMO World Weather Watch Tech. Rep. 8, 114 pp. Velleman, P.F., and D.C. Hoaglin, 1981: Applications, Basics, and Computing of Exploratory Data Analysis. Duxbury Press, 354 pp.

14 May 2001QPF Verification Workshop References (cont.) Wilks, D.S., 1995: Statistical Methods in the Atmospheric Sciences, Academic Press, San Diego, CA, 467 pp. Wilks, D.S., 2001: A skill score based on economic value for probability forecasts. Meteorological Applications, in press. Wilson, L.J., W.R. Burrows, and A. Lanzinger, 1999: A strategy for verification of weather element forecasts from an ensemble prediction system. Monthly Weather Review, 127, 956-970.

Download ppt "14 May 2001QPF Verification Workshop Verification of Probability Forecasts at Points WMO QPF Verification Workshop Prague, Czech Republic 14-16 May 2001."

Similar presentations