Presentation on theme: "1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen"— Presentation transcript:
1 Forecast verification - did we get it right? Ian Jolliffe Universities of Reading, Southampton, Kent, Aberdeen firstname.lastname@example.org
2 Outline of talk Introduction – what, why, how? Binary forecasts –Performance measures, ROC curves –Desirable properties Of forecasts Of performance measures Other forecasts –Multi-category, continuous, (probability) Value
3 Forecasts Forecasts are made in many disciplines –Weather and climate –Economics –Sales –Medical diagnosis
4 Why verify/validate/assess forecasts? Decisions are based on past data but also on forecasts of data not yet observed A look back at the accuracy of forecasts is necessary to determine whether current forecasting methods should be continued, abandoned or modified
5 Two (very different) recent references I T Jolliffe and D B Stephenson (eds.) (2003) Forecast verification A practitioner’s guide in atmospheric science. Wiley. M S Pepe (2003) The statistical evaluation of medical tests for classification and prediction. Oxford.
6 Horses for courses Different types of forecast need different methods of verification, for example in the context of weather hazards (TSUNAMI project): lbinary data - damaging frost: yes/no lcategorical - storm damage: slight/moderate/severe ldiscrete - how many land-falling hurricanes in a season lcontinuous - height of a high tide lprobabilities – of tornado Some forecasts (wordy/ descriptive) are very difficult to verify at all
7 Binary forecasts Such forecasts might be –Whether temperature will fall below a threshold, damaging crops or forming ice on roads –Whether maximum river flow will exceed a threshold, causing floods –Whether mortality due to extreme heat will exceed some threshold (PHEWE project) –Whether a tornado will occur in a specified area The classic Finley Tornado example (next 2 slides) illustrates that assessing such forecasts is more subtle than it looks There are many possible verification measures – most have some poor properties
8 Forecasting tornados Tornado Observed Tornado not observed Total Tornado Forecast 2872100 Tornado not forecast 2326802703 Total5127522803
9 Tornado forecasts Correct decisions 2708/2803 = 96.6% Correct decisions by procedure which always forecasts ‘No Tornado’ 2752/2803 = 98.2% It’s easy to forecast ‘No Tornado’, and get it right but more difficult to forecast when tornadoes will occur Correct decision when Tornado is forecast is 28/100 = 28.0% Correct forecast of observed tornadoes 28/51 = 54.5%
10 Forecast/observed contingency table Event Observed Event not observed Total Event Forecast aba + b Event not forecast cdc + d Totala + cb + dn
11 Some verification measures for (2 x 2) tables a/(a+c) Hit rate = true positive fraction = sensitivity b/(b+d) False alarm rate = 1- specificity b/(a+b) False alarm ratio = 1 – positive predictive value c/(c+d) Negative predictive value (a+d)/n Proportion correct (PC) (a+b)/(a+c) Bias
12 Skill scores A skill score is a verification measure adjusted to show improvement over some unskilful baseline, typically a forecast of ‘climatology’, a random forecast or a forecast of persistence. Usually adjustment gives zero value for the baseline and unity for a perfect forecast. For (2x2) tables we know how to calculate ‘expected’ values in the cells of the table under a null hypothesis of no association (no skill) for a χ 2 test.
13 More (2x2) verification measures (PC – E)/(1- E), where E is the expected value of PC assuming no skill – the Heidke (1926) skill score = Cohen’s Kappa (1960), also Doolittle (1885) a/(a+b+c). Critical success index (CSI) = threat score Gilbert’s (1884) skill score - a skill score version of CSI (ad –bc)/(ad +bc) Yule’s Q (1900). A skill score version of the odds ratio ad/bc a(b+d)/b(a+c); c(b+d)/d(a+c) Diagnostic likelihood ratios Note that neither the list of measures nor the list of names is exhaustive – see, for example, J A Swets (1986), Psychological Bulletin, 99, 100-117
14 The (Relative Operating Characteristic) ROC curve Plots hit rate (proportion of occurrences of the event that were correctly forecast) against false alarm rate (proportion of non-occurrences that were incorrectly forecast) for different thresholds Especially relevant if a number of different thresholds are of interest There are a number of verification measures based on ROC curves. The most widely used is probably the area under the curve
15 Desirable properties of measures: hedging and proper scores ‘Hedging’ is when a forecaster gives a forecast different from his/her true belief because he/she believes that the hedged forecasts will improve the (expected) score on a measure used to verify the forecasts. Clearly hedging is undesirable. For probability forecasts, a (strictly) proper score is one for which the forecaster (uniquely) maximises the expected score by forecasting his/her true beliefs, so that there is no advantage in hedging.
17 Desirable properties of measures: equitability A score for a probability forecast is equitable if it takes the same expected value (often chosen to be zero) for all unskilful forecasts of the type –Forecast the same probability all the time or –Choose a probability randomly from some distribution on the range [0,1]. Equitability is desirable – if two sets of forecasts are made randomly, but with different random mechanisms, one should not score better than the other.
18 Desirable properties of measures III There are a number of other desirable properties of measures, both for probability forecasts and other types of forecast, but equitability and propriety are most often cited in the meteorological literature. Equitability and propriety are incompatible (a new result)
19 Desirable properties (attributes) of forecasts Reliability. Conditionally unbiased. Expected value of the observation equals the forecast value. Resolution. The sensitivity of the expected value of the observation to different forecasts values (or more generally the sensitivity of this conditional distribution as a whole). Discrimination. The sensitivity of the conditional distribution of forecasts, given observations, to the value of the observation. Sharpness. Measures spread of marginal distribution of forecasts. Equivalent to resolution for reliable (perfectly calibrated) forecasts. –Other lists of desirable attributes exist.
20 A reliability diagram For a probability forecast of an event based on 850hPa temperature. Lots of grid points, so lots of forecasts (16380). Plots observed proportion of event occurrence for each forecast probability vs. forecast probability (solid line). Forecast probability takes only 17 possible values (0, 1/16, 2/16, … 15/16, 1) because forecast is based on proportion of an ensemble of 16 forecasts that predict the event. Because of the nature of the forecast event, 0 or 1 are forecast most of the time (see inset sharpness diagram).
21 Weather/climate forecasts vs medical diagnostic tests Quite different approaches in the two literatures –Weather/climate. Lots of measures used. Literature on properties, but often ignored. Inference (tests, confidence intervals, power) seldom considered –Medical (Pepe). Far fewer measures. Little discussion of properties. More inference: confidence intervals, complex models for ROC curves
22 Multi-category forecasts These are forecasts of the form –Temperature or rainfall ‘above’, ‘below’ or ‘near’ average (a common format for seasonal forecasts) –‘Very High Risk’, High Risk’, ‘Moderate Risk’, ‘Low Risk’ of excess mortality (PHEWE) Different verification measures are relevant depending on whether categories are ordered (as here) or unordered
23 Multi-category forecasts II As with binary forecasts there are many possible verification measures With K categories one class of measures assigns scores to each cell in the (K x K) table of forecast/outcome combinations Then multiply the proportion of observations in each cell by its score, and sum over cells to get an overall score By insisting on certain desirable properties (equitability, symmetry etc) the number of possible measures is narrowed
24 Gerrity (and LEPS) scores for 3 ordered category forecasts with equal probabilities Two possibilities are Gerrity scores or LEPS (Linear Error in Probability Space) In the example, LEPS rewards correct extreme forecasts more, and penalises badly wrong forecasts more, than Gerrity (divide Gerrity (LEPS) by 24 (36) to give the same scaling – an expected maximum value of 1) 30 (48) -6 (-6) -24 (-42) -6 (-6) 12 (12) -6 (-6) -24 (-42) -6 (-6) 30 (48)
25 Verification of continuous variables Suppose we make forecasts f 1, f 2, …, f n ; the corresponding observed data are x 1, x 2, …, x n. We might assess the forecasts by computing [ |f 1 -x 1 | + |f 2 -x 2 | + … + |f n -x n |]/n (mean absolute error) [ (f 1 -x 1 ) 2 + (f 2 -x 2 ) 2 + … + (f n -x n ) 2 ]/n (mean square error) – or take its square root Some form of correlation between the f’s and x’s Both MSE and correlation can be highly influenced by a few extreme forecasts/observations No time here to explore other possibilities
26 Skill or value? Our examples have looked at assessing skill Often we really want to assess value This needs quantification of the loss/cost of incorrect forecasts in terms of their ‘incorrectness’
27 Value of Tornado Forecasts If wrong forecasts of any sort costs $1K, then the cost of forecasting system is $95K, but the naive system costs only $51K If a false alarm costs $1K, but a tornado missed costs $10K, then the system costs $302K, but naivety costs $510K If a false alarm costs $1K, but a tornado missed costs $1million, then the system costs $23.07million, with naivety costing $51 million
28 Concluding remarks Forecasts should be verified Forecasts are multi-faceted; verification should reflect this Interpretation of verification results needs careful thought Much more could be said, for example, on inference, wordy forecasts, continuous forecasts, probability forecasts, ROC curves, value, spatial forecasts etc.
30 Continuous variables – LEPS scores Also for MSE a difference between forecast and observed of, say, 2 o C is treated the same way, whether it is –a difference between 1 o C above and 1 o C below the long-term mean or –a difference between 3 o C above and 5 o C above the long-term mean It can be argued that the second forecast is better than the first because the forecast and observed are closer with respect to the probability distribution of temperature.
31 LEPS scores II LEPS (Linear Error in Probability Space) are scores that measure distances with respect to position in a probability distribution. They start from the idea of using | P f – P v |, where P f, P v are positions in the cumulative probability distribution of the measured variable for the forecast and observed values, respectively This has the effect of down-weighting differences between extreme forecasts and outcomes e.g. a forecast/outcome pair 3 & 4 standard deviations above the mean is deemed ‘closer’ than a pair 1 & 2 SDs above the mean. Hence it gives greater credit to ‘good’ forecasts of extremes.
32 LEPS scores III The basic measure is normalized and adjusted to ensure –the score is doubly equitable –no ‘bending back’ – a simple value for unskilful and for perfect forecasts We end up with 3(1-|P f -P v |+P f 2 -P f +P v 2 -P v )-1
33 LEPS scores IV LEPS scores can be used on both continuous and categorical data. A skill score, taking values between –100 and 100 (or –1 and 1) for a set of forecasts can be constructed based on the LEPS score but it is not doubly equitable. Cross-validation (successively leaving out one of the data points and basing the prediction for that point from a rule derived from all the other data points) can be used to reduce the optimistic bias which exists when the same data are used to construct and to evaluate a rule. It has been used in some applications of LEPS, but is relevant more widely.