1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona RFC Verification Workshop,

Presentation on theme: "1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona RFC Verification Workshop,"— Presentation transcript:

1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona hollyoregon@juno.com RFC Verification Workshop, 08/14/2007

2 1.Introduction to Verification -Applications, Rationale, Basic Concepts -Data Visualization and Exploration -Deterministic Scalar measures 2. Categorical measures – KEVIN WERNER -Deterministic Forecasts -Ensemble Forecasts 3. Diagnostic Verification -Reliability -Discrimination -Conditioning/Structuring Analyses 4. Lab Session/Group Exercise - Developing Verification Strategies - Connecting to Forecast Operations and Users Agenda

3 Probabilistic Ensemble Forecasts From: California-Nevada River Forecast Center

4 Probabilistic Ensemble Forecasts From: California-Nevada River Forecast Center

5 Probabilistic Ensemble Forecasts From: A. Hamlet, University of Washington

6

7 Probabilistic Ensemble Forecasts From: A. Hamlet, University of Washington

8 Identifies systematic flaws of an ensemble prediction system. Shows effectiveness of ensemble distribution in sampling the observations. Does not indicate that the ensemble will be of practical use. Talagrand Diagram – Also Called Ranked Histogram

9 With only one ensemble member ( | ) all (100%) observations (  ) will fall “outside” With two ensemble members two out of three observations ( 2/3=67%) should fall outside With three ensemble members two out of four observations ( 2/4=50%) should fall outside  |   |  |   |  |  |  For any number of ensemble members, 2/#members should fall outside the ensemble Identifies systematic flaws of an ensemble prediction system. Shows effectiveness of ensemble distribution in sampling the observations. Does not indicate that the ensemble will be of practical use. Principle Behind Talagrand Diagram Talagrand Diagram – Also Called Ranked Histogram Adapted from A. Persson, 2006

10 Talagrand Diagram Computation Example YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Four sample ensemble members (E1 – E4) for daily flow forecasts (produced from reforecasts using carryover each year) Step 1: Rank lowest to highest for each year. Four members results in 5 bins. Step 2: Determine which bin the corresponding observation falls into. Step 3: Tally how many observations fall in each bin. Step 4: Plot frequency of observations for ranked bin. Bin # 5 3 5 4 5 3 4 5 Bin1 Bin2 Bin3 Bin4 Bin5 Bin # Tally 1 2 3 4 5

11 Talagrand Diagram Computation Example YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Four sample ensemble members (E1 – E4) for daily flow forecasts (produced from reforecasts using carryover each year) Step 1: Rank lowest to highest for each year. Four members results in 5 bins. Step 2: Determine which bin the corresponding observation falls into. Step 3: Tally how many observations fall in each bin. Step 4: Plot frequency of observations for ranked bin. Bin # 5 3 5 4 3 1 2 5 3 4 5 Bin1 Bin2 Bin3 Bin4 Bin5 Bin # Tally 11 21 33 43 54

12 Talagrand Diagram Computation Example YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Four sample ensemble members (E1 – E4) ranked lowest to highest for daily flow (produced from reforecasts using carryover in each year) Bin # 5 3 5 4 3 1 2 5 3 4 5 Bin # Tally 11 21 33 43 54 Bin1 Bin2 Bin3 Bin4 Bin5 Frequency

13 Talagrand Diagram: 25 traces/ensemble, 375 observations Example: “U-Shaped” Observations too often falling outside ensemble Indicates ensemble spread too small Example: “L-Shaped” Observations too often larger (smaller) than ensemble Indicates under- (over-) forecasting bias Example: “N-Shaped” (domed shaped) Observations too rarely falling outside ensemble Indicates ensemble spread is too big Example: “Flat-Shaped” Observations falling uniformly across ensemble Indicates appropriately sized ensemble distribution

14 Talagrand Diagram Example: Interpretation? YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Bin # 5 3 5 4 3 1 2 5 3 4 5 Bin # Tally 11 21 33 43 54 Bin1 Bin2 Bin3 Bin4 Bin5 Four sample ensemble members (E1 – E4) ranked lowest to highest for daily flow (produced from reforecasts using carryover in each year) ??? Frequency

15 Distributions-oriented Forecast Evaluation leads to Diagnostic Verification It’s all about conditional and marginal distributions! P(O|F), P(F|O), P(F), P(O) Reliability, Discrimination, Sharpness, Uncertainty

16 Forecast Reliability -- P(O|F) For a specified forecast condition, what does the distribution of observations look like? Forecasted Probability Relative frequency of observed 0 1 0 1 Forecasted Probability Relative frequency of observed 0 1 0 1 User perspective: “When you say 20% chance of flood flows, how often do flood flows actually happen?” User perspective: “When you say 80% chance of flood flows, how often do flood flows actually happen?”

17 l Good reliability – close to diagonal l Sharpness diagram (p(f)) –histogram of forecasts in each probability bin shows shows marginal distribution of forecasts The reliability diagram is conditioned on the forecasts. That is, given that X was predicted, what was the outcome? Reliability (Attributes) Diagram – Reliability, Sharpness

18 Reliability Diagram Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 1: Choose threshold value to base probability forecasts on. For simplicity we’ll choose the mean forecast over all years and all ensembles (= 208).

19 Reliability Diagram Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 2: Choose how many forecast probability categories to use (5 here: 0,.25,.5,.75,1) Step 3: For each forecast, calculate the forecast probability below the threshold value. P(peak for < 208) 1.0 0.5 0.0 0.25 1.0 0.5 1.0

20 Reliability Diagram Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 2: Choose how many forecast probability categories to use (5 here: 0,.25,.5,.75,1) Step 3: For each forecast, calculate the forecast probability below the threshold value. P(peak for < 208) 1.0 0.5 0.0 0.25 1.0 0.5 0.25 0.5 1.0

21 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 4: Group the observations into groups of equal forecast probability (or, more generally, into forecast probability categories). P(peak for < 208) 1.0 0.5 0.0 0.25 1.0 0.5 0.25 0.5 1.0 P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 Reliability Diagram Example Computation P(peak < 208) = 1.0

22 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 4: Group the observations into groups of equal forecast probability (or, more generally, into forecast probability categories). P(peak for < 208) 1.0 0.5 0.0 0.25 1.0 0.5 0.25 0.5 1.0 P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 1.0 112, 156, 167 Reliability Diagram Example Computation

23 Step 5: For each group, calculate the frequency of observations above the threshold value, 208 cfs. P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = 1.0 112, 156, 167 P(obs peak < 208 given [P(peak for < 208) = 0.0]) = 0/1 = 0.0 P(obs peak < 208 given [P(peak for < 208) = 0.25]) = 1/3 = 0.33 P(obs peak < 208 given [P(peak for < 208) = 0.5]) = 1/5 = 0.2 P(obs peak < 208 given [P(peak for < 208) = 1.0]) = P(obs peak < 208 given [P(peak for < 208) = 0.75]) = 0/0 = NA Reliability Diagram Example Computation

24 Step 5: For each group, calculate the frequency of observations above the threshold value, 208 cfs. P(peak < 208) = 0.0 516 P(peak < 208) = 0.25 348, 98, 233 P(peak < 208) = 0.5 206, 301, 245, 248, 227 P(peak < 208) = 0.75 N/A P(peak < 208) = 1.0 112, 156, 167 P(obs peak < 208 given [P(peak for < 208) = 0.0]) = 0/1 = 0.0 P(obs peak < 208 given [P(peak for < 208) = 0.25]) = 1/3 = 0.33 P(obs peak < 208 given [P(peak for < 208) = 0.5]) = 1/5 = 0.2 P(obs peak < 208 given [P(peak for < 208) = 1.0]) = 3/3 = 1 P(obs peak < 208 given [P(peak for < 208) = 0.75]) = 0/0 = NA Reliability Diagram Example Computation

25 Step 6: Plot centroid of the forecast category (just points in our case) on the x-axis against the observed frequency within each forecast category on the y-axis. Include the 45 degree diagonal for reference. Reliability Diagram Example Computation

26 Step 7: Include sharpness plot showing the number of observation/forecast pairs in each category. Reliability Diagram Example Computation

27 l Good reliability – close to diagonal l Sharpness diagram (p(f)) –histogram of forecasts in each probability bin shows marginal distribution of forecasts l Good resolution –wide range of frequency of observations corresponding to forecast probabilities l Skill – related to Brier Skill Score, in reference to sample climatology (not historical climatology) The reliability diagram is conditioned on the forecasts. That is, given that X was predicted, what was the outcome? Reliability Diagram – Reliability, Sharpness – P(O|F)

28 Overall relative frequency of observations (sample climatology) Points closer to perfect-reliability line than to no-resolution line: subsamples of probabilistic forecast contribute positively to overall skill (as defined by BSS) in reference to sample climatology No-skill line : halfway between perfect-reliability line and no- resolution line, with sample climatology as a reference Attributes Diagram – Reliability, Resolution, Skill/No-skill

29 ClimatologyMinimal RESolutionUnderforecasting Good RES, at expense of REL Reliable forecasts of rare event Small sample size Source: Wilks (1995) Interpretation of Reliability Diagrams

30 Interpretation of Reliability Diagrams Reliability P[O|F] Does the frequency of occurrence match your probability statement? Identifies conditional bias Relative frequency of observations Forecasted probability No resolution

31 EVS Reliability Diagram Examples 25 th Percentile Observed Flows (low flows) Sharp forecasts, but low resolution Arkansas-Red Basin, 24-hr flows, lead time 1-14 days 85 th Percentile Observed Flows (high flows) Good reliability at shorter lead times, long-leads miss high events From: J. Brown, EVS Manual

32 Historical seasonal water supply outlooks Colorado River Basin Morrill, Hartmann, and Bales, 2007

33 Forecast probability Relative Frequency of Observations Jan 1 2) These months show best reliability; low resolution limiting reliability 1) Few high prob. fcasts, good reliability between 10-70% probability; reliability improves. Reliability: Colorado Basin ESP Seasonal Supply Outlooks Apr 1 Mar 1 Jun 1 Jan 1 Apr 1 LC JM (5 mo. lead) LC MM (3 mo. lead) LC AM (2 mo. lead) UC JJy (7 mo. lead) UC AJy (4 mo. lead) UC JnJy (2 mo. lead) 3) Reliability decreases for later forecasts as resolution increases; UC good at extremes. high 30% mid 40% low 30% Franz, Hartmann, and Sorooshian, 2003

34 For a specified observation category, what do the forecast distributions look like? Discrimination – P(F|O) “When dry conditions happen… What do the forecasts usually look like? You sure hope that forecasts look different when there’s a drought, compared to when there’s a flood!

35 Discrimination – P(F|O) You sure hope that forecasts look different when there’s a drought, compared to when there’s a flood! Example: NWS CPC Seasonal climate outlooks, sorted into DRY cases (lowest tercile), 1995-2001, all forecasts, all lead-times Good discrimination! Not much discrimination! Forecasted Probability Relative frequency of indicated forecast Climatology 0.00 0.33 1.00 Probability of dry Probability of wet Forecasted Probability Relative frequency of indicated forecast Climatology 0.00 0.33 1.00 Probability of dry Probability of wet

36 Relative Frequency of Forecasts High Mid- Low There is some discrimination… Early forecasts warned “High flows less likely” Jan 1 Jan-May When unusually low flows happened… P(F|Low flows). Low < 30 th percentile Franz, Hartmann, and Sorooshian (2003) Forecast probability Discrimination: Lower Colorado ESP Supply Outlooks

37 Relative Frequency of Forecasts Good Discrimination… Forecasts were saying: 1) high and mid- flows less likely. 2) Low flows more likely Jan 1 Forecast probability Apr 1 Jan-May Apr-May High Mid- Low There is some discrimination… Early forecasts warned “High flows less likely” Discrimination: Lower Colorado ESP Supply Outlooks When unusually low flows happened… P(F|Low flows). Low < 30 th percentile Franz, Hartmann, and Sorooshian (2003)

38 Relative Frequency of Forecasts high 30% mid 40% low 30% 1)High flows less likely. 2) No discrimination between mid and low flows. 3) Both UC and LC show good discrimination for low flows at 2-month lead time. Jan 1 Forecast probability Apr 1 Lower Colorado Basin Jan-May (5 mo. lead) April-May (2 mo. lead) Jan 1 Jun 1 Upper Colorado Basin Jan-July (7 mo. lead) June-July (2 mo. lead) For observed flows in lowest 30% of historic distribution Discrimination: Colorado Basin ESP Supply Outlooks Franz, Hartmann, and Sorooshian (2003)

39 Historical seasonal water supply outlooks Colorado River Basin

40 All observation CDF is plotted and color coded by tercile. Forecast ensemble members are sorted into 3 groups according to which tercile its associated observation falls into. The CDF for each group is plotted in the appropriate color. i.e. high is blue. Discrimination: CDF Perspective Credit: K. Werner

41 In this case, there is relatively good discrimination since the three conditional forecast CDFs separate themselves. Discrimination Credit: K. Werner

42 Discrimination Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 1: Order observations and divide ordered list into categories. Here we will use terciles (≤ 167, 206 ≤ ≤ 245, ≥ 248). OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner

43 Discrimination Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 2: Group forecast ensemble members according to OBS tercile. Low OBS Forecasts: 42, 74, 82, 90, 114, 277, 351, 356, 98, 170, 204, 205, 94,135, 156, 158 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner

44 Discrimination Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.

45 Discrimination Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, 108, 189, 227, 228 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.

46 Discrimination Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Hi OBS Forecasts: 82, 192, 295, 300, 142, 291, 349, 356, 59, 175, 244, 250 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.

47 Discrimination Example Computation YEARE1 E2 E3 E4 1981 42 74 82 90 1982 65 143 223 227 1983 82 192 295 300 1984 211 397 514 544 1985 142 291 349 356 1986 114 277 351 356 1987 98 170 204 205 1988 69 169 229 236 1989 94 219 267 270 1990 59 175 244 250 1991 108 189 227 228 1992 94 135 156 158 OBS 112 206 301 516 348 98 156 245 233 248 227 167 Hi OBS Forecasts: 82, 192, 295, 300, 211, 397, 514, 544 142, 291, 349, 356, 59, 175, 244, 250 OBS Tercile Low Middle High Low Middle High Middle Low Credit: K. Werner Step 2: Group forecast ensemble members according to OBS tercile.

48 Discrimination Example Computation OBS 112 206 301 516 348 98 156 245 233 248 227 167 Step 3: Plot all-observation CDF color coded by tercile (≤ 167, 206 ≤ ≤ 245, ≥ 248). Credit: K. Werner OBS Tercile Low Middle High Low Middle High Middle Low

49 Step 4: Add forecasts conditioned on observed terciles CDFs to plot. Low OBS Forecasts: 42, 74, 82, 90, 114, 277, 351, 356, 98, 170, 204, 205, 94, 135, 156, 158 Mid OBS Forecasts: 65, 143, 223, 227, 69, 169, 229, 236, 94, 219, 267, 270, 108, 189, 227, 228 Hi OBS Forecasts: 82, 192, 295, 300, 211, 397, 514, 544, 142, 291, 349, 356, 59, 175, 244, 250 Discrimination Example Computation Credit: K. Werner

50 Step 5: Discrimination is shown by the degree to which the conditional forecast CDFs are separated from each other. In this case, high forecasts discriminate better than mid and low forecasts. Discrimination Example Computation Credit: K. Werner

51 How well do April – July volume forecasts discriminate when they are made in Jan, Mar, and May? Poor discrimination in Jan between forecasting high and medium flows. Best discrimination in May. Discrimination Credit: K. Werner

52 Another way to look at discrimination using PDF’s in lieu of CDF’s. The more separation between the PDF’s the better the discrimination. Discrimination Credit: K. Werner

53 Deterministic forecasts traditional in hydrology sub-optimal for decision making Common perspective “Deterministic model simulations and probabilistic forecasts … are two entirely different types of products. Direct comparison of probabilistic forecasts with deterministic single valued forecasts is extremely difficult” Comparing Deterministic & Probabilistic Forecasts - Anonymous

54 How can we compare deterministic and probabilistic forecasts? Deterministic Probabilistic Source: XEFS Design Team, 2007 Option: Use ensemble median with standard metrics – No! x

55 From: A. Hamlet, University of Washington The ensemble mean minimizes error, but doesn’t represent the overall behavior. “Pretend Determinism”

56 What’s wrong with using ‘deterministic’ metrics? Metrics using only central tendency of each forecast pdf fail to distinguish between forecasts 1-3, but will identify 4 as inferior. Metrics that reward accuracy but punish spread will rank the forecast skill from 1 to 4. Obs Value PDF 3 2 1 4 From: A. Hamlet, University of Washington

57 How can we compare deterministic and probabilistic forecasts? Deterministic Probabilistic Source: XEFS Design Team, 2007 Option: Use ensemble median with standard metrics – No! x

58 PDF Climatology distribution Forecast distribution Tercile boundaries (equal probability) Deterministic forecast Jack-knife calibration error = PDF of error distribution can determine any quantiles Deterministic vs. Probabilistic Forecasts Observation Flow, Q Approach used by Morrill, Hartmann, Bales 2007

59 Lab Session -- Group Exercise Choose a set of forecasts. Develop strategies for verifying these forecasts from two perspectives: - Users - Forecasters during operations Report back to group. Repeat for second set of forecasts, if time permits.

Download ppt "1 Verification Continued… Holly C. Hartmann Department of Hydrology and Water Resources University of Arizona RFC Verification Workshop,"

Similar presentations