Verification of forecasts from the SWFDP – E Africa

Verification of forecasts from the SWFDP – E Africa
Richard (Rick) Jones SWFDP Training Workshop on Severe Weather Forecasting Bujumbura, Burundi, Nov , 2013

Verification lawrence.wilson@ec.gc.ca
WMO sponsored Joint Working Group on Forecast Verification Research JWGFVR

Why? “you can't know where you're going until you know where you've been” Proverb or George Santayana-"Those who are unaware of history are destined to repeat it.” Quality management- “Plan Do Check Act” – Deming How to verify ….“Begin with the end in mind” … Covey Training Product differentiation

Verification as a measure of Forecast quality
to monitor forecast quality to improve forecast quality to compare the quality of different forecast systems

Introduction “Verification activity has value only if the information generated leads to a decision about the forecast or system being verified” – A. Murphy “User-Oriented Verification” Verification methods designed with the needs of a specific user in mind. “Users” are those who are interested in verification results, and who will take action based on verification results Forecasters, modelers are users too. There has had to be a recent renewal of focus on “user-orientation” in verification. With the SWFDP and other recent initiatives, there has been increasing effort to put out products from global NWP models for countries outside the primary mandate of the center which developed the product. These products are practically never verified, and the center involved may not even consider it important to verify them. It is also safe to assume also that models are not tuned for regions outside the mandate area of the model developer, and may not perform well for other locations. It is known for example that model performance in the tropics is poorer than in mid-latitudes. Therefore, verification has become more essential. The users of the products must therefore verify wherever possible, and also should put pressure on those who deliver the products to demonstrate their quality by verifying them. Users should work in partnership with the producing centers to exchange data, or to pass on the necessary data to ensure a meaningful verification, and should also participate in this verification as much as possible.

SWFDP Goals PROGRESS AGAINST SWFDP GOALS
To improve the ability of NMSs to forecast severe weather events To improve the lead time of alerting these events To improve the interaction of NMSs with Disaster Management and Civil Protection authorities before, during and after severe weather events To identify gaps and areas for improvements To improve the skill of products from Global Centres through feedback from NMSs EVALUATION OF WEATHER WARNINGS Feedback from the public Feedback from the DMCPA to include comments of the timeliness and usefulness of the warnings Feedback from the media Warning verification by the NMCs

Goals of Verification Administrative
Justify cost of provision of weather services Justify additional or new equipment Monitor the quality of forecasts and track changes Usually means summarizing the verification into few numbers (scoring) Impact - $ and injuries Scientific verification efforts generally put more demands on the verification system: to look at the forecast quality in more detail. The verification questions to be answered will usually be more focused on specific aspects of the product being verified. For example: “How well does the ECMWF model forecast extreme precipitation events?” Scientific verification often means using diagnostic tools which break down the verification dataset into subsets, for example, separating extreme events, for assessment of specific qualities of the forecast. For the SWFDP it is fair to say that verification needs to be done for both main purposes, administrative and scientific. At the administrative level, the need is to demonstrate the impact of the project in terms of improved operational forecasting services. At the scientific level, the main need is to establish the level of accuracy of severe weather forecasts and to determine the accuracy of the guidance products for African countries. It would be to demonstrate improvements in forecast quality, though this implies that there is some measure of forecast quality before the project started.

Goals of Verification Scientific
To identify the strengths and weaknesses of a forecast product in sufficient detail that actions can be specified that will lead to improvements in the product, ie to provide information to direct R&D. Demands more detail in verification methodology “diagnostic verification” SWFDP: Both administrative goals and scientific goals Scientific verification efforts generally put more demands on the verification system: to look at the forecast quality in more detail. The verification questions to be answered will usually be more focused on specific aspects of the product being verified. For example: “How well does the ECMWF model forecast extreme precipitation events?” Scientific verification often means using diagnostic tools which break down the verification dataset into subsets, for example, separating extreme events, for assessment of specific qualities of the forecast. For the SWFDP it is fair to say that verification needs to be done for both main purposes, administrative and scientific. At the administrative level, the need is to demonstrate the impact of the project in terms of improved operational forecasting services. At the scientific level, the main need is to establish the level of accuracy of severe weather forecasts and to determine the accuracy of the guidance products for African countries. It would be to demonstrate improvements in forecast quality, though this implies that there is some measure of forecast quality before the project started.

Forecast “goodness” What makes a forecast good?
QUALITY: How well it corresponds with the actual weather, as revealed by observations. (Verification) VALUE: The increase or decrease in economic or other value to a user, attributable to his use of the forecast. (satisfaction) Requires information from the user to assess, in addition to verification Can be assessed by methods of decision theory. (Cost-Loss etc) The general concept of forecast goodness was first presented by Allan Murphy, and extended by others. It comprises not only the quality of forecasts, which is measured by verification techniques, but also the value of the forecast to its users. These two components of forecast goodness are related but not necessarily dependent. It is possible for a perfect forecast to be of no use in an economic sense, and it is also possible that an imperfect forecast would nevertheless be of some value. In addition to evaluation of the forecast it is often necessary to evaluate the forecast system as a whole, including the capability to deliver the forecast to its users in a timely fashion. A high quality forecast is useless if it arrives late. One should also consider the relevance aspects: Is the forecast being effectively communicated to the users; can they understand and use it? And the forecast delivery system should be robust in the sense that forecasts should be delivered to users in a timely fashion all the time, or at least nearly all the time.

Principles of (Objective) Verification
Verification activity has value only if the information generated leads to a decision about the forecast or system being verified User of the information must be identified Purpose of the verification must be known in advance No single verification measure provides complete information about the quality of a forecast product. One cannot design an effective verification system unless the user of the verification information is determined in advance. One must also know as precisely as possible why the verification is being done. It is useful to actually state the verification question that is to be answered, for example: “To determine whether the NMC forecasts are more accurate than the RSMC forecasts of extreme precipitation.” Verification is multi-faceted, that is, one usually needs more than one verification measure to assess a forecast adequately. Stated another way, most verification scores are incomplete, and can lead to misleading information if used alone. In order to be able to objectively verify a forecast, it must be stated in a fully objective way. Subjective terms such as “chance of” etc must be defined objectively, and the forecast statement must be complete in terms of valid location, valid time (period). Subjective verification can be an important component of a verification effort, but must be approached with caution. Subjective verification, to be credible, should be done by someone completely independent from the forecast production. Subjective verification should be used only as a last resort, for example, when there is a complete lack of objective verification.

The contingency Table Observations Forecasts Yes No Yes No
This is the standard format for the contingency table for two forecast and observed categories.

Preparation of the event table
Day Fcst to occur? Observed ? 1 Yes 2 No 3 4 5 6 7 8 9 Start with matched forecasts and observations Forecast event is precipitation >50 mm / 24 h Next day Threshold – medium risk Count up the number of each of hits, false alarms, misses and correct negatives over the whole sample Enter them into the corresponding 4 boxes of the table.

Exercice Mozambique contingency table Review on Saturday 16 Nov

Outline Introduction: Purposes and Principles of verification
Some relevant verification measures: Contingency table and scores Verification of products from the SWFDP Verification of probability forecasts Exercise results and interpretation (Saturday)

Forecast “goodness” Evaluation of forecast system Forecast goodness
Evaluation of delivery system timeliness (are forecasts issued in time to be useful?) relevance (are forecasts delivered to intended users in a form they can understand and use?) robustness (level of errors or failures in the delivery of forecasts)

Principles of (Objective) Verification
Forecast must be stated in such a way that it can be verified What about subjective verification? With care, is OK. If subjective, should not be done by anyone directly connected with the forecast. Sometimes necessary due to lack of objective information One cannot design an effective verification system unless the user of the verification information is determined in advance. One must also know as precisely as possible why the verification is being done. It is useful to actually state the verification question that is to be answered, for example: “To determine whether the NMC forecasts are more accurate than the RSMC forecasts of extreme precipitation.” Verification is multi-faceted, that is, one usually needs more than one verification measure to assess a forecast adequately. Stated another way, most verification scores are incomplete, and can lead to misleading information if used alone. In order to be able to objectively verify a forecast, it must be stated in a fully objective way. Subjective terms such as “chance of” etc must be defined objectively, and the forecast statement must be complete in terms of valid location, valid time (period). Subjective verification can be an important component of a verification effort, but must be approached with caution. Subjective verification, to be credible, should be done by someone completely independent from the forecast production. Subjective verification should be used only as a last resort, for example, when there is a complete lack of objective verification.

Verification Procedure
Start with dataset of matched observations and forecasts Data preparation is the major part of the effort of verification Establish purpose Scientific vs. administrative Pose question to be answered, for specific user or set of users Stratification of dataset On basis of user requirements (seasonal, extremes etc) Take care to maintain sufficient sample size Data preparation is the largest task in all verification activity, consuming up to 80% or so of the required resources. For that reason, it is worthwhile to plan the data preparation carefully, so that the datasets can be used for more than one verification purpose. The verification purpose must be known in advance, and it is best if the person(s) doing the verification are aware of the purpose. Once the purpose is clearly articulated, then it is easier to know whether to stratify the data. When that is done, one must be careful to maintain sufficient sample size in each data subset. “Sufficient sample size” varies with the parameter: in general, the more the temporal and spatial variation in the sample, the larger the sample should be to effectively verify. Thus precipitation amounts, which are highly variable in space and time, will require larger samples before meaningful verification results can be obtained. A method called “bootstrapping” is increasingly being used to establish confidence limits on verification results. It is important to at least obtain the result and report the sample size on which it is based. Computation of confidence limits can then be carried out later on by someone who has sufficient computer resources, and access to the matched verification dataset. Most of the predicted variables available in the SWFDP are best verified as categorical variables, with thresholds established as used in the project. Probabilistic information (e.g. SAWS EPS, EPS output from global centers, risk tables) can be verified as probabilistic forecasts.

Verification Procedure
Nature of variable being verified Continuous: Forecasts of specific value at specified time and place Categorical: Forecast of an “event”, defined by a range of values, for a specific time period, and place or area Probabilistic: Same as categorical, but uncertainty is estimated SWFDP: Predicted variables are categorical: Extreme events, where extreme is defined by thresholds of precipitation and wind. Some probabilistic forecasts are available too. Data preparation is the largest task in all verification activity, consuming up to 80% or so of the required resources. For that reason, it is worthwhile to plan the data preparation carefully, so that the datasets can be used for more than one verification purpose. The verification purpose must be known in advance, and it is best if the person(s) doing the verification are aware of the purpose. Once the purpose is clearly articulated, then it is easier to know whether to stratify the data. When that is done, one must be careful to maintain sufficient sample size in each data subset. “Sufficient sample size” varies with the parameter: in general, the more the temporal and spatial variation in the sample, the larger the sample should be to effectively verify. Thus precipitation amounts, which are highly variable in space and time, will require larger samples before meaningful verification results can be obtained. A method called “bootstrapping” is increasingly being used to establish confidence limits on verification results. It is important to at least obtain the result and report the sample size on which it is based. Computation of confidence limits can then be carried out later on by someone who has sufficient computer resources, and access to the matched verification dataset. Most of the predicted variables available in the SWFDP are best verified as categorical variables, with thresholds established as used in the project. Probabilistic information (e.g. SAWS EPS, EPS output from global centers, risk tables) can be verified as probabilistic forecasts.

What is the Event? For categorical and probabilistic forecasts, one must be clear about the “event” being forecast Location or area for which forecast is valid Time range over which it is valid Definition of category Example? Matching of forecasts and observations is often tricky since they are may be in different forms. Those decisions should be made in advance of the forecast. Not only will this help the verification, but it also will help the communication of the forecast in a more meaningful way to users.

What is the Event? And now, what is defined as a correct forecast? A “hit” The event is forecast, and is observed – anywhere in the area? Over some percentage of the area? Scaling considerations Discussion: Matching of forecasts and observations is often tricky since they are may be in different forms. Those decisions should be made in advance of the forecast. Not only will this help the verification, but it also will help the communication of the forecast in a more meaningful way to users.

Events for the SWFDP Best if “events” are defined for similar time period and similar-sized areas One day 24h Fixed areas; should correspond to forecast areas and have at least one reporting stn. The smaller the areas, the more useful the forecast, potentially, BUT… Predictability lower for smaller areas More likely to get missed event/false alarm pairs

Events for the SWFDP Correct negatives a problem
Data density a problem Best to avoid verification where there is no data. Non-occurrence – no observation problem

The contingency Table Observations Forecasts Yes No Yes No
This is the standard format for the contingency table for two forecast and observed categories.

Contingency tables range: 0 to 1 range: 0 to 1 Characteristics:
Observations range: 0 to 1 best score = 1 Forecasts range: 0 to 1 best score = 0 Characteristics: PoD= “Prefigurance” or “probability of detection”, “hit rate” Sensitive only to missed events, not false alarms Can always be increased by overforecasting rare events FAR= “False alarm ratio” Sensitive only to false alarms, not missed events Can always be improved by underforecasting rare events The PoD and the FAR are examples of incomplete verification scores that should not be used alone. The PoD and the FAR can be used together and compared for two forecasts to understand the characteristics.

Contingency tables range: 0 to 1 Characteristics:
Observations range: 0 to 1 best score = 1 Forecasts best score = 1 Characteristics: PAG= “Post agreement” PAG= (1-FAR), and has the same characteristics Bias: This is frequency bias, indicates whether the forecast distribution is similar to the observed distribution of the categories (Reliability) The frequency bias tells you whether you are forecasting the event too often, too infrequently or just about right. Since it requires no direct matching of specific forecasts with observations, it is not a verification per se, but a diagnostic tool. This is not to be confused with the (linear) bias which is the average error of a continuous forecast variable.

Observations Forecasts range: 0 to 1 best score = 1 Characteristics: Better known as the Threat Score Sensitive to both false alarms and missed events; a more balanced measure than either PoD or FAR This is a frequently-used score in the US especially. Its advantage is that it takes into account both missed events and false alarms, but is of less use diagnostically because they are not separated.

Contingency tables Characteristics:
Observations Forecasts range: negative value to 1 best score = 1 Characteristics: A skill score against chance (as shown) Easy to show positive values Better to use climatology or persistence needs another table This is the most frequently used skill score for contingency tables. A score of 0 means that your forecast is only as good as if you simply guessed the forecast category each time. Positive scores mean that you are doing better than just guessing, and negative scores worse than guessing. Don’t worry, it isn’t hard to show positive values with this score. Skill scores calculated with respect to persistence or climatology are better to use, but harder to compute because they need a second contingency table. A persistence forecast is defined as forecasting no change from the previous observation, and a climatology forecast is a constant forecast of the most likely event (usually the non-occurrence of severe weather).

Observations range: 0 to 1 best score = 1 Forecasts best score = 0 Characteristics: Hit Rate (HR) is the same as the PoD and has the same characteristics False alarm RATE. This is different from the false alarm ratio. These two are used together in the Hanssen-Kuipers score, and in the ROC, and are best used in comparison. These two quantities are functions of the forecasts for occurrences only (HR) and non-occurrences only (FA). The difference of these two, HR-FA is the Hanssen-Kuipers score. This score is useful to evaluate a forecast strategy: If you choose to forecast the occurrence of the event more often, normally the false alarm rate will also increase. It is important to make sure that the increase in the HR is larger than the increase in the FA, or the forecasts will become less useful.

Extreme weather scores
Extreme Dependency Score EDS Extreme Dependency Index EDI Symmetric Extremal Dependency Score SEDS Symmetric Extremal Dependency Index SEDI

Contingency tables Extreme dependency score characteristics:
Observations Forecasts range: -1 to 1 best score = 1 Extreme dependency score characteristics: Score can be improved by incurring more false alarms Considered useful for extremes because does not converge to 0 as the base rate (observed frequency of events) decreases A relatively new score – not yet widely used. The EDS is a relatively new score, published for the first time only a couple of years ago. It has the advantage that it doesn’t go to 0 as the observed frequency of the event goes down to 0 (i.e. for rare events). But its disadvantages are also coming to light: one can get a better score by incurring more false alarms.

Verification of extreme, high-impact weather
EDS – EDI – SEDS - SEDI  Novelty categorical measures! Standard scores tend to zero for rare events Extremal Dependency Index - EDI Symmetric Extremal Dependency Index - SEDI Ferro & Stephenson, 2010: Improved verification measures for deterministic forecasts of rare, binary events. Wea. and Forecasting Base rate independence  Functions of H and F

Weather Warning Index (Canada)

Weather warning index for the ith variable if

Example - Madagascar 211 Cases
Separate tables assuming low, medium, high risk as thresholds Can plot the hit rate vs the false alarm RATE = FA/total obs no Low Obs yes Obs no Totals Fcst yes 35 34 69 Fcst no 15 127 142 50 161 211 Med Obs yes Obs no Totals Fcst yes 31 18 49 Fcst no 19 143 162 50 161 211 High Obs yes Obs no Totals Fcst yes 13 4 17 Fcst no 37 157 194 50 161 211 In their second quarterly report, Madagascar added the 2nd quarter data (events list) to the first quarter results. These three tables are based on the first 2 quarters (Nov 2008 to June 2009). They have been created from the RSMC risk tables by assuming low, medium and high risk categories as thresholds in turn. They are thus a verification of the RSMC risk forecasts.

Example (contd) This is a plot of the Hit Rate vs the False alarm rate for the Madagascar evaluation of the RSMC forecasts. It is called the “relative operating characteristic”. This is used mainly for evaluation of probabilistic forecasts, in this case the RSMC risk forecasts treated as probabilities. The diagonal line is where the HR=FA, and which means the forecasts are useless to a user, for making decisions. If the points lie above the diagonal, then this indicates positive skill in the form of “discrimination” (see below), which indicates that the forecast is useful in the decision-making process.

Discrimination User-perspective:
Does the model or forecast tend to give higher values of precipitation when heavy precipitation occurs than when it doesn’t? (or temperature?) Discrimination is one of many attributes of a forecast, which indicates whether or not it can be used as a basis for decisions. There are several measures of discrimination available, of which the ROC and the Hanssen-Kuipers score are the most often used. The underlying idea is to separate the verification data into two subsets – all cases where the event occurred and all cases where it did not occur. If there is discriminant information in the forecast, then these two distributions will be separated. If they lie on top of each other then the forecast is not discriminating and the user cannot use it for decisions. The abscissa in the diagram above is arbitrary: It is the predictive variable. For example it could be temperature. The event could be defined as “occurrence of temperature above 0” Then you would look at the graph and ask the question “When temperatures above 0 occurred then was the forecast more often above 0 than below? Looking at the graph, one can see that this is true only for temperatures forecast well above 0, otherwise not so clear. So, if the user receives a forecast of +2 for example, he cannot be sure whether the observation will be above 0 or not. Returning to the Madagascar case, the abscissa would be the forecast “risk”, and since there are only 3 values, the graph would be a histogram with the hit rate and false alarm rate for occurrences and non-occurrences respectively, plotted as a function of the three probabilities.

How do we verify this? Spatial verification is a currently a developing research topic. Several different techniques have been proposed and tested. Verification of spatially-defined variables requires both spatially continuous observations and a spatial definition of the contingency table components.

Contingency Table for spatial data
Forecast Observed False alarms Hits Misses Possible interpretation for spatially defined threat areas: Put grid of equal area boxes over overlaid obs and fcsts Entries are just the number of boxes covered by the areas as shown. Correct negatives problematic, but could limit to total forecast domain Likely to result in overforecasting bias – different interpretation? Can be done only where spatially continuous obs and forecasts are available – hydro estimator? Here is the spatial analogue of the four boxes of the contingency table. For spatial verification, one needs observation data which is at high spatial resolution, preferably much higher spatial resolution than the model or forecast being verified. For the SWFDP, one could use the hydroestimator data that is available. Since this data uses a model (UK model) in its creation, results must be treated with caution if the verification is with respect to forecasts using this same model. However there are no alternative datasets at sufficiently high spatial resolution, so use of the hydro estimator data is suggested. In practice, to proceed with this requires definition of a grid for tallying up hits, false alarms, misses and correct negatives. The last is the largest problem, but can be controlled by limiting the domain of evaluation to a specific area, say the total land area of the SWFDP countries. The grid boxes should be chosen so that they are at least as small as the smallest fixed region for which forecasts are issued in any country of the project. One can always evaluate the forecasts at lower resolution if desired, but the resolution of evaluation cannot be increased without reanalysing the data on a finer grid.

Verification of regional maps
SAWS Stephanie Landman: Regional map is discretized into 0 and 1. •All fields are rescaled to 0.25resolution. •SWFDP fields are created for both HE (hydroestimator) and TRMM domains. •HE and TRMM fields are converted to dichotomous fields for both 25 and 50 mm/day threshold values. •25 mm/day is used together with 50 mm/day since 25 mm/day for a 0.25is considered extreme and falls within the 95thpercentile value. •Statistics are calculated per season as well as for whole period. •Daily verification is also done.

Summary – Verification of SWFDP products
Who should verify General method NMC severe weather warnings NMC Contingency tables and scores RSMC severe weather guidance charts RSMC Graphical contingency table Global centre deterministic models Global centres Continuous scores (temperature); contingency tables (precip, wind) Global EPS Global centres. Scores for ensemble pdfs; scores for probability forecasts with respect to relevant categories. A very general summary of the available SWFDP products, with thoughts on who should lead the verification and comments on the best methods to use.

Probability forecast verification – Reliability tables
The level of agreement between the forecast probability and the observed frequency of an event Usually displayed graphically Measures the bias in a probability forecast: Is there a tendency to overforecast or underforecast. Cannot be evaluated on a single forecast.

Reliability

Reliability – Summer 08- Europe 114 h

Summary – NMS products Warnings issued by NMSs
Contingency tables as above, if enough data is gathered Important for a warning to determine the lead time – must archive the issue time of the warning and the occurrence time of the event. Data problems – verify the “reporting of the event”

Summary and discussion….
Keep the data! Be clear about all forecasts! Know why you are verifying and for whom! Keep the verification simple but relevant! Just do it! Case studies – post-mortem

Resources The EUMETCAL training site on verification – computer aided learning: The website of the Joint Working Group on Forecast Verification Research: WMO/TD 1083 : Guidelines on performance assessment of the performance of Public Weather Systems

SWFDP verification Thank you

Verification of forecasts from the SWFDP – E Africa

Similar presentations

Presentation on theme: "Verification of forecasts from the SWFDP – E Africa"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Verification of forecasts from the SWFDP – E Africa

Similar presentations

Presentation on theme: "Verification of forecasts from the SWFDP – E Africa"— Presentation transcript:

Similar presentations

About project

Feedback