Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop – Warwick December 2009

Outline of Presentation Probability Forecasting Why Probability Evaluation? Scoring Rules: Incentives and Evaluation Some Issues and Recent Developments Extended Families of Scoring Rules Relative Evaluation Taking Order into Account Probability Assessment vs. Quantile Assessment Calibration, Sharpness, etc. – What Matters? Competition among Forecasters Concluding Thoughts

Probability Forecasting Formal representation of uncertainty Various sources of forecasts – all with some subjective element Subjective forecasts from “experts” Probabilities from Bayesian analyses Probabilities from other modeling

Why Probability Evaluation? Recognition of importance of uncertainty Some increase in use of probability forecasts Concern about levels of risk Importance of ex post evaluation Connection of forecasts with reality Concern about impact of “poor” forecasts Improvement of future probability forecasts Identification of better probability forecasters Keep forecasters honest (accountability!)

Scoring Rules As incentives Strictly proper scoring rules Maximizing Expected Score  honest reporting, incentive for “better” forecasts As evaluation measures Overall measures of accuracy Decompositions for specific characteristics Commonly-used rules Some background, historical notes

QuadraticSpherical Logarithmic Score Expected Score (n=2) Commonly-Used Rules

Some Issues & Recent Developments My own biased view! Based in part on recent work with Casey Lichtendahl, Victor Richmond Jose, Bob Nau, and others Influenced greatly by many years of work with Allan Murphy

Extended Families of Scoring Rules Power and Pseudospherical families, each with a single parameter β (–  < β <  ) Power family includes quadratic rule (β = 2) Pseudospherical family includes spherical rule (β = 2) Both families include logarithmic rule (β → 1) Provide rich families of strictly proper rules

Power and Pseudospherical Families

Relative Evaluation Standard scoring rules are symmetric Maximum ES is smallest when the distribution is uniform – forecasts are rewarded for improvements over a uniform distribution –Is the uniform distribution a suitable baseline? Evaluation relative to a non-uniform baseline Often makes more sense Facilitates comparability of scores from different situations – improvements over their baselines Motivated “Skill Score” (but not strictly proper)

Scoring Rules with Baseline Distributions Strictly proper asymmetric rules Generalized Power & Pseudospherical families Strictly proper Score = 0 when forecast = baseline ES > 0 when forecast ≠ baseline

Power and Pseudospherical Families with Baselines

Expected Scores (β = 2) POWER SCORE, q = (0.2, 0.7, 0.1) QUADRATIC SCORE SPHERICAL SCORE PSEUDOSPHERICAL SCORE, q = (0.2,0.7, 0.1) State 1 State 2 State 3 State 1 State 2 State 3 State 1 State 2 State 3 State 1 State 2 State 3

Taking Order Into Account What if the events of interest are ordered? Standard scoring rules ignore the order For events other than the one that occurs, the probabilities matter but not “where they are” Do we want the scoring rule to reflect order? Then we want it to be “sensitive to distance” Having more probability on events “close” to the event that happens than on events “far” from the event that happens results in a higher score.

Scoring Rules that Reflect Order First such rule: ranked probability score Based on quadratic scoring rule Reflects order by using cumulative probabilities The same approach can be used with any basic scoring rule Can generate Power and Pseudospherical families that include baseline distributions and are sensitive to order

Sensitive to Distance Rules

Expected Scores (β = 2) QUADRATIC SCORERANKED PROBABILITY SCORE POWER SCORE, q = (1/3,1/3,1/3)POWER SCORE, q = (0.7, 0.2, 0.1) State 1 State 2 State 3 State 1 State 2 State 3 State 1 State 2 State 3 State 1 State 2 State 3

Probabilities vs. Quantiles Often we have ordered events or values of a variable, as noted earlier We might have quantile forecasts, or a mix of probability forecasts and quantile forecasts Why not just use the previously-discussed scoring rules to evaluate quantiles? They provide improper incentives for quantiles Can be gamed to get perfect calibration artificially

Scoring Rules for Quantiles Scores are based on the quantiles and on the actual value of the variable Rules based on linear loss functions Can be used for multiple quantiles Special case: interval forecasts (2 quantiles) Strictly proper for quantile assessment

Calibration & Sharpness How much should we care about calibration? Evaluation of probability forecasts often focuses almost exclusively on calibration Forecasters can game calibration measures A good Bayesian should try to recalibrate probabilities (viewing them as new information) A key issue: will the forecasts be taken at face value?

Calibration & Sharpness, cont. What about sharpness? Sharpness is a better indicator of how informative the forecasts are (or can be) Sharpness can often be improved with effort (more data-gathering, improved models) Limited exploration suggests that sharpness has greater impact on overall scores than calibration Ideal: Maximize sharpness subject to good calibration Pay more attention to sharpness measures

Competition Among Forecasters Forecasters can have goals other than maximizing ES Utility could be nonlinear in the score Risk averse Step function with step at “target score” Goal could be to outscore other forecasters Brings in game-theoretic considerations Results in forecasts that are more extreme than the forecaster’s probabilities Goal could be to be similar to other forecasters, or not to be near the bottom of the group in scores Leads to so-called “herding” behavior Tricky to model We don’t know exactly what a forecaster’s goals are Can’t recover the forecaster’s probabilities from those that are reported

Concluding Thoughts/Recommendations Encourage greater use of probabilities Encourage consistent evaluation of probabilities Rich families of scoring rules Need better understanding of characteristics of different rules Choice of a baseline distribution and sensitivity to distance more important than value of β Tailor rules to other probability-related forecasts, such as quantiles, for proper incentives Pay more attention to sharpness, less to calibration Be aware of possible other goals and their impact on reported probabilities Other issues in probability forecasting & evaluation

Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Similar presentations

Presentation on theme: "Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –

Similar presentations

Presentation on theme: "Probability Forecasting, Probability Evaluation, and Scoring Rules: Expanding the Toolbox Robert L. Winkler Duke University Subjective Bayes Workshop –"— Presentation transcript:

Similar presentations

About project

Feedback