Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk.

Slides:

Advertisements

Similar presentations

EcoTherm Plus WGB-K 20 E 4,5 – 20 kW.

Advertisements

University Paderborn 07 January 2009 RG Knowledge Based Systems Prof. Dr. Hans Kleine Büning Reinforcement Learning.

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory

Variations of the Turing Machine

PDAs Accept Context-Free Languages

Statistics Part II Math 416. Game Plan Creating Quintile Creating Quintile Decipher Quintile Decipher Quintile Per Centile Creation Per Centile Creation.

EuroCondens SGB E.

Reinforcement Learning

Slide 1Fig 26-CO, p.795. Slide 2Fig 26-1, p.796 Slide 3Fig 26-2, p.797.

Slide 1Fig 25-CO, p.762. Slide 2Fig 25-1, p.765 Slide 3Fig 25-2, p.765.

STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Addition and Subtraction Equations

David Burdett May 11, 2004 Package Binding for WS CDL.

Create an Application Title 1Y - Youth Chapter 5.

Add Governors Discretionary (1G) Grants Chapter 6.

CHAPTER 18 The Ankle and Lower Leg

The 5S numbers game..

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

1 OFDM Synchronization Speaker:. Wireless Access Tech. Lab. CCU Wireless Access Tech. Lab. 2 Outline OFDM System Description Synchronization What is Synchronization?

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.

Stationary Time Series

Break Time Remaining 10:00.

The basics for simulations

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Factoring Quadratics — ax² + bx + c Topic

EE, NCKU Tien-Hao Chang (Darby Chang)

A sample problem. The cash in bank account for J. B. Lindsay Co. at May 31 of the current year indicated a balance of $14, after both the cash receipts.

1 Heating and Cooling of Structure Observations by Thermo Imaging Camera during the Cardington Fire Test, January 16, 2003 Pašek J., Svoboda J., Wald.

Regression with Panel Data

1 Prediction of electrical energy by photovoltaic devices in urban situations By. R.C. Ott July 2011.

Biology 2 Plant Kingdom Identification Test Review.

2.5 Using Linear Models Month Temp º F 70 º F 75 º F 78 º F.

Classification.. continued. Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we.

FAFSA on the Web Preview Presentation December 2013.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

Facebook Pages 101: Your Organization’s Foothold on the Social Web A Volunteer Leader Webinar Sponsored by CACO December 1, 2010 Andrew Gossen, Senior.

Artificial Intelligence

When you see… Find the zeros You think….

2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.

Before Between After.

2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.

Slide R - 1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Prentice Hall Active Learning Lecture Slides For use with Classroom Response.

Subtraction: Adding UP

Numeracy Resources for KS2

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Static Equilibrium; Elasticity and Fracture

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

Resistência dos Materiais, 5ª ed.

Clock will move after 1 minute

Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.

Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.

Select a time to count down from the clock above

Copyright Tim Morris/St Stephen's School

WARNING This CD is protected by Copyright Laws. FOR HOME USE ONLY. Unauthorised copying, adaptation, rental, lending, distribution, extraction, charging.

Patient Survey Results 2013 Nicki Mott. Patient Survey 2013 Patient Survey conducted by IPOS Mori by posting questionnaires to random patients in the.

A Data Warehouse Mining Tool Stephen Turner Chris Frala

1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.

Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Introduction Embedded Universal Tools and Online Features 2.

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Presentation transcript:

Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk

Overview What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Probability Forecasting Qualified predictions important in many applications (especially medicine). Most machine learning algorithms make bare predictions. Those that do make qualified predictions make no claims of how effective the measures are!

Probability Forecasting: Generalisation of Pattern Recognition Goal of pattern recognition = find the best label for each new test object. Example Abdominal Pain Dataset: Training Set to learn from Label Diagnosis Object Patient Details Name: David Sex: M Height: 62 Appendicitis Name: Daniil Sex: M Height: 64 Dyspepsia Name: Mark Sex: M Height: 61 Non-specific,..., Name: Sian Sex: F Height: 58 Dyspepsia,, Name: Wilma Sex: F Height: 56 ? Test Object, what is the true label? True label unknown or withheld from learner

Probability Forecasting: Generalisation of Pattern Recognition Probability forecast – estimate the conditional probability of a label given an observed object learner Training set Name: Helen Sex: F Height: 56 Name: Helen Sex: F Height: 56 Name: Helen Sex: F Height: 56 Name: Helen Sex: F Height: 56 Test object ? Name: Helen Sex: F Height: 56 = 0.1 Name: Helen Sex: F Height: 56 = 0.7 Name: Helen Sex: F Height: 56 = 0.2 Name: Helen Sex: F Height: 56 etc… We want learner to estimate probabilities for all possible class labels:

Probability forecasting more formally… X object space, Y label space, Z = X Y example space Our learner makes probability forecasts for all possible labels Use probability forecasts to predict label most likely label:

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Studies of Probability Forecasting Probability forecasting is well studied area since 1970s: Psychology Statistics Meteorology These studies assessed two criteria of probability forecasts: Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically useful

When an event is predicted with probability should have approx chance of being incorrect Reliability a.k.a. well calibrated, Considered an asymptotic property. Dawid (1985) proved no deterministic learner can be reliable for all data – still interesting to investigate This property is often overlooked in practical studies!

Definition of Reliability

Resolution Probability forecasts are practically useful, e.g. they can be used to rank the labels in order of likelihood! Closely related to classification accuracy - common focus of machine learning. Separate from reliability, i.e. do not gohand in hand (Lindsay, 2004)

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Experimental design Tested several learners on many datasets in the online setting: ZeroR = Control K-Nearest Neighbour Neural Network C4.5 Decision Tree Naïve Bayes Venn Probability Machine Meta Learner (see later…)

The Online Learning Setting ?? ? Before After Update training data for learning machine for next trial Learning machine makes prediction for new example. (label withheld) Repeat process for all examples

Lots of benchmark data Tested on data available from the UCI Machine Learning repository: Abdominal Pain: 6387 examples, 135 features, 9 classes, Noisy Diabetes: 768 examples, 8 features, 2 classes Heart-Statlog: 270 examples, 13 features, 2 classes Wisconsin Breast Cancer: 685 examples, 10 features, 2 classes American Votes: 435 examples, 16 features, 2 classes Lymphography: 148 examples, 18 features, 4 classes Credit Card Applications: 690 examples, 15 features, 2 classes Iris Flower: 150 examples, 4 features, 3 classes And many more…

Programs Extended the WEKA data mining system implemented in Java: Added VPM meta learner to existing library of algorithms Allow learners to be tested in online setting. Created Matlab scripts to easily create plots (see later)

Results, papers and website All results that I discuss today can be found in my 3 tech reports: The Probability Calibration Graph - a useful visualisation of the reliability of probability forecasts, Lindsay (2004), CLRC-TR Multi-class probability forecasting using the Venn Probability Machine - a comparison with traditional machine learning methods, Lindsay (2004), CLRC-TR Rapid implementation of Venn Probability Machines, Lindsay (2004), CLRC-TR And on my web site:

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Loss Functions Square loss Log loss There are many other possible loss functions… Degroot and Feinberg (1982) showed that all loss functions measure a mixture of reliability and resolution Log loss punishes more harshly: forced to spread its bets

ROC Curves Naïve Bayes on the Abdominal pain data set 1.Graph shows trade off between false and true positive predictions 2.Want curve to be as close to the upper left corner as possible (away from diagonal) 3.My results show that this graph tests resolution. 4.Area under curve provides measure of quality of probability forecasts.

Table comparing traditional scores VPM C4.5 Naïve Bayes VPM Naïve Bayes 10-NN 20-NN C4.5 Neural Net 30-NN VPM 1-NN 1-NN 0.76 (1) 0.72 (5) 0.75 (2) 0.54 (10) 0.55 (9) 0.57 (8) 0.75 (3) 0.74 (4) 0.61 (6) 0.59 (7) 0.49 (11) 0.8 (4) 1.3 (7) 0.6 (1) 2.6 (10) 2.2 (9) 3.3 (11) 0.72 (2) 0.73 (3) 0.9 (5) 2.1 (8) 1.1 (6) 0.54 (5) 0.50 (4) 0.44 (1) 1.0 (11) 0.96 (10) 0.67 (7) 0.45 (2) 0.47 (3) 0.58 (6) 0.73 (8) 0.74 (9) 40.7 (8) 29.2 (2) 28.9 (1) 33.4 (4) 39.6 (7) 30.5 (3) 34.3 (5) 41.6 (9) 34.6 (6) 55.6 (10)ZeroR PCGROC Area Log Loss Sqr Loss ErrorAlgorithm

Problems with Traditional Assessment Loss functions and ROC give more information than error rate about the quality of probability forecasts. But… loss functions = mixture of resolution and reliability ROC curve = measures resolution Dont have any method of solely assessing reliability Dont have method of telling if probability forecasts are over- or under- estimated

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Inspiration for PCG (Meteorology) Murphy & Winkler (1977) Calibration data for precipitation forecasts Reliable points lie close to diagonal

A PCG plot of ZeroR on Abdominal Pain Predicted Probability Empirical frequency of being correct Line of calibration PCG coordinates Reliability PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable! Plot may not span whole axis – ZeroR makes no predictions with high probability

PCG a visualisation tool and measure of reliability Total Mean Standard Deviation Max Min4.9e-17 Naïve BayesVPM Naïve Bayes VPM is reliable as PCG follows the diagonal! Total496.7 Mean Standard Deviation Max Min9.2e-8 Over and under estimates its probabilities – much like real doctors! Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)

Learners predicting like people! Naïve BayesPeople Lots of psychological research people make unreliable probability forecasts

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Table comparing scores with PCG (4)0.76 (1)0.8 (4)0.54 (5)40.7 (8)VPM C (7)0.72 (5)1.3 (7)0.50 (4)29.2 (2)Naïve Bayes (1)0.75 (2)0.6 (1)0.44 (1)28.9 (1)VPM Naïve Bayes (11)0.54 (10)2.6 (10)1.0 (11)33.4 (4)10-NN (10)0.55 (9)2.2 (9)0.96 (10)33.4 (4)20-NN (8)0.57 (8)3.3 (11)0.67 (7)39.6 (7)C (6)0.75 (3)0.72 (2)0.45 (2)30.5 (3)Neural Net (5)0.74 (4)0.73 (3)0.47 (3)34.3 (5)30-NN (2)0.61 (6)0.9 (5)0.58 (6)41.6 (9)VPM 1-NN (9)0.59 (7)2.1 (8)0.73 (8)34.6 (6)1-NN (3)0.49 (11)1.1 (6)0.74 (9)55.6 (10)ZeroR PCGROC Area Log Loss Sqr Loss ErrorAlgorithm

Correlations of scores Inverse No-0.1ROC vs. Sqr Reliability Direct Weak0.26PCG vs. Error Direct No0.04PCG vs. Sqr Resolution Direct Strong0.76PCG vs. Sqr Reliability InterpretationCorr. Coeff.Scores Inverse Moderate-0.52ROC vs. Error Direct Strong0.67ROC vs. Sqr Resolution

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

What is the VPM meta-learner? Volodyas VPM 1.Predicts a label 2.Produces upper u and lower l bounds for predicted label only My VPM extension 1.Extracts more information 2.Produces probability forecast for all possible labels 3.Predicts a label using these probability forecasts. 4.Produces Volodyas bounds as well! Learner Γ VPM meta learning framework VPM sits on top of existing learner to complement predictions with probability estimates

Volodyas original use of VPM Online Trial Number Error rate and bounds 22.1%1414.1Low Error 28.9%1835Error 34.7%2216.5Up Error Upper (red) and lower (green) bounds lie above and below the actual number of errors (black) made on the data.

Output from VPM compared with that of original underlying learner Key: Predicted = underlined, Actual = NA 7.6e e e-112.2e-91.3e e e NA 2.2e e e e NA 1.3e e e-34.2e e e e-63.08e Naïve Bayes LowUp Dysp.Ren al. PancrIntest obstr CholiNon. Spec Perf. Pept. Div.Appx BoundsProbability forecast for each class label Trial # VPM Naïve Bayes

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

ZeroR Heart DiseaseLymphography Diabetes ZeroR outputs probability forecasts which are mere label frequencies ZeroR predicts the majority class label at each trial. Uses no information about the objects in its learning – the simplest of all learners. Accuracy is poor, but reliability is good.

K-NN 10-NN 20-NN 30-NN K-NN finds subset of K closest (nearest neighbouring) examples in training data using a distance metric. Then counts the label frequencies amongst this subset. Acts like a more sophisticated version of ZeroR that uses information held in the object. Appropriate choice of K must be made to obtain reliable probability forecasts (depends on data).

Traditional Learners and VPM Traditional learners can be very unreliable (yet accurate) - depends on data. My research shows empirically that VPM is reliable. And it can recalibrate a learners original probability forecasts to make them more reliable! Improvement in reliability often without detriment to classification accuracy. Naïve Bayes VPM Naïve Bayes C4.5 VPM C4.5 Neural Net VPM Neural Net 1-NN VPM 1-NN

Back to the plan… What is probability forecasting? Reliability and resolution criteria Experimental design Problems with traditional assessment methods: square loss, log loss and ROC curves Probability Calibration Graph (PCG) Traditional learners are unreliable yet accurate! Extension of Venn Probability Machine (VPM) Which learners are reliable? Psychological and theoretical viewpoint

Psychological Heuristics When faced with the difficult task of judging probability, people employ a limited number of heuristics which reduce the judgements to simpler ones: Availability - An event is predicted more likely to occur if it has occurred frequently in the past Representativeness - One compares the essential features of the event to those of the structure of previous events Simulation - The ease in which the simulation of a system of events reaches a particular state can be used to judge the propensity of the (real) system to produce that state.

Interpretation of reliable learners using heuristics ZeroR, K-NN and VPM learners are reliable probability forecasters. Can identify heuristics in these learning algorithms Remember psychological research states: More heuristics More reliable forecasts

Psychological Interpretation of ZeroR The simplest of all reliable probability forecasters uses 1 heuristic: The learner merely counts labels it has observed so far, and uses the frequencies of labels as its forecasts (Availability)

Psychological Interpretation of K-NN More sophisticated than the ZeroR learner, the K-NN learner uses 2 heuristics: Uses the distance metric to find subset of K closest examples in training set. (Representativeness) Then counts the label frequencies in the subset of K-nearest neighbours to makes its forecasts (Availability)

Psychological Interpretation of VPM Even more sophisticated the VPM meta- learner uses all 3 heuristics: The VPM tries each new test example with all possible classifications (Simulation) Then under each tentative simulation clusters training examples which are similar into groups (Representativeness) Finally the VPM calculates the frequency of labels in each of these groups to make its forecasts (Availability)

Theoretical justifications ZeroR can be proven to be asymptotically reliable (but experiments show well in finite data) K-NN has lots of theory Stone (1977) to support its convergence to true probability distribution VPM has a lots of theoretical justification for finite data using martingales

Take home points Probability forecasting is useful for real life applications especially medicine. Want learners to be reliable and accurate. PCG can be used to check reliability. ZeroR, K-NN and VPM provide consistently reliable probability forecasts. Traditional learners Naïve Bayes, Neural Net and Decision Tree can provide unreliable forecasts. VPM can be used to improve reliability of probability forecasts without detriment to classification accuracy.

Supervision Alex Gammerman Volodya Vovk Zhiyuan Luo Mathematical Advice Daniil Riabko Volodya Vovk Teo Sharia Proofreading Zhiyuan Luo Siân Cox Graphics & Design Siân Cox Catering Siân Cox Fin Acknowledgments

What next? Look at applications in bioinformatics and medicine – noisy data really needs reliable probability forecasts so user can know whether to trust predictions! Results with time series data. Investigate further relationships with psychology. Recursive application of VPM to improve reliability and accuracy.