1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.

Slides:



Advertisements
Similar presentations
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Advertisements

Treatment Effect Heterogeneity & Dynamic Treatment Regime Development S.A. Murphy.
11 Confidence Intervals, Q-Learning and Dynamic Treatment Regimes S.A. Murphy Time for Causality – Bristol April, 2012 TexPoint fonts used in EMF. Read.
Lecture 3 Nonparametric density estimation and classification
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
Using Clinical Trial Data to Construct Policies for Guiding Clinical Decision Making S. Murphy & J. Pineau American Control Conference Special Session.
Visual Recognition Tutorial
1 Dynamic Treatment Regimes Advances and Open Problems S.A. Murphy ICSPRAR-2008.
Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy LSU ---- Geaux Tigers! April 2009.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Substance Abuse, Multi-Stage Decisions, Generalization Error How are they connected?! S.A. Murphy Univ. of Michigan CMU, Nov., 2004.
Constructing Dynamic Treatment Regimes & STAR*D S.A. Murphy ICSA June 2008.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Point estimation, interval estimation
Q-Learning and Dynamic Treatment Regimes S.A. Murphy Univ. of Michigan IMS/Bernoulli: July, 2004.
Hypothesis Testing and Dynamic Treatment Regimes S.A. Murphy Schering-Plough Workshop May 2007 TexPoint fonts used in EMF. Read the TexPoint manual before.
1 A Confidence Interval for the Misclassification Rate S.A. Murphy & E.B. Laber.
Statistical Decision Theory, Bayes Classifier
Hypothesis Testing and Dynamic Treatment Regimes S.A. Murphy, L. Gunter & B. Chakraborty ENAR March 2007.
Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy ENAR March 2009.
Evaluating Hypotheses
A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.
Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy Psychiatric Biostatistics Symposium May 2009.
1 Possible Roles for Reinforcement Learning in Clinical Research S.A. Murphy November 14, 2007.
Data mining and statistical learning - lecture 13 Separating hyperplane.
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
Experiments and Dynamic Treatment Regimes S.A. Murphy At NIAID, BRB December, 2007.
1 Machine/Reinforcement Learning in Clinical Research S.A. Murphy May 19, 2008.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Hypothesis Testing and Adaptive Treatment Strategies S.A. Murphy SCT May 2007.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Online Learning Algorithms
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Sequential, Multiple Assignment, Randomized Trials and Treatment Policies S.A. Murphy UAlberta, 09/28/12 TexPoint fonts used in EMF. Read the TexPoint.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Sequential, Multiple Assignment, Randomized Trials and Treatment Policies S.A. Murphy MUCMD, 08/10/12 TexPoint fonts used in EMF. Read the TexPoint manual.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Optimal Bayes Classification
Linear Models for Classification
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Machine Learning 5. Parametric Methods.
Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Empirical risk minimization
CONCEPTS OF ESTIMATION
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Nonparametric density estimation and classification
Presentation transcript:

1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy

2 Outline –Review –Three challenges in constructing PIs –Combining a statistical approach with a learning theory approach to constructing PIs –Relevance to confidence measures for the value of a dynamic treatment regime.

3 Review –X is the vector of features in R q, Y is the binary label in {-1,1} –Misclassification Rate: –Data: N iid observations of (Y,X) –Given a space of classifiers,, and the data, use some method to construct a classifier, –The goal is to provide a PI for

4 Review –Since the loss function is not smooth, one commonly uses a smooth surrogate loss to estimate the classifier –Surrogate Loss: L(Y,f(X)) –

5 Review General approach to providing a PI: –We estimate using the data, resulting in –Derive approximate distribution for –Use this approximate distribution to construct a prediction interval for

6 Review A common choice for is the resubstitution error or training error: evaluated at e.g. if then

7 Three challenges 1) is too large leading to over-fitting and (negative bias) 2) is a non-smooth function of f. 3) may behave like an extreme quantity No assumption that is close to optimal.

8 A Challenge 2) is non-smooth. Example: The unknown optimal classifier has quadratic decision boundary. We fit, by least squares, a linear decision boundary f(x)= sign(β 0 + β 1 x)

9 Density of Three Point Dist. (n=30) Three Point Dist. (n=100)

10 Bias of Common on Three Point Example

11 Coverage of Bootstrap PI in Three Point Example (goal =95%)

12 Coverage of Correctly Centered Bootstrap PI (goal= 95%)

13 Sample Size Bootstrap Percentile Yang CV CUD- Bound Coverage of 95% PI (Three Point Example)

14 PIs from Learning Theory Given a result of the form: for all N where is known to belong to and forms a conservative 1-δ PI:

15 Combine statistical ideas with learning theory ideas Construct a prediction interval for where is chosen to be small yet contain ---from this PI deduce a conservative PI for ---use the surrogate loss to perform estimation and to construct

16 Construct a prediction interval for --- should contain all that are close to --- all f for which --- is the “limiting value” of ;

17 Prediction Interval Construct a prediction interval for ---

18 Prediction Interval

19 Bootstrap We use bootstrap to obtain an estimate of an upper percentile of the distribution of to obtain b U ; similarly for b L. The PI is then

20 Implementation Approximation space for the classifier is linear: Surrogate loss is least squares:

21 Implementation becomes

22 Implementation Bootstrap version: denotes the expectation for the bootstrap distribution

23 Cud-Bound Level Sets (n=30) Three Point Dist.

24 Computational Issues Partition R q into equivalence classes defined by the 2N possible values of the first term. Each equivalence class, can be written as a set of β satisfying linear constraints. The first term is constant on

25 Computational Issues can be written as since g is non-decreasing.

26 Computational Issues Reduced the problem to the computation of at most 2N mixed integer quadratic programming problems. Using commercial solvers (e.g. CPLEX) the CUD bound can be computed for moderately sized data sets in a few minutes on a standard desktop (2.8 GHz processor 2GB RAM).

27 Comparisons, 95% PI DataCUDBSM Y Magic Mamm Ion Donut Pt Balance Liver Sample size = 30 (1000 data sets)

28 Comparisons, Length of PI Sample size=30 (1000 data sets) DataCUDBSM Y Magic Mamm Ion Donut Pt Balance Liver

29 Intuition In large samples behaves like

30 Intuition The large sample distribution is the same as the distribution of where

31 Intuition If then the distribution is approximately that of a (limiting distribution for binomial, as expected).

32 Intuition If the distribution is approximately that of where

33 Discussion Further reduce the conservatism of the CUD- bound. –Replace by other quantities. –Other surrogates (exponential, logit, hinge) Construct a principle for minimizing the length of the conservative PI? The real goal is to produce PIs for the Value of a policy.

34 The simplest Dynamic treatment regime (e.g. policy) is a decision rule if there is only one stage of treatment 1 Stage for each individual Observation available at j th stage Action at j th stage (usually a treatment) Primary Outcome:

35 Goal : Construct decision rules that input patient information and output a recommended action; these decision rules should lead to a maximal mean Y. In future one selects action:

36 Single Stage Find a confidence interval for the mean outcome if a particular estimated policy (here one decision rule) is employed. Treatment A is randomized in {-1,1}. Suppose the decision rule is of form We do not assume the optimal decision boundary is linear.

37 Single Stage Mean outcome following this policy is is the randomization probability

38

39 Oslin ExTENd Late Trigger for Nonresponse 8 wks Response TDM + Naltrexone CBI Random assignment: CBI +Naltrexone Nonresponse Early Trigger for Nonresponse Random assignment: Naltrexone 8 wks Response Random assignment: CBI +Naltrexone CBI TDM + Naltrexone Naltrexone Nonresponse

40 This seminar can be found at: seminars/ColumbiaBioStat02.09.ppt Eric or me with questions or if you would like a copy of the associated paper: or

41 Bias of Common on Three Point Example

42 Intuition Consider the large sample variance of Variance is if in place of we put where is close to 0 then due to the non-smoothness in at we can get jittering.