1 A Confidence Interval for the Misclassification Rate S.A. Murphy & E.B. Laber.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Linear Regression.

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)

Treatment Effect Heterogeneity & Dynamic Treatment Regime Development S.A. Murphy.

11 Confidence Intervals, Q-Learning and Dynamic Treatment Regimes S.A. Murphy Time for Causality – Bristol April, 2012 TexPoint fonts used in EMF. Read.

Model Assessment, Selection and Averaging

CMPUT 466/551 Principal Source: CMU

Chapter 4: Linear Models for Classification

Using Clinical Trial Data to Construct Policies for Guiding Clinical Decision Making S. Murphy & J. Pineau American Control Conference Special Session.

Visual Recognition Tutorial

1 Dynamic Treatment Regimes Advances and Open Problems S.A. Murphy ICSPRAR-2008.

Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy LSU ---- Geaux Tigers! April 2009.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Substance Abuse, Multi-Stage Decisions, Generalization Error How are they connected?! S.A. Murphy Univ. of Michigan CMU, Nov., 2004.

Constructing Dynamic Treatment Regimes & STAR*D S.A. Murphy ICSA June 2008.

Screening Experiments for Developing Dynamic Treatment Regimes S.A. Murphy At ICSPRAR January, 2008.

Point estimation, interval estimation

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Q-Learning and Dynamic Treatment Regimes S.A. Murphy Univ. of Michigan IMS/Bernoulli: July, 2004.

1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.

Experiments and Dynamic Treatment Regimes S.A. Murphy Univ. of Michigan Florida: January, 2006.

Hypothesis Testing and Dynamic Treatment Regimes S.A. Murphy Schering-Plough Workshop May 2007 TexPoint fonts used in EMF. Read the TexPoint manual before.

Planning Survival Analysis Studies of Dynamic Treatment Regimes Z. Li & S.A. Murphy UNC October, 2009.

Statistical Decision Theory, Bayes Classifier

Hypothesis Testing and Dynamic Treatment Regimes S.A. Murphy, L. Gunter & B. Chakraborty ENAR March 2007.

Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy ENAR March 2009.

Evaluating Hypotheses

Presenting: Assaf Tzabari

A Finite Sample Upper Bound on the Generalization Error for Q-Learning S.A. Murphy Univ. of Michigan CALD: February, 2005.

Dynamic Treatment Regimes, STAR*D & Voting D. Lizotte, E. Laber & S. Murphy Psychiatric Biostatistics Symposium May 2009.

Data mining and statistical learning - lecture 13 Separating hyperplane.

1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.

Visual Recognition Tutorial

Experiments and Dynamic Treatment Regimes S.A. Murphy At NIAID, BRB December, 2007.

1 Machine/Reinforcement Learning in Clinical Research S.A. Murphy May 19, 2008.

Lecture 10 Comparison and Evaluation of Alternative System Designs.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Hypothesis Testing and Adaptive Treatment Strategies S.A. Murphy SCT May 2007.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Sequential, Multiple Assignment, Randomized Trials and Treatment Policies S.A. Murphy UAlberta, 09/28/12 TexPoint fonts used in EMF. Read the TexPoint.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Error estimation Data Mining II Year Lluís Belanche Alfredo Vellido.

Sequential, Multiple Assignment, Randomized Trials and Treatment Policies S.A. Murphy MUCMD, 08/10/12 TexPoint fonts used in EMF. Read the TexPoint manual.

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

Optimal Bayes Classification

Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Machine Learning 5. Parametric Methods.

Intro. ANN & Fuzzy Systems Lecture 15. Pattern Classification (I): Statistical Formulation.

Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Estimating standard error using bootstrap

Deep Feedforward Networks

Empirical risk minimization

CONCEPTS OF ESTIMATION

BOOTSTRAPPING: LEARNING FROM THE SAMPLE

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Simple Linear Regression

Parametric Methods Berlin Chen, 2005 References:

Empirical risk minimization

Logistic Regression Chapter 7.

Presentation transcript:

1 A Confidence Interval for the Misclassification Rate S.A. Murphy & E.B. Laber

2 Outline –Review –Three challenges in constructing CIs –Combining a statistical approach with a learning theory approach to constructing CIs –Relevance to confidence measures for the value of a policy.

3 Review –X is the vector of features, Y is the binary classification –Misclassification Rate: –Data: N iid observations of (Y,X) –Given a space of classifiers,, and the data, use some method to construct a classifier, –The goal is to provide a CI for

4 Review –Since the loss function is not smooth, one commonly uses a surrogate loss to estimate the classifier –Surrogate Loss: L(Y,f(X)) –

5 Review General approach to providing a CI: –We estimate using the data, resulting in –Derive approximate distribution for –Use this approximate distribution to construct a confidence interval for

6 Three challenges 1) is too large leading to over-fitting and (negative bias) 2) is a non-smooth function. 3) may behave like an extreme quantity No assumption that is close to optimal.

7 Three challenges 2) is non-smooth. Example: The unknown Bayes classifier has quadratic decision boundary. We fit, by least squares, a linear decision boundary f(x)= sign(β 0 + β 1 x)

8 Density of

9 Bootstrap CI for

10 Misclassification Rate is Non-smooth Coverage of 95% CI Sample Size Bootstrap Percentile Yang CV CUD- Bound

11 CIs for Extreme Quantities 3) may behave like an extreme quantity Should this be problematic? –Highly skewed distribution of –Fast convergence of to zero

12 CIs from Learning Theory Given a result of the form where is known to belong to and forms a 1-δ CI as

13 Combine statistical ideas with learning theory ideas Construct a confidence interval for where is chosen to be small yet contain ---from this CI deduce a conservative CI for ---use the surrogate loss to smooth the maximization and to construct

14 Construct a confidence interval for --- should contain all that are close to --- all f for which --- is the “limiting value” of ;

15 Confidence Interval Construct a confidence interval for --- is a rate; I would like ---

16 Confidence Interval

17 Bootstrap We use bootstrap to obtain an estimate of an upper percentile of the distribution of to obtain b *. The CI is then

18 Implementation Approximation space for the classifier is linear: Surrogate loss is least squares: is the.632 estimator

19 Implementation becomes

20

21 Computational Issues Partition R p into equivalence classes defined by the pattern of signs Each equivalence class, can be written as a set of β satisfying linear constraints. The term in absolute values is constant on

22 Computational Issues can be written as since g is non-decreasing.

23 Computational Issues Reduced the problem to the computation of a number of convex optimization problems. The number of convex optimizations is reduced via use of g with a branch and bound algorithm. With a sample size of N = 150 and 11 features calculation of the percentiles of the CUD bound using 500 bootstrap samples can be accomplished in a few minutes on a standard desktop (2.4 GHz processor 2 GB RAM).

24 Comparisons, 95% CI DataCUDBSM Y Spam Ion Heart Diabetes Donut Outlier Sample size = 50

25 Comparisons, length of CI DataCUDBSM Y Spam Ion Heart Diabetes Donut Outlier Sample size=50

26 Intuition If then we are approximating the distribution of where

27 Intuition If and then the distribution is approximately that of the absolute value of a (limiting distribution for binomial, as expected).

28 Intuition If and the distribution is approximately the distribution of

29 Intuition Consider if in place of we put where is close to then due to the non-smoothness in at we will get jittering.

30 Discussion Further reduce the conservatism of the CUD- bound. –Eliminate symmetry of CUD-bound CI –? Trade off computational burden versus bias by use of a surrogate for the indicator in the misclassification rate The real goal is to produce CIs for the Value of a policy.

31 The simplest Dynamic treatment regime (e.g. policy) is a decision rule if there is only one stage of treatment 1 Stage for each individual Observation available at j th stage Action at j th stage (usually a treatment) Primary Outcome:

32 Goal : Construct decision rules that input patient information and output a recommended action; these decision rules should lead to a maximal mean Y. In future one selects action:

33 Single Stage (k=1) Find a confidence interval for the mean outcome if a particular estimated policy (here one decision rule) is employed. Action A is randomized in {-1,1}. Suppose the decision rule is of form We do not assume the optimal decision boundary is linear.

34 Single Stage (k=1) Mean outcome following this policy is is the randomization probability

35

36 Oslin ExTENd Late Trigger for Nonresponse 8 wks Response TDM + Naltrexone CBI Random assignment: CBI +Naltrexone Nonresponse Early Trigger for Nonresponse Random assignment: Naltrexone 8 wks Response Random assignment: CBI +Naltrexone CBI TDM + Naltrexone Naltrexone Nonresponse

37 This seminar can be found at: seminars/Stanford ppt Eric or me with questions or if you would like a copy of the associated paper: or