Diagnostic Metrics, Part 1 Week 2 Video 2. Different Methods, Different Measures  Today we’ll focus on metrics for classifiers  Later this week we’ll.

Slides:



Advertisements
Similar presentations

Advertisements

Lecture 3 Validity of screening and diagnostic tests
Diagnostic Metrics Week 2 Video 3. Different Methods, Different Measures  Today we’ll continue our focus on classifiers  Later this week we’ll discuss.
Imbalanced data David Kauchak CS 451 – Fall 2013.
The Research Consumer Evaluates Measurement Reliability and Validity
© McGraw-Hill Higher Education. All rights reserved. Chapter 3 Reliability and Objectivity.
Part II Sigma Freud & Descriptive Statistics
Reliability for Teachers Kansas State Department of Education ASSESSMENT LITERACY PROJECT1 Reliability = Consistency.
Interpreting Kappa in Observational Research: Baserate Matters Cornelia Taylor Bruckner Vanderbilt University.
Relationship Mining Network Analysis Week 5 Video 5.
Accuracy Assessment of Thematic Maps
Lecture 22: Evaluation April 24, 2010.
Week 2 Video 4 Metrics for Regressors.
The Normal Distribution and the Rule p.html p.html.
Ground Truth for Behavior Detection Week 3 Video 1.
Observational Methods Part Two January 20, Today’s Class Survey Results Probing Question for today Observational Methods Probing Question for next.
PSYC512: Research Methods PSYC512: Research Methods Lecture 16 Brian P. Dyre University of Idaho.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
Multidimensional Arrays Histograms CSC 1401: Introduction to Programming with Java Week 11 – Lecture 1 Wanda M. Kunkle.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Experimental Control & Design Psych 231: Research Methods in Psychology.
Minor program 1 Aerospace Systems Design & Technology Goals: To introduce the student to the relationships between physical processes and their numerical.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Practical Meta-Analysis -- D. B. Wilson
Reach and Frequency: Reach Curves and Effective Frequency
Measures of Central Tendency
Descriptive Statistics: Maarten Buis Lecture 1: Central tendency, scales of measurement, and shapes of distributions.
Today Evaluation Measures Accuracy Significance Testing
Answering Descriptive Questions in Multivariate Research When we are studying more than one variable, we are typically asking one (or more) of the following.
COMPASS National and Local Norming Sandra Bolt, M.S., Director Student Assessment Services South Seattle Community College February 2010.
Classifiers, Part 1 Week 1, video 3:. Prediction  Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
1 Measurement Theory Ch 3 in Kan Steve Chenoweth, RHIT.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Case Study – San Pedro Week 1, Video 6. Case Study of Classification  San Pedro, M.O.Z., Baker, R.S.J.d., Bowers, A.J., Heffernan, N.T. (2013) Predicting.
Data Annotation for Classification. Prediction Develop a model which can infer a single aspect of the data (predicted variable) from some combination.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Technical Adequacy Session One Part Three.
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
Experiment Basics: Variables Psych 231: Research Methods in Psychology.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation of Inter-rater Reliability for Movement and Posture Observations of workers in an Audio Compact Cassette Plant Dararat Techakamolsuk Pornchai.
What's yours is mine, what's mine is yours: unconscious plagiarism and its opposite. Tim Perfect, Nicholas Lange & Ian Dennis Plymouth University.
CpSc 810: Machine Learning Evaluation of Classifier.
Building Capacity for Analytics within the Academic Environment John P. Campbell Purdue University.
Part III – Gathering Data
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 February 6, 2013.
Distance Learning Quiz. Is Distance Learning For Me? Let this short quiz help you decide. For each question, choose one answer. Scoring instructions will.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Variables It is very important in research to see variables, define them, and control or measure them.
Core Methods in Educational Data Mining HUDK4050 Fall 2014.
Inter-rater Reliability of Clinical Ratings: A Brief Primer on Kappa Daniel H. Mathalon, Ph.D., M.D. Department of Psychiatry Yale University School of.
Inter-observer variation can be measured in any situation in which two or more independent observers are evaluating the same thing Kappa is intended to.
RELIABILITY OF DISEASE CLASSIFICATION Nigel Paneth.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 February 1, 2012.
Correlations in Personality Research Many research questions that are addressed in personality psychology are concerned with the relationship between two.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CHAPTER 11 Mean and Standard Deviation. BOX AND WHISKER PLOTS  Worksheet on Interpreting and making a box and whisker plot in the calculator.
Statistical Concepts Basic Principles An Overview of Today’s Class What: Inductive inference on characterizing a population Why : How will doing this allow.
Accuracy Assessment of Thematic Maps THEMATIC ACCURACY.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
7. Performance Measurement
Core Methods in Educational Data Mining
Accuracy Assessment of Thematic Maps
Core Methods in Educational Data Mining
Evaluating Classifiers (& other algorithms)
Sociology Outcomes Assessment
Evaluating Classifiers
Presentation transcript:

Diagnostic Metrics, Part 1 Week 2 Video 2

Different Methods, Different Measures  Today we’ll focus on metrics for classifiers  Later this week we’ll discuss metrics for regressors  And metrics for other methods will be discussed later in the course

Metrics for Classifiers

Accuracy

 One of the easiest measures of model goodness is accuracy  Also called agreement, when measuring inter-rater reliability # of agreements total number of codes/assessments

Accuracy  There is general agreement across fields that accuracy is not a good metric

Accuracy  Let’s say that my new Kindergarten Failure Detector achieves 92% accuracy  Good, right?

Non-even assignment to categories  Accuracy does poorly when there is non-even assignment to categories  Which is almost always the case  Imagine an extreme case  92% of students pass Kindergarten  My detector always says PASS  Accuracy of 92%  But essentially no information

Kappa

(Agreement – Expected Agreement) (1 – Expected Agreement) Kappa

Computing Kappa (Simple 2x2 example) Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is the percent agreement? Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is the percent agreement? 80% Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is Data’s expected frequency for on-task? Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is Data’s expected frequency for on-task? 75% Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is Detector’s expected frequency for on-task? Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is Detector’s expected frequency for on-task? 65% Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is the expected on-task agreement? Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is the expected on-task agreement? 0.65*0.75= Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560

Computing Kappa (Simple 2x2 example) What is the expected on-task agreement? 0.65*0.75= Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What are Data and Detector’s expected frequencies for off-task behavior? Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What are Data and Detector’s expected frequencies for off- task behavior? 25% and 35% Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is the expected off-task agreement? Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is the expected off-task agreement? 0.25*0.35= Detector Off-Task Detector On-Task Data Off-Task 205 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is the expected off-task agreement? 0.25*0.35= Detector Off-Task Detector On-Task Data Off-Task 20 (8.75)5 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is the total expected agreement? Detector Off-Task Detector On-Task Data Off-Task 20 (8.75)5 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is the total expected agreement? = Detector Off-Task Detector On-Task Data Off-Task 20 (8.75)5 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is kappa? Detector Off-Task Detector On-Task Data Off-Task 20 (8.75)5 Data On-Task 1560 (48.75)

Computing Kappa (Simple 2x2 example) What is kappa? (0.8 – 0.575) / ( ) 0.225/ Detector Off-Task Detector On-Task Data Off-Task 20 (8.75)5 Data On-Task 1560 (48.75)

So is that any good? What is kappa? (0.8 – 0.575) / ( ) 0.225/ Detector Off-Task Detector On-Task Data Off-Task 20 (8.75)5 Data On-Task 1560 (48.75)

Interpreting Kappa  Kappa = 0  Agreement is at chance  Kappa = 1  Agreement is perfect  Kappa = negative infinity  Agreement is perfectly inverse  Kappa > 1  You messed up somewhere

Kappa<0  This means your model is worse than chance  Very rare to see unless you’re using cross-validation  Seen more commonly if you’re using cross-validation  It means your model is junk

0<Kappa<1  What’s a good Kappa?  There is no absolute standard

0<Kappa<1  For data mined models,  Typically is considered good enough to call the model better than chance and publishable  In affective computing, lower is still often OK

Why is there no standard?  Because Kappa is scaled by the proportion of each category  When one class is much more prevalent  Expected agreement is higher than  If classes are evenly balanced

Because of this…  Comparing Kappa values between two data sets, in a principled fashion, is highly difficult  It is OK to compare Kappa values within a data set  A lot of work went into statistical methods for comparing Kappa values in the 1990s  No real consensus  Informally, you can compare two data sets if the proportions of each category are “similar”

Quiz What is kappa? A: B: C: D: Detector Insult during Collaboration Detector No Insult during Collaboration Data Insult 167 Data No Insult 819

Quiz What is kappa? A: B: C: D: Detector Academic Suspension Detector No Academic Suspension Data Suspension 12 Data No Suspension 4141

Next lecture  ROC curves  A’  Precision  Recall