Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting.

Slides:



Advertisements
Similar presentations
Hypothesis Testing. To define a statistical Test we 1.Choose a statistic (called the test statistic) 2.Divide the range of possible values for the test.
Advertisements

Learning Algorithm Evaluation
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Statistical Issues in Research Planning and Evaluation
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
PSY 307 – Statistics for the Behavioral Sciences
Evaluation.
Evaluation and Credibility How much should we believe in what was learned?
Experimental Evaluation
Evaluation and Credibility
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
PSY 307 – Statistics for the Behavioral Sciences
5-3 Inference on the Means of Two Populations, Variances Unknown
PSY 307 – Statistics for the Behavioral Sciences
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
AM Recitation 2/10/11.
Estimation and Hypothesis Testing Faculty of Information Technology King Mongkut’s University of Technology North Bangkok 1.
Inferential Statistics: SPSS
II.Simple Regression B. Hypothesis Testing Calculate t-ratios and confidence intervals for b 1 and b 2. Test the significance of b 1 and b 2 with: T-ratios.
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Chapter 9.3 (323) A Test of the Mean of a Normal Distribution: Population Variance Unknown Given a random sample of n observations from a normal population.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
STATISTICAL INFERENCE PART VII
Slides for “Data Mining” by I. H. Witten and E. Frank.
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.
T-TEST Statistics The t test is used to compare to groups to answer the differential research questions. Its values determines the difference by comparing.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
STA Lecture 181 STA 291 Lecture 18 Exam II Next Tuesday 5-7pm Memorial Hall (Same place) Makeup Exam 7:15pm – 9:15pm Location TBA.
Manijeh Keshtgary Chapter 13.  How to report the performance as a single number? Is specifying the mean the correct way?  How to report the variability.
Essential Question:  How do scientists use statistical analyses to draw meaningful conclusions from experimental results?
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Ch11: Comparing 2 Samples 11.1: INTRO: This chapter deals with analyzing continuous measurements. Later, some experimental design ideas will be introduced.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter 10 The t Test for Two Independent Samples
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
© Copyright McGraw-Hill 2004
The t-distribution William Gosset lived from 1876 to 1937 Gosset invented the t -test to handle small samples for quality control in brewing. He wrote.
Chapter 8, continued.... III. Interpretation of Confidence Intervals Remember, we don’t know the population mean. We take a sample to estimate µ, then.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Monday, October 21 Hypothesis testing using the normal Z-distribution. Student’s t distribution. Confidence intervals.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Chapter 9 Introduction to the t Statistic
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten and E. Frank.
Data Science Credibility: Evaluating What’s Been Learned
ESTIMATION.
Data Science Credibility: Evaluating What’s Been Learned Evaluating Numeric Prediction WFH: Data Mining, Section 5.8 Rodney Nielsen Many of these.
9. Credibility: Evaluating What’s Been Learned
Hypothesis Testing: Hypotheses
Machine Learning Techniques for Data Mining
Learning Algorithm Evaluation
Hypothesis Testing.
Presentation transcript:

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Credibility: Evaluating What’s Been Learned Predicting Performance; Comparing DM Schemes WFH: Data Mining, Sections

Rodney Nielsen, Human Intelligence & Language Technologies Lab Credibility: Evaluating What’s Been Learned Issues: training, testing, tuning Holdout, cross-validation, bootstrap Predicting performance Comparing schemes: the t-test

Rodney Nielsen, Human Intelligence & Language Technologies Lab Predicting Performance Assume the estimated success rate (accuracy) is 75%. How close is this to the true accuracy? Depends on the amount of test data Prediction is just like tossing a (biased!) coin “Heads” is a “success”, “tails” is an “error” In statistics, a succession of independent events like this is called a Bernoulli process Statistical theory provides us with confidence intervals for the true underlying proportion

Rodney Nielsen, Human Intelligence & Language Technologies Lab Confidence Intervals We can say: p lies within a certain specified interval with a certain specified confidence Example: S=750 successes in N=1000 trials Estimated success rate: 75% How close is this to true success rate p? Answer: with 80%confidence p in [73.2,76.7] Another example: S=75 and N=100 Estimated success rate: 75% How close is this to true success rate p? With 80% confidence p in [69.1,80.1]

Rodney Nielsen, Human Intelligence & Language Technologies Lab Mean and Variance Mean and variance for a Bernoulli trial: p, p (1–p) Expected success rate f=S/N Mean and variance for f : p, p (1–p)/N For large enough N, f follows an approximate Normal distribution c% confidence interval [–z ≤ X ≤ z] for random variable with 0 mean is given by: With a symmetric distribution:

Rodney Nielsen, Human Intelligence & Language Technologies Lab Confidence Limites Confidence limits for the normal distribution with 0 mean and a standard deviation of 1: Thus: To use this we have to reduce our random variable f to have 0 mean and unit standard deviation % % 1.655% % z 1.0% 0.5% 0.1% Pr[X  z]

Rodney Nielsen, Human Intelligence & Language Technologies Lab Examples f = 75%, N = 1000, c = 80% (i.e., z = 1.28): f = 75%, N = 100, c = 80% : Note that normal distribution assumption is only valid for large N (maybe, N > 100) f = 75%, N = 10, c = 80% : (should be taken with a grain of salt)

Rodney Nielsen, Human Intelligence & Language Technologies Lab Credibility: Evaluating What’s Been Learned Issues: training, testing, tuning Predicting performance Holdout, cross-validation, bootstrap Comparing schemes: the t-test

Rodney Nielsen, Human Intelligence & Language Technologies Lab Comparing Data Mining Schemes Frequent question: which of two learning schemes performs better? Note: this is domain dependent! Obvious way: compare 10-fold CV estimates How effective would this be? We need to show convincingly that a particular method works better

Rodney Nielsen, Human Intelligence & Language Technologies Lab Comparing DM Schemes II Want to show that scheme A is better than scheme B in a particular domain For a given amount of training data On average, across all possible training sets Let's assume we have an infinite amount of data from the domain: Sample infinitely many dataset of specified size Obtain CV estimate on each dataset for each scheme Check if mean accuracy for scheme A is better than mean accuracy for scheme B

Rodney Nielsen, Human Intelligence & Language Technologies Lab Paired t-test In practice we have limited data and a limited number of estimates for computing the mean Student’s t-test tells whether the means of two samples are significantly different In our case the samples are cross-validation estimates for different datasets from the domain Use a paired t-test because the individual samples are paired The same CV is applied twice William Gosset Born: 1876 in Canterbury; Died: 1937 in Beaconsfield, England Obtained a post as a chemist in the Guinness brewery in Dublin in Invented the t-test to handle small samples for quality control in brewing. Wrote under the name "Student".

Rodney Nielsen, Human Intelligence & Language Technologies Lab Distribution of the Means x 1 x 2 … x k and y 1 y 2 … y k are the 2k samples for the k different datasets m x and m y are the means With enough samples, the mean of a set of independent samples is normally distributed Estimated variances of the means are σ x 2 / k and σ y 2 / k

Rodney Nielsen, Human Intelligence & Language Technologies Lab Student’s Distribution With small samples (k < 100) the mean follows Student’s distribution with k–1 degrees of freedom Confidence limits: % % 1.835% z 1% 0.5% 0.1% Pr[X  z] % % 1.655% z 1% 0.5% 0.1% Pr[X  z] 9 degrees of freedom normal distribution Assuming we have 10 estimates

Rodney Nielsen, Human Intelligence & Language Technologies Lab Distribution of the Differences Let m d = m x – m y The difference of the means (m d ) also has a Student’s distribution with k–1 degrees of freedom Let σ d 2 be the variance of the difference The standardized version of m d is called the t- statistic: We use t to perform the t-test

Rodney Nielsen, Human Intelligence & Language Technologies Lab Performing the Test Fix a significance level If a difference is significant at the α% level, there is a (100 - α)% chance that the true means differ Divide the significance level by two because the test is two-tailed I.e., either result could be greater than the other Look up the value for z that corresponds to α/2 If t ≤ –z or t ≥ z then the difference is significant I.e., the null hypothesis (that the difference is zero) can be rejected – there is less than a (100 - α)% probability that the difference is due to chance or random factors

Rodney Nielsen, Human Intelligence & Language Technologies Lab Unpaired Observations If the CV estimates are from different datasets, they are no longer paired (or maybe we have k estimates for one scheme, and j estimates for the other scheme) Then we have to use an unpaired t-test with min(k, j) - 1 degrees of freedom

Rodney Nielsen, Human Intelligence & Language Technologies Lab Dependent Estimates We assumed that we have enough data to create several datasets of the desired size Need to re-use data if that's not the case E.g., running cross-validations with different randomizations on the same data Samples become dependent → insignificant differences can become significant A heuristic test is the corrected resampled t-test: Assume we use the repeated hold-out method, with n 1 instances for training and n 2 for testing New test statistic is:

Rodney Nielsen, Human Intelligence & Language Technologies Lab Credibility: Evaluating What’s Been Learned Issues: training, testing, tuning Predicting performance Holdout, cross-validation, bootstrap Comparing schemes: the t-test Later we’ll come back to: Predicting probabilities: loss functions Cost-sensitive measures Evaluating numeric prediction The Minimum Description Length principle