# October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.

## Presentation on theme: "October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted."— Presentation transcript:

October 1999 Statistical Methods for Computer Science Marie desJardins (mariedj@cs.umbc.edu)mariedj@cs.umbc.edu CMSC 601 April 9, 2012 Material adapted from slides by Tom Dietterich, with permission

Statistical Analysis of Data u Given a set of measurements of a value, how certain can we be of the value? u Given a set of measurements of two values, how certain can we be that the two values are different? u Given a measured outcome, along with several condition or treatment values, how can we remove the effect of unwanted conditions or treatments on the outcome? 4/9/12 2

Measuring CPU Time 4/9/12 3

CPU Data Distribution 4/9/12 4

Kernel Density Estimate u Kernel density: place a small Gaussian distribution (“kernel”) around each data point, and sum them  Useful for visualization; also often used as a regression technique 4/9/12 5

Sample Mean u The data seems to have reasonably close to a normal (Gaussian or bell curve) distribution u Given this assumption, we can compute a sample mean: u How certain can we be that this is the true value? u Confidence interval [min, max]:  Suppose we drew many random samples of size n=37, and computed the sample means  95% of the time, this value would lie between max and min 4/9/12 6

Confidence Intervals via Resampling u We can simulate this process algorithmically u Draw 1000 random subsamples (with replacement) from the original 37 points  This process makes no assumption about a Gaussian distribution! u Sort the means of these subsamples u Choose the 26 th and 975 th values as min and max of a 95% confidence interval (includes 95% of the sample means!) u Result: The resampled confidence interval is [0.245, 0.251] 4/9/12 7

Confidence Intervals via Distributional Theory u The Central Limit Theorem says that the distribution of the sample means is normally distributed, u If the original data is normally distributed with mean μand standard deviation σ, then the sample means will be normally distributed with mean μ and standard deviation σ’ = σ/√n (but we don’t know the original μ and σ...): u Note that it isn’t important to remember this formula, since Matlab, R, etc. will do this for you. But it is very important to understand why you are computing it! 4/9/12 8

t Distribution u Instead of assuming a normal distribution, we can use a t distribution (sometimes called a “Student’s t distribution”), which has three parameters: μ, σ, and the degrees of freedom (d.f. = n-1)  The probability distribution function looks somewhat like a normal distribution, but gives a tighter peak (with longer tails) as n increases u This distribution yields just slightly tighter confidence limits, using the central limit theorem: 4/9/12 9

Distributional Confidence Intervals u We can use the mathematical formula for the t distribution to compute a p (typically, p=0.95) confidence interval:  The 0.025 t-value, t 0.025, is the value such that the probability that μ-μ’ < t 0.025 is 0.975  The 95% confidence interval is then [μ’-t 0.025, μ+t 0.025 ] u For the CPU example, t 0.025 is 0.028, so the distributional confidence interval is [0.220, 0.276] -- tighter than the bootstrapped CI of [0.245, 0.251] 4/9/12 10

Bootstrap Computations of Other Statistics u The bootstrap method can be used to compute other sample statistics for which the distribution method isn’t appropriate:  median  mode  variance u Because the tails and outlying values may not be well represented in a sample, the bootstrap method is not as useful for statistics involving the “ends” of the distribution:  minimum  maximum 4/9/12 11

Measuring the Number of Occurrences of Events u In CS, we often want to know how often something occurs:  How many times does a process complete successfully?  How many times do we correctly predict membership in a class?  How many times do we find the top search result? u Again, the sample rate θ’ is what we have observed, but we would like to know the “true” rate θ 4/9/12 12

Bootstrap Confidence Intervals for Rates u Suppose we have observed 100 predictions of a decision tree, and 88 of them were correct u Draw many (say, 1000) samples of size n, with replacement, from the n observed predictions (here, n=100), and compute the sample classification rate u Sort the sample rates θ i in increasing order u Choose the 26 th and 975 th values as the ends of the confidence interval: here, the confidence interval is [0.81, 0.94] 4/9/12 13

Binomial Distributional Confidence u If we assume that the classifier is a “biased coin” with probability θ of coming up heads, then we can use the binomial distribution to analytically compute the confidence interval  This requires a small correction because the binomial distribution is actually discrete, but we want a continuous estimate 4/9/12 14

Comparing Two Measurements u Consider the CPU measurements of the earlier example, and suppose we have performed the same computation on a different machine, yielding these CPU times:  0.21 0.20 0.20 0.19 0.20 0.19 0.18 0.20 0.19 0.19 0.19 0.19 0.20 0.18 0.19 0.20 0.22 0.20 0.20 0.20 0.19 0.20 0.18 0.19 0.19 0.20 0.20 0.22 0.18 0.29 0.21 0.23 0.20 u These times certainly seem faster than the first machine, which yielded a distributional confidence interval of [0.220, 0.276] – but how can we be sure? 4/9/12 15

Kernel Density Comparison u Visually, the second machine (Shark3) is much faster than the first (Darwin): 4/9/12 16

Difference Estimation u Bootstrap testing:  Repeat many times: n Draw a bootstrap sample from each of the machines, computer sample means  If Shark3 is faster than Darwin more than 95% of the time, we can be 95% confident that it really is faster  We can also compute a 95% bootstrap confidence interval on the difference between the means – this turns out to be [0.0461, 0.0553] u If the samples are drawn from t distributions, then the difference between the sample means also has a t distribution  Confidence interval on this difference: [0.0463, 0.0555] 4/9/12 17

Hypothesis Testing u Is the true difference zero, or more than zero? u Use classical statistical rejection testing  Null hypothesis: The two machines have the same speed (i.e., μ, the difference in sample rate, is equal to zero)  Can we reject this hypothesis, based on the observed data?  If the null hypothesis were true, what is the probability we would have observed this data?  We can measure this probability using the t distribution  In this case, the computed t value = (μ1 – μ2) / σ’ = 21.69  The probability of seeing this t value, if μ was actually zero, is nearly nonexistent: The 99.999% confidence interval (for the null hypothesis) is [-4.59, 4.59], so the probability of this t value is (much) less than 0.00001 4/9/12 18

Paired Differences u Suppose we had a set of 10 different benchmark programs that we ran on the two machines, yielding these CPU times: u Obviously, we don’t want to just compare the means, since the programs have such different running times 4/9/12 19

Kernel Density Visualization u CPU1 seems to be systemically faster (offset to the left) than CPU2 4/9/12 20

Scatterplot Visualization u CPU1 is always faster than CPU2 (i.e., above the diagonal line that corresponds to equal speed) 4/9/12 21

Sequential Visualization u The co-correlation of program “difficulty” (and faster CPU speed of CPU1) is even more obvious in this ordered (by program number) line plot: 4/9/12 22

Distribution Analysis I u If the differences are in the same “units,” we can subtract the CPU times for the “paired” tests and assume a t distribution on these differences  The probability of observing a sample mean difference as large as 02779, given a null hypothesis of the machines having the same speed, is 0.0000466 – we can reject the null hypothesis u If we have no prior belief about which machine is faster, we should use a “two-tailed test”  The probability of observing a sample mean difference this large in either direction is 00000932 – slightly larger, but still sufficiently improbable that we can be sure that the machines have different speeds u Note that we can also use a bootstrap analysis on this type of paired data 4/9/12 23

Paired vs. Non-Paired u If we don’t pair the data (just compare the overall mean, not the differences for paired tests):  Distributional analysis doesn’t let us reject the null hypothesis  Bootstrap analysis doesn’t let us reject the null hypothesis 4/9/12 24

Sign Tests u I mentioned before that the paired t-test is appropriate if the measurements are in the same “units” u If the magnitude of the difference is not important, or not meaningful, we still can compare performance u Look at the sign of the difference (here, CPU1 is faster 10 out of 10 times; but in another case, it might only be faster 9 out of 10 times) u Use the binomial distribution (flip a coin to get the sign) to compute a confidence interval for the probability that CPU1 is faster 4/9/12 25

Other Important Topics u Regression analysis u Cross-validation u Human subjects analysis and user study design u Analysis of Variance (ANOVA) u For your particular investigation, you need to know which of these topics are relevant, and to learn about them! 4/9/12 26

Statistically Valid Experimental Design u Make sure you understand the nuances before you design your experiments... ...and definitely before you analyze your experimental data! u Designing the statistical methods (and hypotheses) after the fact is not valid!  You can often find a hypothesis and associated statistical method and hypothesis ex post facto – i.e., design an experiment to fit the data instead of the other way around  In the worst case, doing this is downright unethical  In the best case, it shows a lack of clear research objectives and may not be reproducible or meaningful 4/9/12 27

Download ppt "October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted."

Similar presentations