Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics for Data Miners: Part I (continued) S.T. Balke.

Similar presentations


Presentation on theme: "Statistics for Data Miners: Part I (continued) S.T. Balke."— Presentation transcript:

1 Statistics for Data Miners: Part I (continued) S.T. Balke

2 Probability = Relative Frequency

3 Typical Distribution for a Discrete Variable

4 Typical Distribution for a Continuous Variable Normal (Gaussian) Distribution

5 Probability Density Function

6 The Normal Distribution (Also termed the “Gaussian Distribution”) Note: f(x)dx is the probability of observing a value of x between x and x+dx. Note the statement on page 87 of the text re: dx canceling for the Bayesian method.

7 Selecting One Normal Distribution The Normal Distribution can fit data with any mean and any standard deviation…..which one shall we focus on? We do need to focus on just one….for tables and for theoretical developments.

8 Need for the Standard Normal Distribution The mean, , and standard deviation, , depends upon the data----a wide variety of values are possible To generalize about data we need: – to define a standard curve and –a method of converting any Normal curve to the standard Normal curve

9 The Standard Normal Distribution  = 0  = 1

10 The Standard Normal Distribution

11 P.D.F. of z

12 Transforming Normal to Standard Normal Distributions Observations x i are transformed to z i : This allows us to go from f(x) versus x to f(z) versus z. Areas under f(z) versus z are tabulated.

13 The Use of Standard Normal Curves Statistical Tables Convert x to z Use tables of area of curve segments between different z values on the standard normal curve to define probabilities

14 Z Table http://www.statsoft.com/textbook/stathome.html

15 Emphasis on Mean Values We are really not interested in individual observations as much as we are in the mean value. Now we have f(x) versus x where x is the value of observations. We need to deal in xbar, the sample mean, instead of individual x values.

16 Introduction to Inferential Statistics Inferential statistics refers to methods for making generalizations about populations on the basis of data from samples

17 Sample Quantities Mean Standard Deviation is an estimate of  is an estimate of  Note: These quantities can be for any distribution, Normal or otherwise.

18 Population and Sample Measures Parameters: Mean of the Population   Standard Deviation of the Population   Variance of the Population   2 Statistics (sample estimates of the parameters): Sample estimate of   Sample Estimate of   s

19 Population and Samples n observations per sample.

20 P.D.F. of the Sample Means Note: The std. dev. of this distribution is  xbar

21 Types of Estimators Point estimator - gives a single value as an estimate of the parameter of interest Interval estimator - specifies a range of values of the parameter and our confidence that the parameter value is in that range

22 Point Estimators Unbiased estimator: as the number of observations, n, increases for the sample the average value of the estimator approaches the value of the population parameter.

23 Interval Estimators P(lower limit<parameter<upper limit) =1-  lower limit and upper limit = confidence limits upper limit-lower limit=confidence interval 1-  = confidence level; degree of confidence; confidence coefficient

24 Comments on the Need to Transform to z for C.I. of Means We have a point estimate of , xbar. Now the interval estimate consists of a lower and an upper bound around our point estimate of the population mean: P(  low <  <  high )=1- 

25 Confidence Interval for a Population Mean P(  low <  <  high )=1-  If f(xbar) versus xbar is a Normal distribution and if we can define z as we did before, then:  low =xbar-z  /2  xbar  high =xbar+z  /2  xbar

26 A Standard Distribution for f(xbar) versus xbar Previously we transformed f(x) versus x to f(z) versus z We can still use f(z) versus z as our standard distribution. Now we need to transform f(xbar) versus xbar to f(z) versus z.

27 P.D.F. of the Sample Means Note: The std. dev. of this distribution is  xbar

28 P.D.F. of z

29 Transforming Normal to Standard Normal Distributions This time the sample means, xbar are transformed to z: Note that now we use xbar and sigma for the p.d.f. of xbar.

30 The Normal Distribution Family

31 Remaining Questions When can we assume that f(xbar) versus xbar is a Normal Distribution? –when f(x) versus x is a Normal Distribution –but….what if f(x) versus x is not a Normal Distribution How can we calculate μ and σ for the f(xbar) versus xbar distribution?

32 The Answer to Both Questions The Central Limit Theorem

33 If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased The Central Limit Theorem Note that the distribution of x is not necessarily Normal.

34 The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased Every member of the population must have an equally likely chance of becoming a member of your sample.

35 The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased

36 The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased

37 The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased. Note: The standard deviation depends upon n, the number of replicate observations in each sample.

38 The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased. Note: n, the number of replicates per sample, should be at least thirty.

39 Calculating a Confidence Interval Assume: n  30,  known

40 Effect of 1-   /2 1-   /2

41 Understanding What is a 95% Confidence Interval If we compute values of the confidence interval with many different random samples from the same population, then in about 95% of those samples, the value of the 95% c.i. so calculated would include the value of the population mean, . Note that  is a constant. The c.i. vary because they are each based on a sample.

42 The Binomial Distribution Histograms and p.d.f.’s Area segments and the normal distribution The standard normal distribution p.d.f. of the sample means est. of mean=point est. + interval est. The Central Limit Theorem Summary

43 Improving the Estimate of the Mean Reduce the confidence interval. Variables to examine: 1-  n 

44 Effect of 1-   /2 1-   /2

45 Effect of n (the sampling distribution of xbar)

46 Effect of 

47 Understanding the Question If we are asked to estimate the value of the population mean then we provide: –the point estimate + the interval estimate of the mean If we are asked to estimate the noise in the experimental technique then we provide: –the point estimate + the interval estimate of the standard deviation (something not reviewed yet)

48 Complication for Small Samples For small samples (n<30), if the observations, x, follow a Normal distribution, and if  must be approximated by s, then the sample means, xbar, tend to follow a “Student’s t” distribution rather than a Normal distribution. So, we must use t instead of z.

49 Confidence Intervals for Small Samples (n<30) Assume the x i follow a Normal distribution. Assume  is unknown. Use t and s instead of z and 

50 Large Samples: Estimation of C.I. for 

51 Small Samples: Estimation of C.I. for  No Soln

52 Return to a Data Mining Problem Predicting Classifier Performance…..

53 Predicting Classifier Performance (Page 123) y=750 successes (symbol: S in text) n=1000 trials (symbol: N in text) f=y/n=0.750 success rate for the training set What will be the success rate for other data? What is the error in the estimate of f as 0.750? From statistics we can calculate that we are 80% confident that the confidence interval 0.732 to 0.767 will contain the true error rate for any data.

54 The Binomial Distribution The probability of y successes in n trials is: The total probability of having any number of successes is the sum of all the g(y) which is unity. The probability of having any number of successes up to a certain value y’ is the sum of f(y) up to that value of y. See page 178 regarding quantifying the value of a rule.

55 Shape Changes for the Binomial Distribution if np>5 when p≤0.5 or if n(1-p)≥5 when p≥0.5 the Normal Distribution becomes a good approximation to the Binomial distribution N(np,np(1-p) 0.5 )=N(μ,σ)

56 Confidence Intervals for p is approximately N(0,1) where f(z) versus z is N(0,1)

57 Calculating a Confidence Interval Recall, for large samples: So, now we could say: But, we want the limits for p, not np.

58 Focus on p instead of np but now p is on both sides of the equation!

59 Focus on p instead of np is approximately N(0,1) Let’s return to z: and now, solve for p: where f=(y/n)=observed success rate Two values of p are obtained: the upper and lower limits.

60 Predicting Classifier Performance (Page 123) y=750 successes (symbol: S in text) n=1000 trials (symbol: N in text) f=y/n=0.750 If 1-α=0.80 (80% confidence=c in text) From z table: z=1.28 Interval from Eqn: 0.732, 0.767

61 Using the z Tables for the Binomial Distribution Where Φ( z) is the value obtained from the z table.

62 The Binomial Distribution Histograms and p.d.f.’s Area segments and the normal distribution The standard normal distribution p.d.f. of the sample means est. of mean=point est. + interval est. The Central Limit Theorem Confidence Intervals Summary

63 In Two Weeks Hypothesis Testing: How do we know if we can accept a batch of material from a few replicate analyses of a sample? Are the error rates obtained from two data mining methods really different?


Download ppt "Statistics for Data Miners: Part I (continued) S.T. Balke."

Similar presentations


Ads by Google