Statistics for Data Miners: Part I (continued) S.T. Balke.

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Mean, Proportion, CLT Bootstrap
Chapter 6 Sampling and Sampling Distributions
Sampling: Final and Initial Sample Size Determination
Sampling Distributions (§ )
Chapter 11- Confidence Intervals for Univariate Data Math 22 Introductory Statistics.
ELEC 303 – Random Signals Lecture 18 – Statistics, Confidence Intervals Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 10, 2009.
Class notes for ISE 201 San Jose State University
Chapter 7 Sampling and Sampling Distributions
Point and Confidence Interval Estimation of a Population Proportion, p
Quality Control Procedures put into place to monitor the performance of a laboratory test with regard to accuracy and precision.
Sampling Distributions
Evaluating Hypotheses
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 9: Hypothesis Tests for Means: One Sample.
1 The Basics of Regression Regression is a statistical technique that can ultimately be used for forecasting.
1 Hypothesis Testing In this section I want to review a few things and then introduce hypothesis testing.
Chapter 7 Estimation: Single Population
Sampling and Sampling Distributions
Need to know in order to do the normal dist problems How to calculate Z How to read a probability from the table, knowing Z **** how to convert table values.
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Normal and Sampling Distributions A normal distribution is uniquely determined by its mean, , and variance,  2 The random variable Z = (X-  /  is.
Standard error of estimate & Confidence interval.
Standard Error of the Mean
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
AP Statistics Chapter 9 Notes.
Topic 5 Statistical inference: point and interval estimate
Confidence Intervals 1 Chapter 6. Chapter Outline Confidence Intervals for the Mean (Large Samples) 6.2 Confidence Intervals for the Mean (Small.
Introduction to Statistical Inference Chapter 11 Announcement: Read chapter 12 to page 299.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Theory of Probability Statistics for Business and Economics.
Normal Distributions Z Transformations Central Limit Theorem Standard Normal Distribution Z Distribution Table Confidence Intervals Levels of Significance.
Ch9. Inferences Concerning Proportions. Outline Estimation of Proportions Hypothesis concerning one Proportion Hypothesis concerning several proportions.
1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Determination of Sample Size: A Review of Statistical Theory
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Probability = Relative Frequency. Typical Distribution for a Discrete Variable.
LSSG Black Belt Training Estimation: Central Limit Theorem and Confidence Intervals.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
LECTURE 3: ANALYSIS OF EXPERIMENTAL DATA
CpSc 881: Machine Learning Evaluating Hypotheses.
Chapter 10: Introduction to Statistical Inference.
The final exam solutions. Part I, #1, Central limit theorem Let X1,X2, …, Xn be a sequence of i.i.d. random variables each having mean μ and variance.
Mystery 1Mystery 2Mystery 3.
Review Normal Distributions –Draw a picture. –Convert to standard normal (if necessary) –Use the binomial tables to look up the value. –In the case of.
ES 07 These slides can be found at optimized for Windows)
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Ex St 801 Statistical Methods Inference about a Single Population Mean (CI)
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Statistics for Business and Economics 8 th Edition Chapter 7 Estimation: Single Population Copyright © 2013 Pearson Education, Inc. Publishing as Prentice.
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Confidence Intervals and Sample Size
ESTIMATION.
Normal Distribution and Parameter Estimation
Chapter 9 One and Two Sample Estimation
Statistics in Applied Science and Technology
CONCEPTS OF ESTIMATION
Econ 3790: Business and Economics Statistics
MATH 2311 Section 4.4.
Estimation Goal: Use sample data to make predictions regarding unknown population parameters Point Estimate - Single value that is best guess of true parameter.
Sampling Distributions (§ )
Presentation transcript:

Statistics for Data Miners: Part I (continued) S.T. Balke

Probability = Relative Frequency

Typical Distribution for a Discrete Variable

Typical Distribution for a Continuous Variable Normal (Gaussian) Distribution

Probability Density Function

The Normal Distribution (Also termed the “Gaussian Distribution”) Note: f(x)dx is the probability of observing a value of x between x and x+dx. Note the statement on page 87 of the text re: dx canceling for the Bayesian method.

Selecting One Normal Distribution The Normal Distribution can fit data with any mean and any standard deviation…..which one shall we focus on? We do need to focus on just one….for tables and for theoretical developments.

Need for the Standard Normal Distribution The mean, , and standard deviation, , depends upon the data----a wide variety of values are possible To generalize about data we need: – to define a standard curve and –a method of converting any Normal curve to the standard Normal curve

The Standard Normal Distribution  = 0  = 1

The Standard Normal Distribution

P.D.F. of z

Transforming Normal to Standard Normal Distributions Observations x i are transformed to z i : This allows us to go from f(x) versus x to f(z) versus z. Areas under f(z) versus z are tabulated.

The Use of Standard Normal Curves Statistical Tables Convert x to z Use tables of area of curve segments between different z values on the standard normal curve to define probabilities

Z Table

Emphasis on Mean Values We are really not interested in individual observations as much as we are in the mean value. Now we have f(x) versus x where x is the value of observations. We need to deal in xbar, the sample mean, instead of individual x values.

Introduction to Inferential Statistics Inferential statistics refers to methods for making generalizations about populations on the basis of data from samples

Sample Quantities Mean Standard Deviation is an estimate of  is an estimate of  Note: These quantities can be for any distribution, Normal or otherwise.

Population and Sample Measures Parameters: Mean of the Population   Standard Deviation of the Population   Variance of the Population   2 Statistics (sample estimates of the parameters): Sample estimate of   Sample Estimate of   s

Population and Samples n observations per sample.

P.D.F. of the Sample Means Note: The std. dev. of this distribution is  xbar

Types of Estimators Point estimator - gives a single value as an estimate of the parameter of interest Interval estimator - specifies a range of values of the parameter and our confidence that the parameter value is in that range

Point Estimators Unbiased estimator: as the number of observations, n, increases for the sample the average value of the estimator approaches the value of the population parameter.

Interval Estimators P(lower limit<parameter<upper limit) =1-  lower limit and upper limit = confidence limits upper limit-lower limit=confidence interval 1-  = confidence level; degree of confidence; confidence coefficient

Comments on the Need to Transform to z for C.I. of Means We have a point estimate of , xbar. Now the interval estimate consists of a lower and an upper bound around our point estimate of the population mean: P(  low <  <  high )=1- 

Confidence Interval for a Population Mean P(  low <  <  high )=1-  If f(xbar) versus xbar is a Normal distribution and if we can define z as we did before, then:  low =xbar-z  /2  xbar  high =xbar+z  /2  xbar

A Standard Distribution for f(xbar) versus xbar Previously we transformed f(x) versus x to f(z) versus z We can still use f(z) versus z as our standard distribution. Now we need to transform f(xbar) versus xbar to f(z) versus z.

P.D.F. of the Sample Means Note: The std. dev. of this distribution is  xbar

P.D.F. of z

Transforming Normal to Standard Normal Distributions This time the sample means, xbar are transformed to z: Note that now we use xbar and sigma for the p.d.f. of xbar.

The Normal Distribution Family

Remaining Questions When can we assume that f(xbar) versus xbar is a Normal Distribution? –when f(x) versus x is a Normal Distribution –but….what if f(x) versus x is not a Normal Distribution How can we calculate μ and σ for the f(xbar) versus xbar distribution?

The Answer to Both Questions The Central Limit Theorem

If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased The Central Limit Theorem Note that the distribution of x is not necessarily Normal.

The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased Every member of the population must have an equally likely chance of becoming a member of your sample.

The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased

The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased

The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased. Note: The standard deviation depends upon n, the number of replicate observations in each sample.

The Central Limit Theorem If x is distributed with mean  and standard deviation , then the sample mean (xbar) obtained from a random sample of size n will have a distribution that approaches a normal distribution with mean  and standard deviation (  /n 0.5 ) as n is increased. Note: n, the number of replicates per sample, should be at least thirty.

Calculating a Confidence Interval Assume: n  30,  known

Effect of 1-   /2 1-   /2

Understanding What is a 95% Confidence Interval If we compute values of the confidence interval with many different random samples from the same population, then in about 95% of those samples, the value of the 95% c.i. so calculated would include the value of the population mean, . Note that  is a constant. The c.i. vary because they are each based on a sample.

The Binomial Distribution Histograms and p.d.f.’s Area segments and the normal distribution The standard normal distribution p.d.f. of the sample means est. of mean=point est. + interval est. The Central Limit Theorem Summary

Improving the Estimate of the Mean Reduce the confidence interval. Variables to examine: 1-  n 

Effect of 1-   /2 1-   /2

Effect of n (the sampling distribution of xbar)

Effect of 

Understanding the Question If we are asked to estimate the value of the population mean then we provide: –the point estimate + the interval estimate of the mean If we are asked to estimate the noise in the experimental technique then we provide: –the point estimate + the interval estimate of the standard deviation (something not reviewed yet)

Complication for Small Samples For small samples (n<30), if the observations, x, follow a Normal distribution, and if  must be approximated by s, then the sample means, xbar, tend to follow a “Student’s t” distribution rather than a Normal distribution. So, we must use t instead of z.

Confidence Intervals for Small Samples (n<30) Assume the x i follow a Normal distribution. Assume  is unknown. Use t and s instead of z and 

Large Samples: Estimation of C.I. for 

Small Samples: Estimation of C.I. for  No Soln

Return to a Data Mining Problem Predicting Classifier Performance…..

Predicting Classifier Performance (Page 123) y=750 successes (symbol: S in text) n=1000 trials (symbol: N in text) f=y/n=0.750 success rate for the training set What will be the success rate for other data? What is the error in the estimate of f as 0.750? From statistics we can calculate that we are 80% confident that the confidence interval to will contain the true error rate for any data.

The Binomial Distribution The probability of y successes in n trials is: The total probability of having any number of successes is the sum of all the g(y) which is unity. The probability of having any number of successes up to a certain value y’ is the sum of f(y) up to that value of y. See page 178 regarding quantifying the value of a rule.

Shape Changes for the Binomial Distribution if np>5 when p≤0.5 or if n(1-p)≥5 when p≥0.5 the Normal Distribution becomes a good approximation to the Binomial distribution N(np,np(1-p) 0.5 )=N(μ,σ)

Confidence Intervals for p is approximately N(0,1) where f(z) versus z is N(0,1)

Calculating a Confidence Interval Recall, for large samples: So, now we could say: But, we want the limits for p, not np.

Focus on p instead of np but now p is on both sides of the equation!

Focus on p instead of np is approximately N(0,1) Let’s return to z: and now, solve for p: where f=(y/n)=observed success rate Two values of p are obtained: the upper and lower limits.

Predicting Classifier Performance (Page 123) y=750 successes (symbol: S in text) n=1000 trials (symbol: N in text) f=y/n=0.750 If 1-α=0.80 (80% confidence=c in text) From z table: z=1.28 Interval from Eqn: 0.732, 0.767

Using the z Tables for the Binomial Distribution Where Φ( z) is the value obtained from the z table.

The Binomial Distribution Histograms and p.d.f.’s Area segments and the normal distribution The standard normal distribution p.d.f. of the sample means est. of mean=point est. + interval est. The Central Limit Theorem Confidence Intervals Summary

In Two Weeks Hypothesis Testing: How do we know if we can accept a batch of material from a few replicate analyses of a sample? Are the error rates obtained from two data mining methods really different?