2Sample proportion Kerry vs. Bush in 2004 A Gallup Poll 49% for Kerry1016 respondentsA Rasmussen Poll45.9% for Kerry1000 respondentsWhy the answers are different?Variability of sampleKerry vs. Bush in 2004A Gallup Poll49% for KerryA Rasmussen Poll45.9% for KerryWas one poll “wrong” ? Should we be surprised to find that we could get very different proportions fromproperly selected random samplesWhy the answers are different?Sample proportion estimates population proportionThere is randomness due to samplingLast week we looked the example of 2004 presidential election poll. We were interested in finding the proportion of population in favor of Kerry. Two polls A Gallup poll showed 49% and a Rasmussen Poll a week later says 45.9%. Due to the variability of random sampling, every time we draw samples, we have something different.
3ModelLet Y be the number of people favoring Kerry in a sample of size n=1000Y ~ Binomial(n,p)p: the proportion of people for Kerry in the entire populationWhen n is large, Y can be approximated by Normal model with mean np and variance npq.Let p be the proportion of people favoring Kerry in the entire population.That’s is to say, the probability that a randomly selected respondent saying he would vote for Kerry is p.We may think the process of asking a randomly selected respondent as a Bernoulli trial and the probability thatThis person says he/she would vote for Kerry is p.Therefore we can model the total number of respondents who said yes out of 1000 by a binomial random variable.Then we used a computer software to simulate repeated samples and look at a histogram of the sample proportions of 2000 samples of size 1000And noticed that the histogram looks like a normal model.
4Modeling sample proportion The sample proportionNormal model with mean p and varianceWe figured out the mean and sd of the sample proportions and use the normal model with the same mean and the same sd to approximate theDistribution of the sample proportions.
5Kerry vs. Bush (cont’)Assume the true population proportion voting for Kerry is 49%.The sample proportion = Y/n has a normal model with mean 0.49 and standard deviation (n=1000)Then we know that both 49% and 45.9 % are reasonable to appear ( )/0.0158=
6Sampling Distribution Model Consider the sample proportion as a random variable instead of a number. The distribution of the sample proportion is called the sampling distribution model for the proportion.Before we observe the value of the sample proportion, it is a random variable that has a distribution due to sampling variations.This distribution is called the sampling distribution model for sample proportions.We never actually take repeated samples from the same population and make a histogram. We only imagine or simulate them.Still, sampling distribution models are important becausethey act as a bridge from the real world of data to the imaginary model of the statistics andenable us to say something about the population when all we have is data from the real world.Sampling distribution models. They are used to model the sample to sample variations of statistics. For example, we used a Normal model to describe the sample proportion of people favoring Kerry.
7Left-Handed: Example13% of the population is left-handed. A 200-seat school auditorium was built with 15 “leftie seats”. In a class of n=90 students, what’s the probability that there will NOT be enough seats for the left-handed students?Let Y be the number of left-handed students in the class. We want to find P(Y>15) = P(Y/n>0.167) = P( >0.167)
8Left-Handed (cont’) Check the conditions n is large enough randomization10% conditionThe population should have more than 900 studentsSuccess/failure conditionnp=11.7>10, nq=78.3>10Approximate by Normal model for Y/nMean = 0.13 = pStandard deviation = = 0.035P( = Y/n >0.167) =normalcdf(0.167, 1E99, 0.13, 0.035) =
9Example: Sampling Distribution of a Mean 10,000 simulations for each graph.
10Central limit theorem (CLT) If the observations are drawnindependentlyfrom the same population (equivalently, distribution)the sampling distribution of the sample mean becomes normal as the sample size increases.The population distribution could be unknown.
11CLTSuppose the population distribution has mean μand standard deviation σThe sample mean has mean μand standard deviationLet Y1, …, Yn be n independently and identically distributed random variablesE(Y1) = μVar(Y1)= σ2Then as n increases, the distribution of (Y1+…+Yn)/n tends to a normal model with mean μand standard deviationThe more general result theCLT says if our sample values are independent and as the sample increases, the sample mean becomes a normal r.v.That’s to say if n independent r.v from the same distribution with mean mu and sd sigma, the mean of these r.v follow a normalDistribution with mean mu and sd sigma/sqrt(n).
12The Fundamental Theorem of Statistics The Central Limit Theorem (CLT)The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal model. The larger the sample, the better the approximation will be.
13Example: Elevator Overloaded Suppose the population distribution of adult weights has mean 175 pounds and sd 25 poundsthe shape is unknownAn elevator has a weight limit of 10 persons or 2000 poundsWhat’s the probability that the 10 people who get on the elevator overload its weight limit?
14Let Xi, i=1,2,…,10 be the weight of the ith person in the elevator Then we want to know P(X1+…+X10>2000) =From the CLT (check the requirement first), we know the distribution of is normal with mean 175 pounds and standard deviationThen
15Standard Error Using the CLT, the distribution of sample proportion is In general, by the CLT the distribution of sample mean of independent sample values isp, and could be unknown in some cases.
16Standard ErrorIf we don’t know or σ, the population parameters, we will use sample statistics to estimate.The estimated standard deviation of a sampling distribution is called a standard error.
17Standard Error (cont.) For a sample proportion, the standard error is For the sample mean, the standard error isW
18The Process Going Into the Sampling Distribution Model
19What Can Go Wrong?Don’t confuse the sampling distribution with the distribution of the sample.When you take a sample, you look at the distribution of the values, usually with a histogram, and you may calculate summary statistics.The sampling distribution is an imaginary collection of the values that a statistic might have taken for all random samples—the one you got and the ones you didn’t get.
20What Can Go Wrong? (cont.) Beware of observations that are not independent.The CLT depends crucially on the assumption of independence.You can’t check this with your data—you have to think about how the data were gathered.Watch out for small samples from skewed populations.The more skewed the distribution, the larger the sample size we need for the CLT to work.
21Summary Sample proportions or sample means are statistics They are random because samples varyTheir distribution can be approximated by normal using the CLTBe aware of when the CLT can be usedn is largeIf the population distribution is not symmetric, a much larger n is neededThe CLT is about the distribution of the sample mean, not the sample itself