Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.

Slides:



Advertisements
Similar presentations
Mean, Proportion, CLT Bootstrap
Advertisements

Chapter 6 Sampling and Sampling Distributions
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
A Sampling Distribution
Sampling Distributions and Sample Proportions
Sampling Distributions
6-1 Stats Unit 6 Sampling Distributions and Statistical Inference - 1 FPP Chapters 16-18, 20-21, 23 The Law of Averages (Ch 16) Box Models (Ch 16) Sampling.
Stick Tossing and Confidence Intervals Asilomar - December 2006 Bruce Cohen Lowell High School, SFUSD
Excursions in Modern Mathematics, 7e: Copyright © 2010 Pearson Education, Inc. 16 Mathematics of Normal Distributions 16.1Approximately Normal.
Ch. 17 The Expected Value & Standard Error Review of box models 1.Pigs – suppose there is a 40% chance of getting a “trotter”. Toss a pig 20 times. –What.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Chapter 7 Introduction to Sampling Distributions
Chapter 7 Sampling and Sampling Distributions
Chapter 10 Sampling and Sampling Distributions
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #15.
Sampling Distributions
Part III: Inference Topic 6 Sampling and Sampling Distributions
The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc
Ch. 17 The Expected Value & Standard Error Review box models –Examples 1.Pigs – assume 40% chance of getting a “trotter” -20 tosses 2.Coin toss – 20 times.
Statistical inference Population - collection of all subjects or objects of interest (not necessarily people) Sample - subset of the population used to.
Sampling and Estimating Population Percentages and Averages Math 1680.
Determining the Size of
Inferential Statistics
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
Sampling Theory Determining the distribution of Sample statistics.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 7 Sampling Distributions.
Sampling Distributions
A Sampling Distribution
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 6 Sampling Distributions.
Sampling Distribution ● Tells what values a sample statistic (such as sample proportion) takes and how often it takes those values in repeated sampling.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Sampling Distributions Chapter 7. The Concept of a Sampling Distribution Repeated samples of the same size are selected from the same population. Repeated.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Making Inferences. Sample Size, Sampling Error, and 95% Confidence Intervals Samples: usually necessary (some exceptions) and don’t need to be huge to.
Mar. 22 Statistic for the day: Percent of Americans 18 or older who believe Martha Stewart’s sentence should include jail time: 53% Assignment: Read Chapter.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 18 Sampling Distribution Models.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Sampling Distribution Models Chapter 18. Toss a penny 20 times and record the number of heads. Calculate the proportion of heads & mark it on the dot.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Relative Values. Statistical Terms n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the data  not sensitive to.
Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics Statistics & Econometrics.
Chapter 7: Sampling Distributions Section 7.1 How Likely Are the Possible Values of a Statistic? The Sampling Distribution.
Organization of statistical investigation. Medical Statistics Commonly the word statistics means the arranging of data into charts, tables, and graphs.
Math 3680 Lecture #15 Confidence Intervals. Review: Suppose that E(X) =  and SD(X) = . Recall the following two facts about the average of n observations.
1 Chapter 8 Interval Estimation. 2 Chapter Outline  Population Mean: Known  Population Mean: Unknown  Population Proportion.
Sampling Theory Determining the distribution of Sample statistics.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Chance We will base on the frequency theory to study chances (or probability).
The expected value The value of a variable one would “expect” to get. It is also called the (mathematical) expectation, or the mean.
The Normal Approximation for Data. History The normal curve was discovered by Abraham de Moivre around Around 1870, the Belgian mathematician Adolph.
The normal approximation for probability histograms.
Review Statistical inference and test of significance.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
Chapter 6 Sampling and Sampling Distributions
Sampling Distribution Models
Understanding Sampling Distributions: Statistics as Random Variables
Sampling Distributions and The Central Limit Theorem
Sampling Distribution Models
Chance Errors in Sampling (Dr. Monticino)
Sampling Distributions and The Central Limit Theorem
Accuracy of Averages.
Presentation transcript:

Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random samples from a population whose composition is known. This mainly depends on the size of the sample, not the size of the population.

Example Suppose a health study is based on a representative cross section of 6,672 Americans age 18 to 79. There are 3,091 men and 3,581 women. So men are about 46%. We want to interview a sample of size 100. To avoid bias, we are going to draw the sample at random. To do that, we put the names on 6,672 tickets, and draw out 100 tickets at random. We use a computer program to simulate this process: draw tickets without replacement.

Example In the previous sampling, there were 51 men and 49 women. This is not like the percentage in the population: 46% are men and 54% are women. This is due to chance variability. The sampling process is similar to the chance process we learned before. In the coin tossing process, chances are 50-50, whereas in our example, the chances are just about each time. (Note that when the number of the tickets is large enough, although we draw tickets without replacement, draws can be treated as independent.) Here, the chance error for the percentage of men is 5%, since 51% = 46% + 5%. (Recall: estimate = parameter + chance error.)

Example If we repeat the sampling processes, the percentage of the men will varies from time to time:

Example From the previous table for the 250 samples, we can see the number of men ranged from a low of 34 to a high of 58. Only 17 samples out of the lot have exactly 46 men. Here is a histogram:

Example If we increase the size of the sample, then it will come out more like the population. For instance, we increase the sample size to 400. We draw another 250 samples. The percentage of men again varies from time to time. The low is 39%, and the high is 54%. Compare to the sample size of 100, the sample size of 400 is a bit closer to the population. (Size 100: low 34%, high 58%. The population: 46%.)

Example Here is the histogram for the percentages of men in samples of size 400:

Remarks As we may compare the samples, multiply the sample size by 4 cuts the likely size of the chance error in the percentage by a factor around 2. So one may expect, there could be some quantitative relation between the sample size and the chance error in the percentage. We did the sampling process every single time out of the population 6,672. All the samples are different to each other. This does no mean that there are 250 × 100 = 25,000 people in total. We repeat the process every time, so that some people could be picked many times.

The expected value and standard error With a simple random sample, the expected value for the sample percentage equals the population percentage. The standard error for the percentage is just the ratio of the standard error for the number relative to the sample size.

The expected value Let us continue the previous example: take a sample of size 100 from a population of 6,672 people in a health study, about 46% are men and 54% are women. We know that the percentage of men in the sample will be around the percentage of men in the population, that is about 46%. This is the expected value for the sample percentage in a simple random sample. In practice, for a sample, the percentage will not be exactly equal to its expected value----it will be off by a chance error. Similar to the chance process, in a sampling process, the likely size of the chance error is given by the standard error.

The standard error The idea to compute the standard error is that: First, find the SE for the number of men in the sample. Then, convert to percent, relative to the size of the sample, i.e Note: we are now doing things in percentage, so the SE must be converted to percent. To compute the SE in number, we must set up a box model. Since we are counting the men, in the box there should be tickets 1 and 0. The 1’s stand for men and 0’s stand for women.

The standard error There are 3,091 men and 3,581 women, so that in the box there are 3,091 1’s and 3,581 0’s.

The standard error

Increase the sample

The formulas

Note

Remarks Maybe some of you have noticed that the arguments are exact only when drawing with replacement. But as we mentioned earlier, when the number of draws is so small relative to the number of the total tickets in the box, the draws could be considered as independent. In this case, there is almost no difference between drawing with or without replacement. We may also notice that the SE for the number and the SE for the percentage behave quite differently: The SE for the sample number goes up like the square root of the sample size. The SE for the sample percentage goes down like the square root of the sample size.

Normal Approximation As before, we use the normal curve to estimate the probability in a certain interval.

Example In a town, the telephone company has 100,000 subscribers. They plan to take a simple random sample of 400 of the subscribers as part of a market research study. According to Census data, 20% of the company’s subscribers earn over $50,000 a year. Q1: The percentage of persons in the sample with incomes over $50,000 a year will be around ____, give or take ____ or so. Q2: Estimate the probability that between 18% and 22% of the persons in the sample earn more than $50,000 a year.

Solution To begin with, we first set up a box model. The problem that the persons’ incomes are over $50,000 a year or not, is just a classifying and counting problem. So we use tickets 1 and 0. The people earing more than $50,000 a year get 1’s, and the others get 0’s. Taking a sample of 400 from the population of size 100,000, is just like drawing 400 tickets at random from a box of 100,000 tickets. We will look at the sum of draws and the corresponding percentage. From the Census data, 20% of the tickets are 1’s, that is 20,000. The rest 80,000 are 0’s.

Solution

Notice that the classifying and counting problem is just a special case of sum of draws. We know that by the central limit theorem, the probability histogram for sum of draws follows the normal curve when the number of draws is reasonably large. So when the sample size is large enough, by a change of scale, the probability histogram for the sample percentage can be approximated by the normal curve. (Convert to percent is just a change of scale.) In Q2, we are dealing with the probability histogram for the sample percentage. By above arguments, the normal approximation applies.

Solution Since the expected value is 20%, and the SE is 2%, we now can convert the scale to standard units. The 18% is converted to -1, and the 22% is converted to +1. Recall that the area under the normal curve between -1 and +1 is about 68%. So the probability that between 18% and 22% of the persons in the sample earn more than $50,000 a year is about 68%. This completes the solution to Q2.

Remark In standard units, the histogram for number and the histogram for percentage are exactly the same. This is an application of the change of scale:

Note When the problem is about classifying and counting in a sample, then we set up a 0-1 box to get a percent. There could be problems about adding up the sample values. This will be the general case for the sum of draws box model. Then we have to set up the box with tickets of the sample values to get the sum or the average as we did before.

Size of Population We have already seen that the size of the sample will determine the standard error for the percentage. As a result, the size of the sample will determine the accuracy. We also have seen that when the size of the sample is so small relative to the size of the population, the sampling process can be considered as a box model----drawing with replacement. Then a natural question may come into your head: Will the size of population affect the accuracy?

Size of Population The answer is: No. When estimating percentages, it is the absolute size of the sample which determines accuracy, not the size relative to the population. This is true if the sample is only a small part of the population, which is the usual case.

Example In 2004, the presidential campaign Bush versus Kerry, focus on the Southwest: New Mexico and Texas. Pollsters try to predict the results. There are about 1.5 million eligible voters in New Mexico, and about 15 million in Texas. One polling organization takes a simple random sample of 2,500 voters in New Mexico. Another polling organization takes a simple random sample of 2,500 voters from Texas. Let us compare the accuracy of the two predictions.

Example Intuitively, the New Mexico poll should be more accurate than the Texas poll. Because the New Mexico poll is sampling 1 voter out of 600, while the Texas poll is sampling 1 voter out of 6,000. Let us set up two box models to have a look at this. One of the box has 1.5 million tickets, and the other has 15 million. The tickets are marked either 1 or 0. The tickets marked 1’s stand for Democrats, and the tickets marked 0’s stand for others. To keep life simple, we make the percentage of 1’s in the two boxes both equal to 50%.

Example

Remark

When the size of box is large relative to the number of draws, the correction factor is nearly 1 and can be ignored: This again states that in general it is the absolute size of the sample which determines accuracy. The size of the population does not really matter. On the other hand, if the sample is a substantial fraction of the population, the correction factor must be used. But in general, it is not the case.

Comments for the sampling All the arguments in this chapter focus on simple random sampling. But the conclusion holds for most probability methods of drawing samples. (e.g. multistage cluster sampling) A very important point is that: the likely size of the chance error in sample percentages depends mainly on the absolute size of the sample, not on the size of the population. That is, the size of the sample mainly determine the accuracy.

Comments for the sampling For example, the Gallup Poll predicts the vote with good accuracy by sampling several thousand eligible voters out of 200 million. This is amazing. The sample of size 2,500 is big enough. Suppose we toss a coin for 2,500 times, the standard error for the percentage of heads is only 1%. Similarly, with a sample of 2,500 voters, the likely size of the chance error is only about 1% or so. This will work unless the election is very close, like Bush versus Gore in 2000.

Summary

When the sample is only a small part of the population, it is the sample size which determines the accuracy, not the population size. When the box size is large relative to the number of draws, there is almost no difference between drawing with and without replacement. In this case, the correction factor is nearly 1. The process can be considered as drawing with replacement.