# A few slides about sampling and hypothesis testing. ©2013 Michael J. Rosenfeld Draft date: 1/14/2013.

## Presentation on theme: "A few slides about sampling and hypothesis testing. ©2013 Michael J. Rosenfeld Draft date: 1/14/2013."— Presentation transcript:

A few slides about sampling and hypothesis testing. ©2013 Michael J. Rosenfeld Draft date: 1/14/2013

The sample frame, or sample universe, is the data that our sample is drawn from. In the case of the March, 2000 CPS, the sample universe includes all people residing in the US in March, 2000, who were not living in institutional settings. This sample frame has N members. The Sample Universe Our sample In theory, our sample is drawn from the sample universe. The simplest way to think about this is to think that we start with the sample universe, and we randomly select n cases from the sample universe for our sample. Generally we expect n, the size of our sample, to be much smaller than the sampling universe N, i.e. n<<N. In the case of the CPS, our sample is about 1/2000 as large as the sample universe. If our sample is randomly selected from the Sample Universe, that means that our sample is a representative sample (representative of the sample universe), and this means we can use our sample to answer hypotheses about the sample universe. 1. The sample universe, and our sample Random sampling

The Sample Universe (size N) Our random, representative sample, size n 2. Hypothesis testing Hypothesis 1 Data from our sample allow us to either accept or reject hypothesis 1 about the sample universe Note: We make hypotheses about the sample universe, and we test those hypotheses with data from our sample. There is a lot we don’t know about the sample universe, since the data we have on hand (our sample) is only a small part of the much larger sample universe. We don’t make hypotheses about our sample because we already know all there is to know about our sample.

The Sample Universe (size N) Our random, representative sample, size n 3. Our sample is one of many potential samples. Other potential representative samples drawn from the same sample universe Let’s say that Hypothesis 1 is that X=0. Think of X as some value in the sample universe that we cannot measure directly. In our data, x=b, and b≠0. One way we think about hypothesis testing is to ask this question: if hypothesis 1 were true, meaning X=0, how likely would we be to find the value of x as large as b in our sample? Or in other words, if Hypothesis 1 were true, what percentage of the random other samples would yield a value for x as large as b? If the answer is less than 5%, then we generally reject the Hypothesis. Evidence from our sample leads us to believe that the true (unmeasured) value of X in the sample universe is not equal to zero. Hypothesis 1, X=0 In our data, x=b, and we decide this a value of x as large as b is unlikely if Hypothesis 1 were true, so we reject Hypothesis 1

The Sample Universe (size N) Our random, representative sample, size n 4. How sample size matters. Hypothesis 1, X=0 In our data, x=b, and we decide this a value of x as large as b is unlikely if Hypothesis 1 were true, so we reject Hypothesis 1 Interestingly, and surprisingly to most students, the sample size we really care about is n. As long as N is much larger, that is as long as n<<N, then it doesn’t matter how big N is. The sample size of our sample, n, is what determines the standard errors of our means, and the power of our tests. Remember: And And remember it is the small n, the n of our random sample that we are talking about here.

The Sample Universe (size N) Our random, representative sample, size n 5. Sampling fraction. Random sample The ratio of n/N is called the sampling fraction. Remember that Well, there is actually what is called a finite sample correction, so that the true value of Var(avg(X)) is: With the part in square brackets representing the finite sample correction. When the sampling fraction is small, the finite sample correction is basically 1, and you can ignore it. When the sampling fraction is large (let’s say you have the half the sample universe in your sample), then you are shrinking the variance of your averages, which makes sense, because the variance is measure of uncertainty, and if you have a substantial proportion of all the possible data in your hand, you have a lot less uncertainty about the sample universe. And when sampling fraction is 1, when you have the entire sampling universe in your hand, then the finite sample correction is zero, and the variance of the average is zero, which makes sense because there is no uncertainty left. When we are looking at a dataset of, let’s say, how 100 US senators voted on some bill, we can fit models to that data, but we cannot test statistical hypotheses about the models, because there is no larger sample universe that the 100 senators are drawn from. 100 senators are the entire sampling universe of US senators at any one time.

The Sample Universe (size N) Our sample, size n 6. What about convenience samples? Let’s say you want to study the attitudes of college students. So you create a survey, and you field this survey to your friends. Your friends are what is called a convenience sample- they are a subset of the sample universe of college students, but they are not a random subset. You cannot test hypotheses about the sample universe with a convenience sample subset. You cannot generalize from convenience samples. Convenience sample are easy to get- they are convenient, but not nearly as useful as random, representative samples. Any time you are doing research, or reading someone else’s research, you should know: 1)What is the sample universe? 2)Is the sampling fraction substantially less than 1? If the sampling fraction is close to 1, then we may have good data but we do not need statistical tools to analyze the data. 3)If the sampling fraction is small, is our sample a random and representative sample? If so, then the standard statistical tools can be used. If our sample is not a representative sample, then hypothesis testing is not appropriate.