Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.

Similar presentations


Presentation on theme: "Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading."— Presentation transcript:

1 Chapter 3 Producing Data 1

2 During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading because much time and effort is spent getting data to analyze things we are interested in. Here we point out some of the details of producing data. We point out a lot of terminology. I want to start with an example in the book. At some point the authors tell us there are 230 million adults in the United States. Say we want to know the mean, or average, age of this adult population. A census would have us contact every adult in the U.S. to get their response. From this census one could calculate the mean age. The population mean is an example of a parameter (a number that describes the population). 2

3 3 The whole process of conducting a census might prove to be difficult due to time considerations and money. An alternative is to collect a sample (a subset of the population). From the sample the sample mean could be calculated. The sample mean is a statistic (a number that describes a sample). Statistical inference is the process of using data from a sample to infer conclusions about the wider population. So, we need data to do statistical inference. To do inference as well as we can we need to have the most relevant data possible. Anecdotal data or evidence may be interesting, but usually is not representative of any larger group. Available data from the past produced for some other reason may be useful.

4 4 Sample surveys may be useful provided they are managed properly. Observation studies have the researcher observe subjects without an attempt to influence the responses. In an experimental study individuals have some treatment imposed and responses are observed. Often in a study we have an idea that particular variable x values will mean variable y has particular values as well. (This is the idea behind correlation and regression.) Here we need to be cautious because of something called confounding. The basic idea is that x is not the only influence on y and so what we expect between x and y is mixed up due to other variables.

5 5 Remember we collect a sample from a larger population. The design of the sample survey refers to the method of selecting the sample. A list of all items to be selected is called the sample frame. The response rate is the actual number who responded divided by the number we wanted to respond (the sample frame). Of course you can multiply the fraction by 100 to get a percentage. The design of a study is biased if it systematically favors certain outcomes. A voluntary response sample may be biased because only people with strong or extreme views will respond. A convenience sample may be biased because the easily captured individuals may not represent all the folks in the population.

6 6 A simple random sample for our purposes may be the best in that the sample represents the population well. Note the definition – a simple random sample of size n consists of n individuals from a population chosen such that every set of n individuals has an equal chance to be the sample actually selected. Let’s use an unrealistic example to illustrate! Say we have a population of 3 people: Mr. A, Aunt B, and Missy. Say we have a sample frame of 2 and we have a 100% response rate. The possible people who make it into our sample are: - Mr. A and Aunt B - Mr. A and Missy - Aunt B and Missy. If each of these samples of size 2 has an equal chance of being selected then we have a simple random sample process.

7 7 Perhaps the easiest way to consider getting a simple random sample is to put the name of each person in the population on an equal sized piece of paper and put all the names in a hat. Then draw out of the hat, without looking into the hat, the number you want in the sample. This easy method is also not practical when the population is above a few hundred, but the principle should be used. If you want to do a survey of students at Wayne State College then perhaps the best way to get a representative survey would be to randomly select the names of folks from the list of all students (that could be obtained from college officials). A convenience sample might be to just grab folks as they walk around Gardner Hall. Does anyone see a problem here? Many students rarely get into Gardner Hall and so the sample probably does not represent all students.

8 8 Undercoverage occurs when some groups in the population are left out of the process of choosing the sample. Again, think about surveying WSC students by only looking at students in Gardner Hall. There would be undercoverage of students with majors outside of business. Now you could counter argue that students in Gen Ed ECON classes are not always business majors and I would agree, but you still are undercovering much of the campus. Nonresponse occurs when an individual chosen for the sample cannot be contacted or does not cooperate. Ideally you would then like to contact a person who has similar characteristics of the non-responder. This can be difficult.

9 9 Remember before we said statistical inference is using a sample to get an understanding of the population. Let’s look at an example of this. Say 1009 U.S. adults (out of 230 million) respond to the question “did you eat at a restaurant in the last week?” Also say 605 said yes to the question. The sample proportion that said yes is 605/1009 = 0.60 or 60%. Now, if a different 1009 people made it into the sample then the sample proportion might be different (say only 599 say yes). The idea that the sample proportion will be different with different people in the sample is called sampling variability. It is understood in statistics that sampling variability will happen.

10 10 At this point in the chapter the authors tell us that a simple random sample eliminates favoritism in drawing samples (and this eliminates bias). A second idea the authors want you to consider is the idea of taking many simple random samples from a population and calculating a statistic (like the sample proportion). This allows us to see if a pattern of variation from sample to sample exists. Let’s go back to our 3 person population and ask about eating at a restaurant in the last week. Say 2 said yes and 1 said no for a true population proportion of 0.66. Then the 3 possible sample proportions, when we sample only 2 people, are 0.5 (the one who said no made it into the sample and one of the others), 0.5 (the one who said no and the other one who said yes), and 1.0 (both who said yes made it into the sample).

11 11 Is there a pattern to our sample proportions? Well the average of the sample proportions is (.5 +.5 + 1) / 3 = 2/3 =.66. This is actually the population proportion! The average of all possible sample proportions is the population proportion! Well, the sample proportion is a specific example of a sample statistic. A sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of a given size from the larger population. In the work we are doing here – getting introductory statistical ideas down – there are recognized patterns that we want you to be aware of and work with. That is our task!

12 12 Bias concerns the center of the sampling distribution. A statistic used to estimate a parameter is UNBIASED if the mean of its sampling distribution is equal to the true value of the parameter being estimated. Just as it is in my 3 person example, the sample proportion is an unbiased estimator of the population proportion. To reduce bias, use random sampling. When we start with a list of the entire population, simple random sampling produces unbiased estimates. This means the values of a statistic computed from a simple random sample neither consistently overestimate nor underestimate the value of the population parameter.

13 13 The variability of a statistic is described by the spread of its sampling distribution. This spread is determined by the sampling design and the sample size n. Statistics from larger samples have smaller spreads. To reduce the variability of a statistic from a simple random sample, use a larger sample. The variability can be made as small as desired by taking a large enough sample. Later we will refer to an idea called statistical significance. The idea is that an observed effect so large that it would rarely occur by chance is called a statistically significant effect. For example, say we put people into two groups and in one group we do one thing and in the other group we do something else. Then when we measure a response, small group differences will not mean much, but large differences indicate something is different about the things we have done to the two groups. Later we will spend time understanding the context of “large” in this setting.

14 14 So, what have we done here? The point of the chapter is express the idea that good statistical work starts with good data collection. The idea of a simple random sample is a good place to start. The other idea the authors wanted to plant into your mind is that even when we have the best sampling methods being used, there will still be sampling variability. But, that variability may be understood in the context of the sampling distribution.


Download ppt "Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading."

Similar presentations


Ads by Google