Presentation on theme: "Statistics Sampling Theory. We hear statistical results on the news constantly: “Bifar has been clinical shown to work better than Pliaff on removing."— Presentation transcript:
Statistics Sampling Theory
We hear statistical results on the news constantly: “Bifar has been clinical shown to work better than Pliaff on removing warts,” “The chance of dying from heart disease if you smoke is blank times higher.” But how do we know that these results are correct? How are these results obtained?
First, we can never be 100% certain of anything. But, we would like to be as close to 100% as possible. I want to demonstrate with a simple example what this theory is about.
My friend has a die that he claims is fair. I am however skeptical, but since I have no proof I will assume it is fair, and continue to look for evidence that it may not be fair. Now, the game begins I keep track of how many 3’s show up (I could keep track of other values as well but I want to keep this as simple as possible). Lets say that out of 100 throws I witness twenty-five 3’s appearing. Should I be alarmed? Is this evidence that the die is not fair? Is 25 appearances an unusual number? The game continues and on the next 100 throws twenty-nine 3’s appear. Again, the same question is asked. Is this evidence that the die is not fair? Is 29 appearances an unusual number?
What we need to understand is what the distribution of counts of 3’s are from a sample of 100 throws. It turns out that the mathematical model is well understood; the distribution of this scenario is characterized by a binomial distribution (do not worry about not understanding how to create a binomial distribution). To the left is the binomial distribution describing this event -throwing a die 100 times and counting number of 3’s)- this is one event. To create this graph I had to assume the die was fair. I can see from the graph that the chance of witnessing twenty –five threes out of 100 throws is unusual, and twenty-nine is extremely unusual if the die is fair. X-axis counts number of three’s that appear out of 100 throws
I now realize that I am witnessing very unusual results if the die is fair, after all if the die is fair I was expecting (1/6)(100) about 16 or 17 threes to appear. Twenty five and twenty-nine are very unusual according to the model. The model This is what we are going to concentrate on. The model allowed us to interpret the results from our data. Without models it is very difficult to say anything meaningful about statistical results.
I will go back to the die example and finish that off, but let me say one more thing about models to convince you of the importance of the idea of models. In this example I will make some assumptions. You are an American with no knowledge of the indigenous people of Brazil. You are all of a sudden dropped off in the middle of the Amazon. A man approaches you, and says that he is to marry a woman from the other village who he has never met. He tells you (just like in 60’s outer space sci-fi movies everyone knows English) I mainly want her to be beautiful. I can not visit or see her until after we are married. Can you please go to the next village and see for me?
So, you trek over to the other village and you meet the woman. Now you come to realize something very important. What does beautiful mean to this man? What does beautiful mean in this society? You know that, for example, “Hollywood beauty” would mean nothing in this society, that is, the model of what beautiful is in “Hollywood” would not apply here. Thus, you have data, what the person in front of you looks like, but no way to gauge what that means. Your only hope is to visit the man’s village and start pointing to women in his village he finds attractive and hope to reconstruct a good enough model of “beautiful” so that you can make a decision.
This idea of the model is key to statistics. Without it you are just floating, and guessing. You may have lots of data, but with no means to interpret the results. So now we are back to our die problem and the concept of a mathematical model. Equations come in families and types. For example, the equations y = 5x -1, y = -3x, y = 0.5x + 9 I hope you recognize are all linear. A linear equation is of the form y = mx +b where x and y are variables and m b are constants as you learned in algebra. Now, a specific line is determined by the values m and b, the slope and y-coordinate of the y-intercept. The values m and b are called parameters for linear equations. Those values determine specific lines.
For the die problem, our model was the binomial distribution. What are the parameters for the binomial distribution? It has two: the sample size n and p the probability of the said event occurring. For the die problem, n = 100 and p = 1/6.
Change the parameters,and you change the distribution.
Ok, I see that models are important, and that mathematical models require parameters to make them specific. What does that have to do with sampling? Statistics? The die example was unique in that I could assume the die was fair and then by making that assumption I knew p = 1/6. And the sample size was determined by the number of observations. Often, we do not know the value of the parameter, but we wish to estimate it. Lets assume that a die is altered so that it is “loaded.” The probabilities associated with a fair die do not apply. How could I discover what are the probabilities associated with the outcomes of that die? The only thing we can do is toss the die and see what numbers show up, and how often. But now we are back to the “beautiful” woman example. If I toss the die 100 times I get one set of numbers. If I toss the die 100 times again, I get another set of numbers. If we all toss the die 100 times we all get different sets of numbers. Who is right? Hopefully, the values are close together and we can get a sense of the true values associated with the loaded die.
We need a general theory to deal with these discrepancies. We know that the probabilities are long run values, so how big should I make my sample so that I feel I have a good estimate of the unknown values. For drug tests: How many people would I need to test the drug on so I can be certain to some degree if the drug works or does not work? And by work we mean that a certain proportion of the population, p, can be helped by the drug. General Theory
To understand how things work, we will assume, for the moment (chapter 5), we know our parameters, and the distribution types, and then show you what occurs when you sample and then calculate an estimate of some parameter (a statistic). Chapter 5 will use the two parameters, , population mean, and p, population proportion as examples of what occurs to our estimates (called a statistic) when we sample. A parameter (often called the true value) is a number that describes some characteristic of a population. We think of this number as being fixed (does not change). Now keep in mind that if the characteristics that the parameter is describing in the population change, so does the parameter. If the characteristic does not change neither does the parameter. A statistic is an estimate of a parameter. The way a statistic is gathered is by taking a sample from the population you are interested in and then calculating the desired measurement.
For example, let us say we want to know the exact mean weight of 18 year old men or older in the United States at this exact instant. By exact, I mean the parameter . The only way to calculate that number is to measure every male in the United States that meets the criteria; I need a census. If I gathered a sample of 1000 men that fit the description, and weighed them, then calculated the mean of the sample, this is the sample mean (x- bar), which is a statistic. So a parameter is calculated from a census (the whole population/sample space is measured), while a statistic is calculated from a sample (subset of the population). Here are some examples of parameters with their corresponding statistics.
Parameter - population mean s – Sample Standard deviation - Sample mean Statistic – population Standard deviation p - population proportion - sample proportion To explain a sampling distribution one could use means or proportions, both can convey the general theories we are looking to explain. I will choose the population mean as my example.
Let the random variable X measure the time required for a person to go from home to work in a large city. Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3 minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the population standard deviation). Notice, I have not said what the distribution above is like, I have merely stated what the mean and standard deviation of this population are. How did I get those parameters? At the moment we are pretending we know all the parameters, so that we can understand what occurs when we sample.
Suppose we sample four individual drivers at random from this city. I then take those four numbers and average them. What I have just calculated is the sample average, x-bar. x1x1 x2x2 x3x3 x4x4 This is one possible value of the sample mean.
Now that one sample mean, is one number out of infinitely many possible numbers. If we repeated the sampling we would get different numbers. x xxx
Now that one sample mean is one number out of infinitely many possible numbers. If we repeated the sampling we would get different numbers. x xxx This creates a distribution of sample means (The sampling distribution of the mean). In general terms we have created a distribution of some statistic. This is the key to understanding and interpreting data results. Population of some characteristic of a population that I am interested in knowing more about. Population of a statistic that estimates some parameter that allows me to characterize some aspect of the population I am studying.
When you sample from a population you only generate one sample mean. But that one sample mean is one number out of infinitely many possible numbers. If we repeated the sampling we would get different numbers. And then we ask the next question. If my one number is part of an infinite set of sample means, we understand that this creates a distribution of sample means. The next logical question is “what is the mean of the sample averages?”
Now, this may seem strange but the answer is not at all cemented in stone. The mean of the sampling distribution will depend on how we sample. A better question to ask is, what would we want it to be? Think about it for a second. This is a very important question. If you could make the mean of the sample averages any number you want what would be that number?
When ever I have posed this question eventually someone says they would like the mean from the population we are sampling from to equal the mean of the sampling distribution of the means. This is exactly what we want! And here is the reason. Suppose that it was not this way. What would be the ramifications?
If the sample-mean mean was less, for example, the individual sample means would consistently underestimate the actual population mean of 45.3 minutes. This is not good. This is exactly what we want! And here is the reason. Suppose that it was not this way. What would be the ramifications?
Now, equality of the means of the two distribution does not occur by chance. It must be engineered to occur this way. How you ask? This is the purpose of sections chapter three’s introduction, section 1 and section 2, which concern sampling methods. Our sampling method should consistently produce representative samples with no or very little bias.
If our sampling method creates sample means whose distribution mean is not equal to the population mean we sampled from then the method of sampling is said to be biased. This is the purpose of proper randomization! A proper randomized sampling method should create in the long run an unbiased sampling distribution mean. =
What about the standard deviations of each group? How are they related? = Well, in this case we can not wish it to be true, it will be true because of physics of the situation, assuming that our sampling method is not biased. But lets ask the question anyway.
What would we like the relationship to be? = = > < ?
= = > < ?
= > Why? If is small then this means x-bar values will not vary by much; all possible x-bar values will be very close together. But if that is the case, then we will get a good idea of what really is.
= > The exact relationship is given by
Now we have relationships established for the mean and the standard deviations of both populations. You have noticed that I have not talked about the distribution itself. Is the distribution left skewed, right skewed, symmetric, or some other distribution? We now embark on one of the most important theoretical facts about sampling distributions concerning means. =
Lets say that the distribution of times is uniformly distributed. I sample two values at a time randomly. Then I calculate the average. What does the distribution for all possible averages of two numbers look like if we sampled from a uniform distribution? =8.66
Lets say that the distribution of times is uniformly distributed. I sample four values at a time randomly. Then I calculate the average. What does the distribution for all possible averages of four numbers look like if we sampled from a uniform distribution? =8.66
As we continue to increase the sample size the distribution of sample means continues to take on the shape of a normal distribution. Suppose I sample 15 values at a time and calculate the sample mean. What does the distribution for all possible averages of fifteen numbers look like if we sampled from a uniform distribution? =8.66 The distribution is approximately normal.
The Central Limit Theorem- Suppose we sample from a distribution that is not normally distributed, and we calculate the sample mean. The sample mean belongs to a new population (sample space) consisting of all possible sample mean averages for that particular sample size. The Central Limit Theorem says if the sample size is large enough then the distribution of the sample mean is approximately normal. Sample n values (has to be large enough). Calculate x-bar. Now, this one value belongs to a population of all possible means (all have the same sample size). The distribution of all possible x-bars is approximately normal.
Summary Population of interest. The variable X represents the measurement of interest. Let the random variable X measure the time required for a person to go from home to work in a large city. Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3 minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the population standard deviation). The parameter of interest is the mean of this population; 45. Notice the symbol used to represent this number,. To estimate µ x a sample of size n is gathered and x-bar is calculated. This one value belongs to a distribution of all possible values of x-bar created by averaging n values at a time. Notice the symbol change for the sampling distribution of the mean.
Summary Let the random variable X measure the time required for a person to go from home to work in a large city. Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3 minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the population standard deviation). For a large enough sample size, even if the distribution I sample from is not normal, the distribution of sample averages will be approximately normal. If our sampling method is not biased, then = Equality is not automatic, the sampling method will make this happen.
Summary Let the random variable X measure the time required for a person to go from home to work in a large city. Suppose that the mean time required for a person to go from home to work is 45.3 minutes (µ = 45.3 minutes this is the population mean), with a standard deviation of 7.3 minutes (σ = 7.3 minutes is the population standard deviation). The variation of X is measured by x. The equation, that defines the variation relationship between the two variables suggests that at as n increases, the variation of the distribution of x-bar decreases. The variation of is measured by What size sample would you want to estimate a mean value? A sample size of 50, n = 50, or a sample size of n = 500? Assume both samples have no bias. Most people would say 500. Why? They recognize it should be more accurate! What does accurate mean? Less variability!!!!!!!!!!!!! Think about it. An accurate ball thrower can get the ball to the target consistently.