Introduction to Data Analysis Probability Distributions.

Slides:



Advertisements
Similar presentations
1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Advertisements

A Sampling Distribution
Statistics and Quantitative Analysis U4320
Sampling Distributions
Excursions in Modern Mathematics, 7e: Copyright © 2010 Pearson Education, Inc. 16 Mathematics of Normal Distributions 16.1Approximately Normal.
Normal Distribution; Sampling Distribution; Inference Using the Normal Distribution ● Continuous and discrete distributions; Density curves ● The important.
Sampling Distributions (§ )
Chapter 18 Sampling Distribution Models
Sampling distributions. Example Take random sample of 1 hour periods in an ER. Ask “how many patients arrived in that one hour period ?” Calculate statistic,
Sampling Distributions. Review Random phenomenon Individual outcomes unpredictable Sample space all possible outcomes Probability of an outcome long-run.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 9: Hypothesis Tests for Means: One Sample.
1 The Basics of Regression Regression is a statistical technique that can ultimately be used for forecasting.
1 Sociology 601, Class 4: September 10, 2009 Chapter 4: Distributions Probability distributions (4.1) The normal probability distribution (4.2) Sampling.
1 Hypothesis Testing In this section I want to review a few things and then introduce hypothesis testing.
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Central Limit Theorem The Normal Distribution The Standardised Normal.
Inferential Statistics
Confidence Intervals W&W, Chapter 8. Confidence Intervals Although on average, M (the sample mean) is on target (or unbiased), the specific sample mean.
Standard Error of the Mean
Copyright © 2005 by Evan Schofer
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Section 9.3 Sample Means.
Chapter 5 Sampling Distributions
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
QBM117 - Business Statistics Estimating the population mean , when the population variance  2, is unknown.
Section 8.2 Estimating  When  is Unknown
Probability Distributions What proportion of a group of kittens lie in any selected part of a pile of kittens?
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
The Normal Distribution The “Bell Curve” The “Normal Curve”
Estimation Statistics with Confidence. Estimation Before we collect our sample, we know:  -3z -2z -1z 0z 1z 2z 3z Repeated sampling sample means would.
A Sampling Distribution
Estimation of Statistical Parameters
Sampling Distribution ● Tells what values a sample statistic (such as sample proportion) takes and how often it takes those values in repeated sampling.
Introduction to Data Analysis Sampling and Probability Distributions.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.
Measures of Dispersion CUMULATIVE FREQUENCIES INTER-QUARTILE RANGE RANGE MEAN DEVIATION VARIANCE and STANDARD DEVIATION STATISTICS: DESCRIBING VARIABILITY.
NOTES The Normal Distribution. In earlier courses, you have explored data in the following ways: By plotting data (histogram, stemplot, bar graph, etc.)
Rule of sample proportions IF:1.There is a population proportion of interest 2.We have a random sample from the population 3.The sample is large enough.
1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.
Inference We want to know how often students in a medium-size college go to the mall in a given year. We interview an SRS of n = 10. If we interviewed.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Inferential Statistics Part 1 Chapter 8 P
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Inference: Probabilities and Distributions Feb , 2012.
STA Lecture 151 STA 291 Lecture 15 – Normal Distributions (Bell curve)
INFERENTIAL STATISTICS DOING STATS WITH CONFIDENCE.
IM911 DTC Quantitative Research Methods Statistical Inference I: Sampling distributions Thursday 4 th February 2016.
Please hand in homework on Law of Large Numbers Dan Gilbert “Stumbling on Happiness”
1 Probability and Statistics Confidence Intervals.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Sampling Distributions Chapter 18. Sampling Distributions If we could take every possible sample of the same size (n) from a population, we would create.
Chapter 7: Random Variables 7.2 – Means and Variance of Random Variables.
Confidence Intervals and Sample Size. Estimates Properties of Good Estimators Estimator must be an unbiased estimator. The expected value or mean of.
WARM UP: Penny Sampling 1.) Take a look at the graphs that you made yesterday. What are some intuitive takeaways just from looking at the graphs?
The Normal Distribution. Normal and Skewed Distributions.
 Normal Curves  The family of normal curves  The rule of  The Central Limit Theorem  Confidence Intervals  Around a Mean  Around a Proportion.
THE NORMAL DISTRIBUTION
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Hand in your Homework Assignment.
Sampling Distribution Models
Chapter 5 Sampling Distributions
Sampling Distributions
Sampling Distributions (§ )
Presentation transcript:

Introduction to Data Analysis Probability Distributions

2 Today’s lecture Probability distributions (A&F 4) Normal distribution. Sampling distributions = normal distributions. Standard errors (part 1).

3 Probability – an idiot’s guide We’re interested in how likely, how probable, it is that our sample is similar to the population. In order to make this judgement, we need to think about probability a little bit. In particular we need to think about probability distributions.

4 Probability The proportion of times that an outcome would occur in a long run of repeated observations. Imagine tossing a coin, on any one flip the coin can land heads or tails. If we flip the coin lots of times then the number of heads is likely to be similar to the number of tails (law of large numbers). Thus the probability of a coin landing heads on any one flip is ½, or 0.5, or in bookmakers’ terms ‘evens’. If the coin was double headed then the probability of heads would be 1 – a certainty.

5 Probability Distribution (1) The mean of a probability distribution of a variable is: µ = ∑ y P(y) if y is discrete. µ = ∫ y P(y)dy if y is continuous. Also called the expected value: E(y)=Probability times payoff Standard Dev (σ) of prob dist measures variability.  Larger σ = more spread out distribution

6 Probability distribution (2) Lists the possible outcomes together with their probabilities. Now, let’s take a continuous-level variable, like hours spent working by students per week. The mean = 20, and standard deviation = 5. But what about the distribution…? Assign probabilities to intervals of numbers, for example the probability of students working between 0 and 10 hours is (let’s say) 2½ per cent. Can graph this, with the area under the curve for a certain interval representing the probability of the variable taking that value.

7 Probability distribution (3) Area between 0 and 10 is 2.5 per cent of the total Area beneath the curve

8 Probability distribution (4) Given this distribution, there is a probability (2.5%, or 40-1 for the gamblers) that if I picked a student they would have done less than 10 hours work in a week. A lot of continuous variables have a certain distribution – this is known as the normal distribution. The student work distribution is ‘normal’.

9 What is a ‘normal distribution’? NDs are symmetrical. The distribution higher than the mean is the same as the distribution lower than the mean. Unlike income, which has a skewed distribution. For any normal distribution, the probability of falling within z standard deviations of the mean is the same, regardless of the distribution’s standard deviation. The Empirical Rule tells us: For 1 s.d. (or a z-value of 1) the probability is.68 For 2 s.d. (actually 1.96) the probability is.95 For 3 s.d. the probability is almost 1.

10 Brief aside—what is Z? The Z-score for a value Y on a variable is the number of standard deviations that Y falls from µ. We can use Z-scores to determine the probability in the tail of a normal distribution that is beyond a number Y.

11 Normal distribution (1) Area under the curve here is 0.68 of the total area under the curve. Hence the probability of working between 15 and 25 hours is s.d. less than the mean = 15 1 s.d. more than the mean = 25

12 Normal distribution (2) For any value of z (i.e. not just whole numbers but say 2.34 s.d.), there is a corresponding probability. Most stats book have z tables in their front/back covers. Thus if we were to pick a student out of our population of known distribution we could work out how likely it would be that she was a hard worker. Even non-normal distributions can be transformed to produce approximately normal distributions. For example, incomes are not normally distributed, but we ‘log’ them to make a normal distribution (more on this later).

13 What’s the point? But, surely we don’t know the distribution or mean of the population (that’s probably why we’re sampling it after all), so what use is all this…?

14 Back to sampling The reason that normal distributions are of relevance to us, is that the distributions of sample means are normally distributed. In order to understand what this means let’s take an example of sampling. I want to take a driving trip around the world, visit every country and pay no attention to speed limits. I don’t particularly want to go to prison however, so what to do…

15 Sampling example (1) The plan is to bribe all policemen when caught speeding. Thus I want to measure how much it costs on average to bribe a policeman to avoid a speeding ticket. It’s costly to collect this information, so I don’t want to investigate every country before I set off. Therefore I sample the countries to try and estimate the average bribe I will need to pay. James’ car Ryan’s car

16 Sampling example (2) I randomly sample 5 countries and measure the cost of the bribe. Imagine for the minute I know what the population distribution looks like (it happens to be normal with a mean of $500). Mean of population ($500) Sample mean ($450) One observation ($700) Population distribution

17 Sampling distributions (1) If we took lots of samples we would get a distribution of sample means, or the sampling distribution. It so happens that this sampling distribution (the distribution of sample means( or any statistic)) is normally distributed. Due to averaging the sample mean does not vary as widely as the individual observations. Moreover, if we took lots of samples then the distribution of the sample means would be centred around the population mean.

18 Sampling distributions (2) Imagine I took lots of samples. There would be a normal distribution of their means, centred around the population mean. Population distribution Sampling distribution Mean of population Mean of all sample means

19 3 Very Important Things If we have lots of sample means then the average will be the same as the population mean. In technical language the sample mean is an unbiased estimator of the population mean. If the sample size is large(ish), the distribution of sample means (what is called the sampling distribution) is approximately normal. This is true regardless of the shape of the population distribution. As n (the sample size) increases the sampling distribution looks more and more like a normal distribution. This is called the central limit theorem.

20 Sampling distributions (4) Mean of population Mean of all sample means Population distribution Sampling distribution

21 Sampling distributions (5)

22 ‘Accurate’/‘inaccurate’ samples (1) Some sampling distributions are bigger than others… The top sampling distribution is better for estimating the population mean as more of the sample means lie near the population mean. Mean of population 68% of distribution

23 ‘Accurate’/‘inaccurate’ samples (2) Sampling distributions that are tightly clustered will give us a more accurate estimate on average than those that are more dispersed. Remember, high standard deviations give us a ‘short and flabby’ distribution and low standard deviations give us ‘tall and tight’ distribution. We need to estimate what our sampling distribution’s standard deviation is. But how do we do this…?

24 A (little) bit of math now… Before we work out what the sampling distribution looks like, some important terms.

25 Standard error (1) For my bribery sample, we know the following: But, we want to know the standard deviation of the sampling distribution, so we can see what the typical deviation from the population mean will be.

26 Standard error (2) Fortunately for us: The standard error is an estimate of how far any sample mean ‘typically’ deviates from the population mean.

27 Standard error (3) For my bribery sample. Thus, the ‘typical’ deviation of a sample mean from the population mean (of $500) would be $64, if we repeatedly sampled the population.

28 2 More Very Important Things The formula for standard error means that as:  …the n of the sample increases the sampling distribution is tighter. This makes sense, the bigger the sample the better it is at estimating the population mean.  …the distribution of the population becomes tighter, the sampling distribution is also tighter. This also makes sense. If a population is dispersed it will be more unlikely to get observations near the mean.

29 Binary variables This works for binary variables too, where the mean is just the proportion…

30 Do we trust Blair? Take the example of Blair’s trustworthiness people in the sample, 30% trust him (i.e. the mean is 0.30). Given this, we can work out the standard error. The typical deviation from the proportion would be 1.4% if we took lots of samples.

31 And finally, standard error (4) Don’t forget that we know the shape of the distribution of the sample means (it’s normal). We know the sample mean, the shape of the distribution of all the sample means, and how dispersed the distribution of sample means is. So (at last) we can calculate the probability of the sample mean being ‘near’ to the population mean (i.e. calculate a z-score and look up corresponding probability). But wait, there’s more…

32 Next week Finish off standard error. Think about how we can measure a range around our sample mean that we can be confident contains the population mean. These ranges are called confidence intervals. Hypothesis testing. What’s a hypothesis? How samples can help us to work out the probability of hypotheses being correct.