PYPR1 lecture 2 : Populations and Samples Dr David Field.

Slides:



Advertisements
Similar presentations
Estimation of Means and Proportions
Advertisements

Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Personal Response System (PRS). Revision session Dr David Field Do not turn your handset on yet!
Estimation in Sampling
Psych 5500/6500 The Sampling Distribution of the Mean Fall, 2008.
Sampling Distributions
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Central Limit Theorem.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Evaluating Hypotheses
PPA 415 – Research Methods in Public Administration Lecture 5 – Normal Curve, Sampling, and Estimation.
Chapter Sampling Distributions and Hypothesis Testing.
Inference about a Mean Part II
Need to know in order to do the normal dist problems How to calculate Z How to read a probability from the table, knowing Z **** how to convert table values.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
1 The Sample Mean rule Recall we learned a variable could have a normal distribution? This was useful because then we could say approximately.
Getting Started with Hypothesis Testing The Single Sample.
Inferential Statistics
1 Psych 5500/6500 Statistics and Parameters Fall, 2008.
Standard Error of the Mean
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Go to Index Analysis of Means Farrokh Alemi, Ph.D. Kashif Haqqi M.D.
A Sampling Distribution
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
16-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 16 The.
Introduction to Inferential Statistics. Introduction  Researchers most often have a population that is too large to test, so have to draw a sample from.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Comparing two sample means Dr David Field. Comparing two samples Researchers often begin with a hypothesis that two sample means will be different from.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Sample-Based Epidemiology Concepts Infant Mortality in the USA (1991) Infant Mortality in the USA (1991) UnmarriedMarriedTotal Deaths16,71218,78435,496.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 6 Normal Probability Distributions 6-1 Review and Preview 6-2 The Standard Normal.
1 Psych 5500/6500 Standard Deviations, Standard Scores, and Areas Under the Normal Curve Fall, 2008.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Chapter 7 Estimation Procedures. Basic Logic  In estimation procedures, statistics calculated from random samples are used to estimate the value of population.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Sampling Error.  When we take a sample, our results will not exactly equal the correct results for the whole population. That is, our results will be.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
Week 6 October 6-10 Four Mini-Lectures QMM 510 Fall 2014.
Section 10.1 Confidence Intervals
Inferential Statistics Part 1 Chapter 8 P
Confidence Interval Estimation For statistical inference in decision making:
Statistics What is statistics? Where are statistics used?
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 8. Parameter Estimation Using Confidence Intervals.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. PPSS The situation in a statistical problem is that there is a population of interest, and a quantity or.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Inference About Means Chapter 23. Getting Started Now that we know how to create confidence intervals and test hypotheses about proportions, it’d be nice.
Warsaw Summer School 2015, OSU Study Abroad Program Normal Distribution.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
10.1 Estimating with Confidence Chapter 10 Introduction to Inference.
Confidence Intervals Dr. Amjad El-Shanti MD, PMH,Dr PH University of Palestine 2016.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Dr.Theingi Community Medicine
And distribution of sample means
Sampling Distribution Models
Sampling Distributions and Estimation
The Normal Distribution
Sampling Distributions
Warsaw Summer School 2017, OSU Study Abroad Program
Advanced Algebra Unit 1 Vocabulary
Presentation transcript:

PYPR1 lecture 2 : Populations and Samples Dr David Field

General Information This lecture contains material that is crucial for understanding the rest of the course –read the text book important sections are indicated by, e.g. –go to the workshop –download the lecture –make use of the university maths support service Specialist statistics tutor available every Wednesday afternoon in term time from 2.00pm-4.00pm Alternatively, in a form with your question on the website and get a reply by

Populations and samples At the end of this lecture we will be able to make statements of the form –“We measured the number of hours slept per night of a sample of 50 students in the UK. The mean number of hours slept was 7.2. We can be 95% confident that the population mean lies between 6.8 and 7.6 hours of sleep per night.” We will aim to understand the logic underlying the confidence interval around the mean in the above statement This is based on the properties of normal distributions and sampling distributions

Populations and samples If we weighed every adult domestic cat in the UK we would be able to calculate the population mean weight and the population SD –other populations of interest might be engineering students or amateur cricketers or cars Measuring the whole population is expensive and impractical, and so scientists invariably measure only a fraction of the population of interest, using sampling The aim of the sampling procedure is to obtain as good an estimate of the unknown true population statistics as possible This requires the relationship between the sample you have and the unknown population to be quantified –If I weigh 100 cats, how confident can I be that my observed mean is close to the unobservable population mean?

Representative and unrepresentative samples We can only assess the relationship between a sample and an unobservable population if the sample is representative of the target population This is an issue of study design, but it determines how broadly we can interpret our numeric statistics If a sample of engineering students was selected exclusively from Oxford University then measures obtained from it might not be an accurate reflection of engineering students in general There are a number of ways to obtain a representative sample, the simplest case being random selection from the entire population

What does random mean? Each time you sample a single case, every member of the underlying population had an equal chance of being selected The classic case is rolling an unbiased dice Each time the die is rolled you have an equal chance of the result being 1,2,3,4,5, or 6 There are no history effects! If you rolled 600,000 dice and recorded the results you would end up with very close to 100,000 occurrences of each outcome –What would the frequency histogram of this data look like? In Psychology we often use opportunity samples and treat them as if they were random samples from a target population

Key concept: the normal distribution Values close to the mean of a variable are often more frequent than values far from the mean –This is true of the height of adults –It is not true of rolling a dice repeatedly When true, and if you sample randomly from the population, this produces a bell shaped frequency histogram Many psychological variable are normally distributed, e.g. IQ –in the case of the IQ test, it is designed to be like that Normal distributions can be visualised using frequency histograms just like the ones from Lecture 1

UK cats Mean 5 Kg SD 0.8 Kg Carl Frederick Gaus ( )

UK cats Mean 5 Kg SD 0.8 Kg Curve shape is independent of sample size

The standard deviation - revision scores (pints) deviations squared deviations The sum of the squared deviations is 64 The mean deviation (variance) is therefore 64 /(6 – 1) = 12.8 Square rooting the variance returns it to the original measurement units of the variable Therefore the SD is 3.57 pints

Greek cats Mean 3.55 Kg SD 0.4 Kg UK cats Mean 5.0 Kg SD 0.8 Kg

A Greek cat weighing 3.95 Kg and a British cat weighing 5.8 Kg are clearly different from each other in important ways, including their weight But, to a statistician, they are identical in one important respect: –they occupy the same position in their respective sample distributions –relative to their sample means they are both equally “unusual” occurrences –Both can be described by “Mean + 1SD” –If you randomly select 1 cat from the 10,000 Greek and the 10,000 British cats then the probability of sampling a 5.8 Kg British cat is equal to the probability of sampling a 3.95 Kg Greek cat

6800 Greek cats UK cats Mean 5.0 Kg SD 0.8 Kg

Recall the student sleep example from the start of the lecture, and the 95% confidence interval around the mean number of hours slept? For any population that is normally distributed 95% of all scores will fall within 1.96 SD either side of the mean This means that if you randomly select one case, it’s score is 95% likely to fall within 1.96 SD of the mean The confidence interval is based on the properties of normal distributions, but there are some intermediate steps to understand

The standard normal distribution and z scores The family of normal distributions has an infinite number of members, each defined by a unique combination of mean and SD There is one particular normal distribution, called the standard normal distribution, which has a special status –It has a mean of 0 –It has a SD of 1

The standard normal distribution and z scores One useful thing about the standard normal distribution is that scores from any other normal distribution can be converted into scores on the standard normal distribution The converted scores are called z scores The new scores loose their original units (e.g. Kg), and are now expressed in units of SD This is useful for comparing between samples –If you know the z score for a Greek cat and for a British cat you see directly which cat is a relatively heavier example of it’s own population

Calculating z scores z = (score – sample mean) / SD of sample z score for a British cat weighing 4.7 Kg (4.7 – 5.0) / 0.8 = z score for a Greek cat weighing 4.7 Kg (4.7 – 3.5) / 0.4 = 2.88 The Greek cat is a very large cat The British cat is fairly typical, perhaps slightly on the small side

What happens when the sample is small? In Psychology we usually work with small samples, and we often have little idea of the underlying population parameters With a small sample, you can still calculate a mean and an SD, although the sample might be too small to assess whether the underlying population is normally distributed Lying at the heart of statistics is the question of the relationship between populations and samples To explore this, we can use examples where population parameters are known

Population: Mean 5 Kg SD 0.8 Kg Sample of 10: Mean 4.6 Kg SD 0.7 Kg

Confidence in the sample mean Given a sample, I can produce a sample mean Statistics is about describing the relationship between measured samples and underlying populations –A statistician will ask how good the sample mean is as an indicator of the population mean If you have a large and representative sample, then the sample mean is such a good estimate of the population mean that it can be used interchangeably But often we have a small sample, and no population data (or large sample) to compare it with We can use the cats example, treating the full sample of 10,000 as the population, to explore the relationship between small samples and the population A key point is that the mean of a large sample is more likely to lie close to the population mean than the mean of a small sample

Quantifying confidence in the sample mean The sample mean is a point estimate of the population mean –If we collected a second random sample we would have two different point estimates of the population mean –By definition, they can’t both be correct –This situation implies an underlying continuum on which point estimates can lie –A continuum can be thought of as a curve plotted on a graph, like the normal distributions you saw earlier In statistics, we aim to convert the sample point estimate into an interval estimate –This is a range on the underlying continuum –We want to be able to say that given the sample, the population mean lies somewhere between X and Y –We will have to be content to say that we are 95% sure

Mean 5 Kg SD 0.8 Kg Where each black line crosses the x axis represents a separate point estimate of the population mean The horizontal line is a visually judged interval estimate of the population mean given the 10 samples

Sampling distributions Each sub-sample of 10 cats has a mean weight and SD that is slightly different from the full population of 10,000 Individual samples of 10 are often not normally distributed Imagine collecting 100 separate sub-samples of 10 cats, and producing 100 sample means The mean of the 100 sample means would be an excellent estimate of the population mean It is possible to plot a frequency histogram of the 100 obtained sample means

Sampling distributions The frequency histogram of the 100 samples of 10 will itself be normally distributed –This means the distribution can be defined by its mean and SD More generally, if you collect a sample, then theoretically speaking there is an underlying population of samples, of which yours is just one case –The population of sample means is normally distributed

Sampling distributions In the frequency histogram of the original sample, each case was a single cat In the corresponding sampling distribution, each case is the mean weight of (10) cats selected randomly from the cat population A key property of sampling distributions is that provided the individual samples have N >= 30 they are ALWAYS normally distributed –even if the actual population is skewed (e.g. reaction time) –or bimodal –if population is normal sampling distribution will also be normal for small samples < 30

The black curves are frequency histograms of the means of samples randomly selected from the pink population distribution

Populations and samples If we had enough samples to plot a sampling distribution, then for one sample we could assess exactly how close it is to the population mean (mean of the sampling distribution) But what if we only have one sample? –With only one sample, because sampling distributions are normally distributed we still know that the mean of the single sample is 95% likely to fall within 1.96 SD either side of the mean of a theoretical sampling distribution Avoid confusion: –the mean of a sampling distribution IS equal to the population mean –the SD of a sampling distribution is NOT equal to the SD of the population

Standard error (SE) In statistics, the standard deviation of a sampling distribution is given a different name to distinguish it from the standard deviation of a single sample or the standard deviation of a population It is called the standard error (SE) Its name contains the word “error” because statisticians use it to estimate measurement error

We can be 95% sure that a sample mean will lie within + / SE of the mean of the distribution of sample means –this provides the 95% confidence interval from the example about how long students sleep given at the start of the lecture –7.2 hours – 0.4 hours (1.96 * SE) = 6.8 hours is the lower bound –7.2 hours hours (1.96 * SE) = 7.6 hours is the upper bound But at the moment we have only seen how to calculate the SE if you have collected a huge number of samples of a specific size Therefore, the problem to solve is how to find the SE from a single sample The starting point for solving this problem is the fact that the SE of the sampling distribution shrinks as the individual samples making up the distribution increase in size –This in turn is because the mean of a large sample is more likely to be close to the population mean than the mean of a small sample Standard error (SE)

The SE of the sampling distribution is smaller when the individual samples making up the distribution are larger

Standard error and confidence intervals The previous slide illustrates that the confidence interval around a sample mean will be smaller if the sample size is bigger –smaller confidence intervals imply more accurate and therefore more useful measurements There is a lawful relationship between the sample size and the SE of the resulting sampling distribution –the SE is halved when the sample size is quadrupled The SE is also dependent upon the SD of the population the samples were drawn from –the SE is smaller when the population SD is smaller –As you don’t know the SD of the population, the sample SD is used as an estimate of the population SD

Standard error and confidence intervals This relationship between the SE, sample size, and the SD is captured in the following formula SD of the sample Sample size SE =

Standard error and confidence intervals Standard error (SE) = sample SD / square root of sample size For the small sample of 10 UK cats we observed a SD of 0.7 kg 0.7 / square root of 10 (3.16) = 0.22 Kg One more step is required to arrive at a 95% confidence interval The standard error is 1 SD of the sampling distribution What proportion of samples have means that lie within 1SE of the mean of the sampling distribution?

It is 68% likely that the population mean falls within 1 standard error above or below the sample mean

Standard error and confidence intervals By convention, we usually want to make the statement that we are 95% certain that the population mean lies between X and Y Therefore, we use the properties of normal distributions, which tells us that 95% of all samples in the sampling distribution of the mean fall within 1.96 SE of the mean The confidence interval we give around the sample mean is 1.96 * SE either side of the mean

Standard error and confidence intervals The mean of the 10 cat small sample was 4.6Kg (SD 0.7). This gives a standard error of 0.22 Kg Confidence interval = 1.96 * 0.22 = 0.43 Kg either side of the mean –“We measured the weight of 10 adult cats in the UK. The mean weight of the sample was 4.6 Kg. We can be 95% confident that the population mean weight of adult cats in the UK lies between 4.16 and 5.03 Kg.” The mean of the population of 10,000 cats was 5.0 Kg, and we can see that the above statement is true (but only just in this case!)

Standard error and confidence intervals It is important to remember that we accept a 5% risk that the interval estimate is wrong, and the population mean does not lie within the range of values it defines This is the risk that the sample we have is located in one of the two tails of the underlying sampling distribution Imagine we draw a second sample, this time of 40 UK adult cats, from the population of 10,000 This time, we obtain a mean weight of 4.87 Kg, with a SD of 0.84 Kg This SD is bigger than the SD of the small sample, which was 0.7, so in a sense this sample is showing greater variation

Standard error and confidence intervals You might expect that the SE and confidence interval will also be larger for this sample But, because the SD formula involves dividing by the square root of the sample size it turns out that this is not the case Standard error is / square root 40 (6.32) = 0.13 Confidence interval is 0.13 * 1.96 = Kg either side of the mean Confidence interval for the small sample was 0.43 Kg either side of the mean

List of statistical terms for revision Note that the technical meaning of terms in statistics is not always the same as the everyday meaning of the words. If you understand each of these concepts then you are well on the way to understanding statistics! population sample random normal distribution (also known as a bell curve or Gaussian distribution) frequency standard normal distribution z score sampling distribution standard error confidence interval