Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sampling and Confidence Interval

Similar presentations


Presentation on theme: "Sampling and Confidence Interval"— Presentation transcript:

1 Sampling and Confidence Interval
Epidemiology/Biostatistics Kenneth Kwan Ho Chui, PhD, MPH Department of Public Health and Community Medicine

2 Learning objectives in the syllabus
Understand how a histogram can be read as a probability distribution Understand the importance of random sampling in statistics Understand how sample means can have distributions Explain the behavior (distribution) of sample means and the Central Limit Theorem Know how to interpret confidence intervals as seen in the medical literature Know how to calculate a confidence interval for a mean

3 Population Distribution of sample means Know how to interpret and
calculate a confidence interval for statistical inference Population Parameter Types of data How to summarize data Central tendency Variability How to evaluate graphs Sample Sample statistics

4 Assumed knowledge for today
Mean Variance Standard deviation The rule

5 Central tendency: Mean
Consider a variable with data: 1, 2, 3, 3, 4, 4, 4, 5, 5, 6

6 Variance & Standard deviation
2 Variance & Standard deviation Values Sum them up Divide by (sample size – 1) Variance Observation # SD = √Variance

7 The 68-95-99 rule 99% of samples are within ± 3SD
68% of sample are within ± 1SD # # External package to generate four shades of blue # library(RColorBrewer) # cols <- rev(brewer.pal(4, "Blues")) cols <- c("#2171B5", "#6BAED6", "#BDD7E7", "#EFF3FF") # Sequence between -4 and 4 with 0.1 steps x <- seq(-4, 4, 0.1) png("c:\\temp\\normalcurve.png", width=2000, height=1000, res=250) # Plot an empty chart with tight axis boundaries, and axis lines on bottom and left plot(x, type="n", xaxs="i", yaxs="i", xlim=c(-4, 4), ylim=c(0, 0.4), bty="l", xaxt="n", xlab="a variable", ylab="frequency") # Function to plot each coloured portion of the curve, between "a" and "b" as a # polygon; the function "dnorm" is the normal probability density function polysection <- function(a, b, col, n=11){ dx <- seq(a, b, length.out=n) polygon(c(a, dx, b), c(0, dnorm(dx), 0), col=col, border=NA) # draw a white vertical line on "inside" side to separate each section segments(a, 0, a, dnorm(a), col="white") } # Build the four left and right portions of this bell curve for(i in 0:3){ polysection( i, i+1, col=cols[i+1]) # Right side of 0 polysection(-i-1, -i, col=cols[i+1]) # Left right of 0 # Black outline of bell curve lines(x, dnorm(x)) # Bottom axis values, where sigma represents standard deviation and mu is the mean axis(1, at=-3:3, labels=expression(-3*SD, -2*SD, -1*SD, Mean, 1*SD, 2*SD, 3*SD)) # Add percent densities to each division, between x and x+1 pd <- sprintf("%.1f%%", 100*(pnorm(1:4) - pnorm(0:3))) pd2 <- c("34%", "13.5%", "2%", "0.5%") text(c((0:3)+0.5,(0:-3)-0.5), c(0.16, 0.05, 0.04, 0.02), pd2, col=c("white","white","black","black")) segments(c(-2.5, -3.5, 2.5, 3.5), dnorm(c(2.5, 3.5)), c(-2.5, -3.5, 2.5, 3.5), c(0.03, 0.01)) dev.off() # of SD: 50th 84th 97.5th 99.5th 16th 2.5th 0.5th Percentile:

8 ? Population The true mean BMI of Boston, Massachusetts Parameter
Sample Researcher The mean BMI of a sample from Boston, Massachusetts Sample statistics

9 Sample variation The whole population ? 1, 2, 3, 4, 5, 6 Researcher 1 Researcher 2 Researcher 3 Researcher 4 Researchers 2, 4 4, 6 1, 2 1, 6 Samples 3.0 5.0 1.5 3.5 Means

10 Central limit theorem

11 Central limit theorem The means obtained from many samplings from the same population have the following properties The distribution of the means is always normal if the sample size is big enough (above 120 or so), regardless of the population’s distribution The mean of the sample means is equal to the population mean The standard deviation of the sample means, known as the standard error of the mean (SEM) is inversely related to the sample size: if we repeat the experiment with a bigger sample size, the resultant histogram will be “slimmer”

12 Understanding CLT through simulation
Population size: 10000 Possible values: 0 through 9, 1000 each True population mean: 4.50

13 10000 Simulation scheme Sample A population of 10000 n=500 Sample mean
Frequency Sample mean

14 Sample size = 500; # of draws = 10000
SE 99% SD = 0.13 95% 68% ±1 SE: 67.95% ±2 SE: 95.04% ±3 SE: 99.10% Frequency x <- rep(0:9,rep(1000,10)) m.out <- vector() lower <- vector() upper <- vector() while (length(m.out)<10000) { samp01 <- sample(x,500) m01 <- mean(samp01) se01 <- sd(samp01)/sqrt(500-1) lower01 <- m *se01 upper01 <- m *se01 m.out <- c(m.out, m01) lower <- c(lower, lower01) upper <- c(upper, upper01) } my.data <- data.frame(m.out, lower, upper) my.data.500 <- my.data[order(m.out),] par(mfrow=c(2,1)) dotplot(my.data.200$m.out,pch=16,dot.col="#ff6600",main="",xmin=3,xmax=6) dotplot(my.data.500$m.out,pch=16,dot.col="#ff660090",main="",xmin=4,xmax=5) abline(v=mean(x),col="#336699",lwd=2) abline(v=4.5) plot(my.data.2$m.out, 1:800, xlim=c(min(my.data.2$lower), max(my.data.2$upper))) segments(my.data.2$lower,1:200, my.data.2$upper, 1:200) abline(v=mean(x)) plot(density(m.out)) varx <- my.data.500$m.out my.m <- mean(my.data.500$m.out) my.s <- sd(my.data.500$m.out) sum(varx>(my.m-1.00*my.s)&varx<(my.m+1.00*my.s)) sum(varx>(my.m-1.96*my.s)&varx<(my.m+1.96*my.s)) sum(varx>(my.m-2.58*my.s)&varx<(my.m+2.58*my.s)) 4.5 Sample means

15 Characteristics for the distribution of means
In the previous slide, the mean 4.5 is the true population parameter, for which we have a Greek name, μ (mu) Similarly, the SD 0.13 is the true population parameter, called σ (sigma) in Greek. We call this SD of means “standard error of means” (SEM) or “standard error” (SE) SE can be estimated using sample SD:

16 Why bigger sample sizes are often better
Sample means Sample means Sample means Sample size = 200 Sample size = 500 Sample size = 1000 SE = 0.20 SE = 0.13 SE = 0.08

17 Confidence interval

18 I got CLT, so now what? The histogram can be viewed as a “probability distribution” The sample mean from a researcher can be any pixel under the bell curve How should we define “acceptably close” to the population mean? 95% # # External package to generate four shades of blue # library(RColorBrewer) # cols <- rev(brewer.pal(4, "Blues")) cols <- c("#ce1256", "#df65b0", "#d7b5d8", "#f1eef6") # Sequence between -4 and 4 with 0.1 steps x <- seq(-4, 4, 0.1) png("c:\\temp\\normalcurve.png", width=2000, height=1000, res=250) # Plot an empty chart with tight axis boundaries, and axis lines on bottom and left plot(x, type="n", xaxs="i", yaxs="i", xlim=c(-4, 4), ylim=c(0, 0.4), bty="l", xaxt="n", xlab="sample means", ylab="probability") # Function to plot each coloured portion of the curve, between "a" and "b" as a # polygon; the function "dnorm" is the normal probability density function polysection <- function(a, b, col, n=11){ dx <- seq(a, b, length.out=n) polygon(c(a, dx, b), c(0, dnorm(dx), 0), col=col, border=NA) # draw a white vertical line on "inside" side to separate each section segments(a, 0, a, dnorm(a), col="white") } # Build the four left and right portions of this bell curve for(i in 0:3){ polysection( i, i+1, col=cols[i+1]) # Right side of 0 polysection(-i-1, -i, col=cols[i+1]) # Left right of 0 # Black outline of bell curve lines(x, dnorm(x)) # Bottom axis values, where sigma represents standard deviation and mu is the mean axis(1, at=-3:3, labels=expression(-3*SE, -2*SE, -1*SE, mu, 1*SE, 2*SE, 3*SE)) # Add percent densities to each division, between x and x+1 pd <- sprintf("%.1f%%", 100*(pnorm(1:4) - pnorm(0:3))) pd2 <- c("34%", "13.5%", "2%", "0.5%") text(c((0:3)+0.5,(0:-3)-0.5), c(0.16, 0.05, 0.04, 0.02), pd2, col=c("white","white","black","black")) segments(c(-2.5, -3.5, 2.5, 3.5), dnorm(c(2.5, 3.5)), c(-2.5, -3.5, 2.5, 3.5), c(0.03, 0.01)) dev.off()

19 The confidence interval
95% # # External package to generate four shades of blue # library(RColorBrewer) # cols <- rev(brewer.pal(4, "Blues")) cols <- c("#ce1256", "#df65b0", "#d7b5d8", "#f1eef6") # Sequence between -4 and 4 with 0.1 steps x <- seq(-4, 4, 0.1) png("c:\\temp\\normalcurve.png", width=2000, height=1000, res=250) # Plot an empty chart with tight axis boundaries, and axis lines on bottom and left plot(x, type="n", xaxs="i", yaxs="i", xlim=c(-4, 4), ylim=c(0, 0.4), bty="l", xaxt="n", xlab="sample means", ylab="probability") # Function to plot each coloured portion of the curve, between "a" and "b" as a # polygon; the function "dnorm" is the normal probability density function polysection <- function(a, b, col, n=11){ dx <- seq(a, b, length.out=n) polygon(c(a, dx, b), c(0, dnorm(dx), 0), col=col, border=NA) # draw a white vertical line on "inside" side to separate each section segments(a, 0, a, dnorm(a), col="white") } # Build the four left and right portions of this bell curve for(i in 0:3){ polysection( i, i+1, col=cols[i+1]) # Right side of 0 polysection(-i-1, -i, col=cols[i+1]) # Left right of 0 # Black outline of bell curve lines(x, dnorm(x)) # Bottom axis values, where sigma represents standard deviation and mu is the mean axis(1, at=-3:3, labels=expression(-3*SE, -2*SE, -1*SE, mu, 1*SE, 2*SE, 3*SE)) # Add percent densities to each division, between x and x+1 pd <- sprintf("%.1f%%", 100*(pnorm(1:4) - pnorm(0:3))) pd2 <- c("34%", "13.5%", "2%", "0.5%") text(c((0:3)+0.5,(0:-3)-0.5), c(0.16, 0.05, 0.04, 0.02), pd2, col=c("white","white","black","black")) segments(c(-2.5, -3.5, 2.5, 3.5), dnorm(c(2.5, 3.5)), c(-2.5, -3.5, 2.5, 3.5), c(0.03, 0.01)) dev.off()

20 If we put a CI on every sample
mean, about 95% of them would include the true mean. The two red ones are the “unlucky” samples which do not include the true mean. True mean

21 Interpretation of a confidence interval
The mean and 95% confidence interval (CI) of the blood glucose of a sample is: 140 mg/dl (95%CI: 120, 160) We are 95% confident that the interval 120 and 160 mg/dl includes the true population mean. Our best estimate is 140 mg/dl (i.e. the sample mean) Why only 95% certain? Because the sample mean can be, unfortunately, an extreme one beyond ± 2 SE (the blue zones)

22 Some common CIs and their z-score multipliers
There are two numbers in a confidence interval: the lower and upper confidence limits 90%CI: Mean ± 1.65  SE 95%CI: Mean ± 1.96  SE 2.00 is an approximation, 1.96 is recommended The most commonly used criterion 99%CI: Mean ± 2.58  SE The more certain we want the interval to include the true mean, the wider the CI becomes “I am 100% certain that the true mean is between –∞ and ∞.”

23 How to narrow down confidence interval?
Lower our certainty by opting for, say, a 90%CI instead of a 95%CI Decrease sample standard deviation (for instance, using a more accurate measurement device) Increase sample size

24 Are confidence intervals always symmetric?
Not in all occasions. CIs for untransformed continuous variables are symmetric However, CIs for other statistics such as odds ratios and relative risks are calculated on logarithmic scale. When back-transformed to the ratios, the interval will be asymmetric “Multivariable analysis revealed a more than 2-fold increase in the risk of total stroke among men with job strain (combination of high job demand and low job control) (hazard ratio, 2.73; 95% confidence interval, )”


Download ppt "Sampling and Confidence Interval"

Similar presentations


Ads by Google