Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 3 – Distributions, the confidence interval of a mean,

Slides:



Advertisements
Similar presentations
Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Estimating a Population Proportion
Estimation in Sampling
Chapter 8: Estimating with Confidence
Chapter 10: Estimating with Confidence
Introduction to Statistics
Inference: Confidence Intervals
Chapter 19 Confidence Intervals for Proportions.
Confidence Intervals for Proportions
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 7-1 Introduction to Statistics: Chapter 8 Estimation.
Inference about a Mean Part II
Copyright © 2010 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
Chapter 10: Estimating with Confidence
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics, A First Course.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 7 – T-tests Marshall University Genomics Core Facility.
Objectives (BPS chapter 14)
Confidence Interval Estimation
Chap 8-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Business Statistics: A First Course.
Estimation of Statistical Parameters
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Chapter 8: Confidence Intervals
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
+ Warm-Up4/8/13. + Warm-Up Solutions + Quiz You have 15 minutes to finish your quiz. When you finish, turn it in, pick up a guided notes sheet, and wait.
Statistical estimation, confidence intervals
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
Confidence intervals: The basics BPS chapter 14 © 2006 W.H. Freeman and Company.
Chap 7-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 7 Estimating Population Values.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Unit 5: Estimating with Confidence Section 10.1 Confidence Intervals: The Basics.
Section 10.1 Confidence Intervals
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Confidence intervals: The basics BPS chapter 14 © 2006 W.H. Freeman and Company.
CHAPTER-6 Sampling error and confidence intervals.
Confidence Interval Estimation For statistical inference in decision making:
Chap 7-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 7 Estimating Population Values.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall
Confidence Interval Estimation For statistical inference in decision making: Chapter 9.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
1 Probability and Statistics Confidence Intervals.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Confidence Intervals Dr. Amjad El-Shanti MD, PMH,Dr PH University of Palestine 2016.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
CHAPTER 6: SAMPLING, SAMPLING DISTRIBUTIONS, AND ESTIMATION Leon-Guerrero and Frankfort-Nachmias, Essentials of Statistics for a Diverse Society.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Sampling Distributions and Estimation
Descriptive and inferential statistics. Confidence interval
Sampling Distributions
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Objectives 6.1 Estimating with confidence Statistical confidence
Objectives 6.1 Estimating with confidence Statistical confidence
Presentation transcript:

Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 3 – Distributions, the confidence interval of a mean, and error bars Marshall University Genomics Core Facility

Distribution A distribution describes a data set. – Describes how frequently the values in the data set occur. – Can be used to describe an actual data set from a sample, or a theoretical data set in a population. Most commonly used in the theoretical case. – Can be used with discrete (categorical) or continuous variables. Marshall University School of Medicine

Simple Distribution Example For a very simple example, we'll use some categorical data from "Improving Adherence To A Mechanical Ventilation Weaning Protocol for Critically Ill Adults", American Journal of Critical Care, May 2006, 15;3. – Data gives number of patients in the study with each of a collection of health problems which lead to mechanical ventilation. – For a distribution, the number should be relative to the total – The total number of patients in this study was 129, so all numbers were divided by 129. Marshall University School of Medicine

Comorbid Health Problem Marshall University School of Medicine

Distributions for continuous data What if the variable is continuous? – If measured precisely enough, no value occurs more than once… – Can graph the distribution with a histogram instead of a bar chart – Break the range of the variable into intervals and show the relative frequency in each interval Marshall University School of Medicine

Histogram of Ages Marshall University School of Medicine

Population Distributions In many cases, it is useful to be able to try to describe the distribution of the population, rather than the sample. – Remember, we never actually know the data from the whole population. – However, we can sometimes deduce (mathematically) properties that must be true of its distribution. When we can't, we can often make reasonable assumptions. – In a lab experiment, the population is the set of all possible similar experiments we could perform. Infinitely many data points! In other scenarios, the population is large enough that we can consider it infinite. Marshall University School of Medicine

Probability Density Functions A probability density function is like a histogram of relative frequency, but for an infinite number of data points! – It has the property that if we select any interval of values, the area under the graph in that interval is the probability (or relative frequency) a value will be in that interval. The area under the whole graph must be equal to 1. Theoretical distributions are usually described by their probability density functions. Marshall University School of Medicine

Distributions for Experimental Data Think about any lab experiment which results in the measurement of a continuous variable. If we repeat the experiment many times under the same conditions, we will get similar, but not identical, results. What are the possible sources of the variation? – Imprecise measurement of reagents. – Imprecise pipetting – Nonhomogeneous mixes of solutions, suspensions, etc – For experiments based on animal or human samples, natural genetic variation, etc. – Many others… Marshall University School of Medicine

Effect of Multiple Errors on Measurements When multiple sources of variation ("errors") are present in a single measurement, they will often combine additively Most of the time, some variation will increase the measured value, and some will decrease it – So much of the variation cancels out, and most measured values will be close to the "true" value Less often, all the sources of variation will combine to either increase or decrease the measurement – So some, but fewer, measured values will lie far from the true mean In the 17th century, mathematicians determined what the distribution for experimental data would look like assuming there were many sources of variation and that they behaved additively Marshall University School of Medicine

The Gaussian Distribution The resulting distribution is called the Gaussian Distribution or Normal Distribution Marshall University School of Medicine

Mathematical Details of the Normal Distribution Remember, the probability a value lies in a given interval is the area under the curve over that interval. The formula for the distribution function is well- known. However, there is no way to get a formula for the area under an arbitrary portion of the curve. – Even with calculus! Instead, advanced numerical techniques are used, and values are published in tables or numerically calculated when required by software. Marshall University School of Medicine

Standard Deviation and the Normal Distribution If a variable is distributed according to the normal distribution, the chances that it lies within 1 standard deviation of the mean are 68.3% Marshall University School of Medicine

Standard Deviation and the Normal Distribution If a variable is distributed according to the normal distribution, the chances that it lies within 2 standard deviations of the mean are 95.5% Marshall University School of Medicine

The Standard Normal Distribution The probability values expressed in terms of standard deviations, as in the previous two slides, are true no matter what the standard deviation or mean are. Consequently, if we know values for the normal distribution with mean 0 and standard deviation 1, we know the values for any normal distribution. The normal distribution with mean μ=0 and standard deviation σ=1 is called the standard normal distribution If a variable is normally distributed, then subtracting its mean and dividing by its standard deviation yields a variable with the standard normal distribution. Marshall University School of Medicine

The Central Limit Theorem The normal distribution is particularly important in statistics for the following reason: – Take some measurement, with any distribution whatsoever – Sample that measurement n times and compute the mean – Repeat that many times to get lots of estimates of the mean – Those means will be approximately normally distributed The mean of those means will be the same as the mean of the original variable The standard deviation of those means will be the standard deviation of the original variable, divided by √n – The approximation improves as n increases Since many statistical tests are concerned only with the difference between means of values, tests that assume the normal distribution may work well even when the variable may not be normally distributed. Marshall University School of Medicine

The lognormal distribution The argument that many sources of variation lead to a normal distribution relies on the assumption that the sources of variation are additive. Sometimes the sources of variation are multiplicative – In this case the effect is equally likely to double a value as to halve it. – 100 is equally likely to be moved to 50 or 200 – Not symmetric! – This leads to a lognormal distribution Marshall University School of Medicine

The lognormal distribution Marshall University School of Medicine

Working with the lognormal distribution If the data are distributed lognormally: – The log of the data will be distributed normally – Take logs of all the data values and use standard statistical tests on the log data – Later in the course we'll see how to test whether data is distributed normally or not Marshall University School of Medicine

The confidence interval of a mean A common theme in experiments is to collect data from a number of samples, and compute the mean. – We interpret the mean of our samples as an approximation to the mean over the whole population. – We need to know the precision of this approximation Commonly, we will compute the mean from two or more groups of samples and compare them – Really interested in knowing the comparison of the means from the population – Cannot make this inference without a sense of the precision of our approximations Most intuitive way to do this is to compute the confidence interval for the mean. Marshall University School of Medicine

Values determining the CI of a mean Four values are used in determining the CI of a mean: 1.The mean of the sample. The confidence interval is centered on the sample mean. 2.The SD of the sample. The smaller the SD, the narrower (more precise) the confidence interval. 3.The sample size. The larger the sample size, the more confidence we have and so the confidence interval will be smaller. 4.The degree of confidence required. A higher degree of confidence requires a wider interval. Marshall University School of Medicine

Assumptions for computing the CI of a mean The calculation for the CI of a mean relies on four assumptions: 1.Representative sample 2.Independent observations 3.Accurate data 4.Population values are approximately normally distributed Marshall University School of Medicine

Calculating the CI of a mean Any statistical software worth using will perform this calculation for you. However, it's probably worth seeing how it works. – Compute the mean, m, and standard deviation s of your sample. – The confidence interval will be centered on the mean m, so we need to know how to compute the width of the interval. – The width of the interval, w, which we call the margin of error, depends on a value from the T distribution, which we'll call t*. We'll discuss the T distribution in the next lecture. – t* depends on: the number of degrees of freedom, which in this case is n-1, where n is the sample size the degree of confidence The margin of error, w=t*s/√n. Marshall University School of Medicine

Example: CI of the mean From the mechanical ventilation paper referenced earlier: – Before the experimental intervention (which was to enhance adherence to clinical protocols), the mean duration of mechanical ventilation was 86.0 hours, with an SD of 68.0, and sample size 63. – After intervention, the mean was 70.8, sd 67.5, with a sample size of 66. – From tables, the t* values for 95% confidence are for 62 degrees of freedom and for 65 degrees of freedom. – The margins of error are 17.1 (before intervention) and 16.6 (after) Marshall University School of Medicine

Example continued The confidence intervals for the before intervention mean thus has a lower limit of =68.9 and an upper limit of =103.1 – We say the 95% confidence interval for the mean is [68.9, 103.1] – Similarly, the 95% confidence interval for the mean duration of mechanical ventilation after intervention is [54.2, 87.4] Are you confident there is a real difference? Marshall University School of Medicine

Confidence Interval Plot Marshall University School of Medicine

Confidence Intervals by Resampling The previous formula for computing confidence intervals assumes the population data is normally distributed. The resampling technique doesn't make this assumption. Resample the sample data by randomly picking values from the sample. Do this so the resample is the same size as the sample. – Values from the sample may occur more than once in the resample. – This is called a pseudosample Compute the mean of the resample. Repeat this many times (say, 1000), to get a distribution of means. Order the resampled means and identify the middle 95% (i.e. from the 2.5th percentile to the 97.5th percentile). This is an approximation to the 95% confidence interval for the mean. Marshall University School of Medicine

Resampling Resampling can be used for many different data and statistics. No assumptions about the distribution – Requires only that the sample size is large enough to generate many (hundreds of) different resamples – Nine or ten values usually sufficient Reuses the same real sample to generate the pseudosamples - no real additional data is generated Theoretically valid approach But computationally intensive Marshall University School of Medicine

The Central Limit Theorem (again) Recall the central limit theorem: – Suppose we have a population with mean μ and standard deviation σ (but any distribution) – If we repeatedly take samples of size n from this population, and compute their mean m: The sample means m will have a normal distribution (approximately) The (population) mean of the sample means will also be μ The standard deviation of the sample means will be σ/√n. Alternatively, we could say that the quantity (μ-m)/(σ/√n) follows the standard normal distribution. Marshall University School of Medicine

Subtle but important point In the previous slide, the quantity that was stated to be normally distributed was Notice the denominator includes σ, the population sd, not the sample sd. – Substituting the sample sd will not result in a normal distribution – Of course, in most cases we only know the sample sd, not the population sd. Marshall University School of Medicine

The t distribution Suppose we have a normally distributed population, with mean μ. Repeatedly pick a sample of size n from this population. Compute the sample mean m, the sample sd s, and the quantity t=(μ-m)/(s/√n) We want to know the distribution of values of t It turns out that the distribution depends on the sample size We characterize the t distribution by the degrees of freedom, which is n-1 Marshall University School of Medicine

t distribution for small degrees of freedom Marshall University School of Medicine

CI of a mean, revisited We'll revisit the "after intervention" data set earlier. Mean was m=70.8, sd s=67.5, and sample size n=66. – For a t distribution with given degrees of freedom, we can find a value t* so that the chance that a value from that distribution lies between -t* and t* is 95% – Because the t distribution is symmetrical, t* is the 97.5th percentile (and -t* is the 2.5th percentile) – If we call the (unknown) population mean μ, there is a 95% chance that (μ-m)/(s/√n) lies between ±t*. – So there is a 95% chance that μ-m lies between ±t*s/√n – And so we are 95% confident that μ is in the range m±t*s/√n For the t distribution with 65 degrees of freedom, the t* value is 1.997, as we saw earlier. Marshall University School of Medicine

Error Bars As discussed earlier, any graphical presentation of data should quantify the variation in the data. – Required to make inferences about the population, instead of just the sample. There are three commonly used ways to display a measure of variability, or a measure of the precision with which we estimate a population mean: – Display the standard deviation of the sample – Display the standard error of the mean (SEM), computed from the sample – Display a confidence interval (typically the 95% CI) We are now familiar with the standard deviation and with confidence intervals: we'll start by examining what the SEM actually is Marshall University School of Medicine

The Standard Error of the Mean Recall our standard setup: – We have some population, with some unknown mean and standard deviation, and we select a sample from it. – We compute the mean of the sample as an estimate of the mean of the population. – If we were to repeatedly compute the mean of lots of samples, all of the same size, we would get (slightly) different values each time – The best estimate of the standard deviation of all these sample means is s/√n, where s is the sample standard deviation and n is the sample size. This is the Standard Error of the Mean Marshall University School of Medicine

Interpreting the Standard Error of the Mean Interpreting the SEM is difficult. – In a sense, it says: "If we computed lots of sample means from samples this big, this is(our best estimate of) how far they would be, on average, from the true population mean” – The SEM does not quantify the amount of scatter within your sample – If you have a large enough sample, the SEM will be very small, even if there is lots of scatter – The SEM does, indirectly, quantify how precisely you know the population mean – So does a confidence interval Marshall University School of Medicine

Why the SEM is hard to interpret When we work with normally distributed values, we can easily answer the question: "What proportion of values lie within one standard deviation of the mean?” The SEM is the standard deviation of sample means And sample means are not normally distributed – They are distributed according to the t distribution which itself depends on the sample size So the answer to "What proportion of sample means lie within one SEM of the true mean?" depends on the sample size – When n is very large, it is close to 68.3% – But for n=3, it is 57.7% Marshall University School of Medicine

SEM and Confidence Intervals Both the SEM and Confidence Intervals quantify the precision with which we estimate the population mean. – Remember, the margin of error (half-width of the confidence interval) is w=t*s/√n which we can now write as w=t*(SEM) Here t* is the value from the t distribution so that the area from -t* to t* is the desired confidence level If n is very large, then for a 95% confidence interval, t*≈1.96 But for n=4, t*=3.18, and for n=3, t*=4.30 Marshall University School of Medicine

Error Bars When space is limited in an image (for example if you are showing multiple graphs in a figure), you may be restricted to showing just the sample mean and a single-value summary of the variation. – Again, if possible, show more detail Column scatter plot, box and whisker plot, etc Common choices for the "error bars" are – Standard deviation – Standard Error of the Mean – Confidence Interval Marshall University School of Medicine

Example: Gene Expression in Breast Cancer Cell Lines For an example, we'll use a recent experiment in which we measured expression of a the gene GRHL2 in 51 different breast cancer cell lines – Cell lines categorized by basal type (Basal A, Basal B, Luminal) – Thirteen values for Basal A. Mean is 4.02, sd Marshall University School of Medicine

Example Error Bars Marshall University School of Medicine

Guidelines for using error bars Whatever you choose to plot, always clearly state what the error bars are Decide what you want to show: – To show variation among samples: Use mean±sd or median and quartiles – To show precision of the estimation of the mean Use mean and a confidence interval Possbily use mean±SEM – but interpretation is far harder Marshall University School of Medicine

Guidelines for reading error bars Make sure you know what the error bars represent – Should be stated in the figure legend – If not, might be buried in the methods somewhere… If the error bars show SEM: – Convert to SD with s=SEM√n – Or, if you are prepared to do more work, convert to a confidence interval with w=t*SEM – Don't assume w≈2xSEM unless the sample size is in the hundreds Marshall University School of Medicine