Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Similar presentations


Presentation on theme: "Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations."— Presentation transcript:

1 Copyright (c) Bani Mallick1 Lecture 2 Stat 651

2 Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations and samples The Median and Percentiles Robustness of the median and IQR, lack of robustness for the mean and standard deviation Variability: standard deviation and interquartile range (IQR) Boxplots

3 Copyright (c) Bani Mallick3 Book Sections Covered in Lecture #2 Chapter 3.4 Chapter 3.5, up to page 88 Chapter 3.6

4 Copyright (c) Bani Mallick4 Review of Lecture #1 We described samples and populations We want to make inference about populations We draw samples from the population to do so Different samples give different results

5 Copyright (c) Bani Mallick5 Review of Lecture #1 I will make quite a big deal about the difference between samples and populations One major thing we do in statistics is to quantify: how far is what we see in the sample from what we cannot see in the population?

6 Copyright (c) Bani Mallick6 Review of Lecture #1 We defined the relative frequency histogram This counts the percentage of the sample that falls into computer-generated categories We studies the NHANES case-control study This had samples from 2 (sub)populations Those who developed breast cancer Those who did not

7 Copyright (c) Bani Mallick7 Review of Lecture #1 In the NHANES data, the sample who developed breast cancer seemed to have smaller values of saturated fat than the sample that did not develop breast cancer What we will try to do today is to quantify those differences

8 Copyright (c) Bani Mallick8 Review of Lecture #1: The Population Mean In many problems, the goal is to make inference about the population mean of a numerical variable, e.g., saturated fat intake The only thing we have available is data from a sample, e.g., the sample mean. Define in words what you mean by the population mean and the sample mean!

9 Copyright (c) Bani Mallick9 Review of Lecture #1: The Population Mean In many problems, the goal is to make inference about the population mean of a numerical variable, e.g., saturated fat intake You’re right! The population mean is the average of all the outcomes in the population It can’t be measured, hence we take samples. BTW, what’s an average?

10 Copyright (c) Bani Mallick10 Review of Lecture #1: The Population Mean Formal definition. If the sample is of size n and the data are X 1,…, X n, then the sample mean is Total sum of the values in the sample Number of observations in the sample

11 Copyright (c) Bani Mallick11 Parameters and Statistics Parameter: numerical characteristic of a population Statistic : numerical characteristic of a sample Population Sample

12 Copyright (c) Bani Mallick12 Parameters and Statistics Parameter: numerical characteristic of a population, called  Statistic : numerical characteristic of a sample, called Population Sample

13 Copyright (c) Bani Mallick13 Parameters and Statistics Parameter: numerical characteristic of a population, called  never known!!!) Statistic : numerical characteristic of a sample, called Population Sample We want to make inference about population from sample

14 Copyright (c) Bani Mallick14 Case-control data: NHANES log(Saturated Fat) Which sample has the larger sample mean? 0% 5% 10% 15% Percent Cancer Healthy 2.003.00 4.0 Log(Saturated Fat) 0% 5% 10% 15% Percent

15 Copyright (c) Bani Mallick15 Case-control data: NHANES log(Saturated Fat) Cancer: = 2.70 Healthy: = 2.99 0% 5% 10% 15% Percent Cancer Healthy 2.003.00 4.0 Log(Saturated Fat) 0% 5% 10% 15% Percent

16 Copyright (c) Bani Mallick16 The Concept of a Median When we talk about the population median age of graduate students at Texas A&M, what do we mean? When we talk about the sample median of graduate students at Texas A&M, what do we mean?

17 Copyright (c) Bani Mallick17 The Concept of a Median The population median is the central point 1/2 the population falls below the population median 1/2 the population falls above the population median Look in newspapers for the use of the median and the mean

18 Copyright (c) Bani Mallick18 The Concept of a Median The sample median is the central point of the sample 1/2 the sample falls below the sample median 1/2 the sample falls above the sample median We can use the sample median to try to estimate the population median Remember though, different samples give different numbers

19 Copyright (c) Bani Mallick19 The Concept of a Median The sample median is computed by SPSS, or by hand as follows Order the data If n is an odd number, the sample median is the (n+1)/2 point in order If n is even, it is the average of the n/2 point in order and the (n/2+1) point in order

20 Copyright (c) Bani Mallick20 The Concept of a Median Data 97 99 93 96 91 90 95: n = 7 Ordered: 90 91 93 95 96 97 99 If n is an odd number, the sample median is the (n+1)/2 point in order (n+1)/2 = 4 Sample median = 95

21 Copyright (c) Bani Mallick21 The Concept of a Median Data 97 92 96 91 90 95: n = 6 Ordered: 90 91 93 95 96 97 If n is even, it is the average of the n/2 point in order and the (n/2+1) point in order n/2 = 3, (n/2+1) = 4 Sample median = average of 93 and 95 = 94

22 Copyright (c) Bani Mallick22 Summary Statistics in SPSS Select “analyze”, “descriptive statistics”, “explore” Select your variables (“Dependent”) and you populations (“Factor List”) Ask for “Statistics” Cut and paste as needed

23 Copyright (c) Bani Mallick23 Descriptives 2.99057.969E-02 2.8310 3.1500 3.0015 2.9957.381.6173 1.39 4.26 2.88.9130 -.332.309 -.138.608 2.69698.362E-02 2.5295 2.8642 2.6886 2.8332.413.6423 1.39 4.77 3.38.8755.156.311.748.613 Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Healthy Status Numerical: 0 = Healthy, 1 = Cancer Healthy Breast Cancer Log(Saturated Fat) StatisticStd. Error SPSS Output for NHANES: you see variable, populations, sample means, medians, minimum and maximum

24 Copyright (c) Bani Mallick24 Summary Statistics in SPSS For both measures of central tendency, healthy women had reported more saturated fat on the day they were interviewed

25 Copyright (c) Bani Mallick25 VARIABILITY Variability is one of the hardest concepts to understand, and to measure. Variability means how widely spread out the population is Populations with large variability should have samples that are more spread out than are samples from populations with low variability Variability is measured by the average distance of points to center

26 Copyright (c) Bani Mallick26 Shelf life of two drugs (past shelf life the drug is harmful). Which drug would you take, assuming they are equally effective on average? Sample from Drug A Sample from Drug B

27 Copyright (c) Bani Mallick27 Absolute distance from sample mean Measuring Variability by Distances Absolute distance from sample median Squared distance to sample mean All numerical measures of variability are measures of “the center of the distance”

28 Copyright (c) Bani Mallick28 Variability as average squared distance The sample variance, s 2 is defined as follows Compute (X - sample mean)  for every observation: squared distances Sum these numbers up Divide by n-1 Except for the n-1, this is the sample mean of the squared differences

29 Copyright (c) Bani Mallick29 COMPUTING FORMULA for the Sample Variance s 2 Note how s 2 measures how far the data are from the sample mean

30 Copyright (c) Bani Mallick30 The Standard Devation, s The sample standard deviation is called s It is the square root of the sample variance Its units are the same as the units of the data

31 Copyright (c) Bani Mallick31 Aortic Stenosis Data Two populations: healthy kids and kids with aortic stenosis Two outcomes: body surface area and aortic valve area Size adjusted aortic valve areas is the ratio of aortic valve area to body surface area

32 Copyright (c) Bani Mallick32 Aortic Valve Area for Healthy and Stenotic Kids Which has larger mean, larger variance? 5% 10% 15% 20% 25% Percent Healthy Stenotic 0.0001.0002.0003.000 Aortic Valve Area 5% 10% 15% 20% 25% Percent

33 Copyright (c) Bani Mallick33 Aortic Valve Area for Healthy and Stenotic Kids Which has larger mean, larger variance? 5% 10% 15% 20% 25% Percent Healthy Stenotic 0.0001.0002.0003.000 Aortic Valve Area 5% 10% 15% 20% 25% Percent Mean = 1.06 Median = 0.80 std dev = 0.78 Mean = 0.83 Median = 0.69 std dev = 0.64

34 Copyright (c) Bani Mallick34 SAMPLE MEANS ARE NOT ROBUST HEIGHTS -4 -2 0 2 4 6 MED= -40 -2 0 2 4 6 MED= -2 0 2 4 6 40 MED=

35 Copyright (c) Bani Mallick35 SAMPLE MEANS ARE NOT ROBUST Outliers affect sample means much more than they do sample medians. You should look out for wild points They may be errors, or naturally occurring variability, but they have the potential for mischief We will develop statistical methods that help us understand whether our conclusions are being driven by a few wild points.

36 Copyright (c) Bani Mallick36 PERCENTILES SAT-SCORES If your are in the 90th percentile of the population, what % scored higher than you? If you are in the 25th percentile, what percent scored less than or equal to you? What percent lie between the 25th and 75th percentiles? What percentile is the median?

37 Copyright (c) Bani Mallick37 INTERQUARTILE RANGE (IQR) Defined as the difference between the 75th and 25th percentiles The length of data needed to cover 50% of the data. This is a natural, robust measure of variability Why do I say it is robust? Why do I say it is natural?

38 Copyright (c) Bani Mallick38 Aortic Valve Area for Healthy and Stenotic Kids Which has larger mean, larger variance? 5% 10% 15% 20% 25% Percent Healthy Stenotic 0.0001.0002.0003.000 Aortic Valve Area 5% 10% 15% 20% 25% Percent Mean = 1.06 Median = 0.80 std dev = 0.78 IQR = 0.98 Mean = 0.83 Median = 0.69 std dev = 0.64 IQR = 0.59

39 Copyright (c) Bani Mallick39 INTERQUARTILE RANGE Defined as the difference between the 75th and 25th percentiles The length of data needed to cover 50% of the data. This is a natural, robust measure of variability If comparisons of variability are different for the standard deviation and the IQR, good chance of an outlier

40 Copyright (c) Bani Mallick40 Descriptives 2.99057.969E-02 2.8310 3.1500 3.0015 2.9957.381.6173 1.39 4.26 2.88.9130 -.332.309 -.138.608 2.69698.362E-02 2.5295 2.8642 2.6886 2.8332.413.6423 1.39 4.77 3.38.8755.156.311.748.613 Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Healthy Status Numerical: 0 = Healthy, 1 = Cancer Healthy Breast Cancer Look for variance, standard deviation and IQR Log(Saturated Fat) StatisticStd. Error

41 Copyright (c) Bani Mallick41 Box Plots Box plots are a means of visualizing data from different populations, especially comparing them Far less clunky than histograms Clear definitions, don’t have to worry about # of bars, class intervals, etc. Easily available

42 Copyright (c) Bani Mallick42 BASIC FORM OF THE BOXPLOT 75th percentile Median 25th percentile IQR can be read off

43 Copyright (c) Bani Mallick43 Box Plot Additions: Technical To the basic box, whiskers are added: Go out to furthest point 1.5 IQR from 75th and 25th percentiles Any other points are called “Suspicious” “Extreme” “Outliers”

44 Copyright (c) Bani Mallick44 BASIC FORM OF THE BOXPLOT 75th percentile Median 25th percentile IQR can be read off Point here Outlier

45 Copyright (c) Bani Mallick45 IQR AS A MEASURE OF VARIABILITY The box is from 25th to 75th sample percentile This means 50% of the data are in the box Hence, IQR= range needed to cover 50% of the data IQR is a very robust measure of variability which can be judged graphically

46 Copyright (c) Bani Mallick46 NHANES Saturated Fat Data: A moderately outlying S unusually outlying Cancer Healthy 0.00 25.00 50.00 75.00 100.00 Saturated Fat A A S

47 Copyright (c) Bani Mallick47 Box Plots The SPSS plot actually displays the median line, but it did not translate into powerpoint You go to graphs, interactive and boxplot to get these things Here’s about what the thing looks like in SPSS (then I’ll show you SPSS)

48 Copyright (c) Bani Mallick48 NHANES Data not done interactively (just graphs and boxplot): cannot edit

49 Copyright (c) Bani Mallick49 Aortic Valve Area for Healthy and Stenotic Kids: done interactively, hence the median is not labeled when imported to powerpoint Healthy Stenotic 0.000 1.000 2.000 3.000 Aortic Valve Area A A A A S S


Download ppt "Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations."

Similar presentations


Ads by Google