1 Chapter 2: Methods for Describing Sets of Data (Page 19-98) Homework:14ab, 36, 43, 45, 51, 56, 64abc, 71, 79, 85, 89, 96.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Chapter 2 Exploring Data with Graphs and Numerical Summaries
Descriptive Measures MARE 250 Dr. Jason Turner.
Class Session #2 Numerically Summarizing Data
Numerically Summarizing Data
Descriptive Statistics
Chapter 3 Describing Data Using Numerical Measures
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Descriptive Statistics – Central Tendency & Variability Chapter 3 (Part 2) MSIS 111 Prof. Nick Dedeke.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Measures of Central Tendency
Slides by JOHN LOUCKS St. Edward’s University.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Chapter 2 Describing Data with Numerical Measurements
Programming in R Describing Univariate and Multivariate data.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
Descriptive Statistics
Census A survey to collect data on the entire population.   Data The facts and figures collected, analyzed, and summarized for presentation and.
Numerical Descriptive Techniques
Methods for Describing Sets of Data
2011 Summer ERIE/REU Program Descriptive Statistics Igor Jankovic Department of Civil, Structural, and Environmental Engineering University at Buffalo,
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 2: Methods for Describing Sets of Data
Review Measures of central tendency
1 Laugh, and the world laughs with you. Weep and you weep alone.~Shakespeare~
1 MATB344 Applied Statistics Chapter 2 Describing Data with Numerical Measures.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Chapter 2 Describing Data.
Ex St 801 Statistical Methods Introduction. Basic Definitions STATISTICS : Area of science concerned with extraction of information from numerical data.
Describing distributions with numbers
Lecture 3 Describing Data Using Numerical Measures.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
1 CHAPTER 3 NUMERICAL DESCRIPTIVE MEASURES. 2 MEASURES OF CENTRAL TENDENCY FOR UNGROUPED DATA  In Chapter 2, we used tables and graphs to summarize a.
Categorical vs. Quantitative…
1 Elementary Statistics Larson Farber Descriptive Statistics Chapter 2.
Larson/Farber Ch 2 1 Elementary Statistics Larson Farber 2 Descriptive Statistics.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
1 Review Sections 2.1, 2.2, 1.3, 1.4, 1.5, 1.6 in text.
Numerical Measures. Measures of Central Tendency (Location) Measures of Non Central Location Measure of Variability (Dispersion, Spread) Measures of Shape.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
Larson/Farber Ch 2 1 Elementary Statistics Larson Farber 2 Descriptive Statistics.
CHAPTER 1 Basic Statistics Statistics in Engineering
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Exploratory Data Analysis
Methods for Describing Sets of Data
Chapter 3 Describing Data Using Numerical Measures
2.5: Numerical Measures of Variability (Spread)
Chapter 2: Methods for Describing Data Sets
4. Interpreting sets of data
Chapter 6 ENGR 201: Statistics for Engineers
NUMERICAL DESCRIPTIVE MEASURES
Chapter 3 Describing Data Using Numerical Measures
Chapter 2b.
Topic 5: Exploring Quantitative data
Numerical Measures: Skewness and Location
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Honors Statistics Review Chapters 4 - 5
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Presentation transcript:

1 Chapter 2: Methods for Describing Sets of Data (Page 19-98) Homework:14ab, 36, 43, 45, 51, 56, 64abc, 71, 79, 85, 89, 96

2 Section 2.1: Numerical Measures of Central Tendency (center): Why we are interested in the central tendency of a set of measurements? The central tendency of a set of measurements is the tendency of the data to cluster (or center) about certain numerical values. Since it is very important to both descriptive and inferential statistics, there are many numerical measures such as mean, median, and mode available to estimate the central tendency of a set of measurements. One can not say which one is the best measure for the central tendency of a set of data because data have very different characteristic.

3 The most popular measure for the central tendency is the mean (or the arithmetic mean). We use the Greek letter µ to stand for the population mean and use the to stand for the sample mean. The mode is a useful numerical measure of the central tendency if one wants to know the measurement that occurs most frequently in the data set. The median is a good measure for the central tendency if there are several extremely large (or extremely small) measurements in the data. Which one is the best numerical measure for the central tendency of a set of data?

4 Example 2.1 (Basic): The following data give the weekly expenditures (in dollars) on nonalcoholic beverages for 45 households randomly selected from the 1996 Diary Survey Use part of the SAS output in next 3 tables to find the sample size, mean, median, and mode for weekly expenditures.

5 Results for Example 2.1 Variable=EXPENSE Moments N 45 Sum Wgts 45 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr>|T| Num ^= 0 45 Num > 0 45 M(Sign) 22.5 Pr>=|M| Sign Rank 517.5Pr>=|S|

6 Quantiles(Def=5) 100% Max % % Q % % Med % % Q % 1.3 Range 15.8 Q3-Q1 7.6 Mode 0.9

7 Extremes Lowest Obs Highest Obs 0.7( 27) 12.7( 45) 0.9( 34) 13.5( 22) 0.9( 14) 15.1( 26) 1.3( 39) 15.9( 24) 1.3( 20) 16.5( 41)

8 Example 2.2 (Intermediate): Michelson conducted an experiment to determine the velocity of the light between 1879 and Table 2.1 presents Michelson's determinations minus in Km/sec. Table 2.1 Velocity of the Light

9 Result From Example 2.2 Variable=SPEED N 100 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr>|T| Num ^= Num > M(Sign) 50 Pr>=|M| Sgn Rank 2525 Pr>=|S|

10 Quantiles(Def=5) 100% Max % % Q % % Med % % Q % 760 0% Min 620 5% 730 1% 635 Range 450 Q3-Q1 90 Mode 810

11 Extremes Lowest Obs Highest Obs 620( 67) 980( 83) 650( 34) 1000( 4) 720( 60) 1000( 64) 720( 57) 1000( 74) 720( 47) 1070( 33)

12 The data set is skew to the right if there are several extremely large measurements (see Figure 2.2). In this case the mean is greater than the median and the extremely large values have a stronger impact on the mean. The data set is skew to the left if there are several extremely small measurements (see Figure 2.3). In this case the mean is small than the median and the extremely small values pose stronger impact on the mean as well. The data sets are well behaved if they are symmetric (see Figure 2.1). Symmetrical data sets pose several good properties that will be discussed in later chapters.

13

14

15

16 Section 2.2: Numerical Measures of Variability Why we are interested in numerical measures for the variability of a set of measurements? The variability of a set of measurements is the "spread" of the data. Measure of variabiltiy is as important as the measure of central tendency. There are many significant different data sets, which can have the same mean, median, and mode. We introduce three numerical measurements: range, variance, and standard deviatiation to estimate the variability.

17 Why sometimes the range is not a good numerical measure for the variability of a set of data? The variability of two sets of data can be very different even if they have a similar range because the range only depends on the largest and smallest measurements and one extremely large measurement (or one extremely small measurement) can alter the range significantly.

18 We use the symbols s and s 2 to stand for the samlpe standard deviation and the sample variance, respectively, and the Greek symbols  and  2 to stand for the population standard deviation and the population variance, respectively. Both standard deviation and variance are good measures for the variability of a set of measurements.

19 Is there any set of measurements that can be completely explained by the sample mean and the sample standard deviation? Yes. A set of measurements can be explained completely by the sample mean and the sample standard deviation of the relative frequency distribution if the data is similar to Figure 2.1.

20 Example 2.3 (Basic): Find the variance, the standard deviation and the range from SAS output in Example 2.1.

21 Example 2.4 (Intermediate): a) Find the variance, the standard deviation and the range from SAS output in Example 2.2. b) Find the variance, the standard deviation, and the range without three extreme values. c) Which measure is most affected by the deletion of extreme values? d) Comparing the mean, the median, and the mode before and after the deletion of outliers.

22 Result From Example 2.4 (Without Extreme values) Variable=SPEED N 97 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr>|T| Num ^= 0 97 Num > 0 97 M(Sign) 48.5 Pr>=|M| Sgn Rank Pr>=|S|

23 Quantiles(Def=5) 100% Max % % Q % % Med % % Q % 760 0% Min 720 5% 740 1% 720 Range 280 Q3-Q1 80 Mode 810

24 Section 2.3: Interpreting the Standard Deviation Standard deviation provides a measurement of variability of a sample. The sample with larger sample standard deviation has higher variability. The standard deviation also provides information to answer question such as "How many measurements are within 2 standard deviations of the mean?" for any specific data set. We need to understand the following two rules in order to answer the above question.

25 Chebyshev's Rule: For any set of measurements, at least of the measurements will fall within k standard deviations of the mean for any number of k greater than 1 (a) At least 3/4 of the measurements will fall within the interval for a sample and for a population. (b) At least 8/9 of the measurements will fall within the interval for a sample and for a population.

26 The Empirical Rule: The empirical rule is a rule of thumb that applies only to samples or populations with frequency distributions that are mound-shaped, i.e. the frequency distributions are similar to a bell (a) Approximately 68% of the measurements will fall within the interval for a sample and for a population. (b) Approximately 95% of the measurements will fall within the interval for a sample and for a population. (c) Approximately 99.7% of the measurements will fall within the interval for a sample and for a population.

27 Example 2.5 (Basic): For any set of data, what can be said about the percentage of measurements contained in each of the following intervals. (a) (b) (c)

28 Example 2.6 (Intermediate): The mean and standard deviation of a group of one hundred NBA players are inches and 3.25 inches, respectively. (a) How many players in this group are taller than inches based upon the Empirical Rule? (b) Can we answer part (a) based on the Chebyshev's rule? (c) What assumption is required in order to apply the Empirical Rule?

29 Section 2.4: Numerical Measures of Relative Standing Can you say that you did poorly in one exam if you got 70 points? You might do poorly or you might do a fair job in this exam. You can get the top score if all other students got less than 60 points in this extremely difficult exam. Your performance should be judged by the relative standing instead of the numerical score. Descriptive measures of the relationship of a measurement to the rest of the date are called measures of relative standing.

30 Example 2.7 (Basic): Base on the SAS output for Example 2.1 to find the following percentiles: (a) 10th percentile (b) 25th percentile (c) 50th percentile (d) 55th percentile (e) 90th percentile Note: 1. Median is the 50th percentile of a quantitative data set. 2.Upper quartile is the 75th percentile and lower quartile is the 25th percentile of a quantitative data set.

31 Quantile: Let q be any number between 0 and 1, the qth quantile denoted by Q(q) is a number such that a fraction of q of the measurements fall below and a fraction of (1-q) of the measurements fall above this number.

32 Sample Z Score: Suppose x is a measurement from a sample with mean and standard deviation s. The sample Z score of x is Population Z Score: Suppose x is a measurement from a population with mean  and standard deviation . The population Z score of x is

33 Example 2.8: The following data give the yearly contributions (in dollars) to a local church by 35 households randomly selected from the 1996 Interview Survey (a) Find the mean and median of this set of data? (b) Find the standard deviation and range? (c) Compute the Z score for 200. (d) How many measurements are fall within two standard deviations of the mean?

34 Univariate Procedure Variable=DOLLARS N 35 Sum Wgts 35 Mean Sum 4381 Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr>|T| Num ^= 0 35 Num > 0 35 M(Sign) 17.5 Pr>=|M| Sgn Rank 315 Pr>=|S|

35 Quantiles(Def=5) 100% Max % % Q % % Med 87 90% % Q % 18 0% Min 10 5% 15 1% 10 Range 490 Q3-Q1 175 Mode 25

36 Extremes Lowest Obs Highest Obs 10( 20) 275( 33) 15( 17) 300( 6) 15( 12) 300( 27) 18( 19) 400( 25) 24( 30) 500( 26)

37 Section 2.5: Graphic Methods for Describing Data (Bar Chart, Pie Chart, and Histogram) Why we need to use graphic methods to describe data. Mean and standard deviation alone can not characterize the wide variety of distributions that data can have. We can easily find examples that several significantly different data sets have same mean and standard deviation. Can we find several different data sets with same mean and standard deviation? Three data sets in Figure 2.4 all have same mean, median, standard deviation, and variance. However, they are very different.

38 Figure 2.4 A B C

39 We will not cover bar-charts, pie-charts, or histograms in this semester. Firstly, bar-charts and pie-charts pose several perception problems as indicated by the famous book entitled "The Elements of Graphing Data" (William S. Cleveland, 1995). Secondly, we focus on discussing quantitative data in this semester but both pie-charts and bar-charts are graphical tools for qualitative data. Thirdly, there is more information encoded in a well designed stem- leaf display than a histogram. Box-plots, and stem-leaf displays are the graphical methods discussed in this course.

40 Section 2.6: Stem-and-Leaf Display Figure 2.5 shows a stem-and-leaf display of the ozone data (Tukey 1977). It is a hybrid between a data table and a histogram since it shows numerical values as numerals but its profile is very much like a histogram (see Figure 2.6). One can follow the following steps to construct a stem-and-leaf display by hand. 1. Define the stem and leaf to be used. 2. Write the stems in a column arranged from the smallest stem at the top(bottom) to the largest stem at the bottom (top).

41 3. If the leaves consist of more than one digit, drop the digits after the first digit. 4. Record the leaf for each measurement in the row corresponding to its stem. 5. Find the median and highlight the leaf corresponding to the median. 6. Count the number of leaves in the row with the median and put the count in the depth column. 7. Count the number of leaves for each row from the top row to the median row and put the cumulative counts in the depth column. 8. Count the number of leaves for each row from the bottom row to the median row and put the cummulative counts in the depth column.

42 Figure 2.5 Stem-and-Leaf DepthStem Leaf (11)

43 DepthStem Leaf (11) Figure 2.6 Stem-and-Leaf Display with 90 Degree Rotation

44 Univariate Procedure Variable=OZONE N 125 Sum Wgts 125 Mean Sum 9911 Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr>|T| Num ^= Num > M(Sign) 62.5 Pr>=|M| Sgn Rank Pr>=|S|

45 Quantiles(Def=5) 100% Max % % Q % % Med 72 90% % Q % 31 0% Min 14 5% 24 1% 14 Range 160 Q3-Q1 56 Mode 38

46 Advantages of stem-and-leaf display: Both the numerical values and the graphical shape can be seen on a stem-and-leaf display. It is very easy to locate an individual measurement on a stem-and-leaf display. You can sort a relative small data set by hand using stem-and-leaf display. You can get the following information such as median, mode, range, maximum, minimum, upper quartile, lower quartile, and inner quartile range on a stem-and-leaf display.

47 We can determine the symmetry information of a set of measurements from the stem-and-leaf display. A set of measurements is symmetric if its relative frequency distribution looks similar to Figure 2.1. The relative frequency distribution of Ozone data can be seen from the rotated stem-and-leaf display (Figure 2.6). Ozone data is skewed to the right because there are more observations with small values than observations with large values.

48 Example 2.9: the following table contains 48 measurements of the weight of a group of male students in STA 3023 last year. Table a) Construct a stem-and-leaf display for data in Table 2.1. b) Is the data symmetric? c) Find the mean, the median, the range, the standard deviation, the lower quartile, and the upper quartile from SAS output

49 DepthStemLeaves 21203, , ,2, ,5,5,5,6,6, ,0,3,5,9 (8)1700,0,0,0,3,4,5, ,2,5,5,5,5, ,0,1,5,5, ,5,6, ,

50 DepthStemLeaves 21203, , ,2, ,5,5,5,6,6, ,0,3,5,9 (8)1700,0,0,0,3,4,5, ,2,5,5,5,5, ,0,1,5,5, ,5,6, , Figure 2.7 Stem-and-Leaf Display with 90 Degree Rotation

51 SAS Output for Example 2.9 Variable=WEIGHT N 48 Sum Wgts 48 Mean Sum 8368 Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr>|T| Num ^= 0 48 Num > 0 48 M(Sign) 24 Pr>=|M| Sgn Rank 588 Pr>=|S|

52 Quantiles(Def=5) 100% Max % % Q % % Med % % Q % 140 0% Min 123 5% 130 1% 123 Range 107 Q3-Q Mode 170

53 Section 2.7: Box Plots Inner Quartile Range (IQR): The upper quartile minus the lower quartile. Step: 1.5*IQR Upper Inner Fence: Upper quartile plus one step. Lower Inner Fence: Lower quartile minus one step. Upper Outer Fence: Upper quartile plus two steps. Lower Outer Fence: Lower quartile minus two steps. Outside Value: Any measurements that are greater than the upper inner fence or less than the lower inner fence.

54 Elements of a Box Plot: A rectangle is drawn with the ends drawn at the lower and upper quartiles. The median of the data is shown in the box, usually by a line through the box. The points at distances 1.5*IQR from each hinge mark the inner fences of the data set. Horizontal lines are drawn from each hinge to the most extreme measurement inside the inner fence. A second pair of fences, the outer fences, exist at a distance of 3 *IQR from the hinges. One symbol (usually "*" in SAS) is use to represent measurements falling between the inner and outer fences. Another symbol (usually "0" in SAS) is use to represent measurements beyond the outer fence.

55 Interpretation of Box Plots The median shows the central tendency of the data. The length of the box (IQR) provides a measure of the variability of the middle 50% of the data. The individual outside values give the viewer an opportunity to the presence of outliers, that is, observations that seem unsually, or even implausibly, large or small. Outside values are not necessarily outliers, but any outliers will almost certain appear as an outlier. The box plot allows a partial assessment of symmetry. The box plot is symmetric about it median if the data is symmetric. If one whisker is clearly longer, the data is probably skewed to the direction of the longer whisker.

56 Example 2.10: Base on the box plot for data in Example 2.1 to answer the following: (a) Is the data symmetric? (b) Is there any outside value? (c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value.

Figure 2.8 Box Plot for Data in Example 2.1 Weekly Expenditure (in Dollar)

58 Example 2.11: Base on the box plot for data in Example 2.2 to answer the following: a. Is the data symmetric? b. Is there any outside value? c. Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. d. Compute the inner quartile range and step.

Figure 2.9 Velocity of the Light Speed of the Light

60 Example 2.12: Base on the box plot for data in Example 2.8 to answer the following: (a) Is the data symmetric? (b) Is there any outside value? (c) Find the upper quartile, the median, the lower quartile, minimum value, and the maximum value. (d) Compute the inner quartile range and step.

Figure 2.10 Box Plot for Data in Example 2.8 Yearly Contributions

62 Quick Review: Mean, Median, and Mode Range, Standard Deviation, and Variance Upper Quartile, Lower Quartile, and IQR Chebyshev's Rule and Empirical Rule Z-Score Symmetry and Skewness Mound-Shaped distribution Box-Plot and Stem-and-Leaf Display