Presentation is loading. Please wait.

Presentation is loading. Please wait.

Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.

Similar presentations


Presentation on theme: "Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable."— Presentation transcript:

1 Describing Data: Summary Measures

2 Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable (continuous, nominal, or ordinal). VARIABLE AGREE NO OPINION DISAGREE

3 Nominal Variables Variable: Type of Beverage or 1 2 3 123

4 Ordinal Variables Variable: Size of Beverage SmallMediumLarge

5 Continuous Variables Variable: Volume of Beverage 0 1.0 3.0 2.0 4.0 Variable: Temperature of Beverage Ratio Level Interval Level

6 Central Tendencies Defined as the tendency of the data to cluster around or center about certain numerical values, such as the: –Mean (arithmetic mean), –Median, and –Mode.

7 Central Tendency – Mean, Median, and Mode Mean=3 1 1 2 1 103 Median=1.5the middle value (also known as the 50 th percentile) Mode=1the most common or frequent data value the sum of all the values in the data set divided by the number of values

8 Mean (or Arithmetic Mean) Sum of the values of all of the observations in a data set divided by the number of observations: –The Sample Mean is –The Population Mean is: –The formula for calculating the sample mean is:

9 Median Defined as the middle point of the set of data, i.e. exactly half of the data points are above the median and exactly half are below: –If the number of data points is odd, it is the middle point of the ordered set of data. –If the number of data points is even, it is the average (mean) of the two middle points of the ordered set of data.

10 Mode Defined as the measurement(s) which occurs with the greatest frequency in the sample, i.e. the most common point(s): –A unimodal data set contains only one mode. –A bimodal data set contains two modes. –And so on….

11 Examples and Calculations

12 Mean: Median: –From stem and leaf: median is 5.8 Mode: –From stem and leaf: mode is 5.8 and 6.2

13 Skewing (Mean and Median Comparisons) If the median is less than the mean, the data set is skewed right (extreme data in right tail which increases the mean). If the median is greater than the mean, the data set is skewed left (extreme data in the left tail which decreases the mean). If median equals the mean, the data set is said to be symmetrical.

14 Other Items Notes: –The mean is sensitive to outliers (extreme data) while the median and mode are not. –Outliers can have significant implications on the ability draw inferences from the data set. For example, consider: Average home sales price. Average exam score. Average jury award amount. –Mean or median may not be feasible.

15 Picturing Distributions: Histogram –Each bar in the histogram represents a group of values (a bin). –The height of the bar represents the frequency or percent of values in the bin. PERCENT Bins

16 Data Distributions Compared to Normal

17 A Normal Distribution

18 Skewness

19 Measures of Data Variability Knowing central tendencies (mean, median, mode) isn’t enough. Also need a method for determining how close the data is clustered around its center point(s). Three typical measures of data variability: –Range, –Variance, and –Standard Deviation.

20 The Spread of a Distribution: Dispersion MeasureDefinition Rangethe difference between the maximum and minimum data values Interquartile Range the difference between the 25th and 75th percentiles Variancea measure of dispersion of the data around the mean Standard Deviation a measure of dispersion expressed in the same units of measurement as your data (the square root of the variance)

21 Range Simplest measure of variability. Calculated by subtracting the smallest measurement from the largest measurement. GPA Examples: –Data Set (4.0, 2.7, 3.3, 3.2, 2.1, 3.7, 3.5, 1.9). –Range equals 4.0 minus 1.9 which is 2.1. –Data Set (2.7, 3.2, 3.4, 2.9, 1.8, 2.2, 3.0, 2.1). –Range equals 3.4 minus 1.8 which is 1.6.

22 Is Range A Sufficient Measure of Variability? No - Consider the following two stem and leaf diagrams where the range equals 9.0.

23 Another Method for Measuring Data Variability (or Spread) A more sensitive measurement of variation uses the difference between the sample mean and each of the measurements of the sample, also known as the deviation from the mean. Each deviation between the sample member and the mean is first calculated and then squared. These results are then summed.

24 Variance Equal to the sum of the squared distances from the mean divided by (n-1) for a sample: –Sample Variance - s 2 –Population Variance - Deviations are squared to remove effects of negative differences.

25 Standard Deviation While variance does not provide a useful metric (i.e. “units squared”), taking the positive square root of the variance provides a metric which is the same as the data itself (i.e. “units”): –Sample Standard Deviation - s –Population Standard Deviation -

26 Variance Formulas

27 Shortcut Variance Formulas

28 Standard Deviation Formulas

29 Shortcut Standard Deviation Formulas

30 Notes on Variability The sample variance (standard deviation) is divided by one less than the sample size (n-1) rather than by sample size itself (n). The sample variance (standard deviation) is used to estimate the population variance (standard deviation). We divide by (n-1) rather than (n) so that this estimator is unbiased.

31 Notes on Variability (continued) A data set with larger spread about its mean will have a larger standard deviation. A data set with smaller spread about its mean will have a smaller standard deviation. It is the calculation of the standard deviation that allows for the comparison of the spread (variability) of the two data sets.

32 Examples GPA Data Set One: (4.0, 2.7, 3.3, 3.2, 2.1, 3.7, 3.5, 1.9). –Range equals 4.0 minus 1.9 which is 2.1. –Mean equals: GPA Data Set Two: (2.7, 3.2, 3.4, 2.9, 1.8, 2.2, 3.0, 2.1). –Range equals 3.4 minus 1.8 which is 1.6. –Mean equals:

33 Variance and Standard Deviation Calculations GPA Data Set One: –Variance: –Standard Deviation: GPA Data Set Two: –Variance: –Standard Deviation:

34 Variance/Standard Deviation Calculations using Shortcut GPA Data Set One: –Variance: GPA Data Set Two: –Variance:

35 Standard Deviation Useful for comparing the variability of two data sets. The data set with the larger standard deviation is the data set with more variability. From GPA Example: –Data Set One: Mean=3.05, St. Dev.=0.752. –Data Set Two: Mean=2.66, St Dev.=0.571.

36 Relative vs. Absolute Comparison Deviation, or error, has been standardized. Thus, for a single data set, variability can be discussed in terms of how many members of the data set fall within one, two, three, or more standard deviations of the mean. A theorem and a rule describe this behavior: –Chebyshev’s Theorem; –Empirical Rule.

37 Chebyshev’s Theorem In general, at least (1-1/k 2 ) of the sample members will fall within k standard deviations of the mean (for k >1). So… –For k=1, it is possible for no members to fall within 1 standard deviation of the mean. –For k=2, 75% (3/4) or more of the members will fall within 2 standard deviations of the mean.

38 Chebyshev’s Theorem (cont’d) From Table 2.10, Part 1 (cont’d): –For k=3, 88.9% (8/9) or more of the members will fall within 3 st. deviations of the mean. –For k=4, 93.75 (15/16) or more of the members will fall within 4 st. deviations of the mean. –And so on…. This theorem holds true regardless of the frequency distribution of the data set, i.e. no matter what the histogram looks like.

39 Empirical Rule Based on empirical evidence for mound or bell shaped frequency distributions: –Approximately 68% (0.6826) of the sample members will fall within 1 standard deviation of the mean. –Approximately 95% (0.9544) of the sample members will fall within 2 standard deviations of the mean. –Almost all (0.9974) of the sample members will fall within 3 standard deviations of the mean.

40 Normal Distributions

41 Chebyshev’s Theorem vs. Empirical Rule Percent of sample members using Chebyshev’s Theorem –For k=1, 0% (+) –For k=2, 75% (+) –For k=3, 88.9% (+) –For k=4, 93.75% (+) –For k=5, 96% (+) –And so on…. Percent of sample members using Empirical Rule –For k=1, 68.26% –For k=2, 95.44% –For k=3, 99.74% –For k=4, > 99.8% –For k=5, > 99.9% –And so on….

42 Example using Toyota 4Runner Data

43 Measures of Relative Standing Percentile Ranking Describes how a member of the data set compares to the rest of the data. Percentile ranking - pth percentile is a number x such that p% of the measurements fall below the pth percentile. Example: –SAT Scores: 90 th percentile means that 90 percent of the scores are below.

44 Percentiles 98 95 92 90 85 81 79 70 63 55 47 42 75 th Percentile=91 50 th Percentile=80 25 th Percentile=59 Quartiles divide your data into quarters. third quartile first quartile

45 Box Plots The mean is denoted by a ◊. largest point <= 1.5 IQR from the box the 75 th percentile the 25 th percentile the 50 th percentile (median) smallest point <= 1.5 IQR from the box outliers > 1.5 IQR from the box 1.5* IQR

46 Excel and StatPro Add-in Demonstration Pivot tables Summary measures –Excel –Add-ins Covariance and correlation Boxplots Applications


Download ppt "Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable."

Similar presentations


Ads by Google