Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Copyright McGraw-Hill 20041 CHAPTER 3 Data Description.

Similar presentations


Presentation on theme: "© Copyright McGraw-Hill 20041 CHAPTER 3 Data Description."— Presentation transcript:

1 © Copyright McGraw-Hill 20041 CHAPTER 3 Data Description

2 © Copyright McGraw-Hill 20042 Objectives Summarize data using measures of central tendency, such as the mean, median, mode, and midrange. Describe data using the measures of variation, such as the range, variance, and standard deviation. Identify the position of a data value in a data set using various measures of position, such as percentiles, deciles, and quartiles.

3 © Copyright McGraw-Hill 20043 Objectives (cont’d.) Use the techniques of exploratory data analysis, including boxplots and five- number summaries to discover various aspects of data.

4 © Copyright McGraw-Hill 20044 Introduction Statistical methods can be used to summarize data. Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange. Measures that determine the spread of data values are called measures of variation or measures of dispersion and include the range, variance, and standard deviation.

5 © Copyright McGraw-Hill 20045 Introduction (cont’d.) Measures of position tell where a specific data value falls within the data set or its relative position in comparison with other data values. The most common measures of position are percentiles, deciles, and quartiles.

6 © Copyright McGraw-Hill 20046 Introduction (cont’d.) The measures of central tendency, variation, and position are part of what is called traditional statistics. This type of data is typically used to confirm conjectures about the data.

7 © Copyright McGraw-Hill 20047 Introduction (cont’d.) Another type of statistics is called exploratory data analysis. These techniques include the the box plot and the five-number summary. They can be used to explore data to see what they show.

8 © Copyright McGraw-Hill 20048 Basic Vocabulary A statistic is a characteristic or measure obtained by using the data values from a sample. A parameter is a characteristic or measure obtained by using all the data values for a specific population. When the data in a data set is ordered it is called a data array.

9 © Copyright McGraw-Hill 20049 General Rounding Rule In statistics the basic rounding rule is that when computations are done in the calculation, rounding should not be done until the final answer is calculated.

10 © Copyright McGraw-Hill 200410 The Arithmetic Average The mean is the sum of the values divided by the total number of values. Rounding rule: the mean should be rounded to one more decimal place than occurs in the raw data. The type of mean that considers an additional factor is called the weighted mean.

11 © Copyright McGraw-Hill 200411 Weighted Mean In some cases, values vary in their degree of importance, so they are weighted accordingly x = w  (w x) 

12 © Copyright McGraw-Hill 200412 The Arithmetic Average The Greek letter  (mu) is used to represent the population mean. The symbol (“x-bar”) represents the sample mean. Assume that data are obtained from a sample unless otherwise specified.

13 © Copyright McGraw-Hill 200413 Median and Mode The median is the halfway point in a data set. The symbol for the median is MD. The median is found by arranging the data in order and selecting the middle point. The value that occurs most often in a data set is called the mode. The mode for grouped data, or the class with the highest frequency, is the modal class.

14 © Copyright McGraw-Hill 200414 Mean and Median Go to: http://www.ruf.rice.edu/~lane/stat_sim/des criptive/index.htmlhttp://www.ruf.rice.edu/~lane/stat_sim/des criptive/index.html http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.h tmlhttp://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.h tml

15 © Copyright McGraw-Hill 200415 Midrange The midrange is defined as the sum of the lowest and highest values in the data set divided by 2. The symbol for midrange is MR.

16 © Copyright McGraw-Hill 200416 Central Tendency: The Mean One computes the mean by using all the values of the data. The mean varies less than the median or mode when samples are taken from the same population and all three measures are computed for these samples. The mean is used in computing other statistics, such as variance.

17 © Copyright McGraw-Hill 200417 Central Tendency: The Mean (cont’d.) The mean for the data set is unique, and not necessarily one of the data values. The mean cannot be computed for an open- ended frequency distribution. The mean is affected by extremely high or low values and may not be the appropriate average to use in these situations.

18 © Copyright McGraw-Hill 200418 Central Tendency: The Median The median is used when one must find the center or middle value of a data set. The median is used when one must determine whether the data values fall into the upper half or lower half of the distribution. The median is used to find the average of an open-ended distribution. The median is affected less than the mean by extremely high or extremely low values.

19 © Copyright McGraw-Hill 200419 Central Tendency: The Mode The mode is used when the most typical case is desired. The mode is the easiest average to compute. The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation. The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a data set.

20 © Copyright McGraw-Hill 200420 a. 5.40 1.10 0.42 0.73 0.48 1.10 b.27 27 27 55 55 55 88 88 99 c. 1 2 3 6 7 8 9 10 Examples  Mode is 1.10  Bimodal - 27 & 55  No Mode

21 © Copyright McGraw-Hill 200421 Central Tendency: The Midrange The midrange is easy to compute. The midrange gives the midpoint. The midrange is affected by extremely high or low values in a data set.

22 © Copyright McGraw-Hill 200422 Best Measure of Center

23 © Copyright McGraw-Hill 200423 Distribution Shapes In a positively skewed or right skewed distribution, the majority of the data values fall to the left of the mean and cluster at the lower end of the distribution.

24 © Copyright McGraw-Hill 200424 Distribution Shapes (cont’d.) In a symmetrical distribution, the data values are evenly distributed on both sides of the mean.

25 © Copyright McGraw-Hill 200425 Distribution Shapes (cont’d.) When the majority of the data values fall to the right of the mean and cluster at the upper end of the distribution, with the tail to the left, the distribution is said to be negatively skewed or left skewed.

26 © Copyright McGraw-Hill 200426 Skewness Figure 2-11

27 © Copyright McGraw-Hill 200427 Recap In this section we have discussed:  Types of Measures of Center Mean Median Mode  Mean from a frequency distribution  Weighted means  Best Measures of Center  Skewness

28 © Copyright McGraw-Hill 200428 Measures of Variation Because this section introduces the concept of variation, this is one of the most important sections in the entire book

29 © Copyright McGraw-Hill 200429 Definition The range of a set of data is the difference between the highest value and the lowest value value highest lowest value

30 © Copyright McGraw-Hill 200430 Definition The standard deviation of a set of sample values is a measure of variation of values about the mean

31 © Copyright McGraw-Hill 200431 Population Variance The variance is the average of the squares of the distance each value is from the mean. The symbol for the population variance is  2.

32 © Copyright McGraw-Hill 200432 Population Standard Deviation The standard deviation is the square root of the variance. The symbol for the population standard deviation is . Rounding rule: The final answer should be rounded to one more decimal place than the original data.

33 © Copyright McGraw-Hill 200433 Sample Standard Deviation Formula  ( x - x ) 2 n - 1 S =S =

34 © Copyright McGraw-Hill 200434 Sample Standard Deviation (Shortcut Formula) Formula 2-5 n ( n - 1) s = n (  x 2 ) - (  x ) 2

35 © Copyright McGraw-Hill 200435 Rationale for Formula Why n – 1 rather than n is used?

36 © Copyright McGraw-Hill 200436 Standard Deviation from a Frequency Distribution Use the class midpoints as the x values Formula 2-6 n ( n - 1) S =S = n [  ( f x 2 )] - [  ( f x )] 2

37 © Copyright McGraw-Hill 200437 Standard Deviation - Key Points  The standard deviation is a measure of variation of all values from the mean  The value of the standard deviation s is usually positive  The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data values far away from all others)  The units of the standard deviation s are the same as the units of the original data values

38 © Copyright McGraw-Hill 200438  Population variance: Square of the population standard deviation Definition  The variance of a set of values is a measure of variation equal to the square of the standard deviation.  Sample variance: Square of the sample standard deviation s

39 © Copyright McGraw-Hill 200439 Variance - Notation standard deviation squared s  2 2 } Notation Sample variance Population variance

40 © Copyright McGraw-Hill 200440 Variance and Standard Deviation Variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. The information is useful in comparing two or more data sets to determine which is more variable. The measures of variance and standard deviation are used to determine the consistency of a variable.

41 © Copyright McGraw-Hill 200441 Round-off Rule for Measures of Variation Carry one more decimal place than is present in the original set of data. Round only the final answer, not values in the middle of a calculation.

42 © Copyright McGraw-Hill 200442 Coefficient of Variation The coefficient of variation is the standard deviation divided by the mean. The result is expressed as a percentage. The coefficient of variation is used to compare standard deviations when the units are different for the two variables being compared.

43 © Copyright McGraw-Hill 200443 Definition The coefficient of variation (or CV) for a set of sample or population data, expressed as a percent, describes the standard deviation relative to the mean CV = SamplePopulation

44 © Copyright McGraw-Hill 200444 Variance and Standard Deviation (cont’d.) The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution. The variance and standard deviation are used quite often in inferential statistics.

45 © Copyright McGraw-Hill 200445 Chebyshev’s Theorem The proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1 – 1/k 2 ; where k is a number greater than 1. This theorem applies to any distribution regardless of its shape.

46 © Copyright McGraw-Hill 200446 Definition Chebyshev’s Theorem The proportion (or fraction) of any set of data lying within K standard deviations of the mean is always at least 1-1/K 2, where K is any positive number greater than 1.  For K = 2, at least 3/4 (or 75%) of all values lie within 2 standard deviations of the mean  For K = 3, at least 8/9 (or 89%) of all values lie within 3 standard deviations of the mean

47 © Copyright McGraw-Hill 200447 Empirical Rule for Normal Distributions The following apply to a bell-shaped distribution. Approximately 68% of the data values fall within one standard deviation of the mean. Approximately 95% of the data values fall within two standard deviations of the mean. Approximately 99.75% of the data values fall within three standard deviations of the mean.

48 © Copyright McGraw-Hill 200448 The Empirical Rule FIGURE 2-13

49 © Copyright McGraw-Hill 200449 The Empirical Rule FIGURE 2-13

50 © Copyright McGraw-Hill 200450 The Empirical Rule FIGURE 2-13

51 © Copyright McGraw-Hill 200451 Recap In this section we have looked at:  Range  Standard deviation of a sample and population  Variance of a sample and population  Coefficient of Variation (CV)  Standard deviation using a frequency distribution  Empirical Distribution  Chebyshev’s Theorem

52 © Copyright McGraw-Hill 200452 Measures of Position

53 © Copyright McGraw-Hill 200453 Standard Scores A standard score or z score is used when direct comparison of raw scores is impossible. A standard score or z score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation.

54 © Copyright McGraw-Hill 200454  z Score (or standard score) the number of standard deviations that a given value x is above or below the mean. Definition

55 © Copyright McGraw-Hill 200455 Sample Population x - µ z =  Round to 2 decimal places Measures of Position z score z = x - x s

56 © Copyright McGraw-Hill 200456 Interpreting Z Scores Whenever a value is less than the mean, its corresponding z score is negative Ordinary values: z score between –2 and 2 sd Unusual Values:z score 2 sd FIGURE 2-14

57 © Copyright McGraw-Hill 200457 Percentiles Percentiles are position measures used in educational and health-related fields to indicate the position of an individual in a group. A percentile, P, is an integer between 1 and 99 such that the Pth percentile is a value where P % of the data values are less than or equal to the value and 100 – P % of the data values are greater than or equal to the value.

58 © Copyright McGraw-Hill 200458 Finding the Percentile of a Given Score Percentile of value x = 100 number of values less than x+0.5 total number of values

59 © Copyright McGraw-Hill 200459 n total number of values in the data set p percentile being used c locator that gives the position of a value P k k th percentile c = n*p 100 Notation Converting from the p Percentile to the Corresponding Data Value

60 © Copyright McGraw-Hill 200460 Quartiles and Deciles Quartiles divide the distribution into four groups, denoted by Q 1, Q 2, Q 3. Note that Q 1 is the same as the 25th percentile; Q 2 is the same as the 50th percentile or the median; and Q 3 corresponds to the 75th percentile. Deciles divide the distribution into 10 groups. They are denoted by D 1, D 2, …, D 10.

61 © Copyright McGraw-Hill 200461 Q 1, Q 2, Q 3 divides ranked scores into four equal parts Quartiles 25% Q3Q3 Q2Q2 Q1Q1 (minimum)(maximum) (median)

62 © Copyright McGraw-Hill 200462 Definition  Q 1 (First Quartile) separates the bottom 25% of sorted values from the top 75%.  Q 2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%.  Q 1 (Third Quartile) separates the bottom 75% of sorted values from the top 25%.

63 © Copyright McGraw-Hill 200463 Percentiles Just as there are quartiles separating data into four parts, there are 99 percentiles denoted P 1, P 2,... P 99, which partition the data into 100 groups.

64 © Copyright McGraw-Hill 200464  Interquartile Range (or IQR): Q 3 - Q 1  Semi-interquartile Range: 2 Q 3 - Q 1  Midquartile: 2 Q 3 + Q 1 Some Other Statistics

65 © Copyright McGraw-Hill 200465 Recap In this section we have discussed:  z Scores  z Scores and unusual values  Quartiles  Percentiles  Converting a percentile to corresponding data values  Other statistics

66 © Copyright McGraw-Hill 200466 Outliers An outlier is an extremely high or an extremely low data value when compared with the rest of the data values. Outliers can be the result of measurement or observational error. When a distribution is normal or bell-shaped, data values that are beyond three standard deviations of the mean can be considered suspected outliers.

67 © Copyright McGraw-Hill 200467 Important Principles  An outlier can have a dramatic effect on the mean  An outlier have a dramatic effect on the standard deviation  An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured

68 © Copyright McGraw-Hill 200468 Exploratory Data Analysis The purpose of exploratory data analysis is to examine data in order to find out what information can be discovered. For example: –Are there any gaps in the data? –Can any patterns be discerned?

69 © Copyright McGraw-Hill 200469 Boxplots and Five-Number Summaries Boxplots are graphical representations of a five- number summary of a data set. The five specific values that make up a five-number summary are: –The lowest value of data set (minimum) –Q 1 (or 25th percentile) –The median (or 50th percentile) –Q 3 (or 75th percentile) –The highest value of data set (maximum)

70 © Copyright McGraw-Hill 200470 Boxplots Figure 2-16

71 © Copyright McGraw-Hill 200471 Boxplots

72 © Copyright McGraw-Hill 200472 Recap In this section we have looked at:  Exploratory Data Analysis  Effects of outliers  5-number summary and boxplots

73 © Copyright McGraw-Hill 200473 Summary Some basic ways to summarize data include measures of central tendency, measures of variation or dispersion, and measures of position. The three most commonly used measures of central tendency are the mean, median, and mode. The midrange is also used to represent an average.

74 © Copyright McGraw-Hill 200474 Summary (cont’d.) The three most commonly used measurements of variation are the range, variance, and standard deviation. The most common measures of position are percentiles, quartiles, and deciles. Data values are distributed according to Chebyshev’s theorem and in special cases, the empirical rule.

75 © Copyright McGraw-Hill 200475 Summary (cont’d.) The coefficient of variation is used to describe the standard deviation in relationship to the mean. These methods are commonly called traditional statistics. Other methods, such as the boxplot and five- number summary, are part of exploratory data analysis; they are used to examine data to see what they reveal.

76 © Copyright McGraw-Hill 200476 Conclusions By combining all of these techniques together, the student is now able to collect, organize, summarize and present data.

77 © Copyright McGraw-Hill 200477 HOMEWORK Start at page 169, Review Exercises 2,5,8,15,20,21,22 Data Analysis page 166 1,2,3 (Use the first 30 values of weight for Databank)


Download ppt "© Copyright McGraw-Hill 20041 CHAPTER 3 Data Description."

Similar presentations


Ads by Google