Presentation is loading. Please wait.

Presentation is loading. Please wait.

MEASURES OF DISPERSION

Similar presentations


Presentation on theme: "MEASURES OF DISPERSION"— Presentation transcript:

1 MEASURES OF DISPERSION

2 MEASURES OF DISPERSION
The measures of central tendency, such as the mean, median and mode, do not reveal the whole picture of the distribution of a data set. Two data sets with the same mean may have completely different spreads. The variation among the values of observations for one data set may be much larger or smaller than for the other data set. NOTE: the words dispersion, spread and variation have the same meaning.

3 MEASURES OF DISPERSION: example
Consider the following two data sets on the ages of all workers in each of two small companies. Company 1: Company 2: The mean age of workers in both these companies is the same: 40 years. By knowing only these means, we may deduce that the workers have a similar age distribution in the two companies. But, the variation in the workers’ age is very different for each of these two companies. Company 1 36 39 35 38 40 45 47 It has a much larger variation than ages of the workers in the first company Company 2 18 27 33 52 70

4 MEASURES OF DISPERSION
The mean, median or mode is usually not by itself a sufficient measure to reveal the shape of a distribution of a data set. We also need a measure that can provide some information about the variation among data set values. The measures that help us to know about the spread of a data set are called measures of dispersion. The measures of central tendency and dispersion taken together give a better picture of a data set. We consider 3 measures of dispersion: Range Variance Standard Deviation

5 RANGE = LARGEST VALUE – SMALLEST VALUE
Definition the range is the simplest measure of dispersion and it is obtained by taking the difference between the largest and the smallest values in a data set: RANGE = LARGEST VALUE – SMALLEST VALUE

6 Total Area (square miles)
RANGE: example The following data set gives the total areas in square miles of the 4 western South-Central states of the United States. State Total Area (square miles) Arkansas Louisiana Oklahoma Texas 53,182 49,651 69,903 267,277 RANGE = LARGEST VALUE – SMALLEST VALUE = 267,277 – 49,651 = 217,626 square miles Thus, the total areas of these four states are spread over a range of 217,626 square miles.

7 RANGE: disadvantages The range, like the mean has the disadvantage of being influenced by outliers. Consequently, it is not a good measure of dispersion to use for data set containing outliers. The calculation of the range is based on two values only: the largest and the smallest. All other values in a data set are ignored. Thus, the range is not a very satisfactory measure of dispersion and it is, in fact, rarely used.

8 VARIANCE Definition The variance is a measure of dispersion of values based on their deviation from the mean. The variance is defined to be: for a population for a sample

9 VARIANCE ( or ) is called dispersion from the mean.
The difference between an observation and the mean, ( or ) is called dispersion from the mean. Consequently, the variance can also be defined as the arithmetic mean of the squared deviations from the mean. From the computational point of view, it is easier and more efficient to use short-cut formulas to calculate the variance

10 VARIANCE: example 1 Refer to the data on 2002 total payrolls of 5 Major League Baseball (MLB) teams. MLB Team 2002 Total Payroll (millions of dollars) Anaheim Angels Atlanta Braves New York Yankees St. Louis Cardinals Tampa Bay Devil Rays 62 93 126 75 34

11 VARIANCE: example 1 We apply the short-cut formula, hence we need to compute the squares of observations x2. MLB Team x Anaheim Angels Atlanta Braves New York Yankees St. Louis Cardinals Tampa Bay Devil Rays 62 93 126 75 34 3844 8649 15,876 5625 1156 ∑x = 390 ∑x² = 35150

12 VARIANCE: example 2 The following data are the 2002 earnings (in thousands of dollars) before taxes for all 6 employees of a small company. x 48.50 38.40 65.50 22.60 79.80 54.60 510.76 ∑x = ∑x² =

13 VARIANCE: frequency distribution
The formula for variance changes slightly if observations are grouped into a frequency table. Squared deviations are multiplied by each frequency's value, and then the total of these results is calculated. for a population for a sample The short-cut formulas become:

14 VARIANCE: example 3 Vehicles Owned (xi) Number of Households (ni)
xi * ni xi2 xi2* ni 1 2 3 4 5 18 11 22 12 10 9 16 25 44 36 48 50 Sum 40 74 196

15 Variance: frequency distribution with classes
Again, when the data set is organized in a frequency distribution with classes, we are approximating the data set by "rounding" each value in a given class to the class midpoint. Thus, the variance of a frequency distribution is given by Short-cut formulas for a population for a sample where mi is the midpoint of each class interval.

16 Variance:example 4 The following table gives the frequency distribution of the number of orders received each day during the past 50 days at the office of a mail-order company. Number of Orders Number of Days n m m2 m*n m2 *n 10 – 12 13 – 15 16 – 18 19 – 21 4 12 20 14 11 17 121 196 289 400 44 168 340 280 484 2352 5780 5600 n= 50 ∑m*n = 832 ∑ m2 *n = 14216

17 STANDARD DEVIATION Definition
The standard deviation is the positive square root of the variance. for a population for a sample

18 STANDARD DEVIATION The standard deviation is the most used measure of dispersion. The value of the standard deviation tells how closely the values of a data set are clustered around the mean. In general, a lower value of the standard deviation for a data set indicates that the values of that data set are spread over a relatively smaller range around the mean. In contrast, a large value of the standard deviation for a data set indicates that the values of that data set are spread over a relatively large range around the mean.

19 STANDARD DEVIATION: example 1
MLB Team 2002 Total Payroll (millions of dollars) x Anaheim Angels Atlanta Braves New York Yankees St. Louis Cardinals Tampa Bay Devil Rays 62 93 126 75 34 3844 8649 15,876 5625 1156 ∑x = 390 ∑x² = 35150

20 STANDARD DEVIATION: example 2
Earnings (thousands of dollars) x 48.50 38.40 65.50 22.60 79.80 54.60 510.76 ∑x = ∑x² =

21 Variance and Standard Deviation: observations
The values of the variance and the standard deviation are never negative. That is, the numerator in the formula for the variance should never produce a negative value. Usually the values of the variance and standard deviation are positive, but if data set has no variation, then the variance and standard deviation are both zero. Example: 4 persons in a group are the same age – say 35 years. If we calculate the variance and the standard deviation, their values are zero.

22 CONTINGENCY TABLES AND ELEMENTS OF PROBABILITY

23 CONTINGENCY TABLES In many applications the interest is focused on the joint analysis of two variables (qualitative and/or quantitative) with the aim of evaluating the relation between them. The variables are usually presented as a contingency table (or two-way classification table). Whereas a frequency distribution provides the distribution of one variable, a contingency table describes the distribution of two or more variables simultaneously.

24 CONTINGENCY TABLES All 420 employees of a company were asked if they are smokers or nonsmokers and whether or not they are college graduates. Joint frequency of category “Smoker” of X and “Not a college Graduate” of Y College Graduate Not a College Graduate Smoker 35 80 Nonsmoker 130 175 Cell The table gives the distribution of 420 employees based on two variables or characters: X-smoke (yes or not) and Y-graduation (yes or not)

25 CONTINGENCY TABLES: marginal distributions
Marginal distribution X College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420 Y X Marginal distribution Y Grand Total The right-hand column and the bottom row are called marginal distribution of X and marginal distribution of Y respectively.

26 CONTINGENCY TABLES Marginal distribution Y Marginal distribution X
Total Smoker 115 Nonsmoker 305 420 Total College graduate 165 Not a College graduate 255 420 X Y

27 CONTINGENCY TABLES: conditional distributions
Conditional distribution of X to the category “College Graduate” of Y Conditional distribution of Y to the category “Smoker” of X College Graduate Smoker 35 Nonsmoker 130 Total 165 Y Smoker College graduate 35 Not a College graduate 80 Total 115 X X Y NOTE

28 Definition of probability
There are three different definitions of probability: classical definition of probability, frequentist definition of probability, subjective (Bayesian) definition of probability. Frequentist definition of probability: The relative frequency associated to a category of a variable (event) analyzed can be interpreted as an approximation of the probability associated to that event.

29 Definition of probability
Example: Ten of the 500 randomly selected cars manufactured at a certain auto factory are found to be lemons. Assuming that the lemons are manufactured randomly, what is the probability that the next car manufactured at this auto factory is a lemon? Car (xi) ni Relative frequency (fi) Good Lemon 490 10 490/500 = .98 10/500 = .02 n = 500 Sum = 1.00 NOTE: The relative frequency is an approximation of the probability!! Relative frequencies and probabilities get closer as the number of cars increases.

30 Marginal Probability Coming back to the example of the 420 employees. Suppose that one employee is selected at random from the 420 employees. He may be classified on the basis of smoke alone or graduation. The employee can be “smoker”, “nonsmoker”, “graduate”, “nongraduate”. The probability of each characteristic is called marginal probability College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420

31 Marginal Probability Marginal (Simple) Probability: is the probability (relative frequency) computed on the marginal distributions: College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420

32 Joint Probability Suppose that one employees is selected at random from these 420. What is the probability that the employee is a smoker and a College graduate? College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420 It is written as P (Smoker  College Graduate). The symbol  is read as “and”.

33 Joint Probability Joint Probability: is the probability (relative frequency) computed on the joint distributions College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420

34 Conditional Probability
Now suppose that one employees is selected at random from these 420. Assume that it is known that he is a Smoker. What is the probability that the employee selected is Graduate? College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420 It is written as P (Graduate|Smoker) It is read as “Probability that he is College Graduate given that he is a Smoker”

35 Conditional Probability
Conditional Probability: is the probability (relative frequency) computed on the conditional distributions: College Graduate Not a College Graduate Total Smoker 35 80 115 Nonsmoker 130 175 305 165 255 420


Download ppt "MEASURES OF DISPERSION"

Similar presentations


Ads by Google