 # Understanding Variability

## Presentation on theme: "Understanding Variability"— Presentation transcript:

Understanding Variability
3/25/2017 Understanding Variability Instructor: Ron S. Kenett Course Website: Course textbook: MODERN INDUSTRIAL STATISTICS, Kenett and Zacks, Duxbury Press, 1998 (c) 2000, Ron S. Kenett, Ph.D.

Understanding Variability Variability in Several Dimensions
3/25/2017 Course Syllabus Understanding Variability Variability in Several Dimensions Basic Models of Probability Sampling for Estimation of Population Quantities Parametric Statistical Inference Computer Intensive Techniques Multiple Linear Regression Statistical Process Control Design of Experiments (c) 2000, Ron S. Kenett, Ph.D.

A set of data is said to be discrete if the values / observations
3/25/2017 Discrete Data A set of data is said to be discrete if the values / observations belonging to it are distinct and separate. That is, they can be counted (1,2,3, ). For example, the number of kittens in a litter; the number of patients in a doctors surgery; the number of flaws in one metre of cloth; gender (male, female); blood group (O, A, B, AB). (c) 2000, Ron S. Kenett, Ph.D.

A set of data is said to be continuous if the values / observations
3/25/2017 Continuous Data A set of data is said to be continuous if the values / observations belonging to it may take on any value within a finite or infinite interval. You can count, order and measure continuous data. For example, height; weight; temperature; the amount of sugar in an orange; the time required to run a mile. (c) 2000, Ron S. Kenett, Ph.D.

Types of Variables Qualitative Variables Quantitative Variables
3/25/2017 Types of Variables Qualitative Variables Attributes, categories Examples: male/female, registered to vote/not, ethnicity, eye color.... Quantitative Variables Discrete - usually take on integer values but can take on fractions when variable allows - counts, how many Continuous - can take on any value at any point along an interval - measurements, how much (c) 2000, Ron S. Kenett, Ph.D.

Self Assessment Test For each of the following,
3/25/2017 Self Assessment Test For each of the following, indicate whether the appropriate variable would be qualitative or quantitative. If the variable is quantitative, indicate whether it would be discrete or continuous. (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Self Assessment Test a) Whether you own an RCA Colortrak television set b) Your status as a full-time or a part-time student c) Number of people who attended your school’s graduation last year Qualitative Variable two levels: yes/no no measurement two levels: full/part Quantitative, Discrete Variable a countable number only whole numbers (c) 2000, Ron S. Kenett, Ph.D.

Self Assessment Test d) The price of your most recent haircut
3/25/2017 Self Assessment Test d) The price of your most recent haircut e) Sam’s travel time from his dorm to the Student Union Quantitative, Discrete Variable a countable number only whole numbers Quantitative, Continuous Variable any number time is measured can take on any value greater than zero (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Self Assessment Test f) The number of students on campus who belong to a social fraternity or sorority Quantitative, Discrete Variable a countable number only whole numbers (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Scales of Measurement Nominal Scale - Labels represent various levels of a categorical variable. Ordinal Scale - Labels represent an order that indicates either preference or ranking. Interval Scale - Numerical labels indicate order and distance between elements. There is no absolute zero and multiples of measures are not meaningful. Ratio Scale - Numerical labels indicate order and distance between elements. There is an absolute zero and multiples of measures are meaningful. (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Self Assessment Test Bill scored 1200 on the Scholastic Aptitude Test and entered college as a physics major. As a freshman, he changed to business because he thought it was more interesting. Because he made the dean’s list last semester, his parents gave him \$30 to buy a new Casio calculator. Identify at least one piece of information in the: (c) 2000, Ron S. Kenett, Ph.D.

Self Assessment Test a) nominal scale of measurement.
3/25/2017 Self Assessment Test a) nominal scale of measurement. 1. Bill is going to college. 2. Bill will buy a Casio calculator. 3. Bill was a physics major. 4. Bill is a business major. 5. Bill was on the dean’s list. (c) 2000, Ron S. Kenett, Ph.D.

Self Assessment Test b) ordinal scale of measurement
3/25/2017 Self Assessment Test b) ordinal scale of measurement c) interval scale of measurement d) ratio scale of measurement Bill is a freshman. Bill earned a 1200 on the SAT. Bill’s parents gave him \$30. (c) 2000, Ron S. Kenett, Ph.D.

Self Assessment Test b) ordinal scale of measurement
3/25/2017 Self Assessment Test b) ordinal scale of measurement c) interval scale of measurement d) ratio scale of measurement Bill is a freshman. Bill earned a 1200 on the SAT. Bill’s parents gave him \$30. (c) 2000, Ron S. Kenett, Ph.D.

A histogram is a way of summarising data that are measured on
3/25/2017 Histogram A histogram is a way of summarising data that are measured on an interval scale (either discrete or continuous). It is often used in exploratory data analysis to illustrate the major features of the distribution of the data in a convenient form. It divides up the range of possible values in a data set into classes or groups. For each group, a rectangle is constructed with a base length equal to the range of values in that specific group, and an area proportional to the number of observations falling into that group. This means that the rectangles might be drawn of non-uniform height. (c) 2000, Ron S. Kenett, Ph.D.

Key Terms Data array Frequency Distribution
3/25/2017 Key Terms Data array An orderly presentation of data in either ascending or descending numerical order. Frequency Distribution A table that represents the data in classes and that shows the number of observations in each class. (c) 2000, Ron S. Kenett, Ph.D.

Key Terms Frequency Distribution Class - The category
3/25/2017 Key Terms Frequency Distribution Class - The category Frequency - Number in each class Class limits - Boundaries for each class Class interval - Width of each class Class mark - Midpoint of each class (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Sturge’s Rule How to set the approximate number of classes to begin constructing a frequency distribution. where k = approximate number of classes to use and n = the number of observations in the data set . (c) 2000, Ron S. Kenett, Ph.D.

Frequency Distributions
3/25/2017 Frequency Distributions 1. Number of classes Choose an approximate number of classes for your data. Sturges’ rule can help. 2. Estimate the class interval Divide the approximate number of classes (from Step 1) into the range of your data to find the approximate class interval, where the range is defined as the largest data value minus the smallest data value. 3. Determine the class interval Round the estimate (from Step 2) to a convenient value. (c) 2000, Ron S. Kenett, Ph.D.

Frequency Distributions
3/25/2017 Frequency Distributions 4. Lower Class Limit Determine the lower class limit for the first class by selecting a convenient number that is smaller than the lowest data value. 5. Class Limits Determine the other class limits by repeatedly adding the class width (from Step 2) to the prior class limit, starting with the lower class limit (from Step 3). 6. Define the classes Use the sequence of class limits to define the classes. (c) 2000, Ron S. Kenett, Ph.D.

Relative Frequency Distributions
3/25/2017 Relative Frequency Distributions 1. Retain the same classes defined in the frequency distribution. 2. Sum the total number of observations across all classes of the frequency distribution. 3. Divide the frequency for each class by the total number of observations, forming the percentage of data values in each class. (c) 2000, Ron S. Kenett, Ph.D.

Cumulative Relative Frequency Distributions
3/25/2017 Cumulative Relative Frequency Distributions 1. List the number of observations in the lowest class. 2. Add the frequency of the lowest class to the frequency of the second class. Record that cumulative sum for the second class. 3. Continue to add the prior cumulative sum to the frequency for that class, so that the cumulative sum for the final class is the total number of observations in the data set. (c) 2000, Ron S. Kenett, Ph.D.

Cumulative Relative Frequency Distributions
3/25/2017 Cumulative Relative Frequency Distributions 4. Divide the accumulated frequencies for each class by the total number of observations -- giving you the percent of all observations that occurred up to an including that class. An Alternative: Accrue the relative frequencies for each class instead of the raw frequencies. Then you don’t have to divide by the total to get percentages. (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Example The average daily cost to community hospitals for patient stays during 1993 for each of the 50 U.S. states was given in the next table. a) Arrange these into a data array. b) Construct a stem-and-leaf display. *) Approximately how many classes would be appropriate for these data? c & d) Construct a frequency distribution. State interval width and class mark. e) Construct a histogram, a relative frequency distribution, and a cumulative relative frequency distribution. (c) 2000, Ron S. Kenett, Ph.D.

Example –Data List AL \$775 HI 823 MA 1,036 NM 1,046 SD 506
3/25/2017 Example –Data List AL \$775 HI MA 1,036 NM 1,046 SD AK 1,136 ID MI NY TN AZ 1,091 IL MN NC TX 1,010 AR IN MS ND UT 1,081 CA 1,221 IA MO OH VT CO KS MT OK VA CT 1,058 KY NE OR 1,052 WA 1,143 DE 1,024 LA NV PA WV FL ME NH RI WI GA MD NJ SC WY (c) 2000, Ron S. Kenett, Ph.D.

Example – Data Array CA 1,221 TX 1,010 RI 885 NY 784 KS 666
3/25/2017 Example – Data Array CA 1,221 TX 1,010 RI NY KS WA 1,143 NH LA AL ID AK 1,136 CO MO GA MN 652 AZ 1,091 FL PA NC NE UT 1,081 CH TN WI IA CT 1,058 IL SC ME MS OR 1,052 MI VA KY WY 537 NM 1,046 NV NJ WV ND MA 1,036 IN HI AR SD DE 1,024 MD OK VT MT (c) 2000, Ron S. Kenett, Ph.D.

Example – Stem and Leaf Display
3/25/2017 Example – Stem and Leaf Display Stem-and-Leaf Display N = 50 Leaf Unit: 100 , 36 , 81, 58, 52, 46, 36, 24, 10 , 61, 60, 40, 17, 02, 00 (11) , 89, 85, 75, 63, 61, 59, 38, 30, 29, 23 , 84, 75, 75, 63, 44, 38, 03, 01 , 76, 66, 59, 52, 26, 12 , 37, 07, 06 Range: \$482 - \$1,221 (c) 2000, Ron S. Kenett, Ph.D.

Example – Frequency Distribution
3/25/2017 Example – Frequency Distribution To approximate the number of classes we should use in creating the frequency distribution, use Sturges’ Rule, n = 50: Sturges’ rule suggests we use approximately 7 classes. (c) 2000, Ron S. Kenett, Ph.D.

Example – Frequency Distribution
3/25/2017 Example – Frequency Distribution Step 1. Number of classes Sturges’ Rule: approximately 7 classes. The range is: \$1,221 – \$482 = \$739 \$739/7 ­ \$106 and \$739/8 ­ \$92 Steps 2 & 3. The Class Interval So, if we use 8 classes, we can make each class \$100 wide. (c) 2000, Ron S. Kenett, Ph.D.

Example – Frequency Distribution
3/25/2017 Example – Frequency Distribution Step 1. Number of classes Sturges’ Rule: approximately 7 classes. The range is: \$1,221 – \$482 = \$739 \$739/7 ­ \$106 and \$739/8 ­ \$92 Steps 2 & 3. The Class Interval So, if we use 8 classes, we can make each class \$100 wide. (c) 2000, Ron S. Kenett, Ph.D.

Example – Frequency Distribution
3/25/2017 Example – Frequency Distribution Step 4. The Lower Class Limit If we start at \$450, we can cover the range in 8 classes, each class \$100 in width. The first class : \$450 up to \$550 Steps 5 & 6. Setting Class Limits \$450 up to \$550 \$850 up to \$950 \$550 up to \$650 \$950 up to \$1,050 \$650 up to \$ \$1,050 up to \$1,150 \$750 up to \$ \$1,150 up to \$1,250 (c) 2000, Ron S. Kenett, Ph.D.

Example – Frequency Distribution
3/25/2017 Example – Frequency Distribution Average daily cost Number Mark \$450 – under \$ \$500 \$550 – under \$ \$600 \$650 – under \$ \$700 \$750 – under \$ \$800 \$850 – under \$ \$900 \$950 – under \$1, \$1,000 \$1,050 – under \$1, \$1,100 \$1,150 – under \$1, \$1,200 Interval width: \$100 (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Example – Histogram (c) 2000, Ron S. Kenett, Ph.D.

Example – Relative Frequency Distribution
3/25/2017 Example – Relative Frequency Distribution Average daily cost Number Rel. Freq. \$450 – under \$ /50 = .08 \$550 – under \$ /50 = .06 \$650 – under \$ /50 = .18 \$750 – under \$ /50 = .18 \$850 – under \$ /50 = .22 \$950 – under \$1, /50 = .14 \$1,050 – under \$1, /50 = .12 \$1,150 – under \$1, /50 = .02 (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Example – Polygon (c) 2000, Ron S. Kenett, Ph.D.

Example – Cumulative Frequency Distribution
3/25/2017 Example – Cumulative Frequency Distribution Average daily cost Number Cum. Freq. \$450 – under \$ \$550 – under \$ \$650 – under \$ \$750 – under \$ \$850 – under \$ \$950 – under \$1, \$1,050 – under \$1, \$1,150 – under \$1, (c) 2000, Ron S. Kenett, Ph.D.

Example – Cumulative Relative Frequency Distribution
3/25/2017 Example – Cumulative Relative Frequency Distribution Average daily cost Cum.Freq Cum.Rel.Freq. \$450 – under \$ /50 = .02 \$550 – under \$ /50 = .14 \$650 – under \$ /50 = .32 \$750 – under \$ /50 = .50 \$850 – under \$ /50 = .72 \$950 – under \$1, /50 = .86 \$1,050 – under \$1, /50 = .98 \$1,150 – under \$1, /50 = 1.00 (c) 2000, Ron S. Kenett, Ph.D.

Example – Percentage Ogive
3/25/2017 Example – Percentage Ogive (c) 2000, Ron S. Kenett, Ph.D.

Statistical Description of Data
3/25/2017 Statistical Description of Data (c) 2000, Ron S. Kenett, Ph.D.

Key Terms Measures of Central Tendency, The Center Mean Weighted Mean
3/25/2017 Key Terms Measures of Central Tendency, The Center Mean µ, population; , sample Weighted Mean Median Mode (c) 2000, Ron S. Kenett, Ph.D.

Key Terms The Spread Measures of Dispersion, Range
3/25/2017 Key Terms Measures of Dispersion, The Spread Range Mean absolute deviation Variance Standard deviation Interquartile range Interquartile deviation Coefficient of variation (c) 2000, Ron S. Kenett, Ph.D.

Key Terms Measures of Relative Position Quantiles Residuals
3/25/2017 Key Terms Measures of Relative Position Quantiles Quartiles Deciles Percentiles Residuals Standardized values (c) 2000, Ron S. Kenett, Ph.D.

The Mean Mean Arithmetic average = (sum all values)/# of values
3/25/2017 The Mean Mean Arithmetic average = (sum all values)/# of values Population: µ = (Sxi)/N Sample: = (Sxi)/n Problem: Calculate the average number of truck shipments from the United States to five Canadian cities for the following data given in thousands of bags: Montreal, 64.0; Ottawa, 15.0; Toronto, 285.0; Vancouver, 228.0; Winnipeg, 45.0 (Ans: 127.4) (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 The Weighted Mean When what you have is grouped data, compute the mean using µ = (Swixi)/Swi Problem: Calculate the average profit from truck shipments, United States to Canada, for the following data given in thousands of bags and profits per thousand bags: Montreal Ottawa Toronto \$ \$ \$15.50 Vancouver Winnipeg 45.0 \$ \$14.00 (Ans: \$14.04 per thous. bags) (c) 2000, Ron S. Kenett, Ph.D.

The Median To find the median:
3/25/2017 The Median To find the median: 1. Put the data in an array. 2A. If the data set has an ODD number of numbers, the median is the middle value. 2B. If the data set has an EVEN number of numbers, the median is the AVERAGE of the middle two values. (Note that the median of an even set of data values is not necessarily a member of the set of values.) The median is particularly useful if there are outliers in the data set, which otherwise tend to sway the value of an arithmetic mean. (c) 2000, Ron S. Kenett, Ph.D.

The Mode The mode is the most frequent value.
3/25/2017 The Mode The mode is the most frequent value. While there is just one value for the mean and one value for the median, there may be more than one value for the mode of a data set. The mode tends to be less frequently used than the mean or the median. (c) 2000, Ron S. Kenett, Ph.D.

Comparing Measures of Central Tendency
3/25/2017 Comparing Measures of Central Tendency If mean = median = mode, the shape of the distribution is symmetric. If mode < median < mean or if mean > median > mode, the shape of the distribution trails to the right, is positively skewed. If mean < median < mode or if mode > median > mean, the shape of the distribution trails to the left, is negatively skewed. (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 The Range The range is the distance between the smallest and the largest data value in the set. Range = largest value – smallest value Sometimes range is reported as an interval, anchored between the smallest and largest data value, rather than the actual width of that interval. (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Residuals Residuals are the differences between each data value in the set and the group mean: for a population, xi – µ for a sample, xi – (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 The MAD The mean absolute deviation is found by summing the absolute values of all residuals and dividing by the number of values in the set: for a population, MAD = (S|xi – µ|)/N for a sample, MAD = (S|xi – |)/n (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 The Variance Variance is one of the most frequently used measures of spread, for population, for sample, The right side of each equation is often used as a computational shortcut. (c) 2000, Ron S. Kenett, Ph.D.

The Standard Deviation
3/25/2017 The Standard Deviation Since variance is given in squared units, we often find uses for the standard deviation, which is the square root of variance: for a population, for a sample, (c) 2000, Ron S. Kenett, Ph.D.

Quartiles One of the most frequently used quantiles is the quartile.
3/25/2017 Quartiles One of the most frequently used quantiles is the quartile. Quartiles divide the values of a data set into four subsets of equal size, each comprising 25% of the observations. To find the first, second, and third quartiles: 1. Arrange the N data values into an array. 2. First quartile, Q1 = data value at position (N + 1)/4 3. Second quartile, Q2 = data value at position 2(N + 1)/4 4. Third quartile, Q3 = data value at position 3(N + 1)/4 (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Quartiles (c) 2000, Ron S. Kenett, Ph.D.

3/25/2017 Standardized Values How far above or below the individual value is compared to the population mean in units of standard deviation “How far above or below” (data value – mean) which is the residual... “In units of standard deviation” divided by s Standardized individual value: A negative z means the data value falls below the mean. (c) 2000, Ron S. Kenett, Ph.D.