Download presentation
Presentation is loading. Please wait.
1
Business Statistics, 6th ed. by Ken Black
Chapter 1 Introduction to Statistics
2
Learning Objectives Define statistics.
Become aware of a wide range of applications of statistics in business. Differentiate between descriptive and inferential statistics. Classify numbers by level of data and understand why doing so is important. 2 2
3
Statistics in Business
Accounting — auditing and cost estimation Economics — regional, national, and international economic performance Finance — investments and portfolio management Management — human resources, compensation, and quality management Management Information Systems — performance of systems which gather, summarize, and disseminate information to various managerial levels Marketing — market analysis and consumer research International Business — market and demographic analysis
4
What is Statistics? Science of gathering, analyzing, interpreting, and presenting data on various topics Branch of mathematics Course of study Facts and figures Measurement taken on a sample Type of distribution being used to analyze data 8 11
5
Statistics in Business
Statistics – science dealing with the collection, analysis, interpretation, and presentation of numerical data Statistics has two types Descriptive measure – computed from a sample and used to make a determination Distribution - used in the analysis of the data
6
Statistics in Business
Branches of statistics Descriptive – using data gathered on a group to describe or reach conclusions about the group Inferential – data gathered from a sample and used to reach conclusions about the population from which the data was gathered Used to draw conclusions about the group or similar groups
7
Population Versus Sample
Population — the whole a collection of persons, objects, or items under study Census — gathering data from the entire population Sample — a portion of the whole/population a subset of the population; must be large enough to represent the whole 9 12
8
Population 13
9
Population and Census Data
Identifier Color MPG RD1 Red 12 RD2 10 RD3 13 RD4 RD5 BL1 Blue 27 BL2 24 GR1 Green 35 GR2 GY1 Gray 15 GY2 18 GY3 17 11 14
10
Sample and Sample Data Identifier Color MPG RD2 Red 10 RD5 13 GR1
Green 35 GY2 Gray 18
11
Parameter vs. Statistic
Parameter — descriptive measure of the population Usually represented by Greek letters Statistic — descriptive measure of a sample Usually represented by Roman letters 14 17
12
Symbols for Population Parameters
15 18
13
Symbols for Sample Statistics
16 19
14
Process of Inferential Statistics
17 20
15
Statistics in Business
Difference between a parameter and statistic is only important in the use of inferential statistics Calculations of parameter can be cost prohibitive When cost prohibitive, a sample calculates appropriate statistics. Researchers use the calculation as an estimate of the parameter.
16
Statistics in Business
Inferences about parameters made under conditions of uncertainty Uncertainty can be caused by small sample lack of knowledge about the source of the inferences change in conditions not accounted for
17
Statistics in Business
Probability statement – used to estimate the level of confidence in the probability statement
18
Levels of Data Measurement
Nominal — In nominal measurement the numerical values just "name" the attribute uniquely. No ordering of the cases is implied. For example, jersey numbers in basketball are measures at the nominal level. A player with number 30 is not more of anything than a player with number 15, and is certainly not twice whatever number 15 is. 18 21
19
Levels of Data Measurement
Ordinal - A variable is ordinal measurable if ranking is possible for values of the variable. For example, a gold medal reflects superior performance to a silver or bronze medal in the Olympics, or you may prefer French toast to waffles, and waffles to oat bran muffins. “First,” “Second” are ordinal measurements.
20
Levels of Data Measurement
Interval - In interval measurement the distance between attributes does have meaning. For example, when measuring temperature (in Fahrenheit), the distance from is same as the distance from The interval between values is interpretable.
21
Levels of Data Measurement
Ratio — in ratio measurement there is always an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. In applied social research most "count" variables are ratio, for example, the number of clients in past six months.
22
Levels of Data Measurement
Cardinal - A variable is cardinally measurable if a given interval between measures has a consistent meaning, i.e., if the measure corresponds to points along a straight line. For example, height, output, and income are cardinally measurable
23
Nominal Level Data Numbers are used to classify or categorize
Example: Employment Classification 1 for Educator 2 for Construction Worker 3 for Manufacturing Worker 19 22
24
Ordinal Level Data Numbers are used to indicate rank or order
Relative magnitude of numbers is meaningful Differences between numbers are not comparable Example: Ranking productivity of employees Example: Position within an organization 1 for President 2 for Vice President 3 for Plant Manager 4 for Department Supervisor 5 for Employee 20 23
25
Ordinal Data Faculty and staff should receive preferential treatment for parking space. 1 2 3 4 5 Strongly Agree Disagree Neutral
26
Interval Level Data Interval Level data - Distances between consecutive integers are equal Relative magnitude of numbers is meaningful Differences between numbers are comparable Location of origin, zero, is arbitrary Vertical intercept of unit of measure transform function is not zero Example: Fahrenheit Temperature Example: Monetary Utility 22 26
27
Ratio Level Data Highest level of measurement
Relative magnitude of numbers is meaningful Differences between numbers are comparable Location of origin, zero, is absolute (natural) Vertical intercept of unit of measure transform function is zero Examples: Height, Weight, and Volume Example: Monetary Variables, such as Profit and Loss, Revenues, Expenses, Financial ratios - such as P/E Ratio, Inventory Turnover, and Quick Ratio. 23 27
28
Ratio Level Data Parametric statistics – requires that the data be interval or ration Non Parametric – used if data are nominal or ordinal Non parametric statistics can be used to analyze interval or ratio data
29
Data Level, Operations, and Statistical Methods
Nominal Ordinal Interval Ratio Meaningful Operations Classifying and Counting All of the above plus Ranking All of the above plus Addition, Subtraction, Multiplication, and Division All of the above Statistical Methods Nonparametric Parametric 25 29
30
Business Statistics, 6th ed. by Ken Black
Chapter 2 CHARTS and GRAPHS
31
Learning Objectives Recognize the difference between grouped and ungrouped data Construct a frequency distribution Construct a histogram, a frequency polygon, an ogive, a pie chart, a stem and leaf plot, a Pareto chart, and a scatter plot
32
Ungrouped Versus Grouped Data
Ungrouped data have not been summarized in any way are also called raw data Grouped data have been organized into a frequency distribution 7
33
Example of Ungrouped Data
Ages of a Sample of Managers from Urban Child Care Centers in the United States 42 30 53 50 52 55 49 61 74 26 58 40 28 36 33 31 37 32 23 43 29 34 47 35 64 46 57 25 60 54 8
34
Frequency Distribution
Frequency Distribution – summary of data presented in the form of class intervals and frequencies Vary in shape and design Constructed according to the individual researcher's preferences
35
Frequency Distribution
Steps in Frequency Distribution Step 1 - Determine range of frequency distribution Range is the difference between the high and the lowest numbers Step 2 – determine the number of classes Don’t use too many, or two few classes Step 3 – Determine the width of the class interval Approx class width can be calculated by dividing the range by the number of classes Values fit into only one class
36
Frequency Distribution of Child Care Manager’s Ages
Class Interval Frequency 20-under under under under under under 80 1 9
37
Data Range Smallest Largest 42 30 53 50 52 55 49 61 74 26 58 40 28 36
33 31 37 32 23 43 29 34 47 35 64 46 57 25 60 54 Smallest Largest 10
38
Number of Classes and Class Width
The number of classes should be between 5 and 15. Fewer than 5 classes cause excessive summarization. More than 15 classes leave too much detail. Class Width Divide the range by the number of classes for an approximate class width Round up to a convenient number 11
39
Class Midpoint The midpoint of each class interval is called the class midpoint or the class mark. 12
40
Relative Frequency The relative frequency is the proportion of the total frequency that is any given class interval in a frequency distribution. Relative Class Interval Frequency Frequency 20-under under under under under under Total 13
41
Cumulative Frequency Cumulative
The cumulative frequency is a running total of frequencies through the classes of a frequency distribution. Cumulative Class Interval Frequency Frequency 20-under 30-under 40-under 50-under 60-under 70-under Total 50 14
42
Class Midpoint Class midpoint – The midpoint of each class interval
Midpoint is half way across the class interval Midpoint is the average of the class end points
43
Relative Frequency Relative frequency – is the proportion of the total frequency that is in any given class interval in a frequency distribution Relative frequency = individual class frequency divided by the total frequency Example => Frequency/Total = 16/40 = .40 Probability of occurrence
44
Cumulative Frequency Cumulative frequency – the running total of frequencies through the classes of a frequency distribution Cumulative frequency for each class is the frequency for that class interval added to the preceding cumulative total At the last interval, the cumulative total equals the sum of the frequencies
45
Class Midpoints, Relative Frequencies, and Cumulative Frequencies
Relative Cumulative Class Interval Frequency Midpoint Frequency Frequency 20-under 30-under 40-under 50-under 60-under 70-under Total 15
46
Cumulative Relative Frequencies
The cumulative relative frequency is a running total of the relative frequencies through the classes of a frequency distribution. Cumulative Relative Cumulative Relative Class Interval Frequency Frequency Frequency Frequency 20-under 30-under 40-under 50-under 60-under 70-under Total 16
47
Common Statistical Graphs
Histogram -- vertical bar chart of frequencies Frequency Polygon -- line graph of frequencies Ogive -- line graph of cumulative frequencies Pie Chart -- proportional representation for categories of a whole 17
48
Common Statistical Graphs
Stem and Leaf Plot -- display is a graphical method of displaying data. It is particularly useful when your data are not too numerous. Pareto Chart -- type of chart which contains both bars and a line graph. The bars display the values in descending order, and the line graph shows the cumulative totals of each category, left to right. The purpose is to highlight the most important among a (typically large) set of factors.
49
Common Statistical Graphs
Scatter Plot -- type of display using Cartesian coordinates to display values for two variables for a set of data. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A scatter plot is also called a scatter chart, scatter diagram and scatter graph.
50
Histogram Class Interval Frequency 20-under 30 6 30-under 40 18
51
Histogram Construction
Class Interval Frequency 20-under 30 6 30-under 40 18 40-under 50 11 50-under 60 11 60-under 70 3 70-under 80 1 19
52
Frequency Polygon Class Interval Frequency 20-under 30 6
53
Ogive Cumulative Class Interval Frequency 20-under under under under under under 80 50 21
54
Relative Frequency Ogive
Cumulative Relative Class Interval Frequency 20-under 30-under 40-under 50-under 60-under 70-under 22
55
Complaints by Amtrak Passengers
NUMBER PROPORTION DEGREES Stations, etc. 28,000 .40 144.0 Train Performance 14,700 .21 75.6 Equipment 10,500 .15 50.4 Personnel 9,800 .14 50.6 Schedules, etc. 7,000 .10 36.0 Total 70,000 1.00 360.0
56
Complaints by Amtrak Passengers
57
Second Quarter U.S. Truck Production
Company A B C D E Totals 357,411 354,936 160,997 34,099 12,747 920,190 Second Quarter Truck Production in the U.S. (Hypothetical values)
58
Second Quarter U.S. Truck Production
59
Pie Chart Calculations for Company A
2d Quarter Truck Production Proportion Degrees Company A B C D E Totals 357,411 354,936 160,997 34,099 12,747 920,190 .388 .386 .175 .037 .014 1.000 140 139 63 13 5 360 27
60
Safety Examination Scores for Plant Trainees
Raw Data Stem Leaf 2 3 4 5 6 7 8 9 3 9 7 9 5 6 9 86 76 23 77 81 79 68 92 59 75 83 49 91 47 72 82 74 70 56 60 88 97 39 78 94 55 67 89 28
61
Construction of Stem and Leaf Plot
Raw Data Stem Leaf 2 3 4 5 6 7 8 9 3 9 7 9 5 6 9 86 76 23 77 81 79 68 92 59 75 83 49 91 47 72 82 74 70 56 60 88 97 39 78 94 55 67 89 Stem Leaf Stem Leaf 29
62
Pareto Chart Frequency 10 20 30 40 50 60 70 80 90 100 Poor Wiring
10 20 30 40 50 60 70 80 90 100 Poor Wiring Short in Coil Defective Plug Other Frequency 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
63
Registered Vehicles (1000's) Gasoline Sales (1000's of Gallons)
Scatter Plot Registered Vehicles (1000's) Gasoline Sales (1000's of Gallons) 5 60 15 120 9 90 140 7
64
Business Statistics, 6th ed. by Ken Black
Chapter 3 Descriptive Statistics
65
Learning Objectives Distinguish between measures of central tendency, measures of variability, measures of shape, and measures of association. Understand the meanings of mean, median, mode, quartile, percentile, and range. Compute mean, median, mode, percentile, quartile, range, variance, standard deviation, and mean absolute deviation on ungrouped data. Differentiate between sample and population variance and standard deviation. 2
66
Learning Objectives -- Continued
Understand the meaning of standard deviation as it is applied by using the empirical rule and Chebyshev’s theorem. Compute the mean, median, standard deviation, and variance on grouped data. Understand box and whisker plots, skewness, and kurtosis. Compute a coefficient of correlation and interpret it. 3
67
Measures of Central Tendency: Ungrouped Data
Measures of central tendency yield information about “particular places or locations in a group of numbers.” Common Measures of Location Mode Median Mean Percentiles Quartiles 4
68
Mode Mode - the most frequently occurring value in a data set
Applicable to all levels of data measurement (nominal, ordinal, interval, and ratio) Can be used to determine what categories occur most frequently Bimodal – In a tie for the most frequently occurring value, two modes are listed Multimodal -- Data sets that contain more than two modes 5
69
Median Media - middle value in an ordered array of numbers.
For an array with an odd number of terms, the median is the middle number For an array with an even number of terms the median is the average of the middle two numbers 7
70
Arithmetic Mean Mean is the average of a group of numbers
Applicable for interval and ratio data Not applicable for nominal or ordinal data Affected by each value in the data set, including extreme values Computed by summing all values in the data set and dividing the sum by the number of values in the data set 11
71
Demonstration Problem 3.1
The number of U.S. cars in service by top car rental companies in a recent year according to Auto Rental News follows. Company Number of Cars in Service Enterprise 643,000; Hertz 327,000; National/Alamo 233,000; Avis 204,000; Dollar/Thrifty 167,000; Budget 144,000; Advantage 20,000; U-Save 12,000; Payless 10,000; ACE 9,000; Fox 9,000; Rent-A-Wreck 7,000; Triangle 6,000 Compute the mode, the median, and the mean.
72
Demonstration Problem 3.1
Solution Mode: 9,000 Median: With 13 different companies in this group, N = 13. The median is located at the (13 +1)/2 = 7th position. Because the data are already ordered, the 7th term is 20,000, which is the median. Mean: The total number of cars in service is 1,791,000 = ∑x μ = ∑x/N = (1,791,000/13) = 137,769.23
73
Population Mean 12
74
Sample Mean 13
75
Percentiles Percentile - measures of central tendency that divide a group of data into 100 parts At least n% of the data lie below the nth percentile, and at most (100 - n)% of the data lie above the nth percentile Example: 90th percentile indicates that at least 90% of the data lie below it, and at most 10% of the data lie above it 14
76
Quartiles Quartile - measures of central tendency that divide a group of data into four subgroups Q1: 25% of the data set is below the first quartile Q2: 50% of the data set is below the second quartile Q3: 75% of the data set is below the third quartile 25% Q3 Q2 Q1 17
77
Measures of Variability: Ungrouped Data
Measures of Variability - tools that describe the spread or the dispersion of a set of data. Provides more meaningful data when used with measures of central tendency 22
78
Measures of Variability: Ungrouped Data
Common Measures of Variability Range Inter-quartile Range Mean Absolute Deviation Variance Standard Deviation Z scores Coefficient of Variation
79
Range The difference between the largest and the smallest values in a set of data Advantage – easy to compute Disadvantage – is affected by extreme values
80
Interquartile Range Interquartile Range - range of values between the first and third quartiles Range of the “middle half”; middle 50% Useful when researchers are interested in the middle 50%, and not the extremes Interquartile Range – used in the construction of box and whisker plots 37
81
Mean Absolute Deviation, Variance, and Standard Deviation
These data are not meaningful unless the data are at least interval level data One way for researchers to look at the spread of data is to subtract the mean from each data set Subtracting the mean from each data value gives the deviation from the mean (X - µ)
82
Mean Absolute Deviation, Variance, and Standard Deviation
An examination of deviation from the mean can reveal information about the variability of the data Deviations are used mostly as a tool to compute other measures of variability The Sum of Deviation from the arithmetic mean is always zero Sum (X - µ) = 0
83
Mean Absolute Deviation, Variance, and Standard Deviation
An obvious way to force the sum of deviations to have a non zero total is to take the absolute value of each deviation around the mean Allows one to solve for the Mean Absolute Deviation
84
Mean Absolute Deviation (MAD)
Mean Absolute Deviation - average of the absolute deviations from the mean 5 9 16 17 18 -8 -4 +3 +4 +5 +8 24 25
85
Population Variance Variance - average of the squared deviations from the arithmetic mean Population variance is denoted by σ2 Sum of Squared Deviations (SSD) about the mean of a set of values (called Sum of Squares of X) is used throughout the book
86
Population Variance Variance = average of the squared deviations from the arithmetic mean Population variance is denoted by σ2 5 9 16 17 18 -8 -4 +3 +4 +5 64 25 130 26
87
Sample Variance Sample Variance - average of the squared deviations from the arithmetic mean Sample Variance – denoted by S2 2,398 1,844 1,539 1,311 7,092 625 71 -234 -462 390,625 5,041 54,756 213,444 663,866 28
88
Sample Standard Deviation
Sample Std Dev is the square root of the sample variance 2,398 1,844 1,539 1,311 7,092 625 71 -234 -462 390,625 5,041 54,756 213,444 663,866 29
89
Empirical Rule Empirical Rule – used to state the approximate percentage of values that lie within a given number of standard deviations from the set of data if the data are normally distributed Empirical rule is used only for three numbers of standard deviation: 1σ, 2σ, and 3σ 1σ = 68% of data; 2σ = 95% of data; and 3σ = 99% of data
90
Chebyshev’s Theorem Empirical rule – applies when data are approximately normally distributed Chebyshev’s Theorem – applies to all distributions, and can be used whenever the data distribution shape is unknown or non-normal
91
Chebyshev’s Theorem Chebyshev’s Theorem - states that at least (1 – 1/k2) values fall within +k standard deviations of the mean regardless of the shape of the distribution Example: At least 75% of all values are within +2σ of the mean regardless of the shape of a distribution when k = 2, then (1 – 1/k2) = = .75
92
Demonstration Problem 3.6
The effectiveness of district attorneys can be measured by several variables, including the number of convictions per month, the number of cases handled per month, and the total number of years of conviction per month. A researcher uses a sample of five district attorneys in a city and determines the total number of years of conviction that each attorney won against defendants during the past month, as reported in the first column in the following tabulations. Compute the mean absolute deviation, the variance, and the standard deviation for these figures.
93
Demonstration Problem 3.6
Solution The researcher computes the mean absolute deviation, the variance, and the standard deviation for these data in the following manner. x |x- | (x - )2 ,681 ,936 ,296 x = 480
94
Demonstration Problem 3.6
The computational formulas are used to solve for s2 and s and compares the results. S2 = (5,770/4) = 1,442.5 and s = Square root of variance = 37.98 MAD = 154/5 = 30.8
95
Z Scores Z score – represents the number of Std Dev a value (x) is above or below the mean of a set of numbers when the data are normally distributed Z score allows translation of a value’s raw distance from the mean into units of std dev. Z = (x-µ)/σ
96
Z Scores If Z is negative, the raw value (x) is below the mean
If Z is positive, the raw value (x) is above the mean Between Z = + 1, are app. 68% of the values Z = + 2, are app. 95% of the values Z = + 3, are app. 99% of the values
97
Coefficient of Variation
Coefficient of Variation (CV) - ratio of the standard deviation to the mean, expressed as a percentage useful when comparing Std Dev computed from data with different means Measurement of relative dispersion 35
98
Coefficient of Variation
36
99
Measures of Central Tendency and Variability: Grouped Data
Mean Median Mode Measures of Variability Variance Standard Deviation 38
100
Measures of Central Tendency and Variability: Grouped Data
Mean – The midpoint of each class interval is used to represent all the values in a class interval Midpoint is weighted by the frequency of values in the class interval Mean is computed by summing the products of class midpoint, and the class frequency for each class and dividing that sum by the total number of frequencies
101
Measures of Central Tendency and Variability: Grouped Data
Median – The middle value in an ordered array of numbers Mode – the mode for grouped data is the class midpoint of the modal class The modal class is class interval with the greatest frequency
102
Calculation of Grouped Mean
Class Interval Frequency Class Midpoint fM 20-under 30-under 40-under 50-under 60-under 70-under 40
103
Median of Grouped Data - Example
Cumulative Class Interval Frequency Frequency 20-under 30-under 40-under 50-under 60-under 70-under N = 50 42
104
Mode of Grouped Data Midpoint of the modal class
Modal class has the greatest frequency Class Interval Frequency 20-under 30 6 30-under 40-under 50-under 60-under 70 3 70-under 43
105
Variance and Standard Deviation of Grouped Data
Population Sample 44
106
Population Variance and Standard Deviation of Grouped Data
6 18 11 3 1 50 25 35 45 55 65 75 150 630 495 605 195 75 2150 -18 -8 2 12 22 32 20-under 30 30-under 40 40-under 50 50-under 60 60-under 70 70-under 80 1944 1152 44 1584 1452 1024 7200 324 64 4 144 484 1024 45
107
Measures of Shape Symmetrical – the right half is a mirror image of the left half Skewness – shows that the distribution lacks symmetry; used to denote the data is sparse at one end, and piled at the other end Absence of symmetry Extreme values in one side of a distribution 46
108
Coefficient of Skewness
Coefficient of Skewness (Sk) - compares the mean and median in light of the magnitude to the standard deviation; Md is the median; Sk is coefficient of skewness; σ is the Std Dev 49
109
Coefficient of Skewness
Summary measure for skewness If Sk < 0, the distribution is negatively skewed (skewed to the left). If Sk = 0, the distribution is symmetric (not skewed). If Sk > 0, the distribution is positively skewed (skewed to the right). 49
110
Business Statistics, 6th ed. by Ken Black
Chapter 4 Probability
111
Learning Objectives Comprehend the different ways of assigning probability. Understand and apply marginal, union, joint, and conditional probabilities. Select the appropriate law of probability to use in solving problems. Solve problems using the laws of probability including the laws of addition, multiplication and conditional probability Revise probabilities using Bayes’ rule. 2
112
Probability Probability – probability of occurrences are assigned to the inferential process under conditions of uncertainty
113
Methods of Assigning Probabilities
Classical method of assigning probability (rules and laws) Relative frequency of occurrence (cumulated historical data) Subjective Probability (personal intuition or reasoning) 3
114
Classical Probability
115
Classical Probability
Number of outcomes leading to the event divided by the total number of outcomes possible Each outcome is equally likely Determined a priori -- before performing the experiment Applicable to games of chance Objective -- everyone correctly using the method assigns an identical probability 4
116
Relative Frequency Probability
Relative Frequency of Occurrence method – the probability of an event is equal to the number of times the event has occurred in the past divided by the total number of opportunities for the event to have occurred Frequency of occurrence is based on what has happened in the past
117
Relative Frequency Probability
Based on historical data Computed after performing the experiment Number of times an event occurred divided by the number of trials Objective -- everyone correctly using the method assigns an identical probability 5
118
Subjective Probability
Subjective Probability - Comes from a person’s intuition or reasoning Subjective -- different individuals may (correctly or incorrectly) assign different numeric probabilities to the same event Degree of belief in the results of the event Useful for unique (single-trial) experiments New product introduction Site selection decisions Sporting events 6
119
Structure of Probability
Experiment – is a process that produces an outcome Event – an outcome of an experiment Elementary event – events that cannot be decomposed or broken down into other events Sample Space – a complete roster/listing of all elementary events for an experiment Trial: one repetition of the process
120
Structure of Probability
Set Notation – the use of braces to group members The UNION of x, y is formed by combining elements from both sets, and is denoted by x U y. Read as “x or y.” An INTERSECTION is denoted x ∩ y. The symbol is read as “and”. “x and y”
121
Structure of Probability
Mutually Exclusive Events – events such that the occurrence of one precludes the occurrence of the other These events have no intersection Independent Events – the occurrence or nonoccurrence of one has no affect on the occurrence of the others 7
122
Structure of Probability
Collectively Exhaustive Events – listing of all possible elementary events for an experiment Complementary Events – two events, one of which comprises all the elementary events of an experiment that are not in the other event
123
Sample Space The set of all elementary events for an experiment
Methods for describing a sample space roster or listing tree diagram set builder notation Venn diagram 10
124
Sample Space: Roster Example
Experiment: randomly select, without replacement, two families from the residents of Tiny Town Each ordered pair in the sample space is an elementary event, for example -- (D,C) Is (A,B) the same as (B,A)? Family Children in Household Number of Automobiles A B C D Yes No 3 2 1 Listing of Sample Space (A,B), (A,C), (A,D), (B,A), (B,C), (B,D), (C,A), (C,B), (C,D), (D,A), (D,B), (D,C) 11
125
Sample Space: Set Notation for Random Sample of Two Families
S = {(x,y) | x is the family selected on the first draw, and y is the family selected on the second draw} Concise description of large sample spaces 13
126
Union of Sets The union of two sets contains an instance of each element of the two sets. Y X 15
127
{ } Intersection of Sets
The intersection of two sets contains only those element common to the two sets. Y X { } C IBM DEC Apple F Grape Lime = Ç , X Y 16
128
Mutually Exclusive Events
Events with no common outcomes Occurrence of one event precludes the occurrence of the other event Y X 17
129
Independent Events Occurrence of one event does not affect the occurrence or nonoccurrence of the other event The conditional probability of X given Y is equal to the marginal probability of X. The conditional probability of Y given X is equal to the marginal probability of Y. 18
130
Collectively Exhaustive Events
Contains all elementary events for an experiment E1 E2 E3 Sample Space with three collectively exhaustive events 19
131
Complementary Events All elementary events not in the event ‘A’ are in its complementary event. Sample Space A 20
132
Counting the Possibilities
mn Rule Sampling from a Population with Replacement Combinations: Sampling from a Population without Replacement
133
mn Rule If an operation can be done m ways and a second operation can be done n ways, then there are mn ways for the two operations to occur in order. A cafeteria offers 5 salads, 4 meats, 8 vegetables, 3 breads, 4 desserts, and 3 drinks. A meal is two servings of vegetables, which may be identical. How many meals are available? 5 * 4 * 8 * 3 * 4 * 3 = 5760 (Mistake? ) – Does order matter on the vegetable choice? Counting Coin and Die tosses?
134
Combinations: Sampling from a Population without Replacement
This counting method uses combinations Selecting n items from a population of N without replacement
135
Combinations Combinations – sampling “n” items from a population size N without replacement provides the formula shown below A tray contains 1,000 individual tax returns. If 3 returns are randomly selected without replacement from the tray, how many possible samples are there? Does order matter?
136
Combinations: Sampling from a Population without Replacement
For example, suppose a small law firm has 16 employees and three are to be selected randomly to represent the company at the annual meeting of the American Bar Association. How many different combinations of lawyers could be sent to the meeting? Answer: NCn = 16C3 = 16!/(3!13!) = 560. Suppose the representatives are chosen such that the order is President, Vice-President, and Alternate?
137
Four Types of Probability
Marginal The probability of X occurring Union The probability of X or Y occurring Joint The probability of X and Y occurring Conditional The probability of X occurring given that Y has occurred Y X 22
138
General Law of Addition
Y X 23
139
General Law of Addition -- Example
S N .56 .67 .70 24
140
Office Design Problem Probability Matrix
.11 .19 .30 .56 .14 .70 .67 .33 1.00 Increase Storage Space Yes No Total Noise Reduction 26
141
Demonstration problem 4.3
If a worker is randomly selected from the company described in Demonstration Problem 4.1, what is the probability that the worker is either technical or clerical? What is the probability that the worker is either a professional or a clerical?
142
Demonstration problem 4.3
Examine the raw value matrix of the company’s human resources data shown in Demonstration Problem 4.1. In many raw value and probability matrices like this one, the rows are non-overlapping or mutually exclusive, as are the columns. In this matrix, a worker can be classified as being in only one type of position and as either male or female but not both. Thus, the categories of type of position are mutually exclusive, as are the categories of sex, and the special law of addition can be applied to the human resource data to determine the union probabilities.
143
Demonstration problem 4.3
Let T denote technical, C denote clerical, and P denote professional. The probability that a worker is either technical or clerical is P(T U C) = P (T) + P (C) = 69/ /155 = 100/155 = .645 The probability that a worker is either professional or clerical is P (P U C) = P (P) + P (C) = 44/ /155 = 75/155 = .484
144
Demonstration Problem 4.3
Type of Gender Position Male Female Total Managerial 8 3 11 Professional 31 13 44 Technical 52 17 69 Clerical 9 22 100 55 155
145
Demonstration Problem 4.3
Type of Gender Position Male Female Total Managerial 8 3 11 Professional 31 13 44 Technical 52 17 69 Clerical 9 22 100 55 155
146
Law of Multiplication Demonstration Problem 4.5
34
147
Law of Multiplication The intersection of two events is called the joint probability General law of multiplication is used to find the joint probability General law of multiplication gives the probability that both events x and y will occur at the same time P(x|y) is a conditional probability that can be stated as the probability of x given y
148
Law of Multiplication If a probability matrix is constructed for a problem, the easiest way to solve for the joint probability is to find the appropriate cell in the matrix and select the answer
149
Law of Multiplication Demonstration Problem 4.5
Total .7857 Yes No .4571 .3286 .1143 .1000 .2143 .5714 .4286 1.00 Married Supervisor Probability Matrix of Employees 35
150
Law of Conditional Probability
Conditional probability are based on the prior knowledge you have on one of the two events being studied If X and Y are two events, the conditional probability of X occurring given that Y is known or has occurred is expressed as P(X|Y)
151
Law of Conditional Probability
The conditional probability of X given Y is the joint probability of X and Y divided by the marginal probability of Y. 37
152
Law of Conditional Probability
70% of respondents believe noise reduction would improve productivity. 56% of respondents believed both noise reduction and increased storage space would improve productivity A worker is selected randomly and asked about changes in the office design What is the probability that a randomly selected person believes storage space would improve productivity given that the person believes noise reduction improves productivity?
153
Law of Conditional Probability
S .56 .70 38
154
Independent Events When X and Y are independent, the conditional probability is solved as a marginal probability
155
Independent Events Demonstration Problem 4.10
Geographic Location Northeast D Southeast E Midwest F West G Finance A .12 .05 .04 .07 .28 Manufacturing B .15 .03 .11 .06 .35 Communications C .14 .09 .08 .37 .41 .17 .21 1.00
156
Revision of Probabilities: Bayes’ Rule
An extension to the conditional law of probabilities Enables revision of original probabilities with new information 43
157
Bayes’ Rule Bays’ rule – extends the use of the law of conditional probabilities to all revision of original probabilities with new information
158
Bayes’ Rule Note, the numerator of Bayes’ Rule and the law of conditional probability are the same The denominator is a collective exhaustive listing of mutually exclusive outcomes of Y The denominator is a weighted average of the conditional probabilities with the weights being the prior probabilities of the corresponding event
159
Revision of Probabilities with Bayes’ Rule: Ribbon Problem
44
160
Revision of Probabilities with Bayes’ Rule: Ribbon Problem
Conditional Probability 0.052 0.042 0.094 0.65 0.35 0.08 0.12 =0.553 =0.447 Alamo South Jersey Event Prior Probability Joint Probability P E d i ( ) Revised Probability | 45
161
Revision of Probabilities with Bayes’ Rule: Ribbon Problem
Alamo 0.65 South Jersey 0.35 Defective 0.08 0.12 Acceptable 0.92 0.88 0.052 0.042 + 0.094 46
162
Business Statistics, 6th ed. by Ken Black
Chapter 5 Discrete Distributions
163
Learning Objectives Distinguish between discrete random variables and continuous random variables. Know how to determine the mean and variance of a discrete distribution. Identify the type of statistical experiments that can be described by the binomial distribution, and know how to work such problems. 2
164
Discrete vs. Continuous Distributions
Discrete distributions – constructed from discrete (individually distinct) random variables Continuous distributions – based on continuous random variables Random Variable - a variable which contains the outcomes of a chance experiment 4
165
Discrete vs. Continuous Distributions
Categories of Random Variables Discrete Random Variable - the set of all possible values is at most a finite or a countable infinite number of possible values Continuous Random Variable - takes on values at every point over a given interval
166
Describing a Discrete Distribution
A discrete distribution can be described by constructing a graph of the distribution Measures of central tendency and variability can be applied to discrete distributions Discrete values of outcomes are used to represent themselves
167
Describing a Discrete Distribution
Mean of discrete distribution – is the long run average If the process is repeated long enough, the average of the outcomes will approach the long run average (mean) Requires the process to eventually have a number which is the product of many processes Mean of a discrete distribution µ = ∑ (X * P(X)) where (X) is the long run average; X = outcome, P = Probability of X
168
Describing a Discrete Distribution
Variance and Standard Deviation of a discrete distribution are solved by using the outcomes (X) and probabilities of outcomes (P(X)) in a manner similar to computing a mean Standard Deviation is computed by taking the square root of the variance
169
Some Special Distributions
Discrete binomial Poisson Hypergeometric Continuous normal uniform exponential t chi-square F 5
170
Discrete Distribution -- Example
Observe the discrete distribution in the following table. An executive is considering out-of-town business travel for a given Friday. At least one crisis could occur on the day that the executive is gone. The distribution contains the number of crises that could occur during the day the executive is gone and the probability that each number will occur. For example, there is a .37 probability that no crisis will occur, a .31 probability of one crisis, and so on.
171
Discrete Distribution -- Example
1 2 3 4 5 0.37 0.31 0.18 0.09 0.04 0.01 Number of Crises Probability Distribution of Daily Crises 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 P r o b a i l t y Number of Crises 6
172
Variance and Standard Deviation of a Discrete Distribution
X -1 1 2 3 P(X) .1 .2 .4 -2 4 .0 1.2 10
173
Requirements for a Discrete Probability Function -- Examples
P(X) -1 1 2 3 .1 .2 .4 1.0 X P(X) -1 1 2 3 -.1 .3 .4 .1 1.0 X P(X) -1 1 2 3 .1 .3 .4 1.2 : YES NO NO 8
174
Mean of a Discrete Distribution
X -1 1 2 3 P(X) .1 .2 .4 -.1 .0 .3 1.0 P ( ) = 1.0 9
175
Mean of the Crises Data Example
P(X) .37 .00 1 .31 2 .18 .36 3 .09 .27 4 .04 .16 5 .01 .05 1.15 0.1 0.2 0.3 0.4 0.5 1 2 3 4 5 P r o b a i l t y Number of Crises 11
176
Variance and Standard Deviation of Crises Data Example
P(X) (X-) (X-)2 .37 -1.15 1.32 .49 1 .31 -0.15 0.02 .01 2 .18 0.85 0.72 .13 3 .09 1.85 3.42 4 .04 2.85 8.12 .32 5 3.85 14.82 .15 1.41 12
177
Binomial Distribution
Probability function Mean value Variance and Standard Deviation 14
178
Binomial Distribution: Demonstration Problem 5.3
According to the U.S. Census Bureau, approximately 6% of all workers in Jackson, Mississippi, are unemployed. In conducting a random telephone survey in Jackson, what is the probability of getting two or fewer unemployed workers in a sample of 20?
179
Binomial Distribution: Demonstration Problem 5.3
In the following example, 6% are unemployed => p The sample size is 20 => n 94% are employed => q x is the number of successes desired What is the probability of getting 2 or fewer unemployed workers in the sample of 20? The hard part of this problem is identifying p, n, and x – emphasis this when studying the problems.
180
Binomial Distribution: Demonstration Problem 5.3
20
181
Binomial Distribution Table: Demonstration Problem 5.3
PROBABILITY X 0.05 0.06 0.07 0.3585 0.2901 0.2342 1 0.3774 0.3703 0.3526 2 0.1887 0.2246 0.2521 3 0.0596 0.0860 0.1139 4 0.0133 0.0233 0.0364 5 0.0022 0.0048 0.0088 6 0.0003 0.0008 0.0017 7 0.0000 0.0001 0.0002 8 … 20 23
182
Excel’s Binomial Function
20 p = 0.06 X P(X) =BINOMDIST(A5,B$1,B$2,FALSE) 1 =BINOMDIST(A6,B$1,B$2,FALSE) 2 =BINOMDIST(A7,B$1,B$2,FALSE) 3 =BINOMDIST(A8,B$1,B$2,FALSE) 4 =BINOMDIST(A9,B$1,B$2,FALSE) 5 =BINOMDIST(A10,B$1,B$2,FALSE) 6 =BINOMDIST(A11,B$1,B$2,FALSE) 7 =BINOMDIST(A12,B$1,B$2,FALSE) 8 =BINOMDIST(A13,B$1,B$2,FALSE) 9 =BINOMDIST(A14,B$1,B$2,FALSE)
183
Minitab’s Binomial Function
X P(X =x) Binomial with n = 23 and p = 0.64
184
Mean and Std Dev of Binomial Distribution
Binomial distribution has an expected value or a long run average denoted by µ (mu) If n items are sampled over and over for a long time and if p is the probability of success in one trial, the average long run of successes per sample is expected to be np => Mean µ = np => Std Dev = √(npq)
185
Poisson Distribution The Poisson distribution focuses only on the number of discrete occurrences over some interval or continuum Poisson does not have a given number of trials (n) as a binomial experiment does Occurrences are independent of other occurrences Occurrences occur over an interval
186
Poisson Distribution If Poisson distribution is studied over a long period of time, a long run average can be determined The average is denoted by lambda (λ) Each Poisson problem contains a lambda value from which the probabilities are determined A Poisson distribution can be described by λ alone
187
Poisson Distribution Probability function Mean value
Standard deviation Variance 34
188
Poisson Distribution: Demonstration Problem 5.7
Bank customers arrive randomly on weekday afternoons at an average of 3.2 customers every 4 minutes. What is the probability of having more than 7 customers in a 4-minute interval on a weekday afternoon?
189
Poisson Distribution: Demonstration Problem 5.7
Solution λ = 3.2 customers>minutes X > 7 customers/4 minutes The solution requires obtaining the values of x = 8, 9, 10, 11, 12, 13, 14, Each x value is determined until the values are so far away from λ = 3.2 that the probabilities approach zero. The exact probabilities are summed to find x 7. If the bank has been averaging 3.2 customers every 4 minutes on weekday afternoons, it is unlikely that more than 7 people would randomly arrive in any one 4-minute period. This answer indicates that more than 7 people would randomly arrive in a 4-minute period only 1.69% of the time. Bank officers could use these results to help them make staffing decisions.
190
Poisson Distribution: Demonstration Problem 5.7
35
191
Poisson Distribution: Using the Poisson Tables
X 0.5 1.5 1.6 3.0 0.6065 0.2231 0.2019 0.0498 1 0.3033 0.3347 0.3230 0.1494 2 0.0758 0.2510 0.2584 0.2240 3 0.0126 0.1255 0.1378 4 0.0016 0.0471 0.0551 0.1680 5 0.0002 0.0141 0.0176 0.1008 6 0.0000 0.0035 0.0047 0.0504 7 0.0008 0.0011 0.0216 8 0.0001 0.0081 9 0.0027 10 11 12
192
Poisson Distribution: Using the Poisson Tables
X 0.5 1.5 1.6 3.0 0.6065 0.2231 0.2019 0.0498 1 0.3033 0.3347 0.3230 0.1494 2 0.0758 0.2510 0.2584 0.2240 3 0.0126 0.1255 0.1378 0.2240 4 0.0016 0.0471 0.0551 0.1680 5 0.0002 0.0141 0.0176 0.1008 6 0.0000 0.0035 0.0047 0.0504 7 0.0000 0.0008 0.0011 0.0216 8 0.0000 0.0001 0.0002 0.0081 9 0.0000 0.0000 0.0000 0.0027 10 0.0000 0.0000 0.0000 0.0008 11 0.0000 0.0000 0.0000 0.0002 12 0.0000 0.0000 0.0000 0.0001
193
Excel’s Poisson Function
= 1.6 X P(X) =POISSON(D5,E$1,FALSE) 1 =POISSON(D6,E$1,FALSE) 2 =POISSON(D7,E$1,FALSE) 3 =POISSON(D8,E$1,FALSE) 4 =POISSON(D9,E$1,FALSE) 5 =POISSON(D10,E$1,FALSE) 6 =POISSON(D11,E$1,FALSE) 7 =POISSON(D12,E$1,FALSE) 8 =POISSON(D13,E$1,FALSE) 9 =POISSON(D14,E$1,FALSE)
194
Minitab’s Poisson Function
X P(X =x) Poisson with mean = 1.9
195
Mean and Std Dev of a Poisson Distribution
Mean of a Poisson Distribution is λ Understanding the mean of a Poisson distribution gives a feel for the actual occurrences that are likely to happen Variance of a Poisson distribution is also λ Std Dev = Square root of λ
196
Poisson Approximation of the Binomial Distribution
Binomial problems with large sample sizes and small values of p, which then generate rare events, are potential candidates for use of the Poisson Distribution Rule of thumb, if n > 20 and np < 7, the approximation is close enough to use the Poisson distribution for binomial problems
197
Poisson Approximation of the Binomial Distribution
Procedure for Approximating binomial with Poisson Begin with the computation of the binomial mean distribution µ = np Because µ is the expected value of the binomial, it becomes λ for Poisson distribution Use µ as the λ, and using the x from the binomial problem allows for the approximation of the probabilities from the Poisson table or Poisson formula
198
Poisson Approximation of the Binomial Distribution
Binomial probabilities are difficult to calculate when n is large. Under certain conditions binomial probabilities may be approximated by Poisson probabilities. Poisson approximation 41
199
Hypergeometric Distribution
Sampling without replacement from a finite population The number of objects in the population is denoted N. Each trial has exactly two possible outcomes, success and failure. Trials are not independent X is the number of successes in the n trials The binomial is an acceptable approximation, if n < 5% N. Otherwise it is not. 25
200
Hypergeometric Distribution
Probability function N is population size n is sample size A is number of successes in population x is number of successes in sample Mean Value Variance and standard deviation 26
201
Hypergeometric Distribution: Probability Computations
X = 8 n = 5 x 0.1028 1 0.3426 2 0.3689 3 0.1581 4 0.0264 5 0.0013 P(x) 27
202
Excel’s Hypergeometric Function
24 A = 8 n = 5 X P(X) =HYPGEOMDIST(A6,B$3,B$2,B$1) 1 =HYPGEOMDIST(A7,B$3,B$2,B$1) 2 =HYPGEOMDIST(A8,B$3,B$2,B$1) 3 =HYPGEOMDIST(A9,B$3,B$2,B$1) 4 =HYPGEOMDIST(A10,B$3,B$2,B$1) =HYPGEOMDIST(A11,B$3,B$2,B$1) =SUM(B6:B11)
203
Minitab’s Hypergeometric Function
X P(X =x) Hypergeometric with N = 24, A = 8, n = 5
204
Business Statistics, 6th ed. by Ken Black
Chapter 6 Continuous Distributions Copyright 2010 John Wiley & Sons, Inc.
205
Learning Objectives Understand concepts of the uniform distribution.
Appreciate the importance of the normal distribution. Recognize normal distribution problems, and know how to solve them. Decide when to use the normal distribution to approximate binomial distribution problems, and know how to work them. Decide when to use the exponential distribution to solve problems in business, and know how to work them. Copyright 2010 John Wiley & Sons, Inc. 205 2
206
Continuous Distributions
Continuous distributions are constructed from continuous random variables in which values are taken for every point over a given interval With continuous distributions, probabilities of outcomes occurring between particular points are determined by calculating the area under the curve between these points Copyright 2010 John Wiley & Sons, Inc. 206
207
Uniform Distribution The uniform distribution is a relatively simple continuous distribution in which the same height f(x), is obtained over a range of values Copyright 2010 John Wiley & Sons, Inc. 207
208
Uniform Distribution Area = 1 a b
Copyright 2010 John Wiley & Sons, Inc. 208
209
Uniform Distribution Mean and standard deviation of a uniform distribution => Mean μ = (a + b)/2 => Std Dev σ = (b-a)/Square root 12 Copyright 2010 John Wiley & Sons, Inc. 209
210
Uniform Distribution Mean and Standard Deviation
Copyright 2010 John Wiley & Sons, Inc. 210
211
Uniform Distribution of Lot Weights
Area = 1 Copyright 2010 John Wiley & Sons, Inc. 211
212
Uniform Distribution Probability
With discrete distributions, the probability function yields the value of the probability For continuous distributions, probabilities are calculated by determining the area over an interval of the function Copyright 2010 John Wiley & Sons, Inc. 212
213
Demonstration Problem 6.1
Suppose the amount of time it takes to assemble a plastic module ranges from 27 to 39 seconds and that assembly times are uniformly distributed. Describe the distribution. What is the probability that a given assembly will take between 30 and 35 seconds? Fewer than 30 seconds? Copyright 2010 John Wiley & Sons, Inc. 213
214
Demonstration Problem 6.1
Solution The height of the distribution is The mean time is 33 seconds with a standard deviation of seconds. f (x) = 1/(39 – 27) = 1/12 μ = (a + b)/2 = ( )/2 = 33 σ = (b – a)/ = (39 – 27)/3.464 = 112 = 12 112 = 3.464 Copyright 2010 John Wiley & Sons, Inc. 214
215
Demonstration Problem 6.1
P (30 <x <35) = (35 – 30)/(39 – 27) = 5/12 = .4167 There is a probability that it will take between 30 and 35 seconds to assemble the module. P (x < 30) = (30 – 27)/(39 – 27) = 3/12 = .2500 There is a probability that it will take less than 30 seconds to assemble the module. Because there is no area less than 27 seconds, P(x < 30) is determined by using only the interval 27 x 30. In a continuous distribution, there is no area at any one point (only over an interval). Thus the probability x < 30 is the same as the probability of x … 30. Copyright 2010 John Wiley & Sons, Inc. 215
216
Properties of the Normal Distribution
Characteristics of the normal distribution: Continuous distribution - Line does not break Symmetrical distribution - Each half is a mirror of the other half Asymptotic to the horizontal axis - it does not touch the x axis and goes on forever Unimodal - means the values mound up in only one portion of the graph Area under the curve = 1; total of all probabilities = 1 Copyright 2010 John Wiley & Sons, Inc. 216
217
Probability Density Function of the Normal Distribution
Normal distribution is characterized by the mean and the Std Dev Values of μ and σ produce a normal distribution Copyright 2010 John Wiley & Sons, Inc. 217
218
Probability Density Function of the Normal Distribution
X Copyright 2010 John Wiley & Sons, Inc. 218
219
Standardized Normal Distribution
The conversion formula for any x value of a given normal distribution is given below. It is called the z-score. A z-score gives the number of standard deviations that a value x, is above or below the mean. Copyright 2010 John Wiley & Sons, Inc. 219
220
Standardized Normal Distribution
Every unique pair of μ or σ values define a different normal distribution Changes in μ or σ give a different distribution Z distribution – mechanism by which normal distributions can be converted into a single distribution Z formula => Z = (x – μ)/ σ, where σ ≠ 0 Copyright 2010 John Wiley & Sons, Inc. 220
221
Standardized Normal Distribution
Z score is the number of Std Dev that a value, x, is above or below the mean If x value is less than the mean, the Z score is negative If x value is greater than mean, the Z score is positive Copyright 2010 John Wiley & Sons, Inc. 221
222
Standardized Normal Distribution - Continued
Z score can be used to find probabilities for any normal curve problem that has been converted to Z scores Z distribution is normal distribution with a mean of 0 and a Std Dev of 1 Copyright 2010 John Wiley & Sons, Inc. 222
223
Standardized Normal Distribution - Continued
Z distribution probability values are given in table A5 Table A5 gives the total area under the Z curve between 0 and any point on the positive Z axis Since the curve is symmetric, the area under the curve between Z and 0 is the same whether the Z curve is positive or negative Copyright 2010 John Wiley & Sons, Inc. 223
224
Standardized Normal Distribution - Continued
= 1 A normal distribution with a mean of zero, and a standard deviation of one Z Formula standardizes any normal distribution Z Score computed by the Z Formula the number of standard deviations which a value is away from the mean Copyright 2010 John Wiley & Sons, Inc. 224
225
Standardized Normal Distribution - Continued
If x is normally distributed with a mean of and a standard deviation of , then the z-score will also be normally distributed with a mean of 0 and a standard deviation of 1. Tables have been generated for standard normal distribution which enable you to determine probabilities for normal variables. The tables are set to give the probabilities between z = 0 and some other z value, z0 say, which is depicted on the next slide. Copyright 2010 John Wiley & Sons, Inc. 225
226
Z Table Second Decimal Place in Z
Copyright 2010 John Wiley & Sons, Inc. 226 11
227
Table Lookup of a Standard Normal Probability
-3 -2 -1 1 2 3 Z Copyright 2010 John Wiley & Sons, Inc. 227 12
228
Applying the Z Formula Z 0.00 0.01 0.02 0.00 0.0000 0.0040 0.0080
Copyright 2010 John Wiley & Sons, Inc. 228
229
Applying the Z Formula Z 0.00 0.01 0.02 0.00 0.0000 0.0040 0.0080
Copyright 2010 John Wiley & Sons, Inc. 229
230
Applying the Z Formula 0.5 + 0.2123 = 0.7123
Copyright 2010 John Wiley & Sons, Inc. 230
231
Applying the Z Formula 0.5 – 0.4803 = 0.0197
Copyright 2010 John Wiley & Sons, Inc. 231
232
Applying the Z Formula 0.4738+ 0.3554 = 0.8292
Copyright 2010 John Wiley & Sons, Inc. 232
233
Normal Approximation of the Binomial Distribution
For certain types of binomial distributions, the normal distribution can be used to approximate the probabilities At large sample sizes, binomial distributions approach the normal distribution in shape regardless of the value of p The normal distribution is a good approximate for binomial distribution problems for large values of n Copyright 2010 John Wiley & Sons, Inc. 233
234
Demonstration Problem 6.9
These types of problems can be solved quite easily with the appropriate technology. The output shows the MINITAB solution. Copyright 2010 John Wiley & Sons, Inc. 234 25
235
Normal Approximation of Binomial: Parameter Conversion
Conversion equations Conversion example: Copyright 2010 John Wiley & Sons, Inc. 235
236
Normal Approximation of Binomial: Interval Check
10 20 30 40 50 60 n 70 Copyright 2010 John Wiley & Sons, Inc. 236 27
237
Normal Approximation of Binomial: Correcting for Continuity
Values Being Determined Correction X X X X X X +.50 -.50 +.05 -.50 and +.50 +.50 and -.50 Copyright 2010 John Wiley & Sons, Inc. 237 28
238
Normal Approximation of Binomial: Computations
25 26 27 28 29 30 31 32 33 Total 0.0167 0.0096 0.0052 0.0026 0.0012 0.0005 0.0002 0.0001 0.0000 0.0361 X P(X) Copyright 2010 John Wiley & Sons, Inc. 238 30
239
Exponential Distribution
Continuous Family of distributions Skewed to the right X varies from 0 to infinity Apex is always at X = 0 Steadily decreases as X gets larger Probability function Copyright 2010 John Wiley & Sons, Inc. 239 31
240
Different Exponential Distributions
Copyright 2010 John Wiley & Sons, Inc. 240
241
Exponential Distribution: Probability Computation
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 2 3 4 5 Copyright 2010 John Wiley & Sons, Inc. 241 33
242
Business Statistics, 6th ed. by Ken Black
Chapter 7 Sampling and Sampling Distributions
243
Learning Objectives x p
Determine when to use sampling instead of a census. Distinguish between random and nonrandom sampling. Decide when and how to use various sampling techniques. Be aware of the different types of errors that can occur in a study. Understand the impact of the Central Limit Theorem on statistical analysis. Use the sampling distributions of and . p x 2
244
Reasons for Sampling For the safety of the consumer.
Sampling – A means for gathering useful information about a population Information gathered from sample, and conclusions drawn Sampling vs. census has advantages Sampling can save money. Sampling can save time.
245
Reasons for Taking a Census
Eliminate the possibility that a random sample is not representative of the population. The person authorizing the study is uncomfortable with sample information. 4
246
Random Versus Nonrandom Sampling
Nonrandom Sampling - Every unit of the population does not have the same probability of being included in the sample Random sampling - Every unit of the population has the same probability of being included in the sample. 6
247
Random Sampling Techniques
Simple Random Sample – basis for other random sampling techniques Each unit is numbered from 1 to n A random number generator can be used to select n items from the sample 7
248
Random Sampling Techniques
Stratified Random Sample Proportionate (% of the sample taken from each stratum is proportionate to the % that each stratum is within the whole population) Disproportionate (when the % of the sample taken from each stratum is not proportionate to the % that each stratum is within the whole population) Systematic Random Sample Cluster (or Area) Sampling
249
Simple Random Sample: Sample Members
01 Alaska Airlines 02 Alcoa 03 Ashland 04 Bank of America 05 BellSouth 06 Chevron 07 Citigroup 08 Clorox 09 Delta Air Lines 10 Disney 11 DuPont 12 Exxon Mobil 13 General Dynamics 14 General Electric 15 General Mills 16 Halliburton 17 IBM 18 Kellog 19 KMart 20 Lowe’s 21 Lucent 22 Mattel 23 Mead 24 Microsoft 25 Occidental Petroleum 26 JCPenney 27 Procter & Gamble 28 Ryder 29 Sears 30 Time Warner N = 30 n = 6 11
250
Simple Random Sampling: Random Number Table
9 4 3 7 8 6 1 5 2 N = 30 n = 6 10
251
Stratified Random Sample
Stratified Random sampling – population is divided into non-overlapping subpopulations called strata Researcher extracts a simple random sample from each subpopulation Stratified random sampling has the potential for reducing error
252
Stratified Random Sample
Sampling error – a sample does not represent the population Stratified random sampling has the potential to match the sample closely to the population Stratified sampling is more costly Stratum should be relatively homogeneous, i.e. race, gender, religion
253
Stratified Random Sample
Proportionate -- the percentage of the sample taken from each stratum is proportionate to the percentage that each stratum is within the population Disproportionate -- proportions of the strata within the sample are different than the proportions of the strata within the population
254
Systematic Sampling N k n
= N n , where : sample size population size size of selection interval Used because of its convenience and easy of administration Population elements are an ordered sequence (at least, conceptually). With systematic sampling, every kth item is selected to produce a sample of size n from a population of size N 14
255
Systematic Sampling Thereafter, sample elements are selected at a constant interval, k, from the ordered sequence frame. Advantages of systematic sampling Systematic sampling is evenly distributed across the frame Evenly determined if a sampling plan has been followed Systematic sampling is based on the assumption that the source of the population is random
256
Systematic Sampling: Example
Purchase orders for the previous fiscal year are serialized 1 to 10,000 (N = 10,000). A sample of fifty (n = 50) purchases orders is needed for an audit. k = 10,000/50 = 200 15
257
Systematic Sampling: Example
First sample element randomly selected from the first 200 purchase orders. Assume the 45th purchase order was selected. Subsequent sample elements: 45, 245, 445, 645, . . .
258
Cluster Sampling Cluster sampling – involves dividing the population into non-overlapping areas Identifies the clusters that tend to be internally homogeneous Each cluster is a microcosm of the population If the cluster is too large, a second set of clusters is taken from each original cluster This is two stage sampling
259
Cluster Sampling Advantages
More convenient for geographically dispersed populations Reduced travel costs to contact sample elements Simplified administration of the survey Unavailability of sampling frame prohibits using other random sampling methods 17
260
Cluster Sampling Disadvantages
Statistically less efficient when the cluster elements are similar Costs and problems of statistical analysis are greater than for simple random sampling
261
Nonrandom Sampling Non-Random sampling – sampling techniques used to select elements from the population by any mechanism that does not involve a random selection process These techniques are not desirable for use in gathering data to be analyzed by inferential statistics Sampling area cannot be determined objectively from these techniques
262
Errors Data from nonrandom samples are not appropriate for analysis by inferential statistical methods. Sampling Error occurs when the sample is not representative of the population Non-sampling Errors – all errors other than sampling errors Missing Data, Recording, Data Entry, and Analysis Errors Poorly conceived concepts , unclear definitions, and defective questionnaires Response errors occur when people do not know, will not say, or overstate in their answers 20
263
Sampling Distribution of Mean
x Sampling Distribution of Mean Proper analysis and interpretation of a sample statistic requires knowledge of its distribution. Process of Inferential Statistics 21
264
Sample Space for n = 2 with Replacement
Mean 1 (54,54) 54.0 17 (59,54) 56.5 33 (64,54) 59.0 49 (69,54) 61.5 2 (54,55) 54.5 18 (59,55) 57.0 34 (64,55) 59.5 50 (69,55) 62.0 3 (54,59) 19 (59,59) 35 (64,59) 51 (69,59) 64.0 4 (54,63) 58.5 20 (59,63) 61.0 36 (64,63) 63.5 52 (69,63) 66.0 5 (54,64) 21 (59,64) 37 (64,64) 53 (69,64) 66.5 6 (54,68) 22 (59,68) 38 (64,68) 54 (69,68) 68.5 7 (54,69) 23 (59,69) 39 (64,69) 55 (69,69) 69.0 8 (54,70) 24 (59,70) 64.5 40 (64,70) 67.0 56 (69,70) 69.5 9 (55,54) 25 (63,54) 41 (68,54) 57 (70,54) 10 (55,55) 55.0 26 (63,55) 42 (68,55) 58 (70,55) 62.5 11 (55,59) 27 (63,59) 43 (68,59) 59 (70,59) 12 (55,63) 28 (63,63) 63.0 44 (68,63) 65.5 60 (70,63) 13 (55,64) 29 (63,64) 45 (68,64) 61 (70,64) 14 (55,68) 30 (63,68) 46 (68,68) 68.0 62 (70,68) 15 (55,69) 31 (63,69) 47 (68,69) 63 (70,69) 16 (55,70) 32 (63,70) 48 (68,70) 64 (70,70) 70.0
265
Central Limit Theorem Central limits theorem allows one to study populations with differently shaped distributions Central limits theorem creates the potential for applying the normal distribution to many problems when sample size is sufficiently large
266
Central Limit Theorem Advantage of Central Limits theorem is when sample data is drawn from populations not normally distributed or populations of unknown shape can also be analyzed because the sample means are normally distributed due to large sample sizes
267
Central Limit Theorem As sample size increases, the distribution narrows Due to the Std Dev of the mean Std Dev of mean decreases as sample size increases
268
Sampling from a Normal Population
The distribution of sample means is normal for any sample size. 35
269
Z Formula for Sample Means
38
270
Tire Store Example Suppose, for example, that the mean expenditure per customer at a tire store is $85.00, with a standard deviation of $9.00. If a random sample of 40 customers is taken, what is the probability that the sample average expenditure per customer for this sample will be $87.00 or more? Because the sample size is greater than 30, the central limit theorem can be used, and the sample means are normally distributed. With = $85.00, = $9.00, and the z formula for sample means, z is computed as shown on the3 next slide.
271
Solution to Tire Store Example
39
272
Graphic Solution to Tire Store Example
87 85 .5000 .4207 Z 1.41 .5000 .4207 Equal Areas of .0793 40
273
Demonstration Problem 7.1
Suppose that during any hour in a large department store, the average number of shoppers is 448, with a standard deviation of 21 shoppers. What is the probability that a random sample of 49 different shopping hours will yield a sample mean between 441 and 446 shoppers?
274
Demonstration Problem 7.1
39
275
Graphic Solution for Demonstration Problem 7.1
Z -2.33 -.67 .2486 .4901 .2415 448 X 441 446 41
276
Sampling Distribution of
Sampling Distribution of Sample Proportion Sampling Distribution Approximately normal if nP > 5 and nQ > 5 (P is the population proportion and Q = 1 - P.) The mean of the distribution is P. The standard deviation of the distribution is √(p*q)/n 47
277
Sampling Distribution of “p hat”
Sampling Distribution of “p hat” p “p hat’ is a sample proportion Whereas the mean is computed by averaging a set of values, the sample proportion is computed by dividing the frequency with which a given characteristic occurs in a sample by the number of items in the sample (see next slide for formula)
278
Z Formula for Sample Proportions
Q n where : sample proportion sample size population proportion 1 5 48
279
Demonstration Problem 7.3
If 10% of a population of parts is defective, what is the probability of randomly selecting 80 parts and finding that 12 or more parts are defective?
280
Solution for Demonstration Problem 7.3
Population Parameters = . - Sample P Q n X p Z 10 1 90 80 12 15 ( ) P Z ( . ) 1 49 5 4319 0681 Q n 15 (. 10 90 80 05 0335 49
281
Graphic Solution for Demonstration Problem 7.3
0.15 0.10 .5000 .4319 ^ Z 1.49 .5000 .4319 50
282
Business Statistics, 6th ed. by Ken Black
Chapter 8 Statistical Inference: Estimation for Single Populations
283
Learning Objectives Know the difference between point and interval estimation. Estimate a population mean from a sample mean when s is known. Estimate a population mean from a sample mean when s is unknown. 2
284
Learning Objectives Estimate a population proportion from a sample proportion. Estimate the population variance from a sample variance. Estimate the minimum sample size necessary to achieve given statistical goals.
285
Estimating the Population Mean
A point estimate is a static taken from a sample that is used to estimate a population parameter Interval estimate - a range of values within which the analyst can declare, with some confidence, the population lies
286
Confidence Interval to Estimate when is Known
Point estimate Interval Estimate 4
287
Distribution of Sample Means for 95% Confidence
.4750 X 95% .025 Z 1.96 -1.96 9
288
Estimating the Population Mean
For a 95% confidence interval α = .05 α/2 = .025 Value of α/2 or z.025 look at the standard normal distribution table under = .4750 From Table A5 look up .4750, and read 1.96 as the z value from the row and column
289
Estimating the Population Mean
α is used to locate the Z value in constructing the confidence interval The confidence interval yields a range within which the researcher feel with some confidence the population mean is located Z score – the number of standard deviations a value (x) is above or below the mean of a set of numbers when the data are normally distributed
290
95% Confidence Intervals for
X 95% 11
291
95% Confidence Interval for
10
292
Demonstration Problem 8.1
A survey was taken of U.S. companies that do business with firms in India. One of the questions on the survey was: Approximately how many years has your company been trading with firms in India? A random sample of 44 responses to this question yielded a mean of years. Suppose the population standard deviation for this question is 7.7 years. Using this information, construct a 90% confidence interval for the mean number of years that a company has been trading in India for the population of U.S. companies trading with firms in India.
293
Demonstration Problem 8.1
13
294
Demonstration Problem 8.2
A study is conducted in a company that employs 800 engineers. A random sample of 50 engineers reveals that the average sample age is 34.3 years. Historically, the population standard deviation of the age of the company’s engineers is approximately 8 years. Construct a 98% confidence interval to estimate the average age of all the engineers in this company.
295
Demonstration Problem 8.2
14
296
Estimating the Mean of a Normal Population: Unknown
The population has a normal distribution. The value of the population Standard Deviation is unknown, then sample Std Dev must be used in the estimation process. z distribution is not appropriate for these conditions when the Population Std Dev is unknown, t distribution is appropriate, and you use the Sample Std Dev in the t formula 18
297
t Distribution A family of distributions -- a unique distribution for each value of its parameter, degrees of freedom (d.f.) Symmetric, Unimodal, Mean = 0, Flatter than a z t distribution is used instead of the z distribution for doing inferential statistics on the population mean when the population Std Dev is unknown and the population is normally distributed With the t distribution, you use the Sample Std Dev 19
298
t Distribution A family of distributions - a unique distribution for each value of its parameter using degrees of freedom (d.f.) Symmetric, Unimodal, Mean = 0, Flatter than a z t formula 19
299
t Distribution Characteristics
t distribution – flatter in middle and have more area in their tails than the normal distribution t distribution approach the normal curve as n becomes larger t distribution is to be used when the population variance or population Std Dev is unknown, regardless of the size of the sample
300
Reading the t Distribution
t table uses the area in the tail of the distribution Emphasis in the t table is on α, and each tail of the distribution contains α/2 of the area under the curve when confidence intervals are constructed t values are located at the intersection of the df value and the selected α/2 value
301
Confidence Intervals for of a Normal Population: Unknown
22
302
Table of Critical Values of t
df t0.100 t0.050 t0.025 t0.010 t0.005 1 3.078 6.314 12.706 31.821 63.656 2 1.886 2.920 4.303 6.965 9.925 3 1.638 2.353 3.182 4.541 5.841 4 1.533 2.132 2.776 3.747 4.604 5 1.476 2.015 2.571 3.365 4.032 23 1.319 1.714 2.069 2.500 2.807 24 1.318 1.711 2.064 2.492 2.797 25 1.316 1.708 2.060 2.485 2.787 29 1.311 1.699 2.045 2.462 2.756 30 1.310 1.697 2.042 2.457 2.750 40 1.303 1.684 2.021 2.423 2.704 60 1.296 1.671 2.000 2.390 2.660 120 1.289 1.658 1.980 2.358 2.617 1.282 1.645 1.960 2.327 2.576 t With df = 24 and a = 0.05, ta = 21
303
Confidence Intervals for of a Normal Population: Unknown
22
304
Demonstration Problem 8.3
The owner of a large equipment rental company wants to make a rather quick estimate of the average number of days a piece of ditch digging equipment is rented out per person per time. The company has records of all rentals, but the amount of time required to conduct an audit of all accounts would be prohibitive. The owner decides to take a random sample of rental invoices. Fourteen different rentals of ditch diggers are selected randomly from the files, yielding the following data. She uses these data to construct a 99% confidence interval to estimate the average number of days that a ditch digger is rented and assumes that the number of days per rental is normally distributed in the population.
305
Solution for Demonstration Problem 8.3
23
306
MINITAB Solution for Demonstration Problem 8.3
23
307
Comp Time: Excel Normal View
308
Confidence Interval to Estimate the Population Proportion
Estimating the population proportion often must be made 25
309
Demonstration Problem 8.5
A clothing company produces men’s jeans. The jeans are made and sold with either a regular cut or a boot cut. In an effort to estimate the proportion of their men’s jeans market in Oklahoma City that prefers boot-cut jeans, the analyst takes a random sample of 212 jeans sales from the company’s two Oklahoma City retail outlets. Only 34 of the sales were for boot-cut jeans. Construct a 90% confidence interval to estimate the proportion of the population in Oklahoma City who prefer boot-cut jeans.
310
Solution for Demonstration Problem 8.5
26
311
Estimating the Population Variance
Population Parameter Estimator of formula for Single Variance 28
312
Confidence Interval for 2
29
313
Two Table Values of 2 df = 7 .95 .05 2.16735 14.0671 2 4 6 8 10 12 14
2 4 6 8 10 12 14 16 18 20 df = 7 .05 .95 df 0.950 0.050 1 E-03 2 3 4 5 6 7 8 9 10 20 21 22 23 24 25 32
314
90% Confidence Interval for 2
33
315
Demonstration Problem 8.6
The U.S. Bureau of Labor Statistics publishes data on the hourly compensation costs for production workers in manufacturing for various countries. The latest figures published for Greece show that the average hourly wage for a production worker in manufacturing is $ Suppose the business council of Greece wants to know how consistent this figure is. They randomly select 25 production workers in manufacturing from across the country and determine that the standard deviation of hourly wages for such workers is $1.12. Use this information to develop a 95% confidence interval to estimate the population variance for the hourly wages of production workers in manufacturing in Greece. Assume that the hourly wages for production workers across the country in manufacturing are normally distributed.
316
Solution for Demonstration Problem 8.6
34
317
Determining Sample Size when Estimating
It may be necessary to estimate the sample size when working on a project In studies where µ is being estimated, the size of the sample can be determined by using the z formula for sample means to solve for n Difference between and µ is the error of estimation Error of Estimation = ( - µ)
318
Determining Sample Size when Estimating
z formula Error of Estimation (tolerable error) Estimated Sample Size Estimated 35
319
Sample Size When Estimating : Example
36
320
Demonstration Problem 8.7
Suppose you want to estimate the average age of all Boeing airplanes now in active domestic U.S. service. You want to be 95% confident, and you want your estimate to be within one year of the actual figure. The was first placed in service about 24 years ago, but you believe that no active s in the U.S. domestic fleet are more than 20 years old. How large of a sample should you take?
321
Solution for Demonstration Problem 8.7
37
322
Determining Sample Size when Estimating p
z formula Error of Estimation (tolerable error) Estimated Sample Size 38
323
Demonstration Problem 8.8
Hewitt Associates conducted a national survey to determine the extent to which employers are promoting health and fitness among their employees. One of the questions asked was, Does your company offer on-site exercise classes? Suppose it was estimated before the study that no more than 40% of the companies would answer Yes. How large a sample would Hewitt Associates have to take in estimating the population proportion to ensure a 98% confidence in the results and to be within .03 of the true population proportion?
324
Solution for Demonstration Problem 8.8
39
325
Business Statistics, 6th ed. by Ken Black
Chapter 9 Statistical Inference: Hypothesis Testing for Single Populations
326
Learning Objectives Understand the logic of hypothesis testing, and know how to establish null and alternate hypotheses. Understand Type I and Type II errors, and know how to solve for Type II errors. Know how to implement the Hypothesis, Test, Action, Business (HTAB) system to test hypotheses. Test hypotheses about a single population mean when s is known. Test hypotheses about a single population mean when s is unknown. Test hypotheses about a single population proportion. Test hypotheses about a single population variance. 2
327
Introduction to Hypothesis Testing
Hypothesis Testing – researchers are able to structure problems in such a way that the researcher can use statistical evidence to test various theories about phenomena 2
328
Types of Hypotheses 1. Research Hypothesis 2. Statistical Hypotheses
a statement of what the researcher believes will be the outcome of an experiment or a study. 2. Statistical Hypotheses a more formal structure derived from the research hypothesis. Composed of two parts Null hypothesis (Ho) – null hypothesis exists; old statement is correct Alternative (Ha) – the new theory is true
329
Types of Hypotheses 3. Substantive Hypotheses - a statistically significant difference does not imply or mean a material, substantive difference. If the null hypothesis is rejected and the alternative hypothesis is accepted, then one can say that a statistically significant result has been obtained With “significant” results, you reject the null hypothesis
330
Statistical Hypotheses
Two Parts a null hypothesis - nothing new is happening; the null condition exists an alternative hypothesis - something new is happening Notation null: H0 alternative: Ha
331
Null and Alternative Hypotheses
The Null and Alternative Hypotheses are mutually exclusive. Only one of them can be true. The Null and Alternative Hypotheses are collectively exhaustive. The Null Hypothesis is assumed to be true. The burden of proof falls on the Alternative Hypothesis. 5
332
Null and Alternative Hypotheses: Example
A manufacturer is filling 40 oz. packages with flour. The company wants the package contents to average 40 ounces. 6
333
One-tailed and Two-tailed Tests
One-tailed Tests Two-tailed Test 10
334
8 Steps in Testing Hypotheses
1. Establish hypotheses: state the null and alternative hypotheses. 2. Determine the appropriate statistical test and sampling distribution. 3. Specify the Type I error rate ( 4. State the decision rule. 5. Gather sample data. 6. Calculate the value of the test statistic. 7. State the statistical conclusion. 8. Make a managerial decision. 4
335
Rejection and Nonrejection Regions
Conceptually and graphically, statistical outcomes that result in the rejection of the null hypothesis lie in what is termed the rejection region. Statistical outcomes that fail to result in the rejection of the null hypothesis lie in what is termed the nonrejection region.
336
Rejection and Non Rejection Regions
Possible statistical outcomes Reject null hypothesis – results lie in this area Do not reject hypothesis – stat results fail to reject the null hypothesis ***If values fall in “rejection region” you reject the null hypothesis DRAW THE REJECTION AND NON-REJECTION GRAPH
337
Rejection and Non Rejection Regions
=40 oz Non Rejection Region Rejection Region Critical Value 7
338
Type I and Type II Errors
Type I Error Committed by rejecting a true null hypothesis If the null hypothesis is true, any mean that falls in a rejection region will be a type I error The probability of committing a Type I error is called , the level of significance. 8
339
Type I and Type II Errors
Committed when a researcher fails to reject a false null hypothesis The probability of committing a Type II error is called .
340
Decision Table for Hypothesis Testing
( ) Null True Null False Fail to reject null Correct Decision Type II error Reject null Type I error Correct Decision 9
341
One-tailed Tests Rejection Region Rejection Region
=40 oz Rejection Region Non Rejection Region Critical Value =40 oz Rejection Region Non Rejection Region Critical Value 11
342
Two-tailed Tests Rejection Region Non Rejection Region =12 oz
Critical Values 12
343
Testing Hypothesis about a Population Mean Using z Statistic (σ)
The z formula can be used to test hypothesis about a single population mean if the sample size (n) is > 30 for any population, and < 30 if x is normally distributed
344
Testing Hypotheses about a Population Mean Using the z Statistic ( Known)
Example: A survey, done 10 years ago, of CPAs in the U.S. found that their average salary was $74,914. An accounting researcher would like to test whether this average has changed over the years. A sample of 112 CPAs produced a mean salary of $78,695. Assume that the population standard deviation of salaries = $14,530.
345
Testing Hypotheses about a Population Mean Using the z Statistic ( Known)
Step 1: Hypothesize Step 2: Test
346
Testing Hypotheses about a Population Mean Using the z Statistic ( Known)
Step 3: Specify the Type I error rate- = z/2 = 1.96 Step 4: Establish the decision rule- Reject H0 if the test statistic < or it the test statistic > 1.96.
347
Testing Hypotheses about a Population Mean Using the z Statistic ( Known)
Step 5: Gather sample data- x-bar = $78,695, n = 112, = $14,530, hypothesized = $74,914. Step 6: Compute the test statistic.
348
Testing Hypotheses about a Population Mean Using the z Statistic ( Known)
Step 7: Reach a statistical conclusion- Since z = 2.75 > 1.96, reject H0. Step 8: Business decision- Statistically, the researcher has enough evidence to reject the figure of $74,914 as the true average salary for CPAs. In addition, based on the evidence gathered, it may suggest that the average has increased over the 10-year period.
349
CPA Net Income Example: Two-tailed Test (Part 2)
13
350
CPA Net Income Example: Critical Value Method (Part 1)
Rejection Region Non Rejection Region =0 72,223 77,605 15
351
Using p value to Test Hypothesis
P value – another way to reach statistical conclusion in hypothesis testing No preset value of α is given in the p value method p value defines the smallest value of α for which the null hypothesis can be ejected p-value < reject H0 p-value do not reject H0
352
Using p value to Test Hypothesis
For two tailed test, alpha is split to determine the critical value of the test statistic With the p value, the probability of getting a test statistic at least as extreme as the observed value is computed The p value is then compared z or α/2 for two tailed tests to determine statistical significance
353
Using the p-Value to Test Hypotheses
One should be careful when using p-values from statistical software outputs. Both MINITAB and EXCEL report the actual p-values for hypothesis tests. MINITAB doubles the p-value for a two-tailed test so you can compare with . EXCEL does not double the p-value for a two-tailed test. So when using the p-value from EXCEL, you may multiply the value by 2 and then compare with .
354
Demonstration Problem: MINITAB
20
355
Using the p-Value to Test Hypotheses
356
Critical Value Method to Test Hypotheses
The critical value method determines the critical mean value required for z to be in the rejection region and uses it to test the hypotheses.
357
Critical Value Method to Test Hypotheses
For the previous example,
358
Critical Value Method to Test Hypotheses
Thus, a sample mean greater than $77,605 or less than $72,223 will result in the rejection of the null hypothesis. The test statistic for this test is
359
Testing Hypotheses About a Variance: Demonstration Problem 9.4
A small business has 37 employees. Because of the uncertain demand for its product, the company usually pays overtime on any given week. The company assumed that about 50 total hours of overtime per week is required and that the variance on this figure is about 25. Company officials want to know whether the variance of overtime hours has changed. Given here is a sample of 16 weeks of overtime data (in hours per week). Assume hours of overtime are normally distributed. Use these data to test the null hypothesis that the variance of overtime data is 25. Let
360
Testing Hypotheses About a Variance: Demonstration Problem 9.4
Step 1: Step 2: Test statistic H0: 2 = 25 Ha: 2 25
361
Testing Hypotheses About a Variance: Demonstration Problem 9.4
Step 3: Because this is a two-tailed test, = 0.10 and /2 = 0.05. Step 4: The degrees of freedom are 16 – 1 = 15. The two critical chi-square values are 2(1 – 0.05), 15 = 2 0.95, 15 = and 2 0.05, 15 = Step 5: The data are listed in the text. Step 6: The sample variance is s2 = The observed chi-square value is calculated as 2 =
362
Testing Hypotheses About a Variance: Demonstration Problem 9.4
Step 7: The observed chi-square value is in the nonrejection region because 2 0.95, 15 = < 2observed = < 2 0.05), 15 = Step 8: This result indicates to the company managers that the variance of weekly overtime hours is about what they expected.
363
Solving for Type II Errors
When the null hypothesis is not rejected, then either a correct decision is made or an incorrect decision is made. If an incorrect decision is made, that is, if the null hypothesis is not rejected when it is false, then a Type II, , error has occurred.
364
Solving for Type II Errors (Soft Drink)
Suppose a test is conducted on the following hypotheses: H0: = 12 ounces vs. Ha: < 12 ounces when the sample size is 60 with mean of The first step in determining the probability of a Type II error is to calculate a critical value for the sample mean (in this case). For an =0.05, then the critical value for the sample mean is (given on next slide).
365
Solving for Type II Errors (Soft Drink)
In testing the null hypothesis by the critical value method, this value is used as the cutoff for the nonrejection region. For any sample mean obtained that is less than , the null hypothesis is rejected. Any sample mean greater than , the null hypothesis is not rejected.
366
Solving for Type II Errors (Soft Drink)
Since a Type II error, , varies with possible values of the alternative parameter, then for an alternative mean of (< 12) the corresponding z-value is
367
Solving for Type II Errors (Soft Drink)
The value of z yields an area of The probability of committing a Type II error is equal to the area to the right of the critical value of the sample mean of This area is = = Thus, there is an 80.23% chance of committing a Type II error if the alternative mean is Note: equivalent problems can be solved for sample proportions (See Demonstration Problem 9.6).
368
Operating Characteristic and Power Curve
Because the probability of committing a Type II error changes for each different value of the alternative parameter, it is best to examine a series of possible alternative values. The power of a test is the probability of rejecting the null hypothesis when it is false. Power = 1 - .
369
Business Statistics, 6th ed. by Ken Black
Chapter 10 Statistical Inference: About Two Populations
370
Learning Objectives Test hypotheses and construct confidence intervals about the difference in two population means using the Z statistic. Test hypotheses and construct confidence intervals about the difference in two population means using the t statistic. 2
371
Learning Objectives Test hypotheses and construct confidence intervals about the difference in two related populations. Test hypotheses and construct confidence intervals about the differences in two population proportions. Test hypotheses and construct confidence intervals about two population variances.
372
Hypothesis Testing; Confidence Intervals - Difference in Means using z Statistic (Population Variance Known) Calculating two sample means and using the difference in the two sample means is used to test the difference in the population The central limit theorem states that the difference in two sample means is normally distributed for large sample sizes ((both n1 and n2) > 30) regardless of the shape of the population
373
Hypothesis Testing for Differences Between Means: The Wage Example
As a specific example, suppose we want to conduct a hypothesis test to determine whether the average annual wage for an advertising manager is different from the average annual wage of an auditing manager. Because we are testing to determine whether the means are different, it might seem logical that the null and alternative hypotheses would be Ho: μ1 = μ2 Ha: μ1 ≠ μ2 where advertising managers are population 1 and auditing managers are population 2.
374
Hypothesis Testing for Differences Between Means: The Wage Example (part 3)
Advertising Managers 74.256 57.791 71.115 96.234 65.145 67.574 89.807 96.767 59.621 93.261 77.242 62.483 67.056 69.319 74.195 64.276 35.394 75.932 74.194 86.741 80.742 65.360 57.351 39.672 73.904 45.652 54.270 93.083 59.045 63.384 68.508 Auditing Managers 69.962 77.136 43.649 55.052 66.035 63.369 57.828 54.335 59.676 63.362 42.494 54.449 37.194 83.849 46.394 99.198 67.160 71.804 61.254 37.386 72.401 73.065 59.505 56.470 48.036 72.790 67.814 60.053 71.351 71.492 66.359 58.653 61.261 63.508 8
375
Hypothesis Testing for Differences Between Means: The Wage Example
=0.05, /2 = 0.025, z0.025 = 1.96
376
Hypothesis Testing for Differences Between Means: Wage Example
Since the observed value of 2.35 is greater than 1.96, reject the null hypothesis. That is, there is a significant difference between the average annual wage of advertising managers and the average annual wage of an auditing manager.
377
Hypothesis Testing for Differences Between Means: The Wage Example
Rejection Region Non Rejection Region Critical Values H o a : 1 2 m - = 6
378
Hypothesis Testing for Differences Between Means: Wage Example (part 2)
Rejection Region Non Rejection Region Critical Values 7
379
Hypothesis Testing for Differences Between Means: Wage Example (part 4)
Rejection Region Non Rejection Region Critical Values 9
380
Difference Between Means: Using Excel
z-Test: Two Sample for Means Adv Mgr Auditing Mgr Mean 62.187 Known Variance Observations 32 34 Hypothesized Mean Difference z 2.35 P(Z<=z) one-tail 0.0094 z Critical one-tail 1.64 P(Z<=z) two-tail 0.0189 z Critical two-tail 1.960
381
Hypothesis Testing Because one is testing to determine whether the mean are different, it might seem logical that the null and alternative hypothesis would be Ho : µ1 - µ2 = 0 Ha : µ1 - µ2 ≠ 0 Analysis is testing whether there is a difference in the average wage This is a two tailed test 2
382
Demonstration Problem 10.1
A sample of 87 professional working women showed that the average amount paid annually into a private pension fund per person was $3352. The population standard deviation is $1100. A sample of 76 professional working men showed that the average amount paid annually into a private pension fund per person was $5727, with a population standard deviation of $1700. A women’s activist group wants to “prove” that women do not pay as much per year as men into private pension funds. If they use ά = .001 and these sample data, will they be able to reject a null hypothesis that women annually pay the same as or more than men into private pension funds? Use the eight-step hypothesis-testing process.
383
Demonstration Problem 10.1 (part 1)
Non Rejection Region Critical Value Rejection Region 10
384
Demonstration Problem 10.1 (part 2)
Non Rejection Region Critical Value Rejection Region 11
385
Confidence Interval Sometimes the solution(s) is/are to take a random sample from each of the two populations and study the difference in the two samples Formula for confidence interval to estimate (µ1 - µ2) Designating a group as group one, and another as group two is an arbitrary decision 2
386
Demonstration Problem 10.2
A consumer test group wants to determine the difference in gasoline mileage of cars using regular unleaded gas and cars using premium unleaded gas. Researchers for the group divided a fleet of 100 cars of the same make in half and tested each car on one tank of gas. Fifty of the cars were filled with regular unleaded gas and 50 were filled with premium unleaded gas. The sample average for the regular gasoline group was miles per gallon (mpg), and the sample average for the premium gasoline group was 24.6 mpg. Assume that the population standard deviation of the regular unleaded gas population is 3.46 mpg, and that the population standard deviation of the premium unleaded gas population is 2.99 mpg. Construct a 95% confidence interval to estimate the difference in the mean gas mileage between the cars using regular gasoline and the cars using premium gasoline.
387
Demonstration Problem 10.2
13
388
Hypothesis Testing Hypothesis test - compares the means of two samples to see if there is a difference in the two population means from which the sample comes This is used when σ2 is unknown and samples are independent Assumes that the measurement is normally distributed. 14
389
Hypothesis Testing If σ is unknown, it can be estimated by pooling the two sample variances and computing a pooled sample standard deviation 14
390
t Test for Differences in Population Means
Each of the two populations is normally distributed. The two samples are independent. The values of the population variances are unknown. The variances of the two populations are equal. 12 = 22 14
391
t Formula to Test the Difference in Means Assuming 12 = 22
15
392
Hernandez Manufacturing Company
At the Hernandez Manufacturing Company, an application of this test arises. New employees are expected to attend a three-day seminar to learn about the company. At the end of the seminar, they are tested to measure their knowledge about the company. The traditional training method has been lecture and a question-and-answer session. Management decided to experiment with a different training procedure, which processes new employees in two days by using DVDs and having no question-and-answer session.
393
Hernandez Manufacturing Company – Cont’d
If this procedure works, it could save the company thousands of dollars over a period of several years. However, there is some concern about the effectiveness of the two-day method, and company managers would like to know whether there is any difference in the effectiveness of the two training methods. Training Method A Training Method B
394
Hernandez Manufacturing Company (part 2)
Training Method A 56 51 45 47 52 43 42 53 50 48 44 Training Method B 59 52 53 54 57 56 55 64 65 17
395
Hernandez Manufacturing Company (part 1)
Rejection Region Non Rejection Region Critical Values 16
396
Hernandez Manufacturing Company (part 3)
18
397
MINITAB Output for Hernandez New-Employee Training Problem
Two sample T for method A vs. method B N Mean StDev SE Mean method A method B 95% C.I. for mu method A - mu method B: (-12.2, -5.3) T-Test mu method A = mu method B (vs. not =): T = -5.20 P= DF = 25 Both use Pooled StDev = 4.35 19
398
EXCEL Output for Hernandez New-Employee Training Problem
t-Test: Two-Sample Assuming Equal Variances Variable 1 Variable 2 Mean 4 7.73 56.5 Variance 19.495 18.27 Observations 15 12 Pooled Variance 18.957 Hypothesized Mean Difference df 25 t Stat - 5.20 P(T<=t) one-tail 1.12E-05 t Critical one-tail 1.71 P(T<=t) two-tail 2.23E-05 t Critical two-tail 2.06 20
399
Confidence Interval to Estimate 1 - 2 when 12 and 22 are unknown and 12 = 22
21
400
Demonstration Problem 10.4
A coffee manufacturer is interested in estimating the difference in the average daily coffee consumption of regular-coffee drinkers and decaffeinated-coffee drinkers. Its researcher randomly selects 13 regular-coffee drinkers and asks how many cups of coffee per day they drink. He randomly locates 15 decaffeinated-coffee drinkers and asks how many cups of coffee per day they drink. The average for the regular-coffee drinkers is 4.35 cups, with a standard deviation of 1.20 cups. The average for the decaffeinated-coffee drinkers is 6.84 cups, with a standard deviation of 1.42 cups. The researcher assumes, for each population, that the daily consumption is normally distributed, and he constructs a 95% confidence interval to estimate the difference in the averages of the two populations.
401
Demonstration Problem 10.4
21
402
Demonstration Problem 10.4
The researcher is 95% confident that the difference in population average daily consumption of cups of coffee between regular- and decaffeinated-coffee drinkers is between 1.46 cups and 3.52 cups. 21
403
Statistical Inferences for Two Related Populations
Dependent samples Used in before and after studies After measurement is not independent of the before measurement 14
404
Hypothesis Testing Researcher must determine if the two samples are related to each other The technique for related samples is different from the technique used to analyze independent samples Matched pairs test requires the two samples be the same size 14
405
Dependent Samples Before and after measurements on the same individual
Studies of twins Studies of spouses Individual 1 2 3 4 5 6 7 Before 32 11 21 17 30 38 14 After 39 15 35 13 41 22 25
406
Hypothesis Testing The following t test for dependent measures uses the sample difference, d, between individual matched samples as the basic measurement of analysis An analysis of d converts the problem from a two sample problem to a single sample of differences 14
407
Formulas for Dependent Samples
26
408
Hypothesis Testing Analysis of data by this method involves calculating a t value with a critical value obtained from the table n in the degrees of freedom (n – 1) is the number of matched pairs of scores 14
409
P/E Ratios for Nine Randomly Selected Companies
Suppose a stock market investor is interested in determining whether there is a significant difference in the P/E (price to earnings) ratio for companies from one year to the next. In an effort to study this question, the investor randomly samples nine companies from the Handbook of Common Stocks and records the P/E ratios for each of these companies at the end of year 1 and at the end of year 2.
410
P/E Ratios for Nine Randomly Selected Companies
Company 2001 P/E Ratio 2002 P/E Ratio 1 8.9 12.7 2 38.1 45.4 3 43.0 10.0 4 34.0 27.2 5 34.5 22.8 6 15.2 24.1 7 20.3 32.3 8 19.9 40.1 9 61.9 106.5
411
Hypothesis Testing with Dependent Samples: P/E Ratios for Nine Companies
Rejection Region Non Rejection Region Critical Value 28
412
Hypothesis Testing with Dependent Samples: P/E Ratios for Nine Companies
29
413
Hypothesis Testing with Dependent Samples: P/E Ratios for Nine Companies – MINITAB Output
414
Hypothesis Testing with Dependent Samples: P/E Ratios for Nine Companies
t-Test: Paired Two Sample for Means 2001 P/E Ratio 2002 P/E Ratio Mean 30.64 35.68 Variance 268.1 837.5 Observations 9 Pearson Correlation 0.674 Hypothesized Mean Difference df 8 t Stat -0.7 P(T<=t) one-tail 0.252 t Critical one-tail 1.86 P(T<=t) two-tail 0.504 t Critical two-tail 2.306
415
Confidence Intervals Researcher can be interested in estimating the mean difference in two populations for related samples This requires a confidence interval of D (the mean population difference of two related samples) to be constructed 14
416
Confidence Intervals for Mean Difference for Related Samples
417
Difference in Number of New-House Sales
Realtor May 2001 May 2002 d 1 8 11 -3 2 19 30 -11 3 5 6 -1 4 9 13 -4 -2 7 15 17 -6 12 10 -7 14 22 -8 16 18
418
Statistical Inference about two Population Proportions ( – )
Sample proportion is used ( – ) 14
419
Confidence Interval for Mean Difference in Number of New-House Sales
The analyst estimates with a 99% level of confidence that the average difference in new-house sales for a real estate company in Indianapolis between 2005 and 2006 is between and houses.
420
Confidence Intervals-MINITAB Solution
421
Sampling Distribution of Differences in Sample Proportions
39
422
Z Formula for the Difference in Two Population Proportions
40
423
Hypothesis Testing Because population proportions are unknown, an estimate of the Std Dev of the difference in two sample proportions is made by using sample proportions as point of estimates of the population proportion
424
Z Formula to Test the Difference in Population Proportions
41
425
Testing the Difference in Population Proportions (Demonstration Problem 10.6)
Rejection Region Non Rejection Region Critical Values 42
426
Testing the Difference in Population Proportions (Demonstration Problem 10.6)
43
427
Confidence Interval to Estimate p1 - p2
44
428
Example Problem: When do men shop for groceries?
45
429
F Test for Two Population Variances
46
430
Sheet Metal Example: Hypothesis Test for Equality of Two Population Variances
49
431
Sheet Metal Example Suppose a machine produces metal sheets that are specified to be 22 millimeters thick. Because of the machine, the operator, the raw material, the manufacturing environment, and other factors, there is variability in the thickness. Two machines produce these sheets. Operators are concerned about the consistency of the two machines. To test consistency, they randomly sample 10 sheets produced by machine 1 and 12 sheets produced by machine 2. The thickness measurements of sheets from each machine are given in the table on the following page. Assume sheet thickness is normally distributed in the population. How can we test to determine whether the variance from each sample comes from the same population variance (population variances are equal) or from different population variances (population variances are not equal)?
432
Sheet Metal Example Machine 1 Machine 2 22.3 21.8 22.2 21.9 21.6 22.4
22.5 Machine 2 22.0 22.1 21.7 51
433
Sheet Metal Example-MINITAB Solution
434
Sheet Metal Example-EXCEL Solution
435
Business Statistics, 6th ed. by Ken Black
Chapter 12 Simple Analysis and Correlation
436
Learning Objectives Compute the equation of a simple regression line from a sample of data, and interpret the slope and intercept of the equation. Understand the usefulness of residual analysis in testing the assumptions underlying regression analysis and in examining the fit of the regression line to the data. Compute a standard error of the estimate and interpret its meaning. Compute a coefficient of determination and interpret it. Test hypotheses about the slope of the regression model and interpret the results. Estimate values of Y using the regression model. 2
437
Regression and Correlation
Regression analysis is the process of constructing a mathematical model or function that can be used to predict or determine one variable by another variable. Correlation is a measure of the degree of relatedness of two variables.
438
Pearson Product-Moment Correlation Coefficient
39
439
Degrees of Correlation
Correlation is a measure of the degree of relatedness of variables Coefficient of Correlation (r) - applicable only if both variables being analyzed have at least an interval level of data 52
440
Degrees of Correlation
The term (r) is a measure of the linear correlation of two variables The number ranges from -1 to 0 to +1 Closer to +1, the higher the correlation between the dependent and the independent variables See the formula for Pearson Product Moment correlation coefficient – See slide 3-82 for the formula
441
Three Degrees of Correlation
40
442
Computation of r for the Economics Example (Part 1)
Day Interest X Futures Index Y 1 7.43 221 55.205 48,841 1,642.03 2 7.48 222 55.950 49,284 1,660.56 3 8.00 226 64.000 51,076 1,808.00 4 7.75 225 60.063 50,625 1,743.75 5 7.60 224 57.760 50,176 1,702.40 6 7.63 223 58.217 49,729 1,701.49 7 7.68 58.982 1,712.64 8 7.67 58.829 1,733.42 9 7.59 57.608 1,715.34 10 8.07 235 65.125 55,225 1,896.45 11 8.03 233 64.481 54,289 1,870.99 12 241 58,081 1,928.00 Summations 92.93 2,725 619,207 21,115.07 X2 Y2 XY 41
443
Computation of r Economics Example (Part 2)
43
444
Computation of r Economics Example (Part 2)
Means that 81.5% of the dependent variables are explained by the independent variables. Is 81.5% high or low? 52
445
Simple Regression Analysis
Bivariate (two variables) linear regression -- the most elementary regression model dependent variable, the variable to be predicted, usually called Y independent variable, the predictor or explanatory variable, usually called X Nonlinear relationships and regression models with more than one independent variable can be explored by using multiple regression models 5
446
Regression Models Deterministic Regression Model Y = 0 + 1X
Probabilistic Regression Model Y = 0 + 1X + 0 and 1 are population parameters 0 and 1 are estimated by sample statistics b0 and b1 8
447
Equation of the Simple Regression Line
9
448
Least Squares Analysis
Least squares analysis is a process whereby a regression model is developed by producing the minimum sum of the squared error values The vertical distance from each point to the line is the error of the prediction. The least squares regression line is the regression line that results in the smallest sum of errors squared.
449
Least Squares Analysis
10
450
Least Squares Analysis
451
Solving for b1 and b0 of the Regression Line: Airline Cost Example (Part 1)
Number of Passengers Cost ($1,000) X Y 2 XY 61 , 63 , 67 4.42 4,489 296.14 69 4.17 4,761 287.73 70 4.48 4,900 313.60 74 4.30 5,476 318.20 76 4.82 5,776 366.32 81 4.70 6,561 380.70 86 5.11 7,396 439.46 91 5.13 8,281 466.83 95 5.64 9,025 535.80 97 5.56 9,409 539.32 å = 930 = 56.69 = 73,764 = 4,462.22 12
452
Solving for b1 and b0 of the Regression Line: Airline Cost Example (Part 2)
13
453
Airline Cost: Excel Summary Output
Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 12 ANOVA df SS MS F Significance F Regression 1 2.7E-06 Residual 10 Total 11 Coefficients t Stat P-value Intercept Number of Passengers 2.692E-06
454
Airline Cost: MINITAB Summary Output
455
Residual Analysis: Airline Cost Example
Number of Predicted Passengers Cost ($1,000) Value Residual X Y ˆ - 61 .227 63 67 4.42 4.297 .123 69 4.17 4.378 -.208 70 4.48 4.419 .061 74 4.30 4.582 -.282 76 4.82 4.663 .157 81 4.70 4.867 -.167 86 5.11 5.070 .040 91 5.13 5.274 -.144 95 5.64 5.436 .204 97 5.56 5.518 .042 å = 001 . ) ( 15
456
Demonstration Problem 14.2
Compute the residuals for Demonstration Problem 12.1 in which a regression model was developed to predict the number of full-time equivalent workers (FTEs) by the number of beds in a hospital. Analyze the residuals by using MINITAB graphic diagnostics.
457
Demonstration Problem 14.2 – MINITAB Computations for Residuals
21
458
Standard Error of the Estimate
Residuals represent errors of estimation for individual points. A more useful measurement of error is the standard error of the estimate The standard error of the estimate, denoted se, is a standard deviation of the error of the regression model
459
Standard Error of the Estimate
Sum of Squares Error Standard Error of the Estimate 22
460
Determining SSE for the Airline Cost Example
Number of Passengers Cost ($1,000) Residual X Y ˆ - 2 ) ( 61 4.28 63 4.08 67 4.42 69 4.17 70 4.48 74 4.30 76 4.82 81 4 .70 86 5.11 91 5.13 95 5.64 97 5.56 å = 001 . =.31434 Sum of squares of error = SSE = 23
461
Determining SSE for the Airline Cost Example – MINITAB Output
23
462
Standard Error of the Estimate for the Airline Cost Example
Sum of Squares Error Standard Error of the Estimate 24
463
Standard Error of the Estimate for the Airline Cost Example
24
464
Coefficient of Determination
The coefficient of determination is the proportion of variability of the dependent variable (y) accounted for or explained by the independent variable (x) The coefficient of determination ranges from 0 to 1. An r 2 of zero means that the predictor accounts for none of the variability of the dependent variable and that there is no regression prediction of y by x. An r 2 of 1 means perfect prediction of y by x and that 100% of the variability of y is accounted for by x.
465
Coefficient of Determination
25
466
Coefficient of Determination for the Airline Cost Example
89.9% of the variability of the cost of flying a Boeing 737 is accounted for by the number of passengers. 26
467
Coefficient of Determination for the Airline Cost Example
26
468
Hypothesis Tests for the Slope of the Regression Model
A hypothesis test can be conducted on the sample slope of the regression model to determine whether the population slope is significantly different from zero. Using this non-regression model (the model) as a worst case, the researcher can analyze the regression line to determine whether it adds a more significant amount of predictability of y than does the model.
469
Hypothesis Tests for the Slope of the Regression Model
As the slope of the regression line diverges from zero, the regression model is adding predictability that the line is not generating. Testing the slope of the regression line to determine whether the slope is different from zero is important. If the slope is not different from zero, the regression line is doing nothing more than the average line of y predicting y.
470
Hypothesis Tests for the Slope of the Regression Model
27
471
Hypothesis Test: Airline Cost Example
27
472
Hypothesis Test: Airline Cost Example
so reject H0 Note: P-value = 0.000
473
Hypothesis Test: Airline Cost Example
The t value calculated from the sample slope falls in the rejection region and the p-value is The null hypothesis that the population slope is zero is rejected. This linear regression model is adding significantly more predictive information to the model (no regression).
474
Testing the Overall Model
It is common in regression analysis to compute an F test to determine the overall significance of the model. In multiple regression, this test determines whether at least one of the regression coefficients (from multiple predictors) is different from zero. Simple regression provides only one predictor and only one regression coefficient to test. Because the regression coefficient is the slope of the regression line, the F test for overall significance is testing the same thing as the t test in simple regression
475
Testing the Overall Model
476
Testing the Overall Model
F = > 4.96 so reject H0 Note: P-value = 0.000
477
Testing the Overall Model
The difference between this value (89.09) and the value obtained by squaring the t statistic (88.92) is due to rounding error. The probability of obtaining an F value this large or larger by chance if there is no regression prediction in this model is .000 according to the ANOVA output (the p-value).
478
Estimation One of the main uses of regression analysis is as a prediction tool. If the regression function is a good model, the researcher can use the regression equation to determine values of the dependent variable from various values of the independent variable. In simple regression analysis, a point estimate prediction of y can be made by substituting the associated value of x into the regression equation and solving for y.
479
Point Estimation for the Airline Cost Example
29
480
Confidence Interval to Estimate Y: Airline Cost Example
30
481
Confidence Interval to Estimate the Average Value of Y for some Values of X: Airline Cost Example
+ to to to to to 31
482
Prediction Interval to Estimate Y for a given value of X
32
483
MINITAB Regression Analysis of the Airline Cost Example
35
484
Pearson Product-Moment Correlation Coefficient
39
485
Pearson Product-Moment Correlation Coefficient- MINITAB output for Airline Cost Example
39
486
Business Statistics, 6th ed. by Ken Black
Chapter 16 Analysis of Categorical Design
487
Learning Objectives Understand the 2 goodness-of-fit test and how to use it. Analyze data using the 2 test of independence. 2
488
2 Goodness-of-Fit Test
The 2 goodness-of-fit test compares expected (theoretical) frequencies of categories from a population distribution to the observed (actual) frequencies from a distribution to determine whether there is a difference between what was expected and what was observed. Chi-square goodness-of-fit test is used to analyze probabilities of multinomial distribution trials along a single dimension. 6
489
2 Goodness-of-Fit Test
The formula which is used to compute the test statistic for a chi-square goodness-of-fit test is given below. 7
490
2 Goodness-of-Fit Test
The formula compares the frequency of observed values to the frequency of the expected values across the distribution. Test loses one degree of freedom because the total number of expected frequencies must equal the number of observed frequencies The chi-square distribution is the sum of the squares of k independent random variables Can never be less than zero; it extends indefinitely in the positive direction
491
Milk Sales Data for Demonstration Problem 12.1
Dairies would like to know whether the sales of milk are distributed uniformly over a year so they can plan for milk production and storage. A uniform distribution means that the frequencies are the same in all categories. In this situation, the producers are attempting to determine whether the amounts of milk sold are the same for each month of the year. They ascertain the number of gallons of milk sold by sampling one large supermarket each month during a year, obtaining the following data. Use .01 to test whether the data fit a uniform distribution.
492
Milk Sales Data for Demonstration Problem 12.1
January 1,610 February 1,585 March 1,649 April 1,590 May 1,540 June 1,397 July 1,410 August 1,350 September 1,495 October 1,564 November 1,602 December 1,655 18,447 Month Gallons 8
493
Hypotheses and Decision Rules for Demonstration Problem 12.1
9
494
Calculations for Demonstration Problem 12.1
Month fo fe (fo - fe)2/fe January 1,610 1,537.25 3.44 February 1,585 1.48 March 1,649 8.12 April 1,590 1.81 May 1,540 0.00 June 1,397 12.80 July 1,410 10.53 August 1,350 22.81 September 1,495 1.16 October 1,564 0.47 November 1,602 2.73 December 1,655 9.02 18,447 18,447.00 74.38 10
495
Calculations for Demonstration Problem 12.1
The observed chi-square value of is greater than the critical value of The decision is to reject the null hypothesis. The data provides enough evidence to indicate that the distribution of milk sales is not uniform.
496
Calculations for Demonstration Problem 12.1
497
Bank Customer Arrival Data for Demonstration Problem 12.2
An earlier chapter indicated that, quite often in the business world, random arrivals are Poisson distributed. This distribution is characterized by an average arrival rate, λ, per some interval. Suppose a teller supervisor believes the distribution of random arrivals at a local bank is Poisson and sets out to test this hypothesis by gathering information. The following data represent a distribution of frequency of arrivals during 1-minute intervals at the bank. Use ά = .05 to test these data in an effort to determine whether they are Poisson distributed.
498
Bank Customer Arrival Data for Demonstration Problem 12.2
Number of Arrivals Observed Frequencies 7 1 18 2 25 3 17 4 12 5 5 12
499
Hypotheses and Decision Rules for Demonstration Problem 12.2
13
500
Calculations for Demonstration Problem 12
Calculations for Demonstration Problem 12.2: Estimating the Mean Arrival Rate Number of Arrivals X Observed Frequencies f f·X 7 1 18 2 25 50 3 17 51 4 12 48 5 5 192 Mean Arrival Rate 14
501
Calculations for Demonstration Problem 12
Calculations for Demonstration Problem 12.2: Poisson Probabilities for = 2.3 Number of Arrivals X Expected Probabilities P(X) Frequencies n·P(X) 0.1003 8.42 1 0.2306 19.37 2 0.2652 22.28 3 0.2033 17.08 4 0.1169 9.82 5 0.0838 7.04 Poisson Probabilities for = 2.3 15
502
2 Calculations for Demonstration Problem 12.2
Number of Arrivals X Observed Frequencies f Expected nP(X) (fo - fe)2 fe 1 2 3 4 5 7 8.42 18 19.37 25 22.28 17 17.08 12 9.82 5 7.04 84 84.00 0.24 0.10 0.33 0.00 0.48 0.59 1.74 16
503
Calculations for Demonstration Problem 12.2
The observed chi-square value of 1.74 is less than the critical value of The decision is not to reject the null hypothesis. The data does not provide enough evidence to indicate that the distribution of bank arrivals is Poisson.
504
Calculations for Demonstration Problem 12.2
505
Using a 2 Goodness-of-Fit Test to Test a Population Proportion
18
506
Using a 2 Goodness-of-Fit Test to Test a Population Proportion: Calculations
fo fe Defects 33 16 Nondefects 167 184 200 n = 19
507
Using a 2 Goodness-of-Fit Test to Test a Population Proportion
The observed chi-square value of is greater than the critical value of The decision is to reject the null hypothesis. The data does provide enough evidence to indicate that the manufacturer does not produce 8% of defective items. Observing the actual sample result, in which of the sample was defective, indicates that the proportion of the population that is defective might be greater than 8%.
508
Using a 2 Goodness-of-Fit Test to Test a Population Proportion – MINITAB Solution
509
2 Test of Independence Chi-square goodness-of-fit test - used to analyze the distribution of frequencies for categories of one variable to determine whether the distribution of these frequencies is the same as some hypothesized or expected distribution. The goodness-of-fit test cannot be used to analyze two variables simultaneously. Chi-square test of independence - used to analyze the frequencies of two variables with multiple categories to determine whether the two variables are independent. 21
510
2 Test of Independence Different chi-square test, the chi-square test of independence, can be used to analyze the frequencies of two variables with multiple categories to determine whether the two variables are independent. Used to analyze the frequencies of two variables with multiple categories to determine whether the two variables are independent
511
2 Test of Independence: Investment Example
Suppose a business researcher is interested in determining whether geographic region is independent of type of financial investment. On a questionnaire, the following two questions might be used to measure geographic region and type of financial investment. In which region of the country do you reside?
512
2 Test of Independence: Investment Example
In which region of the country do you reside? A. Northeast B. Midwest C. South D. West Which type of financial investment are you most likely to make today? E. Stocks F. Bonds G. Treasury bills Type of Financial Investment E F G A nA Geographic B nB Region C nC D nD nE nF nG N Contingency Table O13 22
513
2 Test of Independence: Investment Example
Type of Financial Investment E F G A e12 nA Geographic B nB Region C nC D nD nE nF nG N Contingency Table 23
514
2 Test of Independence: Formulas
ij i j e n N where : = the row the column the total of row i of column of all frequencies 2 o f df (r - 1)(c 1) r the number of rows c of columns Expected Frequencies Calculated (Observed ) 24
515
2 Test of Independence: Gasoline Preference Versus Income Category
Suppose a business researcher wants to determine whether type of gasoline preferred is independent of a person’s income. She takes a random survey of gasoline purchasers, asking them one question about gasoline preference and a second question about income. The respondent is to check whether he or she prefers (1) regular gasoline, (2) premium gasoline, or (3) extra premium gasoline. The respondent also is to check his or her income brackets as being (1) less than $30,000, (2) $30,000 to $49,999, (3) $50,000 to $99,999, or (4) more than $100,000.
516
2 Test of Independence: Gasoline Preference Versus Income Category
Type of Gasoline Income Regular Premium Extra Less than $30,000 $30,000 to $49,999 $50,000 to $99,000 At least $100,000 r = 4 c = 3 25
517
Gasoline Preference Versus Income Category: Observed Frequencies
Type of Gasoline Income Regular Premium Extra Less than $30,000 85 16 6 107 $30,000 to $49,999 102 27 13 142 $50,000 to $99,000 36 22 15 73 At least $100,000 23 25 63 238 88 59 385 26
518
Gasoline Preference Versus Income Category: Expected Frequencies
Type of Gasoline Income Regular Premium Extra Less than $30,000 (66.15) (24.46) (16.40) 85 16 6 107 $30,000 to $49,999 (87.78) (32.46) (21.76) 102 27 13 142 $50,000 to $99,000 (45.13) (16.69) (11.19) 36 22 15 73 At least $100,000 (38.95) (14.40) (9.65) 23 25 63 238 88 59 385 ij i j e n N 11 12 66 24 46 40 . 27
519
Gasoline Preference Versus Income Category: 2 Calculation
2 88 66 15 16 24 46 6 40 102 87 78 27 32 13 21 76 36 45 22 69 11 19 38 95 23 14 25 9 65 70 o e f . 28
520
Gasoline Preference Versus Income Category
The observed chi-square value of is greater than the critical value of The decision is to reject the null hypothesis. The data does provide enough evidence to indicate that the type of gasoline preferred is not independent of income.
521
Gasoline Preference Versus Income Category: 2 Calculation
522
Gasoline Preference Versus Income Category: MINITAB Output
523
Business Statistics, 6th ed. by Ken Black
Chapter 18 Statistical Quality Control
524
Learning Objectives Understand the concepts of quality, quality control, and total quality management. Understand the importance of statistical quality control in total quality management. Learn about process analysis and some process analysis tools. Learn how to construct x-bar charts, R charts, p charts, and c charts. Understand the theory and application of acceptance sampling. 2
525
Quality Quality is when a product delivers what is stipulated for in its specifications Crosby: “quality is conformance to requirements” Feigenbaum: “quality is a customer determination” 3
526
Garvin’s Five Dimensions of Quality
Transcendent quality: “innate excellence” Product quality: quality is measurable User quality: quality is determined by the consumer Manufacturing quality: quality is measured by the manufacturer's ability to target the product specifications with little variability Value Quality: Has to do with the price and cost
527
Quality Control Quality control - the collection of strategies, techniques, and actions taken by an organization to assure themselves of a quality product. After-process quality control - involves inspecting the attributes of a finished product to determine whether the product is acceptable reporting of the number of defects per time period screening defective products from consumers In-process quality control - techniques measure product attributes at various intervals throughout the manufacturing process in an effort to pinpoint problem areas. 4
528
Important Quality Concepts
Benchmarking - examine and emulate the best practices and techniques used in the industry a positive, proactive process to make changes that will effect superior performance Just-In-Time Inventory Systems - necessary parts for production arrive “just in time” reduced holding costs, personnel, and space needed for inventory no extra raw materials or inventory of parts for production are stored 6
529
Important Quality Concepts
Reengineering - complete redesign of the core business process in a company Six sigma - total quality approach that measures the capacity of a process to perform defect free work Team Building - employee groups take on managerial responsibilities
530
Process Analysis A process is a series of actions, changes or functions that bring about a result. Flowcharts - schematic representation of all the activities and interactions that occur in a process Pareto Analysis -quantitative tallying of the number and types of defects that occur with a product Pareto Chart - ranked vertical bar chart with most frequently occurring on the left Fishbone Diagram - display of potential cause-and-effect relationships Control Charts - graphical method for evaluating whether a process is or is not in a “state of statistical control” 7
531
MINITAB Pareto Chart
532
Types of Control Charts
Control chart – graphical method for evaluating whether a process is or is not in a “state of statistical control Control charts for measurements x-bar charts - graph of sample means computed for a series of small random samples over a period of time. R charts - plot of the sample ranges and often is used in conjunction with an chart 12
533
Types of Control Charts
Control charts for compliance items P charts – graphs the proportion of sample items in noncompliance for multiple samples. c charts - displays the number of non-conformances per item or unit.
534
X Control Chart Monitor process location (center)
Decide on the quality to be measured. Determine a sample size. Gather 20 to 30 samples. Compute the sample average for each sample. Compute the sample range for each sample. Determine the average sample mean for all samples. Determine the average sample range (or sample standard deviation) for all samples. Using the size of the samples, determine the value of A2 or A3. Compute the UCL and the LCL 13
535
Control Chart: Formulas
14
536
Data for Demonstration Problem 18.1: Samples 1 - 10
A manufacturing facility produces bearings. The diameter specified for the bearings is 5 millimeters. Every 10 minutes, six bearings are sampled and their diameters are measured and recorded. Twenty of these samples of six bearings are gathered. Use the resulting data and construct an x chart.
537
Data for Demonstration Problem 18.1: Samples 1 - 10
2 3 4 5 6 7 8 9 10 5.13 4.96 5.21 5.02 5.12 4.98 4.99 4.96 4.96 5.03 4.92 4.98 4.87 5.09 5.08 5.02 5.00 5.01 5.00 4.99 5.01 4.95 5.02 4.99 5.09 4.97 5.00 5.02 4.91 4.96 4.88 4.96 5.08 5.02 5.13 4.99 5.02 5.05 4.87 5.14 5.05 5.01 5.12 5.03 5.06 4.98 5.01 5.04 4.96 5.11 4.97 4.89 5.04 5.01 5.13 4.99 5.01 5.02 5.01 5.04 X 4.9933 4.9583 5.0567 5.0267 5.1017 4.9883 5.0050 5.0167 4.9517 5.0450 R 0.25 0.12 0.34 0.10 0.07 0.05 0.03 0.09 0.14 0.18 15
538
Data for Demonstration Problem 18.1: Samples 11 - 20
12 13 14 15 16 17 18 19 20 4.91 4.97 5.09 4.96 4.99 5.01 5.05 4.90 5.04 4.93 4.85 5.03 5.02 4.82 5.00 5.12 4.98 5.07 4.95 5.06 4.88 5.13 4.92 4.86 4.9333 4.9567 5.0483 4.9600 4.9883 5.0767 5.0317 4.9617 4.9200 5.0233 0.22 0.11 0.16 0.21 0.06 0.12 0.09 0.17 X R 16
539
Demonstration Problem 18.1: Control Chart Computations
17
540
Demonstration Problem 18.1: Control Chart
X Sigma level: 3 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Bearing Diameter UCL = Average = LCL = Control Chart: Bearing Diameter Mean 18
541
R Chart Monitor process variation
Decide on the quality to be measured. Determine a sample size. Gather 20 to 30 samples. Compute the sample range for each sample. Determine the average sample mean for all samples. Using the size of the samples, determine the values of D3 and D4. Compute the UCL and the LCL 19
542
R Chart Formulas 20
543
Demonstration Problem 18.2: R Control Chart
Construct an R chart for the 20 samples of data in Demonstration Problem 18.1 on bearings.
544
Demonstration Problem 18.2: R Control Chart
Control Chart: Bearing Diameter Sigma level: 3 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Range .4 .3 .2 .1 0.0 Bearing Diameter UCL = .2725 Average = .1360 LCL = .0000 21
545
P Charts Monitor proportion in noncompliance
Decide on the quality to be measured. Determine a sample size. Gather 20 to 30 samples. Compute the sample proportion for each sample. Determine the average sample proportion for all samples. Compute the UCL and the LCL 22
546
P Chart Formulas 23
547
Demonstration Problem 18.3: Twenty Samples of Bond Paper
A company produces bond paper and, at regular intervals, samples of 50 sheets of paper are inspected. Suppose 20 random samples of 50 sheets of paper each are taken during a certain period of time, with the following numbers of sheets in noncompliance per sample. Construct a p chart from these data.
548
Demonstration Problem 18.3: Twenty Samples of Bond Paper
Number Out of Compliance 1 50 4 11 2 3 12 6 13 14 5 15 16 7 17 8 18 9 19 10 20 24
549
Demonstration Problem 18.3: Preliminary Calculations
Sample n nnon 1 50 4 0.08 11 2 0.04 3 0.06 12 6 0.12 0.02 13 0.00 14 5 0.10 15 16 7 17 8 18 9 19 10 20 p 25
550
Demonstration Problem 18.3: Centerline, UCL, and LCL Computations
26
551
Demonstration Problem 18.3: P Control Chart
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 5 10 15 20 Sample Number P = .053 UCL = .148 LCL = 0 p 27
552
Demonstration Problem 18.3: MINITAB P Control Chart
27
553
c Charts Monitor number of nonconformances per item
Decide on nonconformances to be evaluated. Determine the number of items to be studied (at least 25). Gather items. Determine the value of c for each item by summing the number of nonconformances in the item. Determine the average number of nonconformances per item. Determine the UCL and the LCL 28
554
c Chart Formulas 29
555
Demonstration Problem 18.4: Number of Nonconformities in Oil Gauges
A manufacturer produces gauges to measure oil pressure. As part of the company’s statistical process control, 25 gauges are randomly selected and tested for non-conformances. The results are shown here. Use these data to construct a c chart that displays the non-conformances per item.
556
Demonstration Problem 18.4: Number of Nonconformities in Oil Gauges
Item Number Number of Nonconformities 1 2 14 15 3 16 4 17 5 18 6 19 7 20 8 21 9 22 10 23 11 24 12 25 13 30
557
Demonstration Problem 18.4: c Chart Calculations
31
558
Demonstration Problem 18.4: c Chart
1 2 3 4 5 6 7 10 15 20 25 Item Number c UCL = 6.2 LCL = 0 c = 2.0 32
559
Demonstration Problem 18.4: MINITAB c Chart
32
560
Interpreting Control Charts
Points are above UCL and/or below LCL Eight or more consecutive points fall above or below the centerline. Ten out of 11 points fall above or below the centerline. Twelve out of 14 points fall above or below the centerline. A trend of 6 or more consecutive points (increasing or decreasing) is present Two out of 3 consecutive values are in the outer one-third. Four out 5 consecutive values are in the outer two-thirds. The centerline shifts from chart to chart. 33
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.