Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic.

Similar presentations


Presentation on theme: "1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic."— Presentation transcript:

1 1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

2 2 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

3 Objectives Recognize the differences between categorical and continuous data analysis. Identify the scale of measurement for your response variable. 3

4 Categorical versus Continuous Data Analysis 4

5 Identifying the Scale of Measurement Before analyzing, select the measurement scale for each variable. 5 VARIABLE AGREE NO OPINION DISAGREE

6 Nominal Variables Variable: Type of Beverage or 6 1 2 3 123

7 Ordinal Variables 7 Variable: Size of Beverage SmallMediumLarge

8 Continuous Variables 8 0 1.0 3.0 2.0 Variable: Volume of Beverage 4.0

9

10 1.01 Quiz A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 10

11 1.01 Quiz – Correct Answer A car dealer records several inventory variables, including Type (automatic or standard), Time (the number of seconds it takes for the car to go from 0 to 60 mph), and Model (basic, middle, or luxury). Match the modeling type on the left with the appropriate component on the right. 1. ContinuousA. Type 2. OrdinalB. Time 3. NominalC. Model 11 1-B, 2-C, 3-A

12 What’s Next? 12 Ah ha! Ordinal! Agree No Opinion Disagree opinion

13 13

14 14 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

15 Objectives Examine the distribution of categorical variables. Determine whether an association exists among categorical variables. Perform a stratified analysis of categorical variables. 15

16 Sample Data Set 16

17 17 This demonstration illustrates the concepts discussed previously. Examining Distributions

18 Association An association exists between two variables if the distribution of one variable changes when the level (or values) of the other variable changes. If there is no association, the distribution of the first variable is the same, regardless of the level of the other variable. 18

19 No Association 19 72%28% 72% Is your manager’s mood associated with the weather?

20 Association 20 82%18% 40%60% Is your manager’s mood associated with the weather?

21 21 This demonstration illustrates the concepts discussed previously. Recognizing Associations

22

23 1.02 Quiz Is there an association between finishing a prescription (Rx) and experiencing a relapse? 23

24 1.02 Quiz – Correct Answer Is there an association between finishing a prescription (Rx) and experiencing a relapse? Yes. The distribution of Yes/No for Did not finish Rx is different from the distribution of Yes/No for Finished Rx. 24

25 Tests for Association 25 Row percents of Income by Purchase $100 +Under $100 Low32%68% Medium32%68% High48%52% Purchase Income

26 Null Hypothesis There is no association between Income and Purchase. The probability of purchasing items of $100 or more is the same, regardless of income level. 26

27 Alternative Hypothesis There is an association between Income and Purchase. The probability of purchasing items over $100 is different between Low, Medium, and High income customers. 27

28 Chi-Square Test 28 NO ASSOCIATION observed frequencies = expected frequencies ASSOCIATION observed frequencies = expected frequencies \

29 p -Value for Chi-Square Test This p-value is the probability of observing a chi-square statistic at least as large as the one actually observed, given that there is no association between the variables probability of the association you observe in the data occurring by chance. 29

30 Chi-Square Tests Chi-square tests and the corresponding p-values determine whether an association exists do not measure the strength of an association depend on and reflect the sample size. 30

31 31 This demonstration illustrates the concepts discussed previously. Chi-Square Test

32

33 1.03 Quiz Is there sufficient evidence that an association exists between Relapsed and Rx Status? 33

34 1.03 Quiz – Correct Answer Is there sufficient evidence that an association exists between Relapsed and Rx Status? Yes there is sufficient evidence that an association exists between Relapsed and Rx Status. The p-value for the Pearson chi-square statistic is.0005, so at alpha=.05, there is sufficient evidence to reject the null (that no association exists) in favor of the alternative (that an association exists). 34

35 When Not to Use the Chi-Square Test 35 When more than 20% of the cells have expected counts less than five  2 Expected

36 Observed versus Expected Values 36 3.434.576.00 4.415.887.71 4.165.557.29 Observed ValuesExpected Values 158 567 656

37 Small Samples – Fisher’s Exact Test 37 Fisher’s Exact Test SAMPLE SIZE Small Large

38 Example: Tea and Milk Suppose you want to test whether someone can determine if a cup of tea with milk had the milk poured first or the tea poured first. 38

39 Fisher’s Exact Test Example 9 Cups of Tea: 4 with Milk First and 5 with Tea First Predict which cups had tea poured first. 39 4 5 4 5 M T MT Fixed Marginal Totals Actual Guess

40 Basis for Fisher’s Exact Test 40 0 4 4 1 4 4 5 5 2 2 2 3 4 4 5 5 3 1 1 4 4 4 5 5 row and column totals fixed Other possibilities M M T T 3 4 5 45 0 05 4 Actual Guess 1 3 3 2 4 4 5 5

41 Fisher’s Exact Test Hypotheses Null Hypothesis: There is no association. Alternative Hypothesis: There is an association. Two-tailed Left-tailed Right-tailed 41

42 Left-Tailed Alternative Hypothesis 42 0 4 4 1 4 4 5 5 Left-tailed p-value M 1 3 3 2 4 4 5 5 M T T Actual Guess

43 Right-Tailed Alternative Hypothesis 43 Right-tailed p-value M 1 3 3 2 4 4 5 5 M T T 2 2 2 3 4 4 5 5 3 1 1 4 4 4 5 5 4 0 05 4 45 5 Actual Guess

44 Two-Tailed Alternative Hypothesis 44 0 4 4 1 4 4 5 5 Two-tailed p-value M 1 3 3 2 4 4 5 5 M T T 2 2 2 3 4 4 5 5 3 1 1 4 4 4 5 5 4 4 5 5 40 05 Actual Guess

45 45 This demonstration illustrates the concepts discussed previously. Fisher’s Exact Test

46

47 1.04 Quiz What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? 47

48 1.04 Quiz – Correct Answer What can you conclude from each of the p-values from the Fisher’s Exact Test for the association between Relapsed and Rx Status? The Left p-value =.0007, so there is sufficient evidence to conclude that the probability of a relapse is greater for those who did not finish the Rx than for those who did. The Right p-value =.9999, so there is not sufficient evidence to conclude that the probability of a relapse is greater for those who finished the Rx than for those who did not. The 2-Tail p-value =.0008, so there is sufficient evidence to conclude that the probability of a relapse is different depending on whether a Rx was finished or not. 48

49 What Happens If There Is a Third Variable? 49 Income Gender $100

50 Stratified Data Analysis Stratified data analysis is the process of dividing subjects into groups defined by the levels of a third variable. Use this analysis when you want to examine the association between two variables within the levels of a third variable. 50

51 Stratified Data Analysis Of the 39 single people, 23% have lung cancer and 77% do not. Of the 36 married people, 17% have lung cancer and 83% do not. 51

52 Stratified Data Analysis Of the 28 single smokers, 28% have lung cancer and 72% do not. Of the 14 married smokers, 28% have lung cancer and 72% do not. 52

53 Cochran-Mantel-Haenszel Statistics 53

54 CMH versus Chi-Square 54

55 1. Correlation of Scores 55 B A Test linear association

56 2. Row Scores by Column Categories 56 B A Test equal row scores

57 3. Column Scores by Row Categories 57 B A Test equal column scores

58 4. General Association of Categories 58 B A  2  2 Test general association

59 CMH Statistics and 2x2 Tables 59 2 X 2 CMH statistics are all equal

60 When Do CMH Statistics Lack Power? 60 Response Reversed in Strata

61 61 This demonstration illustrates the concepts discussed previously. CMH Tests

62 62

63 63 Exercise This exercise reinforces the concepts discussed previously.

64

65 1.05 Multiple Choice Poll The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 65

66 1.05 Multiple Choice Poll – Correct Answer The Correlation of Scores CMH test has which null hypothesis? a.There is no linear association between the row and column variables in any stratum. b.The mean scores for each column are equal in each stratum. c.The mean scores for each row are equal in each stratum. d.There is no association between the row and column variables in any stratum. 66

67 67 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

68 Objectives Define partitioning. Understand the splitting criteria used in JMP. Review algorithm parameters available in JMP. Use the Partition platform in JMP. 68

69 Recursive Partitioning Partitioning refers to segmenting the data into groups that are as homogeneous as possible with respect to the dependent variable (Y). 69

70 Divide and Conquer 70 n = 42 n = 261 size ( Large ) size ( Medium, Small ) What factors affect the country from which cars are purchased? n =303 Country

71 Tree Algorithm: Calculate Separation of the Response 71 X1 Separation of Response

72 Tree Algorithm: Find Best Split for the Independent Variable 72 X1 Best Split X1

73 Tree Algorithm: Repeat for the Other Independent Variables 73 X2 Separation of Means

74 Tree Algorithm: Compare the Best Splits 74 Best Split X2 Best Split X1

75 Tree Algorithm: Partition with Best Split 75

76 Tree Algorithm: Repeat within Partitions 76

77 77 This demonstration illustrates the concepts discussed previously. Recursive Partitioning

78 78

79 79 Exercise This exercise reinforces the concepts discussed previously.

80

81 1.06 Quiz In which leaf, and on what variable, will JMP next split? 81

82 1.06 Quiz – Correct Answer In which leaf, and on what variable, will JMP next split? Of the leaves, the highest LogWorth is for Age (.7313), in the Gender(Female) leaf. This is where JMP will next split. 82

83 83 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic Regression

84 Objectives Explain the concepts of logistic regression. Fit a logistic regression model using JMP software. Examine logistic regression output. 84

85 Overview 85 Categorical or Continuous Categorical Linear Regression Analysis Logistic Regression Analysis Predictor Response Analysis

86 Types of Logistic Regression 86 Nominal Ordinal Binary Two Categories Three or More Categories Response Variable Type of Logistic Regression Binary Nominal Ordinal Yes No

87 What Does Logistic Regression Do? The logistic regression model uses the predictor variables, which can be categorical or continuous, to predict the probability of specific outcomes. In other words, logistic regression is designed to describe probabilities associated with the values of the response variable. 87

88 The Logistic Curve The relationship between the probability of a response variable and a predictor variable might be an S-shaped curve. Linear regression cannot model this relationship, but logistic regression can. 88

89 Logistic Regression Curves This graph shows the relationship between the probability of Sale to Price. 89

90 Logit Transformation 90 where iindexes all cases (observations). p i is the probability that the event (a sale, for example) occurs in the i th case. 1- p i is the probability that the event (a sale, for example) does not occur in the i th case logis the natural log (to the base e).

91 Assumption 91 p i Predictor Logit Transform

92 Logistic Regression Model 92 logit (p i ) = B 0 + B 1 X 1 where logit(p i )is the logit transformation of the probability of the event B 0 is the intercept of the regression line B 1 is the slope of the regression line.

93 Likelihood Function A likelihood function expresses the probability of the observed data as a function of the unknown categorical parameters. The goal is to derive values of the parameters such that the probability of the observed data is as large as possible. 93

94 Maximum Likelihood Estimate 94 Log-likelihood

95 Model Inference 95 0 LogL 1 LogL 0 Log-likelihood function

96 Logistic Curve 96 Weak Relationship Strong Relationship Very Strong Relationship

97 Example of Binary Logistic Regression Model You want to predict the probability of defaulting on credit card payments based on having or not having a history of late payments. You can postulate this model: logit (Probability of Defaulting) = B 0 + B 1 *(Late Payment) 97

98 98 This demonstration illustrates the concepts discussed previously. Binary Logistic Regression

99

100 1.07 Quiz You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? 100

101 1.07 Quiz – Correct Answer You want to predict the probability of a defect, given the width of a product. What kind of association exists between Defect and Width – a strong relationship or a weak relationship? Weak – The fitted regression line is nearly flat, indicating a weak association between Defect and Width. 101

102 Multiple Logistic Regression 102

103 Interaction 103

104 104 This demonstration illustrates the concepts discussed previously. Multiple Logistic Regression

105 What Is an Odds Ratio? An odds ratio indicates how much more likely, with respect to odds, a certain event occurs in one group relative to its occurrence in another group. Example:How much more likely are females to purchase 100 dollars or more in items compared to males? Example:How much more likely is a person with a history of late payments on credit cards to default on a loan relative to a person who does not have a history of late payments? 105

106 Probability of Outcome 106 Default on Loan Yes No Yes Late Payments (Group A) 2060 No Late Payments (Group B) 1090 Total30150 Probability of defaulting = 20/80 (.25) in Group A Probability of not defaulting = 60/80 (.75) in Group A Total 80 100 180

107 Odds 107 Odds of Outcome in Group A probability of defaulting in group with history of late payments probability of not defaulting in group with history of late payments 0.25 ÷ 0.75 = 0.33 ÷

108 Odds Ratio 108 Odds Ratio of Group A to Group B odds of defaulting in group with history of late payments odds of defaulting in group with no history of late payments 0.33 ÷ 0.11 = 3 ÷

109 Properties of the Odds Ratio 109

110 Odds Ratio from a Logistic Regression Model For a predictor variable that has only two levels, you can exponentiate twice the parameter estimate that JMP provides to obtain the odds ratio. Estimated odds ratio = exp(2*parameter estimate) What are the odds a female purchases more than 100 dollars in items compared to a male? 110

111 111 This demonstration illustrates the concepts discussed previously. Odds Ratios

112 112

113 113 Exercise This exercise reinforces the concepts discussed previously.

114

115 1.08 Multiple Choice Poll Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =.25. 115

116 1.08 Multiple Choice Poll – Correct Answer Suppose processes A and B are used to make a product, and each product is evaluated as defective or non-defective. Suppose the probability of a defective from A is.2 and of a non-defective from A is.8. Which is true? a.The odds of a defective from group A is given by.8/.2 = 4. b.The odds of a defective from group A is given by.2/.8 =.25. 116

117 1.09 Multiple Choice Poll The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 117

118 1.09 Multiple Choice Poll – Correct Answer The odds of getting a defective product from process A is.25. What is its interpretation? a.You expect only 1/4 as many defectives as non- defectives from process A. b.You expect only 1/4 as many defectives as non- defectives from process B. 118


Download ppt "1 Chapter 1: Stratified Data Analysis 1.1 Introduction 1.2 Examining Associations among Variables 1.3 Recursive Partitioning 1.4 Introduction to Logistic."

Similar presentations


Ads by Google