Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu

Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković sesa@stat.psu.edu TA: Wang Yu wangyu@stat.psu.edu

Revised schedule Nov 8 lab on 2-way ANOVANov 10 lecture on two-way ANOVA and blocking Post HW9 Nov 12 lecture repeated measure and review Nov 15 lab on repeated measuresNov 17 lecture on categorical data/logistic regression HW9 due Post HW10 Nov 19 lecture on categorical data/logistic regression Nov 22 lab on logistic regression & project II introduction No class Thanksgiving No class Thanksgiving Nov 29 labDec 1 lecture HW10 due Post HW11 Dec 3 lecture and Quiz Dec 6 labDec 8 lecture HW 11 due Dec 10 lecture & project II due Dec 13 Project II due

Last lecture  Categorical Data

This lecture  Categorical Data/Response (ch. 18,19,20)  Odds

Review: Categorical Variable  Notation: Population proportion =  = sometimes we use p Population size = N Sample proportion = = X/n = # with trait / total # Sample size = n  The Rule for Sample Proportions If numerous samples of size n are taken, the frequency curve of the sample proportions ( ‘s) from the various samples will be approximately normal with the mean  and standard deviation ~ N( ,  (1-  )/n )

One-sample approximate z test and z-interval for π.

These tests can be extended to test the difference in parameters π between two groups.

Difference between proportions These tests can be extended to test the difference in parameters π between two groups.

Warning: z-tests for proportions are based on an approximation. They don’t work for small samples. It is often said that n is large enough if Because of improved computing power, an exact test based on the binomial distribution rather than the normal is now available in most software.

Analysis Grid (ref. Handout)  Quantitative Explanatory Discrete Explanatory Both Quantitative Outcome RegressionANOVARegression (ANCOVA) Discrete Outcome Logistic Regression Chi-Square Test of Independence Logistic Regression

Contingency Table  A statistical tool for summarizing and displaying results for categorical variables  A two-way table if for two categorical variables  2x2 Table, for two categorical variables, each with two categories  Place the counts of each combination of the two variables in the appropriate cells of the table.  Exploratory variable as labels for the rows, response variable as labels for the columns.

Example  A university offers only two degree programs: English and Computer Science. Admission is competitive and there is a suspicion of discrimination against women in the admission process. Here is a two-way table of all applicants by sex and admission status:  These data show an association between the sex of the applicants and their success in obtaining admission. MaleFemaleTotal Admit352055 Deny454085 Total8060140

Marginal & Conditional Distributions  Marginal Distributions: Exploratory Variable: add up values for the rows; take away response variable  In our example distribution is: 55, 85, 140  Observed proportions: ‘admit’ = 55/140 = 0.39 ‘deny’ = 85/140 = 0.61 NOTE: they add up to 1 Response Variable: add up values for the columns; take away exploratory variable  In our example distribution is?  Observed proportions are:  Do they add up to 1?

Marginal & Conditional Distributions  Conditional Distribution: Conditional percentages; what percent of a particular row or a column a count in a cell is. Conditional distribution of gender for those admitted:  % of admitted who are male = 35/55 = 0.63 = 63%  % of admitted who are female = ? What is:  % of male applicants admitted = ?  % of female applicants admitted = ?

Statistical Significance  An observed relationship is statistically significant if the chances of observing the relationship in the sample when there is no actual relationship in the population are small (usually less than 5%)  In other words, a relationship is statistically significant if that relationship is stronger than 95% of the relationships we would expect to see just by chance.  If we say that there was no statistically significant relationship found, that does not mean that there is no relationship at all!  Warnings: If a sample size is small, strong relationships may not achieve significance If a sample size is large, even minor relationships could achieve significance but these might not then have practical importance

Chi-Squared Test (  2 Test)  A Chi-Squared Test for independence  The Chi-Squared Statistics (  2 ) for contingency table. Follows  2 distribution  Skewed to the right  Min = 0, Max = infinity As the strength of observed relationship in the sample increase, the statistic increases. It combines info about a strength of the relationship and the sample size into a one number Can be calculated for any size contingency table For 2 x 2 table: if  2 > 3.84 then we have a statistically significant relationship  We either show (  2 > 3.84) or fail to show significant relationship (if  2 3.84 ) or fail to reject (  2 < 3.84) the claim of independence between two variables that is our null hypothesis.  H 0 : variables are independentH A : variabls are NOT independent

22  The chi-squared distribution with k-1 degrees of freedom acts as though it was the sum the squares of k-1 independent Normal(0,1) distributions. (Not that you need to know.)  See table on pages 1100-1101 in textbook.

You Must Know:  How to calculate  2 statistic Compute the expected numbers Compare the expected and observed numbers Compute the  2 statistic  How to compare it to 3.84 for 2x2 tables  How to make proper conclusion about statistical relationship and in general about the question of interest for any two-way and k-way tables.

For our example:  Computing  2 statistic: Expected number = the number of counts (individuals) that we expect to fall in a particular cell = (row total)(column total)/(table total)  Expected number of admitted male students = (55 x 80)/140 = 31.42  Expected number of admitted female students = ? Observed number = the number of counts in the cell  Observed number of admitted male students = 35  Observed number of admitted female students = ? Compare the observed and expected number : ( observed – expected) 2 /(expected number) For male students: (35 - 31.42) 2 /(31.42) = 0.41 For female students: = ? Compute the statistic = Sum all the above calculated numbers for all the cells  In our case  2 = 1.58  Compare it to 3.84  Is it statistically significant? Are admission decisions independent of the gender?

Relative Risk, Increased Risk, Odds Ratio  Quantifications of the chances of a particular outcome and how do these chances change  What are the chances that a randomly selected individual would fall into a particular category for a categorical variable.  There are two basic ways to express these chances: Proportions = expressing one category as a proportion of the total  Proportion of admitted students who are female = 20/55 = 0.36 Odds = comparing one category to another  Odds of being admitted = 55 to 85 = 55/85 to 1

Expressing Proportions & Odds  There are 4 equivalent ways to express proportions: Percent = Proportion = Probability = Risk  36% (percent) of all admitted students are females  The proportion of females admitted is 0.36  The probability that a female would be admitted is 0.36  The risk for a female to be admitted is 0.36  Odds = expressed by reducing the numbers with and without a characteristic we are interested in to the smallest possible whole number: The odds of being admitted = 55 to 85 = 7 to 11 = 7/11 to 1  Going back and forth between proportions and odds: If the proportion has value p then the odds are:  /(1-  ) to 1 If the odds of having a characteristic are a to b, then the proportion with the characteristic is a/(a+b)

Generalized forms for the expressions:  Percentage with the characteristic = (number with the characteristic/total) x 100%  Proportion with the characteristic = (number with the characteristic/total)  Probability of having the characteristics = (number with the characteristic/total)  Risk of having the characteristic = (number with the characteristic/total)  Odds of having the characteristic = (number with the characteristic/number without characteristics) to 1  =  /(1-  )

Types of Risk: Relative risk & Increased Risk  Relative risk = the ratio of the risks for each category of the exploratory variable Relative risk of being a female based on whether you are rejected or accepted:  Risk for being rejected if you are female = 40/85 = 0.47  Risk of being accepted if you are female = 20/55 = 0.36  Relative risk = 0.47/0.36 = 1.31 to 1 What does this mean? What does a relative risk of 1 mean?  Increased Risk = usually, the percent increase in risk Increased risk = (change in risk/original risk) x 100%  Change in risk = 0.47 – 0.36 = 0.11  Original risk = Baseline risk = 0.36  Increased risk = 0.11/0.36x 100% = 0.31 = 31% There is a 23% increase in the chances of females to be rejected Increased risk = (relative risk – 1.0) x 100%  Increased risk = (1.31 – 1.0) x 100% = 31%

Odds Ratio  First calculate the odds of having a characteristic versus not having it: Odds for female being admitted = 20/35 =0.571429 Odds for female being rejected = 40/45= 0.888889  Then take the ratio of these odds: Odds ratio = 0.888889/ 0.571429 = 1.5556 Not too close to 1.31, but sometimes it can be close to relative risk  Odds ratio = (upper left * lower right)/(upper right * lower left) Sometimes you need to reverse denominator and numerator so that the ratio is greater than 1 (easier to interpret)

Misleading items about Risk/Odds  The baseline risk is missing  The time period of the risk is not identified  The reported risk is not necessarily your risk (relative risk vs. your risk)  Retrospective vs. Prospective study Prospective: take a random sample and record success and failure in the future Retrospective: take a random sample and record success and failure that happened in the past In retrospective study you can meaningfully interpret odds ratio, but not individual odds

Simpson’s Paradox  Lurking variable = A variable that changes the nature of association even reverses direction of relationship between two other variables.  A nature of association changes due to a lurking variable  In our example we didn’t consider type of a program (major) as a variable. What happens if we do, and if construct two separate tables, one for each major?

Example of Simpson’s Paradox  Computer Science admits each 50% of males and females  English takes ¼ of both males and females  Now there doesn’t seem to be an association between sex and admission decision in either program  Hence, type of program was a lurking variable Computer Science MaleFemale Admit3010 Deny3010 Total6020 English MaleFemale Admit510 Deny1530 Total2040

Commands in SAS  To create contingency tables, calculate chi-square statistic, etc… Statistics/Table Analysis  To run the logistic regression Statistics/Regression/Logistic

Next  Lab Monday Categorical Data,  Logistic Regression -- we will work through the lab together and learn about logistic regression  Project II

Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu

Similar presentations

Presentation on theme: "Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu

Similar presentations

Presentation on theme: "Intermediate Applied Statistics STAT 460 Lecture 20, 11/19/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu"— Presentation transcript:

Similar presentations

About project

Feedback