Descriptive Statistics

Descriptive Statistics

Bar Chart for SI categories
Number of Patients Shock Index Category 0.0 16.7 33.3 50.0 66.7 83.3 100.0 116.7 133.3 150.0 166.7 183.3 200.0 1 2 3 4 5 6 7 8 9 10 Much easier to extract information from a bar chart than from a table!

Box plot and histograms: for continuous variables
To show the distribution (shape, center, range, variation) of continuous variables. Does everybody know what I mean when I say percentiles? What is the median? Anyone?

Box Plot: Shock Index Shock Index Units 2.0 1.3 0.7 0.0 maximum (1.7)
SI Box Plot: Shock Index Shock Index Units maximum (1.7) Outliers Q IQR = (.25)=1.175 “whisker” 75th percentile (0.8) interquartile range (IQR) = = .25 median (.66) 25th percentile (0.55) minimum (or Q1- 1.5IQR)

Histogram of SI Percent SI Bins of size 0.1 Note the “right skew” 0.0
8.3 16.7 25.0 0.7 1.3 2.0 Histogram of SI SI Percent Bins of size 0.1 Note the “right skew” 1. Bin sizes may be altered. 2. How many people do you think are in bin ? 3. Where do you think the center of the data are (what's your best guess at the average weight)? 4. On average, how far do you think a given woman is from the center/mean?

100 bins (too much detail)

2 bins (too little detail)

Box Plot: Shock Index Shock Index Units Also shows the “right skew”
0.0 0.7 1.3 2.0 SI Box Plot: Shock Index Shock Index Units Also shows the “right skew”

Box Plot: Age Years Variables More symmetric 100.0 66.7 33.3 0.0
maximum More symmetric 66.7 75th percentile interquartile range median Years 25th percentile 33.3 minimum 0.0 AGE Variables

Histogram: Age Percent AGE (Years)
0.0 4.7 9.3 14.0 33.3 66.7 100.0 AGE (Years) Percent Not skewed, but not bell- shaped either…

Some histograms from your class (n=24)
Starting with politics…

Feelings about math and writing…

Optimism…

Diet…

Habits…

Measures of central tendency
Mean Median Mode

Central Tendency Mean – the average; the balancing point
calculation: the sum of values divided by the sample size Balance the Bell Curve on a point. Where is the point of balance, average mass on each side. In math shorthan d:

Mean: example Some data: Age of participants:

Mean of age in Kline’s data
Means Section of AGE Geometric Harmonic Parameter Mean Median Mean Mean Sum Mode Value 0.0 4.7 9.3 14.0 33.3 66.7 100.0 Percent

Mean of age in Kline’s data
0.0 4.7 9.3 14.0 33.3 66.7 100.0 Percent The balancing point

Mean of Pulmonary Embolism? (Binary variable?)
80.56% (750) 19.44% (181)

Mean The mean is affected by extreme values (outliers) Mean = 3
Mean = 3 Mean = 4 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Central Tendency Median – the exact middle value Calculation:
If there are an odd number of observations, find the middle value If there are an even number of observations, find the middle two values and average them.

Median: example Some data:
Age of participants: Median = (22+23)/2 = 22.5

Median of age in Kline’s data
Means Section of AGE Geometric Harmonic Parameter Mean Median Mean Mean Sum Mode Value 0.0 4.7 9.3 14.0 33.3 66.7 100.0 AGE (Years) Percent

Median of age in Kline’s data
0.0 4.7 9.3 14.0 33.3 66.7 100.0 Percent 50% of mass 50% of mass

Does PE have a median? Yes, if you line up the 0’s and 1’s, the middle number is 0.

Median The median is not affected by extreme values (outliers).
Median = 3 Median = 3 SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Central Tendency Mode – the value that occurs most frequently

Mode: example Some data: Age of participants: 17 19 21 22 23 23 23 38
Mode = 23 (occurs 3 times)

Mode of age in Kline’s data
Means Section of AGE Geometric Harmonic Parameter Mean Median Mean Mean Sum Mode Value

Mode of PE? 0 appears more than 1, so 0 is the mode.

Measures of Variation/Dispersion
Range Percentiles/quartiles Interquartile range Standard deviation/Variance

Range Difference between the largest and the smallest observations.

Range of age: 94 years-15 years = 79 years
14.0 9.3 Percent 4.7 0.0 0.0 33.3 66.7 100.0 AGE (Years)

Range of PE? 1-0 = 1

Quartiles 25% 25% 25% 25% Q 1 Q 2 Q 3 The first quartile, Q1, is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as the median (50% are smaller, 50% are larger) Only 25% of the observations are greater than the third quartile

Interquartile Range Interquartile range = 3rd quartile – 1st quartile = Q3 – Q1

Interquartile Range: age
Median (Q2) Q1 Q3 maximum minimum 25% % % % Interquartile range = 65 – 35 = 30

Variance Average (roughly) of squared deviations of values from the mean

Why squared deviations?
Adding deviations will yield a sum of 0. Absolute values are tricky! Squares eliminate the negatives. Result: Increasing contribution to the variance as you go farther from the mean.

Standard Deviation Most commonly used measure of variation
Shows variation about the mean Has the same units as the original data

Calculation Example: Sample Standard Deviation
Age data (n=8) : n = Mean = X = 23.25

Std. dev is a measure of the “average” scatter around the mean.
14.0 Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 79/6 = 13 9.3 Percent 4.7 0.0 0.0 33.3 66.7 100.0 AGE (Years)

Std. Deviation age Variation Section of AGE Standard
Parameter Variance Deviation Value

Std Dev of Shock Index Count SI
250.0 Std. dev is a measure of the “average” scatter around the mean. 187.5 Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 1.4/6 =.23 Count 125.0 1. Bin sizes may be altered. 2. How many people do you think are in bin ? 3. Where do you think the center of the data are (what's your best guess at the average weight)? 4. On average, how far do you think a given woman is from the center/mean? 62.5 0.0 0.0 0.5 1.0 1.5 2.0 SI

Std. Deviation SI Variation Section of SI
Standard Std Error Interquartile Parameter Variance Deviation of Mean Range Range Value E E

Std. Dev of binary variable, PE
Std. dev is a measure of the “average” scatter around the mean. 80.56% 19.44%

Std. Deviation PE Variation Section of PE Standard
Parameter Variance Deviation Value

Comparing Standard Deviations
Data A Mean = 15.5 S = 3.338 Data B Mean = 15.5 S = 0.926 Data C Mean = 15.5 S = 4.570 SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Bienaymé-Chebyshev Rule
Regardless of how the data are distributed, a certain percentage of values must fall within K standard deviations from the mean: Note use of  (sigma) to represent “standard deviation.” Note use of  (mu) to represent “mean”. within At least (1 - 1/12) = 0% …….….. k=1 (μ ± 1σ) (1 - 1/22) = 75% … k=2 (μ ± 2σ) (1 - 1/32) = 89% ………....k=3 (μ ± 3σ)

Symbol Clarification S = Sample standard deviation (example of a “sample statistic”)  = Standard deviation of the entire population (example of a “population parameter”) or from a theoretical probability distribution X = Sample mean µ = Population or theoretical mean

**The beauty of the normal curve:
No matter what  and  are, the area between - and + is about 68%; the area between -2 and +2 is about 95%; and the area between -3 and +3 is about 99.7%. Almost all values fall within 3 standard deviations.

68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data
SAY: within 1 standard deviation either way of the mean within 2 standard deviations of the mean within 3 standard deviations either way of the mean WORKS FOR ALL NORMAL CURVES NO MATTER HOW SKINNY OR FAT 95% of the data 99.7% of the data

Summary of Symbols S2= Sample variance S = Sample standard dev
2 = Population (true or theoretical) variance  = Population standard dev. X = Sample mean µ = Population mean IQR = interquartile range (middle 50%)

Examples of bad graphics

What’s wrong with this graph?
from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983, p.69

Notice the X-axis From: Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot Wainer, H. 1997, p.29.

Correctly scaled X-axis…

Report of the Presidential Commission on the Space Shuttle Challenger Accident, 1986 (vol 1, p. 145)
The graph excludes the observations where no O-rings failed.

Smooth curve at least shows the trend toward failure at high and low temperatures…

Even better: graph all the data (including non- failures) using a logistic regression model
Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87,

What’s wrong with this graph?
from: ER Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, 1983, p.74

What’s the message here?
Diagraphics II, 1994

Diagraphics II, 1994

From: Johnson R. Just the Essentials of Statistics. Duxbury Press, 1995.

For more examples…

“Lying” with statistics
More accurately, misleading with statistics…

Example 1: projected statistics
Lifetime risk of melanoma: 1935: 1/1500 1960: 1/600 1985: 1/150 2000: 1/74 2006: 1/60

How do you think these statistics are calculated? How do we know what the lifetime risk of a person born in 2006 will be?

Interestingly, a clever clinical researcher recently went back and calculated (using SEER data) the actual lifetime risk (or risk up to 70 years) of melanoma for a person born in 1935. The answer? Closer to 1/150 (one order of magnitude off) (Martin Weinstock of Brown University, AAD conference 2006)

Example 2: propagation of statistics
In many papers and reviews of eating disorders in women athletes, authors cite the statistic that 15 to 62% of female athletes have disordered eating. I’ve found that this statistic is attributed to about 50 different sources in the literature and cited all over the place with or without citations...

For example… In a recent review (Hobart and Smucker, The Female Athlete Triad, American Family Physician, 2000): “Although the exact prevalence of the female athlete triad is unknown, studies have reported disordered eating behavior in 15 to 62 percent of female college athletes.” No citations given.

And… Fact Sheet on eating disorders:
“Among female athletes, the prevalence of eating disorders is reported to be between 15% and 62%.” Citation given: Costin, Carolyn. (1999) The Eating Disorder Source Book: A comprehensive guide to the causes, treatment, and prevention of eating disorders. 2nd edition. Lowell House: Los Angeles.

And… From a Fact Sheet on disordered eating from a college website:
“Eating disorders are significantly higher (15 to 62 percent) in the athletic population than the general population.” No citation given.

And… “Studies report between 15% and 62% of college women engage in problematic weight control behaviors (Berry & Howe, 2000).” (in The Sport Journal, 2004) Citation: Berry, T.R. & Howe, B.L. (2000, Sept). Risk factors for disordered eating in female university athletes. Journal of Sport Behavior, 23(3),

And… 1999 NY Times article “But informal surveys suggest that 15 percent to 62 percent of female athletes are affected by disordered behavior that ranges from a preoccupation with losing weight to anorexia or bulimia.”

And “It has been estimated that the prevalence of disordered eating in female athletes ranges from 15% to 62%.” ( in Journal of General Internal Medicine 15 (8), ) Citations: Steen SN. The competitive athlete. In: Rickert VI, ed. Adolescent Nutrition: Assessment and Management. New York, NY: Chapman and Hall; 1996: Tofler IR, Stryer BK, Micheli LJ. Physical and emotional problems of elite female gymnasts. N Engl J Med. 1996;335:281 3.

Where did the statistics come from?
The 15%: Dummer GM, Rosen LW, Heusner WW, Roberts PJ, and Counsilman JE. Pathogenic weight-control behaviors of young competitive swimmers. Physician Sportsmed 1987; 15: The “to”: Rosen LW, McKeag DB, O’Hough D, Curley VC. Pathogenic weight-control behaviors in female athletes. Physician Sportsmed. 1986; 14: The 62%:Rosen LW, Hough DO. Pathogenic weight-control behaviors of female college gymnasts. Physician Sportsmed 1988; 16:

Study design? Control group? Cross-sectional survey (all) No non-athlete control groups Population/sample size? Convenience samples Rosen et al. 1986: 182 varsity athletes from two midwestern universities (basketball, field hockey, golf, running, swimming, gymnastics, volleyball, etc.) Dummer et al. 1987: year old swimmers at a swim camp Rosen et al. 1988: 42 college gymnasts from 5 teams at an athletic conference

Measurement? Instrument: Michigan State University Weight Control Survey Disordered eating = at least one pathogenic weight control behavior: Self-induced vomiting fasting Laxatives Diet pills Diuretics In the 1986 survey, they required use 1/month; in the 1988 survey, they required use twice-weekly In the 1988 survey, they added fluid restriction

Findings? Rosen et al. 1986: 32% used at least one “pathogenic weight-control behavior” (ranges: 8% of 13 basketball players to 73.7% of 19 gymnasts) Dummer et al. 1987: 15.4% of swimmers used at least one of these behaviors Rosen et al. 1988: 62% of gymnasts used at least one of these behaviors

References http://www.math.yorku.ca/SCS/Gallery/
Kline et al. Annals of Emergency Medicine 2002; 39: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall Tappin, L. (1994). "Analyzing data relating to the Challenger disaster". Mathematics Teacher, 87, Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, Connecticut, Visual Revelations: Graphical Tales of Fate and Deception from Napoleon Bonaparte to Ross Perot Wainer, H

(Basic Probability and Counting Methods)
Gambling, Probability, and Risk (Basic Probability and Counting Methods) 90 90

A gambling experiment Everyone in the room takes 2 cards from the deck (keep face down) Rules, most to least valuable: Pair of the same color (both red or both black) Mixed-color pair (1 red, 1 black) Any two cards of the same suit Any two cards of the same color In the event of a tie, highest card wins (ace is top)

What do you want to bet? Look at your two cards. Will you fold or bet?
What is the most rational strategy given your hand?

Rational strategy There are N people in the room
What are the chances that someone in the room has a better hand than you? Need to know the probabilities of different scenarios We’ll return to this later in the lecture…

Probability Probability – the chance that an uncertain event will occur (always between 0 and 1) Symbols: P(event A) = “the probability that event A will occur” P(red card) = “the probability of a red card” P(~event A) = “the probability of NOT getting event A” [complement] P(~red card) = “the probability of NOT getting a red card” P(A & B) = “the probability that both A and B happen” [joint probability] P(red card & ace) = “the probability of getting a red ace” 94

Assessing Probability
1. Theoretical/Classical probability—based on theory (a priori understanding of a phenomena) e.g.: theoretical probability of rolling a 2 on a standard die is 1/6 theoretical probability of choosing an ace from a standard deck is 4/52 theoretical probability of getting heads on a regular coin is 1/2 2. Empirical probability—based on empirical data e.g.: you toss an irregular die (probabilities unknown) 100 times and find that you get a 2 twenty-five times; empirical probability of rolling a 2 is 1/4 empirical probability of an Earthquake in Bay Area by is .62 (based on historical data) empirical probability of a lifetime smoker developing lung cancer is 15 percent (based on empirical data)

Recent headlines on earthquake probabiilites…
taly-quake-experts-manslaughter-charge

Computing theoretical probabilities:counting methods
Great for gambling! Fun to compute! If outcomes are equally likely to occur… Note: these are called “counting methods” because we have to count the number of ways A can occur and the number of total possible outcomes.

Counting methods: Example 1
Example 1: You draw one card from a deck of cards. What’s the probability that you draw an ace?

Example 2. What’s the probability that you draw 2 aces when you draw two cards from the deck? This is a “joint probability”—we’ll get back to this on Wednesday

Two counting method ways to calculate this: 1. Consider order: Numerator: AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, AA, or AA = 12 . 52 cards 51 cards Denominator = 52x51 = why?

2. Ignore order: Numerator: AA, AA, AA, AA, AA, AA = 6 Denominator = Divide out order!

Summary of Counting Methods
Counting methods for computing probabilities Permutations— order matters! Combinations— Order doesn’t matter With replacement Without replacement Without replacement

Counting methods for computing probabilities Permutations— order matters! With replacement Without replacement

Permutations—Order matters!
A permutation is an ordered arrangement of objects. With replacement=once an event occurs, it can occur again (after you roll a 6, you can roll a 6 again on the same die). Without replacement=an event cannot repeat (after you draw an ace of spades out of a deck, there is 0 probability of getting it again).

Counting methods for computing probabilities Permutations— order matters! With replacement

Permutations—with replacement
With Replacement – Think coin tosses, dice, and DNA. “memoryless” – After you get heads, you have an equally likely chance of getting a heads on the next toss (unlike in cards example, where you can’t draw the same card twice from a single deck). What’s the probability of getting two heads in a row (“HH”) when tossing a coin? H T Toss 1: 2 outcomes Toss 2: 22 total possible outcomes: {HH, HT, TH, TT}

What’s the probability of 3 heads in a row? H T Toss 1: 2 outcomes Toss 2: Toss 3: HHH HHT HTH HTT THH THT TTH TTT

When you roll a pair of dice (or 1 die twice), what’s the probability of rolling 2 sixes? What’s the probability of rolling a 5 and a 6?

Summary: order matters, with replacement
Formally, “order matters” and “with replacement” use powers

Counting methods for computing probabilities Permutations— order matters! Without replacement

Permutations—without replacement
Without replacement—Think cards (w/o reshuffling) and seating arrangements. Example: You are moderating a debate of gubernatorial candidates. How many different ways can you seat the panelists in a row? Call them Arianna, Buster, Camejo, Donald, and Eve.

Permutation—without replacement
 “Trial and error” method: Systematically write out all combinations: A B C D E A B C E D A B D C E A B D E C A B E C D A B E D C . Quickly becomes a pain! Easier to figure out patterns using a the probability tree!

B A C D ……. Seat One: 5 possible Seat Two: only 4 possible Etc…. # of permutations = 5 x 4 x 3 x 2 x 1 = 5! There are 5! ways to order 5 people in 5 chairs (since a person cannot repeat)

What if you had to arrange 5 people in only 3 chairs (meaning 2 are out)? E B A C D Seat One: 5 possible Seat Two: Only 4 possible Seat Three: only 3 possible

Note this also works for 5 people and 5 chairs:

How many two-card hands can I draw from a deck when order matters (e.g., ace of spades followed by ten of clubs is different than ten of clubs followed by ace of spades) . 52 cards 51 cards

Summary: order matters, without replacement
Formally, “order matters” and “without replacement” use factorials

Practice problems: A wine taster claims that she can distinguish four vintages or a particular Cabernet. What is the probability that she can do this by merely guessing (she is confronted with 4 unlabeled glasses)? (hint: without replacement) In some states, license plates have six characters: three letters followed by three numbers. How many distinct such plates are possible? (hint: with replacement)

Answer 1 A wine taster claims that she can distinguish four vintages or a particular Cabernet. What is the probability that she can do this by merely guessing (she is confronted with 4 unlabeled glasses)? (hint: without replacement) P(success) = 1 (there’s only way to get it right!) / total # of guesses she could make Total # of guesses one could make randomly: glass one: glass two: glass three: glass four: 4 choices 3 vintages left 2 left no “degrees of freedom” left = 4 x 3 x 2 x 1 = 4! P(success) = 1 / 4! = 1/24 =

Answer 2 In some states, license plates have six characters: three letters followed by three numbers. How many distinct such plates are possible? (hint: with replacement) 263 different ways to choose the letters and 103 different ways to choose the digits total number = 263 x 103 = 17,576 x 1000 = 17,576,000

Counting methods for computing probabilities
Summary of Counting Methods Counting methods for computing probabilities Combinations— Order doesn’t matter Without replacement

2. Combinations—Order doesn’t matter
Introduction to combination function, or “choosing” Written as: Spoken: “n choose r”

Combinations How many two-card hands can I draw from a deck when order does not matter (e.g., ace of spades followed by ten of clubs is the same as ten of clubs followed by ace of spades) . 52 cards 51 cards

Combinations How many five-card hands can I draw from a deck when order does not matter? 48 cards 49 cards . 50 cards . 51 cards . 52 cards . .

Combinations 1. 2. 3. …. How many repeats total??

Combinations 1. 2. 3. …. i.e., how many different ways can you arrange 5 cards…?

That’s a permutation without replacement. 5! = 120
Combinations That’s a permutation without replacement. 5! = 120

Combinations How many unique 2-card sets out of 52 cards? 5-card sets?
r-card sets? r-card sets out of n-cards?

Summary: combinations
If r objects are taken from a set of n objects without replacement and disregarding order, how many different samples are possible? Formally, “order doesn’t matter” and “without replacement” use choosing

Examples—Combinations
A lottery works by picking 6 numbers from 1 to How many combinations of 6 numbers could you choose? Which of course means that your probability of winning is 1/13,983,816!

Examples How many ways can you get 3 heads in 5 coin tosses?

Counting methods for computing probabilities Combinations— Order doesn’t matter With replacement: nr Permutations—order matters! Without replacement: n(n-1)(n-2)…(n-r+1)= Without replacement:

Gambling, revisited What are the probabilities of the following hands?
Pair of the same color Pair of different colors Any two cards of the same suit Any two cards of the same color

Pair of the same color? P(pair of the same color) =
Numerator = red aces, black aces; red kings, black kings; etc.…= 2x13 = 26

Any old pair? P(any pair) =

Two cards of same suit?

Two cards of same color? Numerator: 26C2 x 2 colors = 26!/(24!2!) = 325 x 2 = 650 Denominator = 1326 So, P (two cards of the same color) = 650/1326 = 49% chance A little non-intuitive? Here’s another way to look at it… 26x25 RR 26x26 RB 26x26 BR 26x25 BB . 52 cards 26 red branches 26 black branches From a Red branch: 26 black left, 25 red left From a Black branch: 26 red left, 25 black left 50/102 Not quite 50/100

Rational strategy? To bet or fold?
It would be really complicated to take into account the dependence between hands in the class (since we all drew from the same deck), so we’re going to fudge this and pretend that everyone had equal probabilities of each type of hand (pretend we have “independence”)… Just to get a rough idea...

Rational strategy? P(at least one same-color pair in the class)=
**Trick! P(at least 1) = 1- P(0) P(at least one same-color pair in the class)= 1-P(no same-color pairs in the whole class)=

Rational strategy? P(at least one pair)= 1-P(no pairs)=
1-(.94)40=1-8%=92% chance P(>=1 same suit)= 1-P(all different suits)= 1-(.765)40= ~ 100% P(>=1 same color) = 1-P(all different colors)= 1-(.51) 40= ~ 100%

Rational strategy… Fold unless you have a same-color pair or a numerically high pair (e.g., Queen, King, Ace). How does this compare to class? -anyone with a same-color pair? -any pair? -same suit? -same color?

Practice problem: A classic problem: “The Birthday Problem.” What’s the probability that two people in a class of 25 have the same birthday? (disregard leap years) What would you guess is the probability?

Birthday Problem Answer
1. A classic problem: “The Birthday Problem.” What’s the probability that two people in a class of 25 have the same birthday? (disregard leap years) **Trick! 1- P(none) = P(at least one) Use complement to calculate answer. It’s easier to calculate 1- P(no matches) = the probability that at least one pair of people have the same birthday. What’s the probability of no matches? Denominator: how many sets of 25 birthdays are there? --with replacement (order matters) 36525 Numerator: how many different ways can you distribute 365 birthdays to 25 people without replacement? --order matters, without replacement: [365!/(365-25)!]= [365 x 364 x 363 x 364 x ….. (365-24)]  P(no matches) = [365 x 364 x 363 x 364 x ….. (365-24)] /

Use SAS as a calculator 0.568699704, so 57% chance!
Use SAS as calculator… (my calculator won’t do factorials as high as 365, so I had to improvise by using a loop…which you’ll learn later in HRP 223): %LET num = 25; *set number in the class; data null; top=1; *initialize numerator; do j=0 to (&num-1) by 1; top=(365-j)*top; end; BDayProb=1-(top/365**&num); put BDayProb; run; From SAS log: , so 57% chance!

For class of 40 (our class)?
10 %LET num = 40; *set number in the class; 11 data null; top=1; *initialize numerator; do j=0 to (&num-1) by 1; top=(365-j)*top; end; BDayProb=1-(top/365**&num); put BDayProb; 18 run; , i.e. 89% chance of a match!

In this class? --Jan? --Feb? --March? --April? --May? --June? --July?
--August? --September? ….

And the odds ratio and risk ratio as conditional probability

Today’s lecture Probability trees Statistical independence
Joint probability Conditional probability Marginal probability Bayes’ Rule Risk ratio Odds ratio

Probability example Sample space: the set of all possible outcomes.
For example, in genetics, if both the mother and father carry one copy of a recessive disease- causing mutation (d), there are three possible outcomes (the sample space): child is not a carrier (DD) child is a carrier (Dd) child has the disease (dd). Probabilities: the likelihood of each of the possible outcomes (always 0 P 1.0). P(genotype=DD)=.25 P(genotype=Dd)=.50 P(genotype=dd)=.25. Note: mutually exclusive, exhaustive probabilities sum to 1.

Using a probability tree
Mendel example: What’s the chance of having a heterozygote child (Dd) if both parents are heterozygote (Dd)? ______________ 1.0 P(DD)=.5*.5=.25 P(Dd)=.5*.5=.25 P(dD)=.5*.5=.25 P(dd)=.5*.5=.25 Child’s outcome P(♂D=.5) P(♂d=.5) Father’s allele P(♀D=.5) P(♀d=.5) Mother’s allele Rule of thumb: in probability, “and” means multiply, “or” means add

Independence Formal definition: A and B are independent if and only if P(A&B)=P(A)*P(B) The mother’s and father’s alleles are segregating independently. P(♂D/♀D)=.5 and P(♂D/♀d)=.5 Conditional Probability: Read as “the probability that the father passes a D allele given that the mother passes a d allele.” Joint Probability: The probability of two events happening simultaneously. What father’s gamete looks like is not dependent on the mother’s – doesn’t depend which branch you start on! Formally, P(DD)=.25=P(D♂)*P(D♀) Marginal probability: This is the probability that an event happens at all, ignoring all other outcomes.

On the tree Conditional probability Marginal probability: mother
Joint probability ______________ 1.0 P(DD)=.5*.5=.25 P(Dd)=.5*.5=.25 P(dD)=.5*.5=.25 P(dd)=.5*.5=.25 Child’s outcome Father’s allele P(♀D=.5) P(♀d=.5) Mother’s allele Marginal probability: father P(♂D/ ♀D )=.5 P(♂d=.5) P(♂D=.5) P(♂d=.5)

Conditional, marginal, joint
The marginal probability that player 1 gets two aces is 12/2652. The marginal probability that player 5 gets two aces is 12/2652. The marginal probability that player 9 gets two aces is 12/2652. The joint probability that all three players get pairs of aces is 0. The conditional probability that player 5 gets two aces given that player 1 got 2 aces is (2/50*1/49).

Test of independence event A=player 1 gets pair of aces
event B=player 2 gets pair of aces event C=player 3 gets pair of aces P(A&B&C) = 0 P(A)*P(B)*P(C) = (12/2652)3 (12/2652)3  0 Not independent

Independent  mutually exclusive
Events A and ~A are mutually exclusive, but they are NOT independent. P(A&~A)= 0 P(A)*P(~A)  0 Conceptually, once A has happened, ~A is impossible; thus, they are completely dependent.

Practice problem If HIV has a prevalence of 3% in San Francisco, and a particular HIV test has a false positive rate of and a false negative rate of .01, what is the probability that a random person selected off the street will test positive?

Answer P(test +)=.0297+.00097=.03067 P(+&test+)P(+)*P(test+)
Conditional probability: the probability of testing + given that a person is + Joint probability of being + and testing + Marginal probability of carrying the virus. P(test +)=.99 P(test - )= .01 P (+, test +)=.0297 P(+)=.03 P(-)=.97 P(+, test -)=.003 P(test +) = .001 P(test -) = .999 P(-, test +)=.00097 ______________ 1.0 P(-, test -) = Marginal probability of testing positive P(test +)= =.03067 P(+&test+)P(+)*P(test+) .0297 .03* (=.00092)  Dependent!

Law of total probability
One of these has to be true (mutually exclusive, collectively exhaustive). They sum to 1.0.

Law of total probability
Formal Rule: Marginal probability for event A= Where: B2 B3 B1 A

Example 2 A 54-year old woman has an abnormal mammogram; what is the chance that she has breast cancer? 160

Example: Mammography sensitiv ity specific ity
______________ 1.0 P(test +)=.90 P(BC+)=.003 P(BC-)=.997 P(test -) = .10 P(test +) = .11 P (+, test +)=.0027 P(+, test -)=.0003 P(-, test +)=.10967 P(-, test -) = P(test -) = .89 Marginal probabilities of breast cancer….(prevalence among all 54- year olds) specific ity P(BC/test+)=.0027/( )=2.4% 161

Bayes’ rule

Bayes’ Rule: derivation
Definition: Let A and B be two events with P(B)  0. The conditional probability of A given B is: The idea: if we are given that the event B occurred, the relevant sample space is reduced to B {P(B)=1 because we know B is true} and conditional probability becomes a probability measure on B.

Bayes’ Rule: derivation
can be re-arranged to: and, since also:

Bayes’ Rule: OR From the “Law of Total Probability”

Bayes’ Rule: Why do we care?? Why is Bayes’ Rule useful??
It turns out that sometimes it is very useful to be able to “flip” conditional probabilities. That is, we may know the probability of A given B, but the probability of B given A may not be obvious. An example will help…

In-Class Exercise If HIV has a prevalence of 3% in San Francisco, and a particular HIV test has a false positive rate of .001 and a false negative rate of .01, what is the probability that a random person who tests positive is actually infected (also known as “positive predictive value”)?

Answer: using probability tree
______________ 1.0 P(test +)=.99 P(+)=.03 P(-)=.97 P(test - = .01) P(test +) = .001 P (+, test +)=.0297 P(+, test -)=.003 P(-, test +)=.00097 P(-, test -) = P(test -) = .999 A positive test places one on either of the two “test +” branches. But only the top branch also fulfills the event “true infection.” Therefore, the probability of being infected is the probability of being on the top branch given that you are on one of the two circled branches above.

Answer: using Bayes’ rule

Practice problem An insurance company believes that drivers can be divided into two classes—those that are of high risk and those that are of low risk. Their statistics show that a high-risk driver will have an accident at some time within a year with probability .4, but this probability is only .1 for low risk drivers. Assuming that 20% of the drivers are high-risk, what is the probability that a new policy holder will have an accident within a year of purchasing a policy? If a new policy holder has an accident within a year of purchasing a policy, what is the probability that he is a high-risk type driver?

Answer to (a) Use law of total probability: P(accident)=
Assuming that 20% of the drivers are of high-risk, what is the probability that a new policy holder will have an accident within a year of purchasing a policy? Use law of total probability: P(accident)= P(accident/high risk)*P(high risk) + P(accident/low risk)*P(low risk) = .40(.20) + .10(.80) = = .16

Answer to (b) P(high risk/accident)=.08/.16=50%
If a new policy holder has an accident within a year of purchasing a policy, what is the probability that he is a high-risk type driver? P(high-risk/accident)= P(accident/high risk)*P(high risk)/P(accident) =.40(.20)/.16 = 50% Or use tree: P(accident/LR)=.1 ______________ 1.0 P( no acc/HR)=.6 P(accident/HR)=.4 P(high risk)=.20 P(accident, high risk)=.08 P(no accident, high risk)=.12) P(accident, low risk)=.08 P(low risk)=.80 P( no accident/LR)=.9 P(no accident, low risk)=.72 P(high risk/accident)=.08/.16=50%

Fun example/bad investment

The odds ratio and risk ratio as conditional probability
Conditional Probability for Epidemiology: The odds ratio and risk ratio as conditional probability

The Risk Ratio and the Odds Ratio as conditional probability
In epidemiology, the association between a risk factor or protective factor (exposure) and a disease may be evaluated by the “risk ratio” (RR) or the “odds ratio” (OR). Both are measures of “relative risk”— the general concept of comparing disease risks in exposed vs. unexposed individuals.

Odds and Risk (probability)
Definitions: Risk = P(A) = cumulative probability (you specify the time period!) For example, what’s the probability that a person with a high sugar intake develops diabetes in 1 year, 5 years, or over a lifetime? Odds = P(A)/P(~A) For example, “the odds are 3 to 1 against a horse” means that the horse has a 25% probability of winning. Note: An odds is always higher than its corresponding probability, unless the probability is 100%.

Odds vs. Risk=probability
If the risk is… Then the odds are… ½ (50%) ¾ (75%) 1/10 (10%) 1/100 (1%) 1:1 3:1 1:9 1:99 Note: An odds is always higher than its corresponding probability, unless the probability is 100%.

Cohort Studies (risk ratio)
Exposed Not Exposed Disease-free cohort Disease Disease-free Target population Disease Disease-free TIME

The Risk Ratio risk to the exposed risk to the unexposed Exposure (E)
Exposure (E) No Exposure (~E) Disease (D) a b No Disease (~D) c d a+c b+d risk to the exposed risk to the unexposed

Hypothetical Data Normal BP Congestive Heart Failure No CHF 1500 3000
Normal BP Congestive Heart Failure No CHF 1500 3000 High Systolic BP 400 1100 2600

Case-Control Studies (odds ratio)
Exposed in past Disease (Cases) Not exposed Target population Exposed No Disease (Controls) Not Exposed

Case-control study example:
You sample 50 stroke patients and 50 controls without stroke and ask about their smoking in the past.

Hypothetical results:
Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50

What’s the risk ratio here?
Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 Tricky: There is no risk ratio, because we cannot calculate the risk of disease!!

The odds ratio… We cannot calculate a risk ratio from a case- control study. BUT, we can calculate a measure called the odds ratio…

The Odds Ratio (OR) Smoker (E) Stroke (D) No Stroke (~D)
Smoker (E) Smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 50 These data give: P(E/D) and P(E/~D). Luckily, you can flip the conditional probabilities using Bayes’ Rule: Unfortunately, our sampling scheme precludes calculation of the marginals: P(E) and P(D), but turns out we don’t need these if we use an odds ratio because the marginals cancel out!

The Odds Ratio (OR) Odds of exposure in the cases
Exposure (E) No Exposure (~E) Disease (D) a b No Disease (~D) c d Odds of exposure in the cases Odds of exposure in the controls

The Odds Ratio (OR) Odds of disease in the exposed
Odds of disease in the unexposed Odds of exposure in the cases Odds of exposure in the controls But, this expression is mathematically equivalent to: Backward from what we want… The direction of interest!

Proof via Bayes’ Rule What we want! = Odds of exposure in the cases
Odds of exposure in the controls Odds of exposure in the cases Bayes’ Rule Odds of disease in the unexposed Odds of disease in the exposed What we want! =

The odds ratio here: Smoker (E) Non-smoker (~E) Stroke (D) 15 35
Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 Interpretation: there is a 2.25-fold higher odds of stroke in smokers vs. non-smokers.

Interpretation of the odds ratio:
The odds ratio will always be bigger than the corresponding risk ratio if RR >1 and smaller if RR <1 (the harmful or protective effect always appears larger) The magnitude of the inflation depends on the prevalence of the disease.

The rare disease assumption
1 When a disease is rare: P(~D) = 1 - P(D)  1 1

The odds ratio vs. the risk ratio
Rare Outcome Odds ratio Odds ratio Risk ratio Risk ratio 1.0 (null) Common Outcome Odds ratio Odds ratio Risk ratio Risk ratio 1.0 (null)

Odds ratios in cross-sectional and cohort studies…
Many cohort and cross-sectional studies report ORs rather than RRs even though the data necessary to calculate RRs are available. Why? If you have a binary outcome and want to adjust for confounders, you have to use logistic regression. Logistic regression gives adjusted odds ratios, not risk ratios (more on this in HRP 261). These odds ratios must be interpreted cautiously (as increased odds, not risk) when the outcome is common. When the outcome is common, authors should also report unadjusted risk ratios and/or use a simple formula to convert adjusted odds ratios back to adjusted risk ratios.

Example, wrinkle study…
A cross-sectional study on risk factors for wrinkles found that heavy smoking significantly increases the risk of prominent wrinkles. Adjusted OR=3.92 (heavy smokers vs. nonsmokers) calculated from logistic regression. Interpretation: heavy smoking increases risk of prominent wrinkles nearly 4-fold?? The prevalence of prominent wrinkles in non-smokers is roughly 45%. So, it’s not possible to have a 4-fold increase in risk (=180%)! Raduan et al. J Eur Acad Dermatol Venereol Jul 3.

Interpreting ORs when the outcome is common…
If the outcome has a 10% prevalence in the unexposed/reference group*, the maximum possible RR=10.0. For 20% prevalence, the maximum possible RR=5.0 For 30% prevalence, the maximum possible RR=3.3. For 40% prevalence, maximum possible RR=2.5. For 50% prevalence, maximum possible RR=2.0. *Authors should report the prevalence/risk of the outcome in the unexposed/reference group, but they often don’t. If this number is not given, you can usually estimate it from other data in the paper (or, if it’s important enough, the authors).

Interpreting ORs when the outcome is common…
If data are from a cross-sectional or cohort study, then you can convert ORs (from logistic regression) back to RRs with a simple formula: Where: OR = odds ratio from logistic regression (e.g., 3.92) P0 = P(D/~E) = probability/prevalence of the outcome in the unexposed/reference group (e.g. ~45%) Formula from: Zhang J. What's the Relative Risk? A Method of Correcting the Odds Ratio in Cohort Studies of Common Outcomes JAMA. 1998;280:

For wrinkle study… So, the risk (prevalence) of wrinkles is increased by 69%, not 292%. Zhang J. What's the Relative Risk? A Method of Correcting the Odds Ratio in Cohort Studies of Common Outcomes JAMA. 1998;280:

Sleep and hypertension study…
ORhypertension= 5.12 for chronic insomniacs who sleep ≤ 5 hours per night vs. the reference (good sleep) group. ORhypertension = 3.53 for chronic insomiacs who sleep 5-6 hours per night vs. the reference group. Interpretation: risk of hypertension is increased 500% and 350% in these groups? No, ~25% of reference group has hypertension. Use formula to find corresponding RRs = 2.5, 2.2 Correct interpretation: Hypertension is increased 150% and 120% in these groups. -Sainani KL, Schmajuk G, Liu V. A Caution on Interpreting Odds Ratios. SLEEP, Vol. 32, No. 8, -Vgontzas AN, Liao D, Bixler EO, Chrousos GP, Vela-Bueno A. Insomnia with objective short sleep duration is associated with a high risk for hypertension. Sleep 2009;32:491-7.

Practice problem: 1. Suppose the following data were collected on a random sample of subjects (the researchers did not sample on exposure or disease status). Neck pain No Neck Pain Own a cell phone 143 209 Don’t own a cell phone 22 69 Calculate the odds ratio and risk ratio for the association between cell phone usage and neck pain (common outcome).

Answer OR = (69*143)/(22*209) = 2.15 RR = (143/352)/(22/91) = 1.68
Neck pain No Neck Pain Own a cell phone 143 209 Don’t own a cell phone 22 69 OR = (69*143)/(22*209) = 2.15 RR = (143/352)/(22/91) = 1.68

Practice problem: 2. Suppose the following data were collected on a random sample of subjects (the researchers did not sample on exposure or disease status). Brain tumor No brain tumor Own a cell phone 5 347 Don’t own a cell phone 3 88 Calculate the odds ratio and risk ratio for the association between cell phone usage and brain tumor (rare outcome).

Answer OR = (5*88)/(3*347) = .42267 RR = (5/352)/(3/91) = .43087
Brain tumor No brain tumor Own a cell phone 5 347 Don’t own a cell phone 3 88 OR = (5*88)/(3*347) = RR = (5/352)/(3/91) =

Thought problem… Another classic first-year statistics problem. You are on the Monty Hall show. You are presented with 3 doors (A, B, C), only one of which has something valuable to you behind it (the others are bogus). You do not know what is behind any of the doors. You choose door A; Monty Hall opens door B and shows you that there is nothing behind it. Then he gives you the option of sticking with A or switching to C. Do you stay or switch? Does it matter?

Some Monty Hall links… html?res=9D0CEFDD1E3FF932A15754C 0A &sec=&spon=&pagewant ed=all /science/08tier.html?_r=1&em&ex= &en=81bdecc33f60033e&ei= 5087%0A&oref=slogin /science/08monty.html#

Probability Distributions
206

Random Variable A random variable x takes on a defined set of values with different probabilities. For example, if you roll a die, the outcome is random (not fixed) and there are 6 possible outcomes, each of which occur with probability one-sixth. For example, if you poll people about their voting preferences, the percentage of the sample that responds “Yes on Proposition 100” is a also a random variable (the percentage will be slightly differently every time you poll). Roughly, probability is how frequently we expect different outcomes to occur if we repeat the experiment over and over (“frequentist” view)

Random variables can be discrete or continuous
Discrete random variables have a countable number of outcomes Examples: Dead/alive, treatment/placebo, dice, counts, etc. Continuous random variables have an infinite continuum of possible values. Examples: blood pressure, weight, the speed of a car, the real numbers from 1 to 6.

Probability functions
A probability function maps the possible values of x against their respective probabilities of occurrence, p(x) p(x) is a number from 0 to 1.0. The area under a probability function is always 1. It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 209

Discrete example: roll of a die
p(x) 1/6 1 4 5 6 2 3

Probability mass function (pmf)
x p(x) 1 p(x=1)=1/6 2 p(x=2)=1/6 3 p(x=3)=1/6 4 p(x=4)=1/6 5 p(x=5)=1/6 6 p(x=6)=1/6 1.0

Cumulative distribution function (CDF)
x P(x) 1/6 1 4 5 6 2 3 1/3 1/2 2/3 5/6 1.0

Cumulative distribution function
x P(x≤A) 1 P(x≤1)=1/6 2 P(x≤2)=2/6 3 P(x≤3)=3/6 4 P(x≤4)=4/6 5 P(x≤5)=5/6 6 P(x≤6)=6/6

Examples 1. What’s the probability that you roll a 3 or less?
P(x≤3)=1/2 2. What’s the probability that you roll a 5 or higher? P(x≥5) = 1 – P(x≤4) = 1-2/3 = 1/3

Practice Problem Which of the following are probability functions?
a. f(x)=.25 for x=9,10,11,12 b. f(x)= (3-x)/2 for x=1,2,3,4 c f(x)= (x2+x+1)/25 for x=0,1,2,3

Answer (a) a. f(x)=.25 for x=9,10,11,12 x f(x) 9 .25 10 11 12
Yes, probability function! 1.0

Answer (b) b. f(x)= (3-x)/2 for x=1,2,3,4 x f(x) 1 (3-1)/2=1.0 2
(3-2)/2=.5 3 (3-3)/2=0 4 (3-4)/2=-.5 Though this sums to 1, you can’t have a negative probability; therefore, it’s not a probability function.

Answer (c) c. f(x)= (x2+x+1)/25 for x=0,1,2,3 x f(x) 1/25 1 3/25 2
1/25 1 3/25 2 7/25 3 13/25 Doesn’t sum to 1. Thus, it’s not a probability function. 24/25

Practice Problem: Find the probability that on a given day:
The number of ships to arrive at a harbor on any given day is a random variable represented by x. The probability distribution for x is: x 10 11 12 13 14 P(x) .4 .2 .1 Find the probability that on a given day: a. exactly 14 ships arrive b. At least 12 ships arrive c. At most 11 ships arrive p(x=14)= .1 p(x12)= ( ) = .4 p(x≤11)= (.4 +.2) = .6

Practice Problem: You are lecturing to a group of 1000 students. You ask them to each randomly pick an integer between 1 and 10. Assuming, their picks are truly random: What’s your best guess for how many students picked the number 9? Since p(x=9) = 1/10, we’d expect about 1/10th of the students to pick students. What percentage of the students would you expect picked a number less than or equal to 6? Since p(x≤ 6) = 1/10 + 1/10 + 1/10 + 1/10 + 1/10 + 1/10 =.6 60%

Important discrete distributions in epidemiology…
Binomial Yes/no outcomes (dead/alive, treated/untreated, smoker/non- smoker, sick/well, etc.) Poisson Counts (e.g., how many cases of disease in a given area)

Continuous case The probability function that accompanies a continuous random variable is a continuous mathematical function that integrates to 1. The probabilities associated with continuous functions are just areas under the curve (integrals!). Probabilities are given for a range of values, rather than a particular value (e.g., the probability of getting a math SAT score between 700 and 800 is 2%).

Continuous case For example, recall the negative exponential function (in probability, this is called an “exponential distribution”): This function integrates to 1:

Continuous case: “probability density function” (pdf)
x p(x)=e-x 1 The probability that x is any exact particular value (such as ) is 0; we can only assign probabilities to possible ranges of x.

For example, the probability of x falling within 1 to 2:
p(x)=e-x 1 2

Cumulative distribution function
As in the discrete case, we can specify the “cumulative distribution function” (CDF): The CDF here = P(x≤A)=

Example 2 x p(x) 1

Example 2: Uniform distribution
The uniform distribution: all values are equally likely The uniform distribution: f(x)= 1 , for 1 x 0 x p(x) 1 We can see it’s a probability distribution because it integrates to 1 (the area under the curve is 1):

Example: Uniform distribution
What’s the probability that x is between ¼ and ½? x p(x) 1 P(½ x ¼ )= ¼

Practice Problem 4. Suppose that survival drops off rapidly in the year following diagnosis of a certain type of advanced cancer. Suppose that the length of survival (or time-to-death) is a random variable that approximately follows an exponential distribution with parameter 2 (makes it a steeper drop off): What’s the probability that a person who is diagnosed with this illness survives a year?

Answer The probability of dying within 1 year can be calculated using the cumulative distribution function: Cumulative distribution function is: The chance of surviving past 1 year is: P(x≥1) = 1 – P(x≤1)

Expected Value and Variance
All probability distributions are characterized by an expected value and a variance (standard deviation squared). 232

For example, bell-curve (normal) distribution:
Mean () One standard deviation from the mean ()

Expected value, or mean If we understand the underlying probability function of a certain phenomenon, then we can make informed decisions based on how we expect x to behave on-average over the long-run…(so called “frequentist” theory of probability). Expected value is just the weighted average or mean (µ) of random variable x. Imagine placing the masses p(x) at the points X on a beam; the balance point of the beam is the expected value of x.

Example: expected value
Recall the following probability distribution of ship arrivals: x 10 11 12 13 14 P(x) .4 .2 .1

Expected value, formally
Discrete case: Continuous case:

Empirical Mean is a special case of Expected Value…
Sample mean, for a sample of n subjects: = The probability (frequency) of each person in the sample is 1/n.

Expected value, formally
Discrete case: Continuous case:

Extension to continuous case: uniform distribution
p(x) 1 x 1

Symbol Interlude E(X) = µ these symbols are used interchangeably

Expected Value Expected value is an extremely useful concept for good decision- making!

Example: the lottery The Lottery (also known as a tax on people who are bad at math…) A certain lottery works by picking 6 numbers from 1 to 49. It costs $1.00 to play the lottery, and if you win, you win $2 million after taxes. If you play the lottery once, what are your expected winnings or losses?

Lottery Calculate the probability of winning in 1 try:
“49 choose 6” Out of 49 numbers, this is the number of distinct combinations of 6. The probability function (note, sums to 1.0): x$ p(x) -1 + 2 million 7.2 x 10--8

Expected Value The probability function Expected Value
p(x) -1 + 2 million 7.2 x 10--8 Expected Value E(X) = P(win)*$2,000, P(lose)*-$1.00 = 2.0 x 106 * 7.2 x (-1) = = -$.86 Negative expected value is never good! You shouldn’t play if you expect to lose money!

Expected Value If you play the lottery every week for 10 years, what are your expected winnings or losses? 520 x (-.86) = -$447.20

Gambling (or how casinos can afford to give so many free drinks…)
A roulette wheel has the numbers 1 through 36, as well as 0 and 00. If you bet $1 that an odd number comes up, you win or lose $1 according to whether or not that event occurs. If random variable X denotes your net gain, X=1 with probability 18/38 and X= -1 with probability 20/38. E(X) = 1(18/38) – 1 (20/38) = -$.053 On average, the casino wins (and the player loses) 5 cents per game. The casino rakes in even more if the stakes are higher: E(X) = 10(18/38) – 10 (20/38) = -$.53 If the cost is $10 per game, the casino wins an average of 53 cents per game. If 10,000 games are played in a night, that’s a cool $5300.

**A few notes about Expected Value as a mathematical operator:
If c= a constant number (i.e., not a variable) and X and Y are any random variables… E(c) = c E(cX)=cE(X) E(c + X)=c + E(X) E(X+Y)= E(X) + E(Y)

E(c) = c E(c) = c Example: If you cash in soda cans in CA, you always get 5 cents per can. Therefore, there’s no randomness. You always expect to (and do) get 5 cents.

E(cX)=cE(X) E(cX)=cE(X)
Example: If the casino charges $10 per game instead of $1, then the casino expects to make 10 times as much on average from the game (See roulette example above!)

E(c + X)=c + E(X) E(c + X)=c + E(X)
Example, if the casino throws in a free drink worth exactly $5.00 every time you play a game, you always expect to (and do) gain an extra $5.00 regardless of the outcome of the game.

E(X+Y)= E(X) + E(Y) E(X+Y)= E(X) + E(Y)
Example: If you play the lottery twice, you expect to lose: -$ $.86. NOTE: This works even if X and Y are dependent!! Does not require independence!! Proof left for later…

Practice Problem If a disease is fairly rare and the antibody test is fairly expensive, in a resource-poor region, one strategy is to take half of the serum from each sample and pool it with n other halved samples, and test the pooled lot. If the pooled lot is negative, this saves n-1 tests. If it’s positive, then you go back and test each sample individually, requiring n+1 tests total. Suppose a particular disease has a prevalence of 10% in a third-world population and you have 500 blood samples to screen. If you pool 20 samples at a time (25 lots), how many tests do you expect to have to run (assuming the test is perfect!)? What if you pool only 10 samples at a time? 5 samples at a time?

Answer (a) a. Suppose a particular disease has a prevalence of 10% in a third-world population and you have 500 blood samples to screen. If you pool 20 samples at a time (25 lots), how many tests do you expect to have to run (assuming the test is perfect!)? Let X = a random variable that is the number of tests you have to run per lot: E(X) = P(pooled lot is negative)(1) + P(pooled lot is positive) (21) E(X) = (.90)20 (1) + [ ] (21) = 12.2% (1) % (21) = E(total number of tests) = 25*18.56 = 464

Answer (b) b. What if you pool only 10 samples at a time?
E(X) = (.90)10 (1) + [ ] (11) = 35% (1) + 65% (11) = average per lot 50 lots * 7.5 = 375

Answer (c) c. 5 samples at a time?
E(X) = (.90)5 (1) + [1-.905] (6) = 59% (1) + 41% (6) = average per lot 100 lots * 3.05 = 305

Practice Problem If X is a random integer between 1 and 10, what’s the expected value of X?

Answer If X is a random integer between 1 and 10, what’s the expected value of X?

Expected value isn’t everything though…
Take the show “Deal or No Deal” Everyone know the rules? Let’s say you are down to two cases left. $1 and $400,000. The banker offers you $200,000. So, Deal or No Deal?

Deal or No Deal… This could really be represented as a probability distribution and a non-random variable: x$ p(x) +1 .50 +$400,000 x$ p(x) +$200,000 1.0

Expected value doesn’t help…
p(x) +1 .50 +$400,000 x$ p(x) +$200,000 1.0

How to decide? Variance! If you take the deal, the variance/standard deviation is 0. If you don’t take the deal, what is average deviation from the mean? What’s your gut guess?

Variance/standard deviation
“The average (expected) squared distance (or deviation) from the mean” **We square because squaring has better properties than absolute value. Take square root to get back linear average distance from the mean (=”standard deviation”).

Variance, formally Discrete case: Continuous case:

Similarity to empirical variance
The variance of a sample: s2 = Division by n-1 reflects the fact that we have lost a “degree of freedom” (piece of information) because we had to estimate the sample mean before we could estimate the sample variance.

Symbol Interlude Var(X) = 2 these symbols are used interchangeably

Variance: Deal or No Deal
Now you examine your personal risk tolerance…

Practice Problem A roulette wheel has the numbers 1 through 36, as well as 0 and 00. If you bet $1.00 that an odd number comes up, you win or lose $1.00 according to whether or not that event occurs. If X denotes your net gain, X=1 with probability 18/38 and X= -1 with probability 20/38. We already calculated the mean to be = - $ What’s the variance of X?

Answer Standard deviation is $.99. Interpretation: On average, you’re either 1 dollar above or 1 dollar below the mean, which is just under zero. Makes sense!

Handy calculation formula!
Handy calculation formula (if you ever need to calculate by hand!): Intervening algebra!

Var(x) = E(x-)2 = E(x2) – [E(x)]2 (your calculation formula!)
Proofs (optional!): E(x-)2 = E(x2–2x + 2) remember “FOIL”?! =E(x2) – E(2x) +E(2) Use rules of expected value:E(X+Y)= E(X) + E(Y) = E(x2) – 2E(x) +2 E(c) = c = E(x2) – 2 + E(x) =  = E(x2) – 2 = E(x2) – [E(x)]2 OR, equivalently: E(x-)2 =

For example, what’s the variance and standard deviation of the roll of a die?
p(x) 1 p(x=1)=1/6 2 p(x=2)=1/6 3 p(x=3)=1/6 4 p(x=4)=1/6 5 p(x=5)=1/6 6 p(x=6)=1/6 1.0 x p(x) 1/6 1 4 5 6 2 3 mean average distance from the mean

**A few notes about Variance as a mathematical operator:
If c= a constant number (i.e., not a variable) and X and Y are random variables, then Var(c) = 0 Var (c+X)= Var(X) Var(cX)= c2Var(X) Var(X+Y)= Var(X) + Var(Y) ONLY IF X and Y are independent!!!! {Var(X+Y)= Var(X) + Var(Y)+2Cov(X,Y) IF X and Y are not independent}

Var(c) = 0 Var(c) = 0 Constants don’t vary!

Var (c+X)= Var(X) Var (c+X)= Var(X)
Adding a constant to every instance of a random variable doesn’t change the variability. It just shifts the whole distribution by c. If everybody grew 5 inches suddenly, the variability in the population would still be the same. + c

Var(cX)= c2Var(X) Var(cX)= c2Var(X)
Multiplying each instance of the random variable by c makes it c-times as wide of a distribution, which corresponds to c2 as much variance (deviation squared). For example, if everyone suddenly became twice as tall, there’d be twice the deviation and 4 times the variance in heights in the population.

Var(X+Y)= Var(X) + Var(Y)
Var(X+Y)= Var(X) + Var(Y) ONLY IF X and Y are independent!!!!!!!! With two random variables, you have more opportunity for variation, unless they vary together (are dependent, or have covariance): Var(X+Y)= Var(X) + Var(Y) + 2Cov(X, Y)

Example of Var(X+Y)= Var(X) + Var(Y): TPMT
TPMT metabolizes the drugs mercaptopurine, azathioprine, and 6- thioguanine (chemotherapy drugs) People with TPMT-/ TPMT+ have reduced levels of activity (10% prevalence) People with TPMT-/ TPMT- have no TPMT activity (prevalence 0.3%). They cannot metabolize mercaptopurine, azathioprine, and 6- thioguanine, and risk bone marrow toxicity if given these drugs.

TPMT activity by genotype
Weinshilboum R. Drug Metab Dispos Apr;29(4 Pt 2):601-5

The variability in TPMT activity is much higher in wild-types than heterozygotes. Weinshilboum R. Drug Metab Dispos Apr;29(4 Pt 2):601-5

There is variability in expression from each wild-type allele. With two copies of the good gene present, there’s “twice as much” variability. No variability in expression here, since there’s no working gene. Weinshilboum R. Drug Metab Dispos Apr;29(4 Pt 2):601-5

Practice Problem Find the variance and standard deviation for the number of ships to arrive at the harbor (recall that the mean is 11.3). x 10 11 12 13 14 P(x) .4 .2 .1

Answer: variance and std dev
x2 100 121 144 169 196 P(x) .4 .2 .1 Interpretation: On an average day, we expect ships to arrive in the harbor, plus or minus This gives you a feel for what would be considered a usual day!

Practice Problem You toss a coin 100 times. What’s the expected number of heads? What’s the variance of the number of heads?

Answer: expected value
Intuitively, we’d probably all agree that we expect around 50 heads, right? Another way to show this Think of tossing 1 coin. E(X=number of heads) = (1) P(heads) + (0)P(tails) E(X=number of heads) = 1(.5) + 0 = .5 If we do this 100 times, we’re looking for the sum of 100 tosses, where we assign 1 for a heads and 0 for a tails. (these are 100 “independent, identically distributed (i.i.d)” events) E(X1 +X2 +X3 +X4 +X5 …..+X100) = E(X1) + E(X2) + E(X3)+ E(X4)+ E(X5) …..+ E(X100) = 100 E(X1) = 50

Answer: variance What’s the variability, though? More tricky. But, again, we could do this for 1 coin and then use our rules of variance. Think of tossing 1 coin. E(X2=number of heads squared) = 12 P(heads) + 02 P(tails) E(X2) = 1(.5) + 0 = .5 Var(X) = = = .25 Then, using our rule: Var(X+Y)= Var(X) + Var(Y) (coin tosses are independent!) Var(X1 +X2 +X3 +X4 +X5 …..+X100) = Var(X1) + Var(X2) + Var(X3)+ Var(X4)+ Var(X5) …..+ Var(X100) = 100 Var(X1) = 100 (.25) = 25 SD(X)=5 Interpretation: When we toss a coin 100 times, we expect to get 50 heads plus or minus 5.

Or use computer simulation…
Flip coins virtually! Flip a virtual coin 100 times; count the number of heads. Repeat this over and over again a large number of times (we’ll try 30,000 repeats!) Plot the 30,000 results.

Coin tosses… Mean = 50 Std. dev = 5 Follows a normal distribution
95% of the time, we get between 40 and 60 heads…

Covariance: joint probability
The covariance measures the strength of the linear relationship between two variables The covariance:

The Sample Covariance The sample covariance:

Interpreting Covariance
Covariance between two random variables: cov(X,Y) > X and Y are positively correlated cov(X,Y) < X and Y are inversely correlated cov(X,Y) = X and Y are independent

The binomial and Poisson distributions
Examples of discrete probability distributions: The binomial and Poisson distributions

Binomial Probability Distribution
A fixed number of observations (trials), n e.g., 15 tosses of a coin; 20 patients; 1000 people surveyed A binary random variable e.g., head or tail in each toss of a coin; defective or not defective light bulb Generally called “success” and “failure” Probability of success is p, probability of failure is 1 – p Constant probability for each observation e.g., Probability of getting a tail is the same each time we toss the coin

Binomial example Take the example of 5 coin tosses. What’s the probability that you flip exactly 3 heads in 5 coin tosses?

Binomial distribution
Solution: One way to get exactly 3 heads: HHHTT What’s the probability of this exact arrangement? P(heads)xP(heads) xP(heads)xP(tails)xP(tails) =(1/2)3 x (1/2)2 Another way to get exactly 3 heads: THHHT Probability of this exact outcome = (1/2)1 x (1/2)3 x (1/2)1 = (1/2)3 x (1/2)2

Binomial distribution
In fact, (1/2)3 x (1/2)2 is the probability of each unique outcome that has exactly 3 heads and 2 tails. So, the overall probability of 3 heads and 2 tails is: (1/2)3 x (1/2)2 + (1/2)3 x (1/2)2 + (1/2)3 x (1/2)2 + ….. for as many unique arrangements as there are—but how many are there??

5C3 = 5!/3!2! = 10 Outcome Probability THHHT (1/2)3 x (1/2)2
HHHTT (1/2)3 x (1/2)2 TTHHH (1/2)3 x (1/2)2 HTTHH (1/2)3 x (1/2)2 HHTTH (1/2)3 x (1/2)2 THTHH (1/2)3 x (1/2)2 HTHTH (1/2)3 x (1/2)2 HHTHT (1/2)3 x (1/2)2 THHTH (1/2)3 x (1/2)2 HTHHT (1/2)3 x (1/2)2 10 arrangements x (1/2)3 x (1/2)2 The probability of each unique outcome (note: they are all equal) ways to arrange 3 heads in 5 trials 5C3 = 5!/3!2! = 10

P(3 heads and 2 tails) = x P(heads)3 x P(tails)2 =

Binomial distribution function: X= the number of heads tossed in 5 coin tosses
p(x) p(x) x 1 2 3 4 5 number of heads number of heads

Example 2 As voters exit the polls, you ask a representative random sample of 6 voters if they voted for proposition 100. If the true percentage of voters who vote for the proposition is 55.1%, what is the probability that, in your sample, exactly 2 voted for the proposition and 4 did not?

Solution: . 15 arrangements x (.551)2 x (.449)4
Outcome Probability YYNNNN = (.551)2 x (.449)4 NYYNNN (.449)1 x (.551)2 x (.449)3 = (.551)2 x (.449)4 NNYYNN (.449)2 x (.551)2 x (.449)2 = (.551)2 x (.449)4 NNNYYN (.449)3 x (.551)2 x (.449)1 = (.551)2 x (.449)4 NNNNYY (.449)4 x (.551) = (.551)2 x (.449)4 . ways to arrange 2 Obama votes among 6 voters 15 arrangements x (.551)2 x (.449)4 P(2 yes votes exactly) = x (.551)2 x (.449)4 = 18.5%

Binomial distribution, generally
Note the general pattern emerging  if you have only two possible outcomes (call them 1/0 or yes/no or success/failure) in n independent trials, then the probability of exactly X “successes”= n = number of trials 1-p = probability of failure p = probability of success X = # successes out of n trials

Definitions: Binomial
Binomial: Suppose that n independent experiments, or trials, are performed, where n is a fixed number, and that each experiment results in a “success” with probability p and a “failure” with probability 1-p. The total number of successes, X, is a binomial random variable with parameters n and p. We write: X ~ Bin (n, p) {reads: “X is distributed binomially with parameters n and p} And the probability that X=r (i.e., that there are exactly r successes) is:

Definitions: Bernouilli
Bernouilli trial: If there is only 1 trial with probability of success p and probability of failure 1-p, this is called a Bernouilli distribution. (special case of the binomial with n=1) Probability of success: Probability of failure:

Binomial distribution: example
If I toss a coin 20 times, what’s the probability of getting exactly 10 heads?

Binomial distribution: example
If I toss a coin 20 times, what’s the probability of getting of getting 2 or fewer heads?

**All probability distributions are characterized by an expected value and a variance:
If X follows a binomial distribution with parameters n and p: X ~ Bin (n, p) Then: x= E(X) = np x2 =Var (X) = np(1-p) x =SD (X)= Note: the variance will always lie between 0*N-.25 *N p(1-p) reaches maximum at p=.5 P(1-p)=.25

Characteristics of Bernouilli distribution
For Bernouilli (n=1) E(X) = p Var (X) = p(1-p)

Variance Proof (optional!)
For Y~Bernouilli (p) Y=1 if yes Y=0 if no For X~Bin (N,p)

Recall coin toss example
X= number of heads in 100 tosses of a coin X ~ Bin (100, .5) E(x) = 100*.5=50 Var(X) = 100*.5*.5 = 25 SD(X) = 5

Things that follow a binomial distribution…
Cohort study (or cross-sectional): The number of exposed individuals in your sample that develop the disease The number of unexposed individuals in your sample that develop the disease Case-control study: The number of cases that have had the exposure The number of controls that have had the exposure

Practice problems 1. You are performing a cohort study. If the probability of developing disease in the exposed group is .05 for the study duration, then if you sample (randomly) 500 exposed people, how many do you expect to develop the disease? Give a margin of error (+/- 1 standard deviation) for your estimate. 2. What’s the probability that at most 10 exposed people develop the disease?

Answer X ~ binomial (500, .05) E(X) = 500 (.05) = 25
1. You are performing a cohort study. If the probability of developing disease in the exposed group is .05 for the study duration, then if you sample (randomly) 500 exposed people, how many do you expect to develop the disease? Give a margin of error (+/- 1 standard deviation) for your estimate. X ~ binomial (500, .05) E(X) = 500 (.05) = 25 Var(X) = 500 (.05) (.95) = 23.75 StdDev(X) = square root (23.75) = 4.87 25  4.87

Answer 2. What’s the probability that at most 10 exposed subjects develop the disease? This is asking for a CUMULATIVE PROBABILITY: the probability of 0 getting the disease or 1 or 2 or 3 or 4 or up to 10. P(X≤10) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)+….+ P(X=10)= (we’ll learn how to approximate this long sum next week)

A brief distraction: Pascal’s Triangle Trick
You’ll rarely calculate the binomial by hand. However, it is good to know how to … Pascal’s Triangle Trick for calculating binomial coefficients Recall from math in your past that Pascal’s Triangle is used to get the coefficients for binomial expansion… For example, to expand: (p + q)5 The powers follow a set pattern: p5 + p4q1 + p3q2 + p2q3+ p1q4+ q5 But what are the coefficients? Use Pascal’s Magic Triangle…

Pascal’s Triangle Edges are all 1’s 1 1 1 1 2 1 To get the coefficient for expanding to the 5th power, use the row that starts with 5. (p + q)5 = 1p5 + 5p4q1 + 10p3q2 + 10p2q3+ 5p1q4+ 1q5 Add the two numbers in the row above to get the number below, e.g.: 3+1=4; 5+10=15

Same coefficients for X~Bin(5,p)
For example, X=# heads in 5 coin tosses: X P(X) 1 2 3 4 5 From line 5 of Pascal’s triangle!

Relationship between binomial probability distribution and binomial expansion
P(X=0) P(X=1) P(X=2) P(X=3) P(X=4) P(X=5)

Practice problems If the probability of being a smoker among a group of cases with lung cancer is .6, what’s the probability that in a group of 8 cases you have less than 2 smokers? More than 5? What are the expected value and variance of the number of smokers?

Answer 1 1 1 1 2 1

Answer, continued 1 4 5 2 3 6 7 8

Answer, continued 1 4 5 2 3 6 7 8 P(>5)=.21+.09+.0168 = .3168
P(>5)= = .3168 P(<2)= = E(X) = 8 (.6) = 4.8 Var(X) = 8 (.6) (.4) =1.92 StdDev(X) = 1.38

Practice problem If Stanford tickets in the medical center ‘A’ lot approximately twice a week (2/5 weekdays), if you want to park in the ‘A’ lot twice a week for the year, are you financially better off buying a parking sticker (which costs $726 for the year) or parking illegally (tickets are $35 each)?

Answer If Stanford tickets in the medical center ‘A’ lot approximately twice a week (2/5 weekdays), if you want to park in the ‘A’ lot twice a week for the year, are you financially better off buying a parking sticker (which costs $726 for the year) or parking illegally (tickets are $35 each)? Use Binomial Let X be a random variable that is the number of tickets you receive in a year. Assuming 2 weeks vacation, there are 50x2 days (twice a week for 50 weeks) you’ll be parking illegally. p=.40 is the chance of receiving a ticket on a given day: X~bin (100, .40) E(X) = 100x.40 = 40 tickets expected (with std dev of about 5) 40 x $35 = $1400 in tickets (+/- $200); better to buy the sticker!

Multinomial distribution (beyond the scope of this course)
The multinomial is a generalization of the binomial. It is used when there are more than 2 possible outcomes (for ordinal or nominal, rather than binary, random variables). Instead of partitioning n trials into 2 outcomes (yes with probability p / no with probability 1-p), you are partitioning n trials into 3 or more outcomes (with probabilities: p1, p2, p3,..) General formula for 3 outcomes:

Multinomial example Specific Example: if you are randomly choosing 8 people from an audience that contains 50% democrats, 30% republicans, and 20% green party, what’s the probability of choosing exactly 4 democrats, 3 republicans, and 1 green party member? You can see that it gets hard to calculate very fast! The multinomial has many uses in genetics where a person may have 1 of many possible alleles (that occur with certain probabilities in a given population) at a gene locus.

Introduction to the Poisson Distribution
Poisson distribution is for counts—if events happen at a constant rate over time, the Poisson distribution gives the probability of X number of events occurring in time T.

Poisson Mean and Variance
For a Poisson random variable, the variance and mean are the same! Mean Variance and Standard Deviation where  = expected number of hits in a given time period

Poisson Distribution, example
The Poisson distribution models counts, such as the number of new cases of SARS that occur in women in New England next month. The distribution tells you the probability of all possible numbers of new cases, from 0 to infinity. If X= # of new cases next month and X ~ Poisson (), then the probability that X=k (a particular count) is:

Example For example, if new cases of West Nile Virus in New England are occurring at a rate of about 2 per month, then these are the probabilities that: 0,1, 2, 3, 4, 5, 6, to 1000 to 1 million to… cases will occur in New England in the next month:

Poisson Probability table
X P(X) =.135 1 =.27 2 3 =.18 4 =.09 5 …

Example: Poisson distribution
Suppose that a rare disease has an incidence of 1 in 1000 person-years. Assuming that members of the population are affected independently, find the probability of k cases in a population of 10,000 (followed over 1 year) for k=0,1,2. The expected value (mean) = = .001*10,000 = 10 10 new cases expected in this population per year

more on Poisson… “Poisson Process” (rates)
Note that the Poisson parameter  can be given as the mean number of events that occur in a defined time period OR, equivalently,  can be given as a rate, such as =2/month (2 events per 1 month) that must be multiplied by t=time (called a “Poisson Process”)  X ~ Poisson () E(X) = t Var(X) = t

Example For example, if new cases of West Nile in New England are occurring at a rate of about 2 per month, then what’s the probability that exactly 4 cases will occur in the next 3 months? X ~ Poisson (=2/month) Exactly 6 cases?

Practice problems 1a. If calls to your cell phone are a Poisson process with a constant rate =2 calls per hour, what’s the probability that, if you forget to turn your phone off in a 1.5 hour movie, your phone rings during that time? 1b. How many phone calls do you expect to get during the movie?

Answer P(X≥1)=1 – .05 = 95% chance
1a. If calls to your cell phone are a Poisson process with a constant rate =2 calls per hour, what’s the probability that, if you forget to turn your phone off in a 1.5 hour movie, your phone rings during that time? X ~ Poisson (=2 calls/hour) P(X≥1)=1 – P(X=0) P(X≥1)=1 – .05 = 95% chance 1b. How many phone calls do you expect to get during the movie? E(X) = t = 2(1.5) = 3

Calculating probabilities in SAS
For binomial probability distribution function: P(X=C) = pdf('binomial', C, p, N) For binomial cumulative distribution function: P(X≤C) = cdf('binomial', C, p, N) For Poisson probability distribution function: P(X=C) = pdf('poisson', C, ) For Poisson cumulative distribution function: P(X≤C) = cdf('poisson', C, )

SAS examples data _null_; TwoSixes=pdf('binomial', 8, .0278, 100);
put TwoSixes; run; TwoSixes=cdf('binomial', 8, .0278, 100); TwoSixes=pdf('poisson', 8, 2.78); TwoSixes=cdf('poisson', 8, 2.78);

The normal and standard normal
Examples of continuous probability distributions: The normal and standard normal

The Normal Distribution
f(X) Changing μ shifts the distribution left or right. Changing σ increases or decreases the spread. σ μ X 340

The Normal Distribution: as mathematical function (pdf)
This is a bell shaped curve with different centers and spreads depending on  and  Note constants: = e=

The Normal PDF It’s a probability function, so no matter what the values of  and , must integrate to 1! 342

Normal distribution is defined by its mean and standard dev.
E(X)= = Var(X)=2 = Standard Deviation(X)=

**The beauty of the normal curve:
No matter what  and  are, the area between - and + is about 68%; the area between -2 and +2 is about 95%; and the area between -3 and +3 is about 99.7%. Almost all values fall within 3 standard deviations. 344

68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data 345
SAY: within 1 standard deviation either way of the mean within 2 standard deviations of the mean within 3 standard deviations either way of the mean WORKS FOR ALL NORMAL CURVES NO MATTER HOW SKINNY OR FAT 95% of the data 99.7% of the data 345

Rule in Math terms…

How good is rule for real data?
Check some example data: The mean of the weight of the women = The standard deviation (SD) = 15.5 347

68% of 120 = .68x120 = ~ 82 runners In fact, 79 runners fall within 1-SD (15.5 lbs) of the mean. 112.3 127.8 143.3 348

95% of 120 = .95 x 120 = ~ 114 runners In fact, 115 runners fall within 2-SD’s of the mean. 96.8 127.8 158.8 349

99.7% of 120 = .997 x 120 = runners In fact, all 120 runners fall within 3-SD’s of the mean. 81.3 127.8 174.3 350

Example Suppose SAT scores roughly follows a normal distribution in the U.S. population of college-bound students (with range restricted to ), and the average math SAT is 500 with a standard deviation of 50, then: 68% of students will have scores between 450 and 550 95% will be between 400 and 600 99.7% will be between 350 and 650 351

Example Solve for Q?….Yikes! BUT…
What if you wanted to know the math SAT score corresponding to the 90th percentile (=90% of students are lower)? P(X≤Q) = .90  Solve for Q?….Yikes! 352

The Standard Normal (Z): “Universal Currency”
The formula for the standardized normal probability density function is 353

The Standard Normal Distribution (Z)
All normal distributions can be converted into the standard normal curve by subtracting the mean and dividing by the standard deviation: Somebody calculated all the integrals for the standard normal and put them in a table! So we never have to integrate! Even better, computers now do all the integration. 354

Comparing X and Z units 100 200 X 2.0 Z ( = 100,  = 50)
2.0 Z ( = 0,  = 1) 355

Example For example: What’s the probability of getting a math SAT score of 575 or less, =500 and =50? i.e., A score of 575 is 1.5 standard deviations above the mean Yikes! But to look up Z= 1.5 in standard normal chart (or enter into SAS) no problem! = .9332 356

Practice problem If birth weights in a population are normally distributed with a mean of 109 oz and a standard deviation of 13 oz, What is the chance of obtaining a birth weight of 141 oz or heavier when sampling birth records at random? What is the chance of obtaining a birth weight of 120 or lighter? 357

Answer What is the chance of obtaining a birth weight of 141 oz or heavier when sampling birth records at random? From the chart or SAS  Z of 2.46 corresponds to a right tail (greater than) area of: P(Z≥2.46) = 1-(.9931)= or .69 % 358

Answer b. What is the chance of obtaining a birth weight of 120 or lighter? From the chart or SAS  Z of .85 corresponds to a left tail area of: P(Z≤.85) = .8023= 80.23% 359

Looking up probabilities in the standard normal table
What is the area to the left of Z=1.51 in a standard normal curve? Z=1.51 Area is % Z=1.51

Normal probabilities in SAS
data _null_; theArea=probnorm(1.5); put theArea; run; And if you wanted to go the other direction (i.e., from the area to the Z score (called the so-called “Probit” function data _null_; theZValue=probit(.93); put theZValue; The “probnorm(Z)” function gives you the probability from negative infinity to Z (here 1.5) in a standard normal curve. The “probit(p)” function gives you the Z-value that corresponds to a left-tail area of p (here .93) from a standard normal curve. The probit function is also known as the inverse standard normal function.

Probit function: the inverse
(area)= Z: gives the Z-value that goes with the probability you want For example, recall SAT math scores example. What’s the score that corresponds to the 90th percentile? In Table, find the Z-value that corresponds to area of .90  Z= 1.28 Or use SAS data _null_; theZValue=probit(.90); put theZValue; run; If Z=1.28, convert back to raw SAT score  1.28 = X – 500 =1.28 (50) X=1.28(50) = (1.28 standard deviations above the mean!) `

Are my data “normal”? Not all continuous random variables are normally distributed!! It is important to evaluate how well the data are approximated by a normal distribution 363

Are my data normally distributed?
Look at the histogram! Does it appear bell shaped? Compute descriptive summary measures—are mean, median, and mode similar? Do 2/3 of observations lie within 1 std dev of the mean? Do 95% of observations lie within 2 std dev of the mean? Look at a normal probability plot—is it approximately linear? Run tests of normality (such as Kolmogorov-Smirnov). But, be cautious, highly influenced by sample size! 364

Data from our class… Median = 6 Mean = 7.1 Mode = 0 SD = 6.8
Range = 0 to 24 (= 3.5 σ)

Data from our class… Median = 5 Mean = 5.4 Mode = none SD = 1.8
Range = 2 to 9 (~ 4 σ

Data from our class… Median = 3 Mean = 3.4 Mode = 3 SD = 2.5
Range = 0 to 12 (~ 5 σ

Data from our class… Median = 7:00 Mean = 7:04 Mode = 7:00 SD = :55
Range = 5:30 to 9:00 (~4 σ

Data from our class… 7.1 +/ = 0.3 – 13.9 0.3 13.9

Data from our class… 7.1 +/- 2*6.8 = 0 – 20.7

Data from our class… 7.1 +/- 3*6.8 = 0 – 27.5

Data from our class… 5.4 +/ = 3.6 – 7.2 3.6 7.2

Data from our class… 5.4 +/- 2*1.8 = 1.8 – 9.0 1.8 9.0

Data from our class… 5.4 +/- 3*1.8 = 0– 10 10

Data from our class… 0.9 5.9 3.4 +/- 2.5= 0.9 – 7.9

Data from our class… 8.4 3.4 +/- 2*2.5= 0 – 8.4

Data from our class… 10.9 3.4 +/- 3*2.5= 0 – 10.9

Data from our class… 6:09 7:59 7:04+/- 0:55 = 6:09 – 7:59

Data from our class… 5:14 8:54 7:04+/- 2*0:55 = 5:14 – 8:54

Data from our class… 4:19 9:49 7:04+/- 2*0:55 = 4:19 – 9:49

The Normal Probability Plot
Order the data. Find corresponding standardized normal quantile values: Plot the observed data values against normal quantile values. Evaluate the plot for evidence of linearity. 381

Normal probability plot coffee…
Right-Skewed! (concave up)

Normal probability plot love of writing…
Neither right- skewed or left- skewed, but big gap at 6.

Norm prob. plot Exercise…
Right-Skewed! (concave up)

Norm prob. plot Wake up time
Closest to a straight line…

Formal tests for normality
Results: Coffee: Strong evidence of non- normality (p<.01) Writing love: Moderate evidence of non- normality (p=.01) Exercise: Weak to no evidence of non- normality (p>.10) Wakeup time: No evidence of non- normality (p>.25)

Normal approximation to the binomial
When you have a binomial distribution where n is large and p is middle-of-the road (not too small, not too big, closer to .5), then the binomial starts to look like a normal distribution in fact, this doesn’t even take a particularly large n Recall: What is the probability of being a smoker among a group of cases with lung cancer is .6, what’s the probability that in a group of 8 cases you have less than 2 smokers?

Normal approximation to the binomial
When you have a binomial distribution where n is large and p isn’t too small (rule of thumb: mean>5), then the binomial starts to look like a normal distribution Recall: smoking example… 1 4 5 2 3 6 7 8 .27 Starting to have a normal shape even with fairly small n. You can imagine that if n got larger, the bars would get thinner and thinner and this would look more and more like a continuous function, with a bell curve shape. Here np=4.8.

Normal approximation to binomial
1 4 5 2 3 6 7 8 .27 What is the probability of fewer than 2 smokers? Exact binomial probability (from before) = = Normal approximation probability: =4.8 =1.39 P(Z<2)=.022

A little off, but in the right ballpark… we could also use the value to the left of 1.5 (as we really wanted to know less than but not including 2; called the “continuity correction”)… A fairly good approximation of the exact probability, P(Z≤-2.37) =.0069

Practice problem 1. You are performing a cohort study. If the probability of developing disease in the exposed group is .25 for the study duration, then if you sample (randomly) 500 exposed people, What’s the probability that at most 120 people develop the disease?

Answer By hand (yikes!):
+ … By hand (yikes!): P(X≤120) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)+….+ P(X=120)= OR Use SAS: data _null_; Cohort=cdf('binomial', 120, .25, 500); put Cohort; run; OR use, normal approximation: =np=500(.25)=125 and 2=np(1-p)=93.75; =9.68 P(Z<-.52)= .3015

Proportions… The binomial distribution forms the basis of statistics for proportions. A proportion is just a binomial count divided by n. For example, if we sample 200 cases and find 60 smokers, X=60 but the observed proportion=.30. Statistics for proportions are similar to binomial counts, but differ by a factor of n.

Stats for proportions For binomial: For proportion:
Differs by a factor of n. Differ s by a factor of n. For proportion: P-hat stands for “sample proportion.”

It all comes back to Z… Statistics for proportions are based on a normal distribution, because the binomial can be approximated as normal if np>5

Statistical inference: CLT, confidence intervals, p-values

Statistical Inference The process of making guesses about the truth from a sample.
Sample statistics *hat notation ^ is often used to indicate “estitmate” Truth (not observable) Sample (observation) Population parameters Make guesses about the whole population 397

Statistics vs. Parameters
Sample Statistic – any summary measure calculated from data; e.g., could be a mean, a difference in means or proportions, an odds ratio, or a correlation coefficient E.g., the mean vitamin D level in a sample of 100 men is 63 nmol/L E.g., the correlation coefficient between vitamin D and cognitive function in the sample of 100 men is 0.15 Population parameter – the true value/true effect in the entire population of interest E.g., the true mean vitamin D in all middle-aged and older European men is 62 nmol/L E.g., the true correlation between vitamin D and cognitive function in all middle-aged and older European men is 0.15 398

Examples of Sample Statistics:
Single population mean Single population proportion Difference in means (ttest) Difference in proportions (Z-test) Odds ratio/risk ratio Correlation coefficient Regression coefficient … It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 399

Example 1: cognitive function and vitamin D
Hypothetical data loosely based on [1]; cross- sectional study of 100 middle-aged and older European men. Estimation: What is the average serum vitamin D in middle-aged and older European men? Sample statistic: mean vitamin D levels Hypothesis testing: Are vitamin D levels and cognitive function correlated? Sample statistic: correlation coefficient between vitamin D and cognitive function, measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry Jul;80(7):722-9.

Distribution of a trait: vitamin D
Right-skewed! Mean= 63 nmol/L Standard deviation = 33 nmol/L IF the true mean was 128 with an average variability of 15 lbs…. 401

Distribution of a trait: DSST
Normally distributed Mean = 28 points Standard deviation = 10 points By chance, you would sometimes get values for your sample mean as high as 157 pounds. Very rarely would you see anything higher, though. 402

Distribution of a statistic…
Statistics follow distributions too… But the distribution of a statistic is a theoretical construct. Statisticians ask a thought experiment: how much would the value of the statistic fluctuate if one could repeat a particular study over and over again with different samples of the same size? By answering this question, statisticians are able to pinpoint exactly how much uncertainty is associated with a given statistic. 403

Distribution of a statistic
Two approaches to determine the distribution of a statistic: 1. Computer simulation Repeat the experiment over and over again virtually! More intuitive; can directly observe the behavior of statistics. 2. Mathematical theory Proofs and formulas! More practical; use formulas to solve problems.

Example of computer simulation…
How many heads come up in 100 coin tosses? Flip coins virtually Flip a coin 100 times; count the number of heads. Repeat this over and over again a large number of times (we’ll try 30,000 repeats!) Plot the 30,000 results.

Coin tosses… Conclusions:
We usually get between 40 and 60 heads when we flip a coin 100 times. It’s extremely unlikely that we will get 30 heads or 70 heads (didn’t happen in 30,000 experiments!).

Distribution of the sample mean, computer simulation…
1. Specify the underlying distribution of vitamin D in all European men aged 40 to 79. Right-skewed Standard deviation = 33 nmol/L True mean = 62 nmol/L (this is arbitrary; does not affect the distribution) 2. Select a random sample of 100 virtual men from the population. 3. Calculate the mean vitamin D for the sample. 4. Repeat steps (2) and (3) a large number of times (say 1000 times). 5. Explore the distribution of the means. 407

Distribution of mean vitamin D (a sample statistic)
Normally distributed! Surprise! Mean= 62 nmol/L (the true mean) Standard deviation = 3.3 nmol/L

Distribution of mean vitamin D (a sample statistic)
Normally distributed (even though the trait is right-skewed!) Mean = true mean Standard deviation = 3.3 nmol/L The standard deviation of a statistic is called a standard error The standard error of a mean =

If I increase the sample size to n=400…
Standard error = 1.7 nmol/L

If I increase the variability of vitamin D (the trait) to SD=40…
Standard error = 4.0 nmol/L

Mathematical Theory… The Central Limit Theorem!
If all possible random samples, each of size n, are taken from any population with a mean  and a standard deviation , the sampling distribution of the sample means (averages) will: 1. have mean: 2. have standard deviation: It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 3. be approximately normally distributed regardless of the shape of the parent population (normality improves with larger n). It all comes back to Z! 412

Symbol Check The mean of the sample means.
The standard deviation of the sample means. Also called “the standard error of the mean.” It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 413

Mathematical Proof (optional!)
If X is a random variable from any distribution with known mean, E(x), and variance, Var(x), then the expected value and variance of the average of n observations of X is:

Computer simulation of the CLT: (this is what we will do in lab next Wednesday!)
1. Pick any probability distribution and specify a mean and standard deviation. 2. Tell the computer to randomly generate observations from that probability distributions E.g., the computer is more likely to spit out values with high probabilities 3. Plot the “observed” values in a histogram. 4. Next, tell the computer to randomly generate averages-of-2 (randomly pick 2 and take their average) from that probability distribution. Plot “observed” averages in histograms. 5. Repeat for averages-of-10, and averages-of-100. IF the true mean was 128 with an average variability of 15 lbs…. 415

Uniform on [0,1]: average of 1 (original distribution)
It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 416

Uniform: 1000 averages of 2 It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 417

~Exp(1): average of 1 (original distribution)

~Exp(1): 1000 averages of 2 It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 421

~Bin(40, .05): average of 1 (original distribution)

~Bin(40, .05): 1000 averages of 2 It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 425

The Central Limit Theorem:
If all possible random samples, each of size n, are taken from any population with a mean  and a standard deviation , the sampling distribution of the sample means (averages) will: 1. have mean: 2. have standard deviation: It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 3. be approximately normally distributed regardless of the shape of the parent population (normality improves with larger n) 428

Central Limit Theorem caveats for small samples:
The sample standard deviation is an imprecise estimate of the true standard deviation (σ); this imprecision changes the distribution to a T- distribution. A t-distribution approaches a normal distribution for large n (100), but has fatter tails for small n (<100) If the underlying distribution is non-normal, the distribution of the means may be non-normal. More on T-distributions next week!!

Summary: Single population mean (large n)
Hypothesis test: Confidence Interval 430

Single population mean (small n, normally distributed trait)

Distribution of a correlation coefficient?? Computer simulation…
1. Specify the true correlation coefficient Correlation coefficient = 0.15 2. Select a random sample of 100 virtual men from the population. 3. Calculate the correlation coefficient for the sample. 4. Repeat steps (2) and (3) 15,000 times 5. Explore the distribution of the 15,000 correlation coefficients.

Distribution of a correlation coefficient…
Normally distributed! Mean = 0.15 (true correlation) Standard error = 0.10 434

Distribution of a correlation coefficient in general…
1. Shape of the distribution Normally distributed for large samples T-distribution for small samples (n<100) 2. Mean = true correlation coefficient (r) 3. Standard error  435

Many statistics follow normal (or t-distributions)…
Means/difference in means T-distribution for small samples Proportions/difference in proportions Regression coefficients Natural log of the odds ratio 436

Estimation (confidence intervals)…
What is a good estimate for the true mean vitamin D in the population (the population parameter)? 63 nmol/L +/- margin of error

95% confidence interval Goal: capture the true effect (e.g., the true mean) most of the time. A 95% confidence interval should include the true effect about 95% of the time. A 99% confidence interval should include the true effect about 99% of the time.

Recall: 68-95-99. 7 rule for normal distributions
Recall: rule for normal distributions! These is a 95% chance that the sample mean will fall within two standard errors of the true mean= 62 +/- 2*3.3 = nmol/L to 68.6 nmol/L Mean Mean + 2 Std error =68.6 Mean - 2 Std error=55.4 To be precise, 95% of observations fall between Z= and Z= (so the “2” is a rounded number)…

95% confidence interval There is a 95% chance that the sample mean is between 55.4 nmol/L and nmol/L For every sample mean in this range, sample mean +/- 2 standard errors will include the true mean: For example, if the sample mean is nmol/L: 95% CI = /- 6.6 = 62.0 to 75.2 This interval just hits the true mean, 62.0.

95% confidence interval Thus, for normally distributed statistics, the formula for the 95% confidence interval is: sample statistic  2 x (standard error) Examples: 95% CI for mean vitamin D: 63 nmol/L  2 x (3.3) = 56.4 – 69.6 nmol/L 95% CI for the correlation coefficient: 0.15  2 x (0.1) = -.05 – .35

Simulation of 20 studies of 100 men…
Vertical line indicates the true mean (62) 95% confidence intervals for the mean vitamin D for each of the simulated studies. Only 1 confidence interval missed the true mean.

Confidence Intervals give:
*A plausible range of values for a population parameter. *The precision of an estimate.(When sampling variability is high, the confidence interval will be wide to reflect the uncertainty of the observation.) *Statistical significance (if the 95% CI does not cross the null value, it is significant at .05) It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 443

Confidence Intervals The value of the statistic in my sample (eg., mean, odds ratio, etc.) point estimate  (measure of how confident we want to be)  (standard error) From a Z table or a T table, depending on the sampling distribution of the statistic. Standard error of the statistic.

Common “Z” levels of confidence
Commonly used confidence levels are 90%, 95%, and 99% Confidence Level Z value 80% 90% 95% 98% 99% 99.8% 99.9% 1.28 1.645 1.96 2.33 2.58 3.08 3.27

99% confidence intervals…
99% CI for mean vitamin D: 63 nmol/L  2.6 x (3.3) = 54.4 – nmol/L 99% CI for the correlation coefficient: 0.15  2.6 x (0.1) = -.11 – .41

Testing Hypotheses 1. Is the mean vitamin D in middle- aged and older European men lower than 100 nmol/L (the “desirable” level)? 2. Is cognitive function correlated with vitamin D?

Is the mean vitamin D different than 100?
Start by assuming that the mean = 100 This is the “null hypothesis” This is usually the “straw man” that we want to shoot down Determine the distribution of statistics assuming that the null is true…

Computer simulation (10,000 repeats)…
This is called the null distribution! Normally distributed Std error = 3.3 Mean = 100

Compare the null distribution to the observed value…
What’s the probability of seeing a sample mean of 63 nmol/L if the true mean is 100 nmol/L? It didn’t happen in 10,000 simulated studies. So the probability is less than 1/10,000

Compare the null distribution to the observed value…
This is the p- value! P-value < 1/10,000

Calculating the p-value with a formula…
Because we know how normal curves work, we can exactly calculate the probability of seeing an average of 63 nmol/L if the true average weight is 100 (i.e., if our null hypothesis is true): Z= 11.2, P-value << .0001

The P-value P-value is the probability that we would have seen our data (or something more unexpected) just by chance if the null hypothesis (null value) is true. Small p-values mean the null value is unlikely given our data. Our data are so unlikely given the null hypothesis (<<1/10,000) that I’m going to reject the null hypothesis! (Don’t want to reject our data!)

P-value<.0001 means: The probability of seeing what you saw or something more extreme if the null hypothesis is true (due to chance)<.0001 P(empirical data/null hypothesis) <.0001

The P-value By convention, p-values of <.05 are often accepted as “statistically significant” in the medical literature; but this is an arbitrary cut-off. A cut-off of p<.05 means that in about 5 of 100 experiments, a result would appear significant just by chance (“Type I error”).

Summary: Hypothesis Testing
The Steps: 1. Define your hypotheses (null, alternative) 2. Specify your null distribution 3. Do an experiment 4. Calculate the p-value of what you observed 5. Reject or fail to reject (~accept) the null hypothesis

Hypothesis Testing The Steps:
Define your hypotheses (null, alternative) The null hypothesis is the “straw man” that we are trying to shoot down. Null here: “mean vitamin D level = 100 nmol/L” Alternative here: “mean vit D < 100 nmol/L” (one-sided) Specify your sampling distribution (under the null) If we repeated this experiment many, many times, the mean vitamin D would be normally distributed around 100 nmol/L with a standard error of 3.3 3. Do a single experiment (observed sample mean = 63 nmol/L) 4. Calculate the p-value of what you observed (p<.0001) 5. Reject or fail to reject the null hypothesis (reject)

Confidence intervals give the same information (and more) than hypothesis tests…
458

Duality with hypothesis tests.
Null value 95% confidence interval Null hypothesis: Average vitamin D is 100 nmol/L Alternative hypothesis: Average vitamin D is not 100 nmol/L (two-sided) P-value < .05 459

Duality with hypothesis tests.
Null value 99% confidence interval Null hypothesis: Average vitamin D is 100 nmol/L Alternative hypothesis: Average vitamin D is not 100 nmol/L (two-sided) P-value < .01 460

2. Is cognitive function correlated with vitamin D?
Null hypothesis: r = 0 Alternative hypothesis: r  0 Two-sided hypothesis Doesn’t assume that the correlation will be positive or negative.

Computer simulation (15,000 repeats)…
Null distribution: Normally distributed Std error = 0.1 Mean = 0

What’s the probability of our data?
Even when the true correlation is 0, we get correlations as big as or bigger 7% of the time.

This is a two-sided hypothesis test, so “more extreme” includes as big or bigger negative correlations (<-0.15). P-value = 7% + 7% = 14%

Our results could have happened purely due to a fluke of chance!

Formal hypothesis test
1. Null hypothesis: r=0 Alternative: r  0 (two-sided) 2. Determine the null distribution Normally distributed Standard error = 0.1 3. Collect Data, r=0.15 4. Calculate the p-value for the data: Z = 5. Reject or fail to reject the null (fail to reject) Z of 1.5 corresponds to a two-sided p-value of 14%

Or use confidence interval to gauge statistical significance…
95% CI = to 0.35 Thus, 0 (the null value) is a plausible value! P>.05

Example 2: HIV vaccine trial
Thai HIV vaccine trial (2009) 8197 randomized to vaccine 8198 randomized to placebo Generated a lot of public discussion about p-values!

51/8197 vs. 75/8198 =23 excess infections in the placebo group. =2.8 fewer infections per people vaccinated Source: BBC news,

Null hypothesis Null hypothesis: infection rate is the same in the two groups Alternative hypothesis: infection rates differ

Computer simulation assuming the null (15,000 repeats)…
Normally distributed, standard error = 11.1

If the vaccine is completely ineffective, we could still get 23 excess infections just by chance. Probability of 23 or more excess infections = 0.04

How to interpret p=.04… P(data/null) = .04 P(null/data) .04
*estimated using Bayes’ Rule (and prior data on the vaccine) *Gilbert PB, Berger JO, Stablein D, Becker S, Essex M, Hammer SM, Kim JH, DeGruttola VG. Statistical interpretation of the RV144 HIV vaccine efficacy trial in Thailand: a case study for statistical issues in efficacy trials. J Infect Dis 2011; 203:

Alternative analysis of the data (“intention to treat”)…
56/8202 (6.8 per 1000) infections in the vaccine group versus 76/8200 (9.3 per 1000)

Probability of 20 or more excess infections = 0.08 P=.08 is only slightly different than p=.04!

Confidence intervals…
95% CI (analysis 1): to 95% CI (analysis 2): to The plausible ranges are nearly identical!

One sample statistical tests, continued…

Recall: Single population mean (large n)

Single population mean (small n, normally distributed trait)

What is a T-distribution?
A t-distribution is like a Z distribution, except has slightly fatter tails to reflect the uncertainty added by estimating . The bigger the sample size (i.e., the bigger the sample size used to estimate ), then the closer t becomes to Z. If n>100, t approaches Z. 481

T-distribution with only 1 degree of freedom.

T-distribution with 4 degrees of freedom.

T-distribution with 99 degrees of freedom. Looks a lot like Z!!

Student’s t Distribution
Note: t Z as n increases Standard Normal (t with df = ) t (df = 13) t-distributions are bell- shaped and symmetric, but have ‘fatter’ tails than the normal t (df = 5) t from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004 487

Student’s t Table .05 2 t /2 = .05 2.920 Upper Tail Area df .25 .10 1
Let: n = df = n - 1 =  = /2 =.05 df .25 .10 .05 1 1.000 3.078 6.314 2 0.817 1.886 2.920 /2 = .05 3 0.765 1.638 2.353 The body of the table contains t values, not probabilities t 2.920 from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004 488

With comparison to the Z value
t distribution values With comparison to the Z value Confidence t t t Z Level (10 d.f.) (20 d.f.) (30 d.f.) ____ Note: t Z as n increases from “Statistics for Managers” Using Microsoft® Excel 4th Edition, Prentice-Hall 2004 489

The T probability density function
What does t look like mathematically? (You may at least recognize some resemblance to the normal distribution function…) Where: v is the degrees of freedom (gamma) is the Gamma function is the constant Pi ( )

The t-distribution in SAS
Yikes! The t-distribution looks like a mess! Don’t want to integrate! Luckily, there are charts and SAS! MUST SPECIFY DEGREES OF FREEDOM! The t-function in SAS is: probt(t-statistic, df)

The normality assumption…
Ttests (and all linear models, in fact) have a “normality assumption”: If the outcome variable is not normally distributed and the sample size is small, a ttest is inappropriate it takes longer for the CLT to kick in and the sample means do not immediately follow a t-distribution… This is the source of the “normality assumption” of the ttest…

Computer simulation of the distribution of the sample mean (non-normal, small n):
1. Pick any probability distribution and specify a mean and standard deviation. 2. Tell the computer to randomly generate 1000 observations from that probability distributions E.g., the computer is more likely to spit out values with high probabilities 3. Calculate 1000 T-statistics: 4. Plot the T-statistics in histograms. 5. Repeat for different sample sizes (n’s). IF the true mean was 128 with an average variability of 15 lbs…. 493

n=2, underlying distribution is exponential (mean=1, SD=1)
This is NOT a t- distribution!

This is NOT a t-distribution!

This doesn’t yet follow a t- distribution!

Still not quite a t-distribution! Note the left skew.

Now, pretty close to a T- distribution!

Conclusions If the underlying data are not normally distributed AND n is small**, the means do not follow a t- distribution (so using a ttest will result in erroneous inferences). Data transformation or non- parametric tests should be used instead. **How small is too small? No hard and fast rule—depends on the true shape of the underlying distribution. Here N>30 (closer to 100) is needed.

Practice Problem: A manufacturer of light bulbs claims that its light bulbs have a mean life of 1520 hours with an unknown standard deviation. A random sample of 40 such bulbs is selected for testing. If the sample produces a mean value of hours and a sample standard deviation of 86, is there sufficient evidence to claim that the mean life is significantly less than the manufacturer claimed? Assume that light bulb lifetimes are roughly normally distributed.

Answer 1. What is your null hypothesis?
Null hypothesis: mean life = 1520 hours Alternative hypothesis: mean life < 1520 hours 2. What is your null distribution? Since we have to estimate the standard deviation, we need to make inferences from a T-curve with 39 degrees of freedom. 3. Empirical evidence: 1 random sample of 40 has a mean of hours 5. Probably not sufficient evidence to reject the null. We cannot sue the light bulb manufacturer for false advertising! Notice that using t-distribution to calculate the p-value didn’t change much! With n>30, might as well use Z table.

Practice problem You want to estimate the average ages of kids that ride a particular kid’s ride at Disneyland. You take a random sample of 8 kids exiting the ride, and find that their ages are: 2,3,4,5,6,6,7,7. Assume that ages are roughly normally distributed. a. Calculate the sample mean. b. Calculate the sample standard deviation. c. Calculate the standard error of the mean. d. Calculate the 99% confidence interval. 502

Answer (a,b) a. Calculate the sample mean.
b. Calculate the sample standard deviation. 503

Answer (c) c. Calculate the standard error of the mean. 504

Answer (d) d. Calculate the 99% confidence interval. t7,.005=3.5 505

Example problem, class data:
A two-tailed hypothesis test: A researcher claims that Stanford affiliates eat fewer than the recommended intake of 5 fruits and vegetables per week. We have data to address this claim: 24 people in the class provided data on their daily fruit and vegetable intake. Do we have evidence to dispute her claim? 506

Histogram fruit and veggie intake (n=24)…
Mean=3.7 servings Median=3 servings Mode=3 servings Std Dev=1.7 servings 507

Answer 508 1. Define your hypotheses (null, alternative)
H0: P(average servings)=5.0 Ha: P(average servings)≠5.0 servings (two-sided) 2. Specify your null distribution 508

Answer, continued 509 3. Do an experiment
observed mean in our experiment = 3.7 servings T23 critical value for p<.05, two tailed = 2.07 4. Calculate the p-value of what you observed p-value < .05; 5. Reject or fail to reject (~accept) the null hypothesis Reject! Stanford affiliates eat significantly fewer than the recommended servings of fruits and veggies. 509

95% Confidence Interval H0: P(average servings)=5.0
The 95% CI excludes 5, so p- value <.05 510

Paired data (repeated measures)
Patient BP Before (diastolic) BP After 1 100 92 2 89 84 3 83 80 4 98 93 5 108 6 95 90 What about these data? How do you analyze these? 511

Example problem: paired ttest
Patient Diastolic BP Before D. BP After Change 1 100 92 -8 2 89 84 -5 3 83 80 -3 4 98 93 5 108 -10 6 95 90 Null Hypothesis: Average Change = 0 512

Change -8 -5 -3 -10 Null Hypothesis: Average Change = 0 With 5 df, T> corresponds to p<.05 (two-sided test) 513

Change -8 -5 -3 -10 Note: does not include 0. 514

Summary: Single population mean (small n, normality)

Summary: paired ttest Hypothesis test: Confidence Interval
Where d=change over time or difference within a pair. 516

Summary: Single population mean (large n)

Single population mean (known ) Single population mean (unknown ) Single population proportion Difference in means (ttest) Difference in proportions (Z-test) Odds ratio/risk ratio Correlation coefficient Regression coefficient … It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 518

Recall: normal approximation to the binomial…
Statistics for proportions are based on a normal distribution, because the binomial can be approximated as normal if np>5

Recall: stats for proportions
For binomial: Differs by a factor of n. Diffe rs by a fact or of n. For proportion: P-hat stands for “sample proportion.”

Sampling distribution of a sample proportion
p=true population proportion. BUT… if you knew p you wouldn’t be doing the experiment! Always a normal distribution! 521

Practice Problem A fellow researcher claims that at least 15% of smokers fail to eat any fruits and vegetables at least 3 days a week. You find this hard to believe and decide to check the validity of this statistic by taking a random (representative) sample of smokers. Do you have sufficient evidence to reject your colleague’s claim if you discover that 17 of the 200 smokers in your sample eat no fruits and vegetables at least 3 days a week?

Answer 1. What is your null hypothesis?
Null hypothesis: p=proportion of smokers who skip fruits and veggies frequently >= .15 Alternative hypothesis: p < .15 2. What is your null distribution? Var( ) = .15*.85/200 = SD( ) = .025 ~ N (.15, .025) 3. Empirical evidence: 1 random sample: = 17/200 = .085 4. Z = ( )/.025 = -2.6 p-value = P(Z<-2.6) = .0047 5. Sufficient evidence to reject the claim.

OR, use computer simulation…
1. Have SAS randomly pick 200 observations from a binomial distribution with p=.15 (the null). 2. Divide the resulting count by 200 to get the observed sample proportion. 3. Repeat this 1000 times (or some arbitrarily large number of times). 4. Plot the resulting distribution of sample proportions in a histogram:

How often did we get observed values of 0. 085 or lower when true p=
Only 4/1000 times! Emprical p-value=.004

Practice Problem In Saturday’s newspaper, in a story about poll results from Ohio, the article said that 625 people in Ohio were sampled and claimed that the margin of error in the results was 4%. Can you explain where that 4% margin of error came from?

Answer

Paired data proportions test…
Analogous to paired ttest… Also takes on a slightly different form known as McNemar’s test (we’ll see lots more on this next term…)

1000 subjects were treated with antidepressants for 6 months and with placebo for 6 months (order of tx was randomly assigned) Question: do suicide attempts (yes/no) differ depending on whether a subject is on antidepressants or on placebo?

15 subjects attempted suicide in both conditions (non-informative) 10 subjects attempted suicide in the antidepressant condition but not the placebo condition 5 subjects attempted suicide in the placebo condition but not the antidepressant condition 970 did not attempt suicide in either condition (non-informative) Data boils down to 15 observations… In 10/15 cases (66.6%), antidepressant>placebo.

Paired proportions test…
Single proportions test: Under the null hypothesis, antidepressants and placebo work equally well. So, Ho: among discordant cases, p (antidepressant>placebo) = 0.5 Observed p = .666 Not enough evidence to reject the null!

Key one-sample Hypothesis Tests…
Test for Ho: μ = μ0 Test for Ho: p = po: Tn-1 approaches Z for large n. ** If np (expected value)<5, use exact binomial rather than Z approximation … 532

Corresponding confidence intervals…
For a mean: For a proportion: Tn-1 approaches Z for large n. ** If np (expecte d value)<5 , use exact binomial rather than Z approxim ation… 533

Symbol overload! n: Sample size Z: Z-statistic (standard normal)
tdf: T-statistic (t-distribution with df degrees of freedom) p: (“p-hat”): sample proportion X: (“X-bar”): sample mean s: Sample standard deviation p0: Null hypothesis proportion 0: Null hypothesis mean 534

Pitfalls of Hypothesis Testing

Hypothesis Testing The Steps:
1. Define your hypotheses (null, alternative) 2. Specify your null distribution 3. Do an experiment 4. Calculate the p-value of what you observed 5. Reject or fail to reject (~accept) the null hypothesis Follows the logic: If A then B; not B; therefore, not A.

Summary: The Underlying Logic of hypothesis tests…
Follows this logic: Assume A. If A, then B. Not B. Therefore, Not A. But throw in a bit of uncertainty…If A, then probably B…

Error and Power POWER (the flip side of type-II error: 1- β):
Type-I Error (also known as “α”): Rejecting the null when the effect isn’t real. Type-II Error (also known as “β “): Failing to reject the null when the effect is real. POWER (the flip side of type-II error: 1- β): The probability of seeing a true effect if one exists. Note the sneaky conditionals…

Think of… Pascal’s Wager
Your Decision The TRUTH God Exists God Doesn’t Exist Reject God BIG MISTAKE Correct Accept God Correct— Big Pay Off MINOR MISTAKE

(example: the drug doesn’t work) (example: the drug works)
Type I and Type II Error in a box Your Statistical Decision True state of null hypothesis H0 True (example: the drug doesn’t work) H0 False (example: the drug works) Reject H0 (ex: you conclude that the drug works) Type I error (α) Correct Do not reject H0 (ex: you conclude that there is insufficient evidence that the drug works) Type II Error (β)

Error and Power Type I error rate (or significance level): the probability of finding an effect that isn’t real (false positive). If we require p-value<.05 for statistical significance, this means that 1/20 times we will find a positive result just by chance. Type II error rate: the probability of missing an effect (false negative). Statistical power: the probability of finding an effect if it is there (the probability of not making a type II error). When we design studies, we typically aim for a power of 80% (allowing a false negative rate, or type II error rate, of 20%).

Pitfall 1: over-emphasis on p-values
Clinically unimportant effects may be statistically significant if a study is large (and therefore, has a small standard error and extreme precision). Pay attention to effect size and confidence intervals. 542

Example: effect size A prospective cohort study of 34,079 women found that women who exercised >21 MET hours per week gained significantly less weight than women who exercised <7.5 MET hours (p<.001) Headlines: “To Stay Trim, Women Need an Hour of Exercise Daily.” Physical Activity and Weight Gain Prevention. JAMA 2010;303:

Mean (SD) Differences in Weight Over Any 3-Year Period by Physical Activity Level, Women's Health Study, a Lee, I. M. et al. JAMA 2010;303: Copyright restrictions may apply. 544

What was the effect size. Those who exercised the least 0. 15 kg (
What was the effect size? Those who exercised the least 0.15 kg (.33 pounds) more than those who exercised the most over 3 years. Extrapolated over 13 years of the study, the high exercisers gained 1.4 pounds less than the low exercisers! Classic example of a statistically significant effect that is not clinically significant. 545

A picture is worth… 546

Authors explain: “Figure 2 shows the trajectory of weight gain over time by baseline physical activity levels. When classified by this single measure of physical activity, all 3 groups showed similar weight gain patterns over time.” A picture is worth… But baseline physical activity should predict weight gain in the first three years…do those slopes look different to you? 547

Another recent headline
Drinkers May Exercise More Than Teetotalers Activity levels rise along with alcohol use, survey shows “MONDAY, Aug. 31 (HealthDay News) -- Here's something to toast: Drinkers are often exercisers”… “In reaching their conclusions, the researchers examined data from participants in the 2005 Behavioral Risk Factor Surveillance System, a yearly telephone survey of about 230,000 Americans.”… For women, those who imbibed exercised 7.2 minutes more per week than teetotalers. The results applied equally to men…

Pitfall 2: association does not equal causation
Statistical significance does not imply a cause-effect relationship. Interpret results in the context of the study design. 549

Pitfall 3: data dredging/multiple comparisons
In 1980, researchers at Duke randomized 1073 heart disease patients into two groups, but treated the groups equally. Not surprisingly, there was no difference in survival. Then they divided the patients into 18 subgroups based on prognostic factors. In a subgroup of 397 patients (with three-vessel disease and an abnormal left ventricular contraction) survival of those in “group 1” was significantly different from survival of those in “group 2” (p<.025). How could this be since there was no treatment? (Lee et al. “Clinical judgment and statistics: lessons from a simulated randomized trial in coronary artery disease,” Circulation, 61: , 1980.) 550

Pitfall 3: multiple comparisons
The difference resulted from the combined effect of small imbalances in the subgroups 551

Multiple comparisons By using a p-value of 0.05 as the criterion for significance, we’re accepting a 5% chance of a false positive (of calling a difference significant when it really isn’t). If we compare survival of “treatment” and “control” within each of 18 subgroups, that’s 18 comparisons. If these comparisons were independent, the chance of at least one false positive would be…

Multiple comparisons With 18 independent comparisons, we have 60% chance of at least 1 false positive.

Multiple comparisons With 18 independent comparisons, we expect about 1 false positive.

Pitfall 3: multiple comparisons
A significance level of 0.05 means that your false positive rate for one test is 5%. If you run more than one test, your false positive rate will be higher than 5%. Control study-wide type I error by planning a limited number of tests. Distinguish between planned and exploratory tests in the results. Correct for multiple comparisons. 555

Results from Class survey…
My research question was actually to test whether or not being born on odd or even days predicted anything about your future. In fact, I discovered that people who were born on even days: Had significantly better English SATs (p=.04) Tended to enjoy manuscript writing more (p=.09) Tended to be more pessimistic (p=.09)

The differences were clinically meaningful. Compared with those born on odd days (n=11), those born on even days (n=13): Scored 65 points higher on the English SAT (720 vs. 655) Enjoyed manuscript writing by 1.5 units more (6.2 vs. 4.8) Were less optimistic by 1.5 units (6.7 vs. 8.2)

I can see the NEJM article title now… “Being born on even days makes you a better writer, but may predispose to depression.”

Assuming that this difference can’t be explained by astrology, it’s obviously an artifact! What’s going on?…

After the odd/even day question, I asked you 25 other questions… I ran 25 statistical tests (comparing the outcome variable between odd-day born people and even-day born people). So, there was a high chance of finding at least one false positive!

P-value distribution for the 25 tests…
Under the null hypothesis of no associations (which we’ll assume is true here!), p-values follow a uniform distribution… My “significant” and near significant p- values!

Compare with… Next, I generated 25 “p- values” from a random number generator (uniform distribution). These were the results from three runs…

In the medical literature…
Researchers examined the relationship between intakes of caffeine/coffee/tea and breast cancer overall and in multiple subgroups (50 tests) Overall, there was no association Risk ratios were close to 1.0 (ranging from 0.67 to 1.79), indicated protection (<1.0) about as often harm (>1.0), and showed no consistent dose- response pattern But they found 4 “significant” p-values in subgroups: coffee intake was linked to increased risk in those with benign breast disease (p=.08) caffeine intake was linked to increased risk of estrogen/progesterone negative tumors and tumors larger than 2 cm (p=.02) decaf coffee was linked to reduced risk of BC in postmenopausal hormone users (p=.02) Ishitani K, Lin J, PhD, Manson JE, Buring JE, Zhang SM. Caffeine consumption and the risk of breast cancer in a large prospective cohort of women. Arch Intern Med. 2008;168:

Distribution of the p-values from the 50 tests
Likely chance findings ! Also, effect sizes showed no consistent pattern. The risk ratios: -were close to 1.0 (ranging from 0.67 to 1.79) -indicated protection (<1.0) about as often harm (>1.0) -showed no consistent dose- response pattern.

Hallmarks of a chance finding:
Analyses are exploratory Many tests have been performed but only a few are significant The significant p-values are modest in size (between p=0.01 and p=0.05) The pattern of effect sizes is inconsistent The p-values are not adjusted for multiple comparisons

Pitfall 4: high type II error (low statistical power)
Results that are not statistically significant should not be interpreted as "evidence of no effect,” but as “no evidence of effect” Studies may miss effects if they are insufficiently powered (lack precision). Example: A study of 36 postmenopausal women failed to find a significant relationship between hormone replacement therapy and prevention of vertebral fracture. The odds ratio and 95% CI were: 0.38 (0.12, 1.19), indicating a potentially meaningful clinical effect. Failure to find an effect may have been due to insufficient statistical power for this endpoint. Design adequately powered studies and interpret in the context of study power if results are null. Ref: Wimalawansa et al. Am J Med 1998, 104: 566

Pitfall 5: the fallacy of comparing statistical significance
“the effect was significant in the treatment group, but not significant in the control group” does not imply that the groups differ significantly

Example In a placebo-controlled randomized trial of DHA oil for eczema, researchers found a statistically significant improvement in the DHA group but not the placebo group. The abstract reports: “DHA, but not the control treatment, resulted in a significant clinical improvement of atopic eczema.” However, the improvement in the treatment group was not significantly better than the improvement in the placebo group, so this is actually a null result.

Misleading “significance comparisons”
The improvement in the DHA group (18%) is not significantly greater than the improvement in the control group (11%). Koch C, Dölle S, Metzger M, et al. Docosahexaenoic acid (DHA) supplementation in atopic eczema: a randomized, double-blind, controlled trial. Br J Dermatol 2008;158:

Within-group vs. between-group tests
Examples of statistical tests used to evaluate within-group effects versus statistical tests used to evaluate between-group effects Statistical tests for within-group effects Statistical tests for between-group effects Paired ttest Two-sample ttest Wilcoxon sign-rank test Wilcoxon sum-rank test (equivalently, Mann-Whitney U test) Repeated-measures ANOVA, time effect ANOVA; repeated-measures ANOVA, group*time effect McNemar’s test Difference in proportions, Chi-square test, or relative risk

Also applies to interactions…
Similarly, “we found a significant effect in subgroup 1 but not subgroup 2” does not constitute prove of interaction For example, if the effect of a drug is significant in men, but not in women, this is not proof of a drug-gender interaction.

Overview of statistical tests

Which test should I use? Outcome Variable
Are the observations independent or correlated? Assumption s independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Which test should I use? 1. What is the dependent variable?
Outcome Variable Are the observations independent or correlated? Assumption s independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Which test should I use? 2. Are the observations correlated?

Which test should I use? 3. Are key model assumptions met?

Are the observations correlated?
What is the unit of observation? person* (most common) limb half a face physician clinical center Are the observations independent or correlated? Independent: observations are unrelated (usually different, unrelated people) Correlated: some observations are related to one another, for example: the same person over time (repeated measures), legs within a person, half a face

Example: correlated data
Split-face trial: Researchers assigned 56 subjects to apply SPF 85 sunscreen to one side of their faces and SPF 50 to the other prior to engaging in 5 hours of outdoor sports during mid-day. The outcome is sunburn (yes/no). Unit of observation = side of a face Are the observations correlated? Yes. Russak JE et al. JAAD 2010; 62:

Results ignoring correlation:
Table I -- Dermatologist grading of sunburn after an average of 5 hours of skiing/snowboarding (P = .03; Fisher’s exact test) Sun protection factor Sunburned Not sunburned 85 1 55 50 8 48 Fisher’s exact test compares the following proportions: 1/56 versus 8/56. Note that individuals are being counted twice!

Correct analysis of data:
Table 1. Correct presentation of the data from: Russak JE et al. JAAD 2010; 62: (P = .016; McNemar’s exact test). SPF-50 side SPF-85 side Sunburned Not sunburned 1 7 48 McNemar’s exact test evaluates the probability of the following: In all 7 out of 7 cases where the sides of the face were discordant (i.e., one side burnt and the other side did not), the SPF 50 side sustained the burn.

Correlations Ignoring correlations will:
overestimate p-values for within- person or within-cluster comparisons underestimate p-values for between- person or between-cluster comparisons

Common statistics for various types of outcome data
Are key model assumptions met? Outcome Variable Are the observations independent or correlated? Assumption s independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Key assumptions of linear models
Assumptions for linear models (ttest, ANOVA, linear correlation, linear regression, paired ttest, repeated- measures ANOVA, mixed models): Normally distributed outcome variable Most important for small samples; large samples are quite robust against this assumption. Predictors have a linear relationship with the outcome Graphical displays can help evaluate this.

Common statistics for various types of outcome data
Outcome Variable Are the observations independent or correlated? Assumption s independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups Are key model assumptions met?

Key assumptions for categorical tests
Assumptions for categorical tests (relative risks, chi-square, logistic regression, McNemar’s test): Sufficient numbers in each cell (np>=5) In the sunscreen trial, “exact” tests (Fisher’s exact, McNemar’s exact) were used because of the sparse data.

Continuous outcome (means); HRP 259/HRP 262
Outcome Variable Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size): independent correlated Continuous (e.g. pain scale, cognitive function) Ttest: compares means between two independent groups ANOVA: compares means between more than two independent groups Pearson’s correlation coefficient (linear correlation): shows linear correlation between two continuous variables Linear regression: multivariate regression technique used when the outcome is continuous; gives slopes Paired ttest: compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA: compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling: multivariate regression techniques to compare changes over time between two or more groups; gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test: non- parametric alternative to the paired ttest Wilcoxon sum-rank test (=Mann-Whitney U test): non-parametric alternative to the ttest Kruskal-Wallis test: non- parametric alternative to ANOVA Spearman rank correlation coefficient: non-parametric alternative to Pearson’s correlation coefficient

Binary or categorical outcomes (proportions); HRP 259/HRP 261
Outcome Variable Are the observations correlated? Alternative to the chi-square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between more than two groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate- adjusted odds ratios McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Time-to-event outcome (survival data); HRP 262
Outcome Variable Are the observation groups independent or correlated? Modifications to Cox regression if proportional- hazards is violated: independent correlated Time-to- event (e.g., time to fracture) Kaplan-Meier statistics: estimates survival functions for each group (usually displayed graphically); compares survival functions with log-rank test Cox regression: Multivariate technique for time-to-event data; gives multivariate-adjusted hazard ratios n/a (already over time) Time-dependent predictors or time-dependent hazard ratios (tricky!)

Two-sample tests

Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Alternative to the chi-square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate- adjusted odds ratios McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Recall: The odds ratio (two samples=cases and controls)
Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 Interpretation: there is a 2.25-fold higher odds of stroke in smokers vs. non-smokers.

Inferences about the odds ratio…
Does the sampling distribution follow a normal distribution? What is the standard error?

Simulation… 1. In SAS, assume infinite population of cases and controls with equal proportion of smokers (exposure), p=.23 (UNDER THE NULL!) 2. Use the random binomial function to randomly select n=50 cases and n=50 controls each with p=.23 chance of being a smoker. 3. Calculate the observed odds ratio for the resulting 2x2 table. 4. Repeat this 1000 times (or some large number of times). 5. Observe the distribution of odds ratios under the null hypothesis.

Properties of the OR (simulation)
(50 cases/50 controls/23% exposed) Under the null, this is the expected variability of the sample ORnote the right skew

Properties of the lnOR Normal!

Properties of the lnOR From the simulation, can get the empirical standard error (~0.5) and p- value (~.10)

Properties of the lnOR Or, in general, standard error =

Inferences about the ln(OR)
Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 p=.10

Confidence interval… Final answer: 2.25 (0.85,5.92) Smoker (E)
Smoker (E) Non-smoker (~E) Stroke (D) 15 35 No Stroke (~D) 8 42 50 Final answer: (0.85,5.92)

Practice problem: Suppose the following data were collected in a case-control study of brain tumor and cell phone usage: Brain tumor No brain tumor Own a cell phone 20 60 Don’t own a cell phone 10 40 Is there sufficient evidence for an association between cell phones and brain tumor?

Answer 1. What is your null hypothesis? Null hypothesis: OR=1.0; lnOR = 0 Alternative hypothesis: OR 1.0; lnOR>0 2. What is your null distribution? lnOR~ N(0, ) ; =SD (lnOR) = .44 3. Empirical evidence: = 20*40/60*10 =800/600 = 1.33  lnOR = .288 4. Z = (.288-0)/.44 = .65 p-value = P(Z>.65 or Z<-.65) = .26*2 5. Not enough evidence to reject the null hypothesis of no association TWO-SIDED TEST TWO-SIDED TEST: it would be just as extreme if the sample lnOR were .65 standard deviations or more below the null mean

Key measures of relative risk: 95% CIs OR and RR:
For an odds ratio, 95% confidence limits: For a risk ratio, 95% confidence limits:

Continuous outcome (means)

The two-sample t-test 604

The two-sample T-test Is the difference in means that we observe between two groups more than we’d expect to see based on chance alone? 605

The standard error of the difference of two means
**First add the variances and then take the square root of the sum to get the standard error. Recall, Var (A-B) = Var (A) + Var (B) if A and B are independent! 606

Shown by simulation: One sample of 30 (with SD=5).
Difference of the two samples.

Distribution of differences
If X and Y are the averages of n and m subjects, respectively: 608

But… As before, you usually have to use the sample SD, since you won’t know the true SD ahead of time… So, again becomes a T- distribution... 609

Estimated standard error of the difference….
Just plug in the sample standard deviations for each group.

Case 1: un-pooled variance
Question: What are your degrees of freedom here? Answer: Not obvious!

Case 1: ttest, unpooled variances
It is complicated to figure out the degrees of freedom here! A good approximation is given as df ≈ harmonic mean (or SAS will tell you!):

Case 2: pooled variance If you assume that the standard deviation of the characteristic (e.g., IQ) is the same in both groups, you can pool all the data to estimate a common standard deviation. This maximizes your degrees of freedom (and thus your power). Degrees of Freedom!

Estimated standard error (using pooled variance estimate)
The degrees of freedom are n+m-2 614

Case 2: ttest, pooled variances

Alternate calculation formula: ttest, pooled variance

Pooled vs. unpooled variance
Rule of Thumb: Use pooled unless you have a reason not to. Pooled gives you more degrees of freedom. Pooled has extra assumption: variances are equal between the two groups. SAS automatically tests this assumption for you (“Equality of Variances” test). If p<.05, this suggests unequal variances, and better to use unpooled ttest.

Example: two-sample t-test
In 1980, some researchers reported that “men have more mathematical ability than women” as evidenced by the 1979 SAT’s, where a sample of 30 random male adolescents had a mean score ± 1 standard deviation of 436±77 and 30 random female adolescents scored lower: 416±81 (genders were similar in educational backgrounds, socio-economic status, and age). Do you agree with the authors’ conclusions? 618

Sample Standard Deviation
Data Summary n Sampl e Mean Sample Standard Deviation Group 1: women 30 416 81 Group 2: men 436 77

Two-sample t-test 1. Define your hypotheses (null, alternative)
H0: ♂-♀ math SAT = 0 Ha: ♂-♀ math SAT ≠ 0 [two-sided]

Two-sample t-test 2. Specify your null distribution:
F and M have similar standard deviations/variances, so make a “pooled” estimate of variance.

Two-sample t-test 3. Observed difference in our experiment = 20 points

Two-sample t-test 4. Calculate the p-value of what you observed
data _null_; pval=(1-probt(.98, 58))*2; put pval; run; 5. Do not reject null! No evidence that men are better in math ;)

Example 2: Difference in means
Example: Rosental, R. and Jacobson, L. (1966) Teachers’ expectancies: Determinates of pupils’ I.Q. gains. Psychological Reports, 19,

The Experiment (note: exact numbers have been altered)
Grade 3 at Oak School were given an IQ test at the beginning of the academic year (n=90). Classroom teachers were given a list of names of students in their classes who had supposedly scored in the top 20 percent; these students were identified as “academic bloomers” (n=18). BUT: the children on the teachers lists had actually been randomly assigned to the list. At the end of the year, the same I.Q. test was re-administered.

Example 2 Statistical question: Do students in the treatment group have more improvement in IQ than students in the control group? What will we actually compare? One-year change in IQ score in the treatment group vs. one-year change in IQ score in the control group. 626

“Academic bloomers” (n=18)
The standard deviation of change scores was 2.0 in both groups. This affects statistical significance… Results: “Academic bloomers” (n=18) Controls (n=72) Change in IQ score: 12.2 (2.0) 8.2 (2.0) 12.2 points 8.2 points Difference=4 points 627

What does a 4-point difference mean?
Before we perform any formal statistical analysis on these data, we already have a lot of information. Look at the basic numbers first; THEN consider statistical significance as a secondary guide. 628

Is the association statistically significant?
This 4-point difference could reflect a true effect or it could be a fluke. The question: is a 4-point difference bigger or smaller than the expected sampling variability? 629

Hypothesis testing Step 1: Assume the null hypothesis. Null hypothesis: There is no difference between “academic bloomers” and normal students (= the difference is 0%) 630

Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true These predictions can be made by mathematical theory or by computer simulation. 631

Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—math theory: 632

Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true—computer simulation: In computer simulation, you simulate taking repeated samples of the same size from the same population and observe the sampling variability. I used computer simulation to take samples of 18 treated and 72 controls 633

Computer Simulation Results
Standard error is about 0.52 634

3. Empirical data Observed difference in our experiment = = 4.0 635

4. P-value t-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96) p-value <.0001 636

Visually… If we ran this study times, we wouldn’t expect to get 1 result as big as a difference of 4 (under the null hypothesis). 637

5. Reject null! Conclusion: I.Q. scores can bias expectancies in the teachers’ minds and cause them to unintentionally treat “bright” students differently from those seen as less bright. 638

Confidence interval (more information!!)
95% CI for the difference: 4.0±1.99(.52) = (3.0 – 5.0) t-curve with 88 df’s has slightly wider cut-off’s for 95% area (t=1.99) than a normal curve (Z=1.96) 639

What if our standard deviation had been higher?
The standard deviation for change scores in treatment and control were each 2.0. What if change scores had been much more variable—say a standard deviation of 10.0 (for both)? 640

Std. dev in change scores = 2.0
Standard error is 0.52 Std. dev in change scores = 2.0 Std. dev in change scores = 10.0 Standard error is 2.58 641

With a std. dev. of 10.0… LESS STATISICAL POWER!
Standard error is 2.58 If we ran this study times, we would expect to get +4.0 or –4.0 12% of the time. P-value=.12 642

Don’t forget: The paired T-test
Did the control group in the previous experiment improve at all during the year? Do not apply a two-sample ttest to answer this question! After-Before yields a single sample of differences… “within-group” rather than “between- group” comparison… 643

Continuous outcome (means);

Sample Standard Deviation
Data Summary n Sampl e Mean Sample Standard Deviation Group 1: Change 72 +8.2 2.0 645

Did the control group in the previous experiment improve at all during the year?
p-value <.0001

Normality assumption of ttest
If the distribution of the trait is normal, fine to use a t-test. But if the underlying distribution is not normal and the sample size is small (rule of thumb: n>30 per group if not too skewed; n>100 if distribution is really skewed), the Central Limit Theorem takes some time to kick in. Cannot use ttest. Note: ttest is very robust against the normality assumption!

Alternative tests when normality is violated: Non-parametric tests

Continuous outcome (means);

Non-parametric tests t-tests require your outcome variable to be normally distributed (or close enough), for small samples. Non-parametric tests are based on RANKS instead of means and standard deviations (=“population parameters”).

Example: non-parametric tests
10 dieters following Atkin’s diet vs. 10 dieters following Jenny Craig Hypothetical RESULTS: Atkin’s group loses an average of 34.5 lbs. J. Craig group loses an average of 18.5 lbs. Conclusion: Atkin’s is better?

Example: non-parametric tests
BUT, take a closer look at the individual data… Atkin’s, change in weight (lbs): +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 J. Craig, change in weight (lbs) -8, -10, -12, -16, -18, -20, -21, -24, -26, -30

Jenny Craig Weight Change 30 25 20 P e r c 15 e n t 10 5 -30 -25 -20
-30 -25 -20 -15 -10 -5 5 10 15 20 Weight Change

Atkin’s Weight Change 30 25 20 P e r c 15 e n t 10 5 -300 -280 -260
-300 -280 -260 -240 -220 -200 -180 -160 -140 -120 -100 -80 -60 -40 -20 20 Weight Change

t-test inappropriate…
Comparing the mean weight loss of the two groups is not appropriate here. The distributions do not appear to be normally distributed. Moreover, there is an extreme outlier (this outlier influences the mean a great deal).

Wilcoxon rank-sum test
RANK the values, 1 being the least weight loss and 20 being the most weight loss. Atkin’s +4, +3, 0, -3, -4, -5, -11, -14, -15, -300 1, 2, 3, 4, 5, 6, 9, 11, 12, 20 J. Craig -8, -10, -12, -16, -18, -20, -21, -24, -26, -30 7, 8, 10, 13, 14, 15, 16, 17, 18, 19

Sum of Atkin’s ranks: =73 Sum of Jenny Craig’s ranks: =137 Jenny Craig clearly ranked higher! P-value *(from computer) = .018 *For details of the statistical test, see appendix of these slides…

Outcome Variable Are the observations correlated? Alternative to the chi-square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate- adjusted odds ratios McNemar’s chi-square test: compares binary outcome between two correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fisher’s exact test: compares proportions between independent groups when there are sparse data (some cells <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (some cells <5).

Difference in proportions (special case of chi-square test)

Analagous to pooled variance in the ttest
Null distribution of a difference in proportions Standard error of a proportion= Standard error can be estimated by= (still normally distributed) Standard error of the difference of two proportions= The variance of a difference is the sum of variances (as with difference in means). Analagous to pooled variance in the ttest

Null distribution of a difference in proportions
Difference of proportions

Difference in proportions test
Follows a normal because binomial can be approximated with normal Difference in proportions test Null hypothesis: The difference in proportions is 0. Recall, variance of a proportion is p(1-p)/n Use average (or pooled) proportion in standard error formula, because under the null hypothesis, groups have equal proportions. 662

Recall case-control example:

Absolute risk: Difference in proportions exposed

Difference in proportions exposed

Example 2: Difference in proportions
Research Question: Are antidepressants a risk factor for suicide attempts in children and adolescents? Example modified from: “Antidepressant Drug Therapy and Suicide in Severely Depressed Children and Adults ”; Olfson et al. Arch Gen Psychiatry.2006;63:

Example 2: Difference in Proportions
Design: Case-control study Methods: Researchers used Medicaid records to compare prescription histories between 263 children and teenagers (6-18 years) who had attempted suicide and 1241 controls who had never attempted suicide (all subjects suffered from depression). Statistical question: Is a history of use of antidepressants more common among cases than controls?

Example 2 Statistical question: Is a history of use of antidepressants more common among heart disease cases than controls? What will we actually compare? Proportion of cases who used antidepressants in the past vs. proportion of controls who did

Results 46% No (%) of cases (n=263) No (%) of controls (n=1241)
Any antidepressant drug ever 120 (46%) 448 (36%) 46% 36% Difference=10%

Is the association statistically significant?
This 10% difference could reflect a true association or it could be a fluke in this particular sample. The question: is 10% bigger or smaller than the expected sampling variability?

Hypothesis testing Step 1: Assume the null hypothesis. Null hypothesis: There is no association between antidepressant use and suicide attempts in the target population (= the difference is 0%) 671

Hypothesis Testing Step 2: Predict the sampling variability assuming the null hypothesis is true

Also: Computer Simulation Results
Standard error is about 3.3% 673

Hypothesis Testing Step 3: Do an experiment We observed a difference of 10% between cases and controls.

Hypothesis Testing Step 4: Calculate a p-value

P-value from our simulation…
When we ran this study 1000 times, we got 1 result as big or bigger than 10%. We also got 3 results as small or smaller than – 10%. 676

P-value From our simulation, we estimate the p-value to be:
4/1000 or .004 677

Hypothesis Testing Here we reject the null.
Step 5: Reject or do not reject the null hypothesis. Here we reject the null. Alternative hypothesis: There is an association between antidepressant use and suicide in the target population.

What would a lack of statistical significance mean?
If this study had sampled only 50 cases and 50 controls, the sampling variability would have been much higher—as shown in this computer simulation…

263 cases and 1241 controls. 50 cases and 50 controls.
Standard error is about 3.3% 263 cases and controls. Standard error is about 10% 50 cases and 50 controls.

With only 50 cases and 50 controls…
If we ran this study times, we would expect to get values of 10% or higher 170 times (or 17% of the time). Standard error is about 10% 681

Two-tailed p-value Two-tailed p-value = 17%x2=34% 682

Practice problem… An August 2003 research article in Developmental and Behavioral Pediatrics reported the following about a sample of UK kids: when given a choice of a non- branded chocolate cereal vs. CoCo Pops, 97% (36) of 37 girls and 71% (27) of 38 boys preferred the CoCo Pops. Is this evidence that girls are more likely to choose brand-named products?

Answer Null says p’s are equal so estimate standard error using overall observed p 1. Hypotheses: H0: p♂-p♀= 0 Ha: p♂-p♀≠ 0 [two-sided] 2. Null distribution of difference of two proportions: 3. Observed difference in our experiment = = .26 4. Calculate the p-value of what you observed: data _null_; pval=(1-probnorm(3.06))*2; put pval; run; 5. p-value is sufficiently low for us to reject the null; there does appear to be a difference in gender preferences here.

Key two-sample Hypothesis Tests…
Test for Ho: μx- μy = 0 (σ2 unknown, but roughly equal): Test for Ho: p1- p2= 0: 685

Corresponding confidence intervals…
For a difference in means, 2 independent samples (σ2’s unknown but roughly equal): For a difference in proportions, 2 independent samples: 686

Appendix: details of rank-sum test…

Wilcoxon Rank-sum test

Example For example, if team 1 and team 2 (two gymnastic teams) are competing, and the judges rank all the individuals in the competition, how can you tell if team 1 has done significantly better than team 2 or vice versa?

Answer Intuition: under the null hypothesis of no difference between the two groups… If n1=n2, the sums of T1 and T2 should be equal. But if n1 ≠n2, then T2 (n2=bigger group) should automatically be bigger. But how much bigger under the null? For example, if team 1 has 3 people and team 2 has 10, we could rank all 13 participants from 1 to 13 on individual performance. If team1 (X) and team2 don’t differ in talent, the ranks ought to be spread evenly among the two groups, e.g.… 1 2 X X X (exactly even distribution if team1 ranks 3rd, 7th, and 11th)

Remember this? Take-home point:
sum of within-group ranks for smaller group. sum of within-group ranks for larger group. Take-home point:

It turns out that, if the null hypothesis is true, the difference between the larger-group sum of ranks and the smaller-group sum of ranks is exactly equal to the difference between T1 and T2

From slide 23 From slide 24 Define new statistics Here, under null: U2= U1= U2+U1=30

 under null hypothesis, U1 should equal U2:
The U’s should be equal to each other and will equal n1n2/2: U1 + U2 = n1n2 Under null hypothesis, U1 = U2 = U0 E(U1 + U2) = 2E(U0) = n1n2 E(U1 = U2=U0) = n1n2/2 So, the test statistic here is not quite the difference in the sum-of-ranks of the 2 groups It’s the smaller observed U value: U0 For small n’s, take U0, and get p-value directly from a U table.

For large enough n’s (>10 per group)…

Add observed data to the example…
Example: If the girls on the two gymnastics teams were ranked as follows: Team 1: 1, 5, Observed T1 = 13 Team 2: 2,3,4,6,8,9,10,11,12, Observed T2 = 78 Are the teams significantly different? Total sum of ranks = 13*14/2 = n1n2=3*10 = 30 Under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 (each should equal about 15 under the null) and U0 = 15 U1= – 13 = 23 U2= – 78 = 7 U0 = 7 Not quite statistically significant in U table…p=.1084 (see attached) x2 for two-tailed test

Example problem 2 A study was done to compare the Atkins Diet (low-carb) vs. Jenny Craig (low-cal, low-fat). The following weight changes were obtained; note they are very skewed because someone lost 100 pounds; the mean loss for Atkins is going to look higher because of the bozo, but does that mean the diet is better overall? Conduct a Mann-Whitney U test to compare ranks. Atkins Jenny Craig -100 -11 -8 -15 -4 -5 +5 +6 +8 -20 +2

Answer Atkins Jenny Craig 1 4 5 3 7 6 9 10 11 2 8
Corresponding Ranks (lower is more weight loss!): Answer Atkins Jenny Craig 1 4 5 3 7 6 9 10 11 2 8 Sum of ranks for JC = 25 (n=5) Sum of ranks for Atkins=41 (n=6) n1n2=5*6 = 30 under the null hypothesis: expect U1 - U2 = 0 and U1 + U2 = 30 and U0 = 15 U1= – 25 = 20 U2= – 41 = 10 U0 = 10; n1=5, n2=6 Go to Mann-Whitney chart….p=.2143x 2 = .42

Introduction to sample size and power calculations
How much chance do we have to reject the null hypothesis when the alternative is in fact true? (what’s the probability of detecting a real effect?)

Can we quantify how much power we have for given sample sizes?

For 5% significance level, one-tail area=2.5%
study 1: 263 cases, 1241 controls Rejection region. Any value >= 6.5 (0+3.3*1.96) Null Distribution: difference=0. For 5% significance level, one-tail area=2.5% (Z/2 = 1.96) Power= chance of being in the rejection region if the alternative is true=area to the right of this line (in yellow) Clinically relevant alternative: difference=10%.

study 1: 263 cases, 1241 controls Rejection region. Any value >= 6.5 (0+3.3*1.96) Power= chance of being in the rejection region if the alternative is true=area to the right of this line (in yellow) Power here:

study 1: 50 cases, 50 controls Critical value= 0+10*1.96=20 2.5% area
Z/2=1.96 Power closer to 15% now.

Study 2: 18 treated, 72 controls, STD DEV = 2
Critical value= *1.96 = 1 Clinically relevant alternative: difference=4 points Power is nearly 100%!

Study 2: 18 treated, 72 controls, STD DEV=10
Critical value= *1.96 = 5 Power is about 40%

Study 2: 18 treated, 72 controls, effect size=1.0
Critical value= *1.96 = 1 Power is about 50% Clinically relevant alternative: difference=1 point

Factors Affecting Power
1. Size of the effect 2. Standard deviation of the characteristic 3. Bigger sample size 4. Significance level desired It turns out that if you were to go out and sample many, many times, most sample statistics that you could calculate would follow a normal distribution. What are the 2 parameters (from last time) that define any normal distribution? Remember that a normal curve is characterized by two parameters, a mean and a variability (SD) What do you think the mean value of a sample statistic would be? The standard deviation? Remember standard deviation is natural variability of the population Standard error can be standard error of the mean or standard error of the odds ratio or standard error of the difference of 2 means, etc. The standard error of any sample statistic. 707

1. Bigger difference from the null mean
average weight from samples of 100 Null Clinically relevant alternative

2. Bigger standard deviation
average weight from samples of 100

3. Bigger Sample Size average weight from samples of 100

4. Higher significance level
Rejection region. average weight from samples of 100

Sample size calculations
Based on these elements, you can write a formal mathematical equation that relates power, sample size, effect size, standard deviation, and significance level… **WE WILL DERIVE THESE FORMULAS FORMALLY SHORTLY**

Simple formula for difference in means
Represents the desired power (typically .84 for 80% power). Sample size in each group (assumes equal sized groups) Standard deviation of the outcome variable Represents the desired level of statistical significance (typically 1.96). Effect Size (the difference in means)

Simple formula for difference in proportions
Represents the desired power (typically .84 for 80% power). Sample size in each group (assumes equal sized groups) A measure of variability (similar to standard deviation) Represents the desired level of statistical significance (typically 1.96). Effect Size (the difference in proportions)

Derivation of sample size formula….

Study 2: 18 treated, 72 controls, effect size=1.0
Critical value= 0+.52*1.96=1 Power close to 50%

SAMPLE SIZE AND POWER FORMULAS
Critical value= 0+standard error (difference)*Z/2 Power= area to right of Z=

Power= area to right of Z=
Power is the area to the right of Z. OR power is the area to the left of - Z. Since normal charts give us the area to the left by convention, we need to use - Z to get the correct value. Most textbooks just call this “Z”; I’ll use the term Zpower to avoid confusion.

All-purpose power formula…

Derivation of a sample size formula…
Sample size is embedded in the standard error….

Algebra…

Sample size formula for difference in means

Examples Example 1: You want to calculate how much power you will have to see a difference of 3.0 IQ points between two groups: 30 male doctors and 30 female doctors. If you expect the standard deviation to be about 10 on an IQ test for both groups, then the standard error for the difference will be about: = 2.57

Power formula… P(Z≤ -.79) =.21; only 21% power to see a difference of 3 IQ points.

Example 2: How many people would you need to sample in each group to achieve power of 80% (corresponds to Z=.84) 174/group; 348 altogether

Sample Size needed for comparing two proportions:
Example: I am going to run a case-control study to determine if pancreatic cancer is linked to drinking coffee. If I want 80% power to detect a 10% difference in the proportion of coffee drinkers among cases vs. controls (if coffee drinking and pancreatic cancer are linked, we would expect that a higher proportion of cases would be coffee drinkers than controls), how many cases and controls should I sample? About half the population drinks coffee.

Derivation of a sample size formula:
The standard error of the difference of two proportions is:

Derivation of a sample size formula:
Here, if we assume equal sample size and that, under the null hypothesis proportions of coffee drinkers is .5 in both cases and controls, then s.e.(diff)=

For 80% power… There is 80% area to the left of a Z-score of .84 on a standard normal curve; therefore, there is 80% area to the right of -.84. Would take 392 cases and 392 controls to have 80% power! Total=784

Question 2: How many total cases and controls would I have to sample to get 80% power for the same study, if I sample 2 controls for every case? Ask yourself, what changes here?

Different size groups…
Need: 294 cases and 2x294=588 controls total. Note: you get the best power for the lowest sample size if you keep both groups equal (882 > 784). You would only want to make groups unequal if there was an obvious difference in the cost or ease of collecting data on one group. E.g., cases of pancreatic cancer are rare and take time to find.

General sample size formula

General sample size needs when outcome is binary:

Compare with when outcome is continuous:

Question How many subjects would we need to sample to have 80% power to detect an average increase in MCAT biology score of 1 point, if the average change without instruction (just due to chance) is plus or minus 3 points (=standard deviation of change)?

Standard error here=

Where D=change from test 1 to test 2. (difference)
Therefore, need: (9)( )2/1 = 70 people total

Sample size for paired data:

Paired data difference in proportion: sample size:

More than two groups: ANOVA and Chi-square

First, recent news… RESEARCHERS FOUND A NINE-FOLD INCREASE IN THE RISK OF DEVELOPING PARKINSON'S IN INDIVIDUALS EXPOSED IN THE WORKPLACE TO CERTAIN SOLVENTS…

The data… Table 3. Solvent Exposure Frequencies and Adjusted Pairwise Odds Ratios in PD–Discordant Twins, n = 99 Pairsa

Which statistical test?

Comparing more than two groups…

ANOVA example Mean micronutrient intake from the school lunch by school S1a, n=28 S2b, n=25 S3c, n=21 P-valued Calcium (mg) Mean 117.8 158.7 206.5 0.000 SDe 62.4 70.5 86.2 Iron (mg) 2.0 0.854 SD 0.6 Folate (μg) 26.6 38.7 42.6 13.1 14.5 15.1 Zinc (mg) 1.9 1.5 1.3 0.055 1.0 1.2 0.4 a School 1 (most deprived; 40% subsidized lunches). b School 2 (medium deprived; <10% subsidized). c School 3 (least deprived; no subsidization, private school). d ANOVA; significant differences are highlighted in bold (P<0.05). FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in England-are the nutritional standards being met? Appetite Jan;46(1):86-92.

ANOVA (ANalysis Of VAriance)
Idea: For two or more groups, test difference between means, for quantitative normally distributed variables. Just an extension of the t-test (an ANOVA with only two groups is mathematically equivalent to a t- test).

One-Way Analysis of Variance
Assumptions, same as ttest Normally distributed outcome Equal variances between the groups Groups are independent

Hypotheses of One-Way ANOVA

ANOVA It’s like this: If I have three groups to compare:
I could do three pair-wise ttests, but this would increase my type I error So, instead I want to look at the pairwise differences “all at once.” To do this, I can recognize that variance is a statistic that let’s me look at more than one difference at a time…

The “F-test” Is the difference in the means of the groups more
than background noise (=variability within groups)? Summarizes the mean differences between all groups at once. Analogous to pooled variance from a ttest. Recall, we have already used an “F-test” to check for equality of variances If F>>1 (indicating unequal variances), use unpooled variance in a t-test.

The F-distribution The F-distribution is a continuous probability distribution that depends on two parameters n and m (numerator and denominator degrees of freedom, respectively):

The F-distribution A ratio of variances follows an F- distribution:
The F-test tests the hypothesis that two variances are equal. F will be close to 1 if sample variances are equal.

How to calculate ANOVA’s by hand…
Treatment 1 Treatment 2 Treatment 3 Treatment 4 y11 y21 y31 y41 y12 y22 y32 y42 y13 y23 y33 y43 y14 y24 y34 y44 y15 y25 y35 y45 y16 y26 y36 y46 y17 y27 y37 y47 y18 y28 y38 y48 y19 y29 y39 y49 y110 y210 y310 y410 n=10 obs./group k=4 groups The group means The (within) group variances

Sum of Squares Within (SSW), or Sum of Squares Error (SSE)
The (within) group variances + Sum of Squares Within (SSW) (or SSE, for chance error)

Sum of Squares Between (SSB), or Sum of Squares Regression (SSR)
Overall mean of all 40 observations (“grand mean”) Sum of Squares Between (SSB). Variability of the group means compared to the grand mean (the variability due to the treatment).

Total Sum of Squares (SST)
Total sum of squares(TSS). Squared difference of every observation from the overall mean. (numerator of variance of Y!)

Partitioning of Variance
= + SSW + SSB = TSS

(n individuals per group)
ANOVA Table Between (k groups) k-1 SSB (sum of squared deviations of group means from grand mean) SSB/k-1 Go to Fk-1,nk-k chart Total variation nk-1 TSS (sum of squared deviations of observations from grand mean) Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Within (n individuals per group) nk-k SSW (sum of squared deviations of observations from their group mean) s2=SSW/nk-k TSS=SSB + SSW

(squared difference in means multiplied by n)
ANOVA=t-test Between (2 groups) 1 SSB (squared difference in means multiplied by n) Squared difference in means times n Go to F1, 2n-2 Chart notice values are just (t 2n-2)2 Total variation 2n-1 TSS Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Within 2n-2 SSW equivalent to numerator of pooled variance Pooled variance

Example Treatment 1 Treatment 2 Treatment 3 Treatment 4 60 inches 50
48 47 67 52 49 42 43 54 55 56 68 62 59 61 65 64 60 72 63 71

Example Step 1) calculate the sum of squares between groups:
Mean for group 1 = 62.0 Mean for group 2 = 59.7 Mean for group 3 = 56.3 Mean for group 4 = 61.4 Grand mean= Treatment 1 Treatment 2 Treatment 3 Treatment 4 60 inches 50 48 47 67 52 49 42 43 54 55 56 68 62 59 61 65 64 60 72 63 71 SSB = [( )2 + ( )2 + ( )2 + ( )2 ] xn per group= 19.65x10 = 196.5

Example Step 2) calculate the sum of squares within groups:
(60-62) 2+(67-62) 2+ (42-62) 2+ (67- 62) 2+ (56-62) 2+ (62-62) 2+ (64-62) 2+ (59-62) 2+ (72- 62) 2+ (71-62) 2+ ( ) 2+ ( ) 2+ ( ) ) 2+ ( ) 2+ ( ) 2…+….(sum of 40 squared deviations) = Treatment 1 Treatment 2 Treatment 3 Treatment 4 60 inches 50 48 47 67 52 49 42 43 54 55 56 68 62 59 61 65 64 60 72 63 71

Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between Within Total 3 196.5 65.5 1.14 .344 36 2060.6 57.2 39 2257.1

Step 3) Fill in the ANOVA table
Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between Within Total 3 196.5 65.5 1.14 .344 36 2060.6 57.2 39 2257.1 INTERPRETATION of ANOVA: How much of the variance in height is explained by treatment group? R2=“Coefficient of Determination” = SSB/TSS = /2275.1=9%

Coefficient of Determination
The amount of variation in the outcome variable (dependent variable) that is explained by the predictor (independent variable).

Beyond one-way ANOVA Often, you may want to test more than 1 treatment. ANOVA can accommodate more than 1 treatment or factor, so long as they are independent. Again, the variation partitions beautifully! TSS = SSB1 + SSB2 + SSW

ANOVA example Table 6. Mean micronutrient intake from the school lunch by school S1a, n=25 S2b, n=25 S3c, n=25 P-valued Calcium (mg) Mean 117.8 158.7 206.5 0.000 SDe 62.4 70.5 86.2 Iron (mg) 2.0 0.854 SD 0.6 Folate (μg) 26.6 38.7 42.6 13.1 14.5 15.1 Zinc (mg) 1.9 1.5 1.3 0.055 1.0 1.2 0.4 a School 1 (most deprived; 40% subsidized lunches). b School 2 (medium deprived; <10% subsidized). c School 3 (least deprived; no subsidization, private school). d ANOVA; significant differences are highlighted in bold (P<0.05). FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in England-are the nutritional standards being met? Appetite Jan;46(1):86-92.

Answer Step 1) calculate the sum of squares between groups:
Mean for School 1 = 117.8 Mean for School 2 = 158.7 Mean for School 3 = 206.5 Grand mean: 161 SSB = [( )2 + ( )2 + ( )2] x25 per group= 98,113

Answer Step 2) calculate the sum of squares within groups:
S.D. for S1 = 62.4 S.D. for S2 = 70.5 S.D. for S3 = 86.2 Therefore, sum of squares within is: (24)[ ]=391,066

Answer Step 3) Fill in your ANOVA table **R2=98113/489179=20%
Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between 2 98,113 49056 9 <.05 Within 72 391,066 5431 Total 74 489,179 **R2=98113/489179=20% School explains 20% of the variance in lunchtime calcium intake in these kids.

ANOVA summary A statistically significant ANOVA (F-test) only tells you that at least two of the groups differ, but not which ones differ. Determining which groups differ (when it’s unclear) requires more sophisticated analyses to correct for the problem of multiple comparisons…

Question: Why not just do 3 pairwise ttests?
Answer: because, at an error rate of 5% each test, this means you have an overall chance of up to 1-(.95)3= 14% of making a type-I error (if all 3 comparisons were independent) If you wanted to compare 6 groups, you’d have to do 6C2 = 15 pairwise ttests; which would give you a high chance of finding something significant just by chance (if all tests were independent with a type-I error rate of 5% each); probability of at least one type-I error = 1-(.95)15=54%.

Recall: Multiple comparisons

Correction for multiple comparisons
How to correct for multiple comparisons post-hoc… Bonferroni correction (adjusts p by most conservative amount; assuming all tests independent, divide p by the number of tests) Tukey (adjusts p) Scheffe (adjusts p) Holm/Hochberg (gives p-cutoff beyond which not significant)

Procedures for Post Hoc Comparisons
If your ANOVA test identifies a difference between group means, then you must identify which of your k groups differ. If you did not specify the comparisons of interest (“contrasts”) ahead of time, then you have to pay a price for making all kCr pairwise comparisons to keep overall type-I error rate to α. Alternately, run a limited number of planned comparisons (making only those comparisons that are most important to your research question). (Limits the number of tests you make).

1. Bonferroni For example, to make a Bonferroni correction, divide your desired alpha cut-off level (usually .05) by the number of comparisons you are making. Assumes complete independence between comparisons, which is way too conservative. Obtained P-value Original Alpha # tests New Alpha Significant? .001 .05 5 .010 Yes .011 4 .013 .019 3 .017 No .032 2 .025 .048 1 .050

2/3. Tukey and Sheffé Both methods increase your p-values to account for the fact that you’ve done multiple comparisons, but are less conservative than Bonferroni (let computer calculate for you!). SAS options in PROC GLM: adjust=tukey adjust=scheffe

4/5. Holm and Hochberg Arrange all the resulting p-values (from the T=kCr pairwise comparisons) in order from smallest (most significant) to largest: p1 to pT

Holm Start with p1, and compare to Bonferroni p (=α/T).
If p1< α/T, then p1 is significant and continue to step 2. If not, then we have no significant p-values and stop here. If p2< α/(T-1), then p2 is significant and continue to step. If not, then p2 thru pT are not significant and stop here. If p3< α/(T-2), then p3 is significant and continue to step If not, then p3 thru pT are not significant and stop here. Repeat the pattern…

Hochberg Start with largest (least significant) p-value, pT, and compare to α. If it’s significant, so are all the remaining p-values and stop here. If it’s not significant then go to step 2. If pT-1< α/(T-1), then pT-1 is significant, as are all remaining smaller p-vales and stop here. If not, then pT-1 is not significant and go to step 3. Repeat the pattern… Note: Holm and Hochberg should give you the same results. Use Holm if you anticipate few significant comparisons; use Hochberg if you anticipate many significant comparisons.

Practice Problem A large randomized trial compared an experimental drug and 9 other standard drugs for treating motion sickness. An ANOVA test revealed significant differences between the groups. The investigators wanted to know if the experimental drug (“drug 1”) beat any of the standard drugs in reducing total minutes of nausea, and, if so, which ones. The p-values from the pairwise ttests (comparing drug 1 with drugs 2-10) are below. a. Which differences would be considered statistically significant using a Bonferroni correction? A Holm correction? A Hochberg correction? Drug 1 vs. drug … 2 3 4 5 6 7 8 9 10 p-value .05 .3 .25 .04 .001 .006 .08 .002 .01

Answer Bonferroni makes new α value = α/9 = .05/9 =.0056; therefore, using Bonferroni, the new drug is only significantly different than standard drugs 6 and 9. Arrange p-values: 6 9 7 10 5 2 8 4 3 .001 .002 .006 .01 .04 .05 .08 .25 .3 Holm: .001<.0056; .002<.05/8=.00625; .006<.05/7=.007; .01>.05/6=.0083; therefore, new drug only significantly different than standard drugs 6, 9, and 7. Hochberg: .3>.05; .25>.05/2; .08>.05/3; .05>.05/4; .04>.05/5; .01>.05/6; .006<.05/7; therefore, drugs 7, 9, and 6 are significantly different.

Practice problem b. Your patient is taking one of the standard drugs that was shown to be statistically less effective in minimizing motion sickness (i.e., significant p-value for the comparison with the experimental drug). Assuming that none of these drugs have side effects but that the experimental drug is slightly more costly than your patient’s current drug-of-choice, what (if any) other information would you want to know before you start recommending that patients switch to the new drug?

Answer The magnitude of the reduction in minutes of nausea.
If large enough sample size, a 1-minute difference could be statistically significant, but it’s obviously not clinically meaningful and you probably wouldn’t recommend a switch.

Non-parametric ANOVA Proc NPAR1WAY in SAS Kruskal-Wallis one-way ANOVA
(just an extension of the Wilcoxon Sum-Rank (Mann Whitney U) test for 2 groups; based on ranks) Proc NPAR1WAY in SAS

Chi-square test for comparing proportions (of a categorical variable) between >2 groups
I. Chi-Square Test of Independence When both your predictor and outcome variables are categorical, they may be cross-classified in a contingency table and compared using a chi-square test of independence. A contingency table with R rows and C columns is an R x C contingency table.

Example Asch, S.E. (1955). Opinions and social pressure. Scientific American, 193,

The Experiment A Subject volunteers to participate in a “visual perception study.” Everyone else in the room is actually a conspirator in the study (unbeknownst to the Subject). The “experimenter” reveals a pair of cards…

The Task Cards Standard line Comparison lines A, B, and C

The Experiment Everyone goes around the room and says which comparison line (A, B, or C) is correct; the true Subject always answers last – after hearing all the others’ answers. The first few times, the 7 “conspirators” give the correct answer. Then, they start purposely giving the (obviously) wrong answer. 75% of Subjects tested went along with the group’s consensus at least once.

Further Results In a further experiment, group size (number of conspirators) was altered from 2-10. Does the group size alter the proportion of subjects who conform?

Number of group members?
The Chi-Square test Conformed? Number of group members? 2 4 6 8 10 Yes 20 50 75 60 30 No 80 25 40 70 Apparently, conformity less likely when less or more group members…

= 235 conformed out of 500 experiments. Overall likelihood of conforming = 235/500 = .47

Calculating the expected, in general
Null hypothesis: variables are independent Recall that under independence: P(A)*P(B)=P(A&B) Therefore, calculate the marginal probability of B and the marginal probability of A. Multiply P(A)*P(B)*N to get the expected cell count.

Number of group members?
Expected frequencies if no association between group size and conformity… Conformed? Number of group members? 2 4 6 8 10 Yes 47 No 53

Do observed and expected differ more than expected due to chance?
Do observed and expected differ more than expected due to chance?

Chi-Square test Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4

The Chi-Square distribution: is sum of squared normal deviates
The expected value and variance of a chi-square: E(x)=df Var(x)=2(df)

Chi-Square test Degrees of freedom = (rows-1)*(columns-1)=(2-1)*(5-1)=4 Rule of thumb: if the chi-square statistic is much greater than it’s degrees of freedom, indicates statistical significance. Here 85>>4.

Chi-square example: recall data…
Brain tumor No brain tumor Own a cell phone 5 347 352 Don’t own a cell phone 3 88 91 8 435 453

Same data, but use Chi-square test
Brain tumor No brain tumor Own 5 347 352 Don’t own 3 88 91 8 435 453 Expected value in cell c= 1.7, so technically should use a Fisher’s exact here! Next term…

Caveat **When the sample size is very small in any cell (expected value<5), Fisher’s exact test is used as an alternative to the chi-square test.

Outcome Variable Are the observations correlated? Alternative to the chi-square test if sparse cells: independent correlated Binary or categorical (e.g. fracture, yes/no) Chi-square test: compares proportions between two or more groups Relative risks: odds ratios or risk ratios Logistic regression: multivariate technique used when outcome is binary; gives multivariate- adjusted odds ratios McNemar’s chi-square test: compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression: multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling: multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fisher’s exact test: compares proportions between independent groups when there are sparse data (np <5). McNemar’s exact test: compares proportions between correlated groups when there are sparse data (np <5).

Linear correlation and linear regression

Recall: Covariance

Interpreting Covariance
cov(X,Y) > X and Y are positively correlated cov(X,Y) < X and Y are inversely correlated cov(X,Y) = X and Y are independent

Correlation coefficient
Pearson’s Correlation Coefficient is standardized covariance (unitless):

Correlation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship

Scatter Plots of Data with Various Correlation Coefficients
Y Y Y X X X r = -1 r = -.6 r = 0 Y Y Y X X X r = +1 r = +.3 r = 0 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation Linear relationships Curvilinear relationships Y Y
X X Y Y X X Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation Strong relationships Weak relationships Y Y X X Y Y
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Linear Correlation No relationship Y X Y X
Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Calculating by hand…

Simpler calculation formula…
Numerator of covariance Numerators of variance

Distribution of the correlation coefficient:
The sample correlation coefficient follows a T-distribution with n-2 degrees of freedom (since you have to estimate the standard error). *note, like a proportion, the variance of the correlation coefficient depends on the correlation coefficient itselfsubstitute in estimated r

Linear regression In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

What is “Linear”? Remember this: Y=mX+B? m B

What’s Slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Prediction If you know something about X, this knowledge helps you predict something about Y. (Sound familiar?…sound like conditional probabilities?)

Regression equation… Expected value of y at a given level of x=

Predicted value for an individual…
yi=  + *xi random errori Follows a normal distribution Fixed – exactl y on the line

Assumptions (or the fine print)
Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The observations are independent

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X. Sy/ x Sy/ x

Regression Picture R2=SSreg/SSt otal C B A2 B2 C2 A y
yi x y *Least squares estimation gave us the line (β) that minimized C2 A B C2 SStotal Total squared distance of observations from naïve mean of y Total variation SSreg Distance from regression line to naïve mean of y Variability due to x (regression) SSresidual Variance around the regression line Additional variability not explained by x—what least squares method aims to minimize R2=SSreg/SSt otal

Recall example: cognitive function and vitamin D
Hypothetical data loosely based on [1]; cross-sectional study of 100 middle-aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry Jul;80(7):722-9.

Distribution of vitamin D
Mean= 63 nmol/L Standard deviation = 33 nmol/L 833

Distribution of DSST Normally distributed Mean = 28 points
Standard deviation = 10 points 834

Four hypothetical datasets
I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): 0.5 points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

Dataset 1: no relationship

Dataset 2: weak relationship

Dataset 3: weak to moderate relationship

Dataset 4: moderate relationship

The “Best fit” line Regression equation:
E(Yi) = *vit Di (in 10 nmol/L)

The “Best fit” line Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is! Regression equation: E(Yi) = *vit Di (in 10 nmol/L)

E(Yi) = *vit Di (in 10 nmol/L)

E(Yi) = *vit Di (in 10 nmol/L) Note: all the lines go through the point (63, 28)!

Estimating the intercept and slope: least squares estimation
A little calculus…. What are we trying to estimate? β, the slope, from What’s the constraint? We are trying to minimize the squared distance (hence the “least squares”) between the observations themselves and the predicted values , or (also called the “residuals”, or left-over unexplained variability) Differencei = yi – (βx + α) Differencei2 = (yi – (βx + α)) 2 Find the β that gives the minimum sum of the squared differences. How do you maximize a function? Take the derivative; set it equal to zero; and solve. Typical max/min problem from calculus…. From here takes a little math trickery to solve for β…

Resulting formulas… Slope (beta coefficient) = Intercept=
Regression line always goes through the point:

Relationship with correlation
In correlation, the two variables are treated as equals. In regression, one variable is considered independent (=predictor) variable (X) and the other the dependent (=outcome) variable Y.

Example: dataset 4 SDx = 33 nmol/L SDy= 10 points
Cov(X,Y) = 163 points*nmol/L Beta = 163/332 = 0.15 points per nmol/L = 1.5 points per 10 nmol/L r = 163/(10*33) = 0.49 Or r = 0.15 * (33/10) = 0.49

Significance testing…
Slope Distribution of slope ~ Tn-2(β,s.e.( )) H0: β1 = 0 (no linear relationship) H1: β1 0 (linear relationship does exist) Tn-2=

Formula for the standard error of beta (you will not have to calculate by hand!):

Example: dataset 4 Standard error (beta) = 0.03
T98 = 0.15/0.03 = 5, p<.0001 95% Confidence interval = 0.09 to 0.21

Residual Analysis: check assumptions
The residual for observation i, ei, is the difference between its observed and predicted value Check the assumptions of regression by examining the residuals Examine for linearity assumption Examine for constant variance for all levels of X (homoscedasticity) Evaluate normal distribution assumption Evaluate independence assumption Graphical Analysis of Residuals Can plot residuals vs. X

Predicted values… For Vitamin D = 95 nmol/L (or 9.5 in 10 nmol/L):

Residual = observed - predicted
X=95 nmol/L 34

 Residual Analysis for Linearity Y Y x x x x Not Linear Linear
residuals residuals  Not Linear Linear Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

 Residual Analysis for Homoscedasticity Y Y x x x x Constant variance
residuals residuals  Constant variance Non-constant variance Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

Residual Analysis for Independence
Not Independent  Independent X residuals X residuals X residuals Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall 856

Residual plot, dataset 4

Multiple linear regression…
What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition “Adjust” for age by putting age in the model: DSST score = intercept + slope1xvitamin D + slope2 xage

2 predictors: age and vit D…

Different 3D view…

Fit a plane rather than a line…
On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Equation of the “Best fit” plane…
DSST score = xvitamin D (in 10 nmol/L) xage (in years) P-value for vitamin D >>.05 P-value for age <.0001 Thus, relationship with vitamin D was due to confounding by age!

Multiple Linear Regression
More than one predictor… E(y)=  + 1*X + 2 *W + 3 *Z… Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.

Functions of multivariate analysis:
Control for confounders Test for interactions between predictors (effect modification) Improve predictions

A ttest is linear regression!
Divide vitamin D into two groups: Insufficient vitamin D (<50 nmol/L) Sufficient vitamin D (>=50 nmol/L), reference group We can evaluate these data with a ttest or a linear regression…

As a linear regression…
Intercept represents the mean value in the sufficient group. Slope represents the difference in means between the groups. Difference is significant. Parameter ````````````````Standard Variable Estimate Error t Value Pr > |t| Intercept <.0001 insuff

ANOVA is linear regression!
Divide vitamin D into three groups: Deficient (<25 nmol/L) Insufficient (>=25 and <50 nmol/L) Sufficient (>=50 nmol/L), reference group DSST=  (=value for sufficient) + insufficient*(1 if insufficient) + 2 *(1 if deficient) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable

The picture… Sufficient vs. Insufficient Sufficient vs. Deficient

Results… Interpretation:
Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 deficient insufficient Interpretation: The deficient group has a mean DSST points lower than the reference (sufficient) group. The insufficient group has a mean DSST points lower than the reference (sufficient) group.

Other types of multivariate regression
Multiple linear regression is for normally distributed outcomes Logistic regression is for binary outcomes Cox proportional hazards regression is used when time-to-event is the outcome

Common multivariate regression models.
Outcome (dependent variable) Example outcome variable Appropriate multivariate regression model Example equation What do the coefficients give you? Continuous Blood pressure Linear regression blood pressure (mmHg) =  + salt*salt consumption (tsp/day) + age*age (years) + smoker*ever smoker (yes=1/no=0) slopes—tells you how much the outcome variable increases for every 1-unit increase in each predictor. Binary High blood pressure (yes/no) Logistic regression ln (odds of high blood pressure) = odds ratios—tells you how much the odds of the outcome increase for every 1-unit increase in each predictor. Time-to-event Time-to- death Cox regression ln (rate of death) = hazard ratios—tells you how much the rate of the outcome increases for every 1-unit increase in each predictor.

Multivariate regression pitfalls
Multi-collinearity Residual confounding Overfitting

Multicollinearity Multicollinearity arises when two variables that measure the same thing or similar things (e.g., weight and BMI) are both included in a multiple regression model; they will, in effect, cancel each other out and generally destroy your model. Model building and diagnostics are tricky business!

Residual confounding You cannot completely wipe out confounding simply by adjusting for variables in multiple regression unless variables are measured with zero error (which is usually impossible). Example: meat eating and mortality

Men who eat a lot of meat are unhealthier for many reasons!
Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective study of over half a million people. Arch Intern Med 2009;169:562-71

Mortality risks… Sinha R, Cross AJ, Graubard BI, Leitzmann MF, Schatzkin A. Meat intake and mortality: a prospective study of over half a million people. Arch Intern Med 2009;169:562-71

Overfitting In multivariate modeling, you can get highly significant but meaningless results if you put too many predictors in the model. The model is fit perfectly to the quirks of your particular sample, but has no predictive ability in a new sample.

Overfitting: class data example
I asked SAS to automatically find predictors of optimism in our class dataset. Here’s the resulting linear regression model: Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept exercise sleep obama <.0001 Clinton mathLove Exercise, sleep, and high ratings for Clinton are negatively related to optimism (highly significant!) and high ratings for Obama and high love of math are positively related to optimism (highly significant!).

If something seems to good to be true…
Clinton, univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept Clinton Clinton Sleep, Univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept sleep sleep Exercise, Univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept <.0001 exercise exercise

More univariate models…
Obama, Univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept obama obama Compare with multivari ate result; p<.0001 Love of Math, univariate: Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept mathLove mathLove Compare with multivari ate result; p=.0011

Overfitting Rule of thumb: You need at least 10 subjects for each additional predictor variable in the multivariate regression model. Pure noise variables still produce good R2 values if the model is overfitted. The distribution of R2 values from a series of simulated regression models containing only noise variables. (Figure 1 from: Babyak, MA. What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models. Psychosomatic Medicine 66: (2004).)

Continuous predictors
Review of statistical tests The following table gives the appropriate choice of a statistical test or measure of association for various types of data (outcome variables and predictor variables) by study design. e.g., blood pressure= pounds + age + treatment (1/0) Continuous outcome Binary predictor Continuous predictors

Statistical procedure or measure of association Types of variables to be analyzed Outcome variable Predictor variable/s Cross-sectional/case-control studies Binary (two groups) Continuous T-test Binary Ranks/ordinal Wilcoxon rank-sum test Categorical (>2 groups) Continuous ANOVA Continuous Simple linear regression Multivariate (categorical and continuous) Continuous Multiple linear regression Categorical Chi-square test (or Fisher’s exact) Binary Odds ratio, risk ratio Cohort Studies/Clinical Trials Multivariate Binary Logistic regression Binary Risk ratio Categorical Time-to-event Kaplan-Meier/ log-rank test Multivariate Time-to-event Cox-proportional hazards regression, hazard ratio Categorical Continuous Repeated measures ANOVA Multivariate Continuous Mixed models; GEE modeling

Alternative summary: statistics for various types of outcome data
Outcome Variable Are the observations independent or correlated? Assumption s independent correlated Continuous (e.g. pain scale, cognitive function) Ttest ANOVA Linear correlation Linear regression Paired ttest Repeated-measures ANOVA Mixed models/GEE modeling Outcome is normally distributed (important for small samples). Outcome and predictor have a linear relationship. Binary or categorical (e.g. fracture yes/no) Difference in proportions Relative risks Chi-square test Logistic regression McNemar’s test Conditional logistic regression GEE modeling Chi-square test assumes sufficient numbers in each cell (>=5) Time-to-event (e.g. time to fracture) Kaplan-Meier statistics Cox regression n/a Cox regression assumes proportional hazards between groups

Continuous outcome (means); HRP 259/HRP 262

Binary or categorical outcomes (proportions); HRP 259/HRP 261

Time-to-event outcome (survival data); HRP 262
Outcome Variable Are the observation groups independent or correlated? Modifications to Cox regression if proportional- hazards is violated: independent correlated Time-to- event (e.g., time to fracture) Kaplan-Meier statistics: estimates survival functions for each group (usually displayed graphically); compares survival functions with log-rank test Cox regression: Multivariate technique for time-to-event data; gives multivariate-adjusted hazard ratios n/a (already over time) Time-dependent predictors or time-dependent hazard ratios (tricky!)

Descriptive Statistics

Similar presentations

Presentation on theme: "Descriptive Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Descriptive Statistics

Similar presentations

Presentation on theme: "Descriptive Statistics"— Presentation transcript:

Similar presentations

About project

Feedback